naming a corpus

I’m creating a corpus for public distribution, but I’m stuck on a highly crucial point in the project: the corpus needs a name!

Fog hanging over Golden Gate Park in western San Francisco

Some corpora have better names than others. There are corpora where the name is simply the initials: the British National Corpus is just the BNC, pronounced (thankfully!) just as “bee en see.” But some corpora have really nice names, like COCA (Corpus of Contemporary American English), ICE (International Corpus of English), VOICE (Vienna-Oxford International Corpus of English), and COLT (the Bergen Corpus of London Teenage Language… which you might think would be BCLT, but COLT is sooo much better). The most clever corpus names, I think, are CHILDES (Child Language Data Exchange System; actually much more than a single corpus, but close enough) and SCOTS (Scottish Corpus of Texts and Speech), which both describe the thing they stand for (there must be a term for that kind of acronym…). There are also corpus names that are not acronyms, and which only describe what they are: Switchboard is a corpus of collected telephone calls, CallFriend is a corpus of calls specifically between friends, and the Buckeye Corpus is the speech of Buckeyes (speakers in Columbus, Ohio)!

The Sunset District, San Francisco, California

My corpus, based on my PhD research, consists of sociolinguistic interviews (in the form of anonymized sound files, transcripts, and phonetic alignments) among San Franciscans who grew up in the western neighborhood known as the Sunset District. The interviews (and reading passages and wordlists) are all in English, which thus gives me the following letters to play with for an acronym for my corpus: S(an), F(rancisco), S(unset), D(istrict), E(nglish), I(nterviews), and C(orpus). But coming up with a cute acronym has proven difficult. The most obvious one is the Corpus of San Francisco English: COSFE. But “COSFE” sounds horrible! What does that last vowel even sound like, exactly? Adding “Interviews” to the end makes it “COSFEI,” which disambiguates the vowel sound, but it still doesn’t sound very nice at all. I tried playing around “SD,” for Sunset Distict, rather than “SF,” but that didn’t get me any further. So, should I just continue to call it the boring thing I’ve affectionately been calling it from the beginning: the SanFran Corpus? Or how about the Sunset Corpus (which unfortunately makes it seem like it’s the corpus to end all corpora)?

I’ve already given this waaay too much thought, so I leave it to you….

Typical Sunset District homes, in the style of Henry Doelger


About vocalised
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

12 Responses to naming a corpus

  1. Stan says:

    When you mentioned Sunset, and before I read the letters to play with, I thought Sunset Corpus would work quite well. It’s easier on the ear and the tongue than an unwieldy acronym, and its connotations are only slightly melancholy/cinematic/apocalyptic!

    Or you could mix them up for something like SunSan Corpus or SanFrin Corpus, though neither of those is particularly appealing or euphonic. SanFran Corpus is fine, I think, and it has the advantage of being descriptive.

  2. Starr says:

    Given the precedent of the Buckeye Corpus, I don’t see a problem with calling it the Sunset Corpus, it’s cute!

  3. I agree with Stan. The first thing when you described the corpus that I could think of was “Sunset.” (Like “Switchboard,” drop the “corpus” component.) That said, I see your point about its limitations, but do you plan to expand it? And, if you eventually do expand it, you could call the hybrid the Sunset-San Francisco Corpus then.

  4. vocalised says:

    Interesting consensus on here, unlike the parallel discussions on on Facebook and Twitter. ‘Oiwi Parker-Jones put his finger on exactly what I didn’t like about Sunset Corpus: “Sunset Corpus suggests West Coast to me, more than ‘the corpus to end all corpora’. Although it does also sound like it might be recordings from a retirement home…” πŸ™‚

  5. I like the “Sunset Corpus” too.

  6. I was thinking “San Francisco Sunset Corpus”. My preference is to avoid acronyms whenever possible, especially the non-initialism kind. It just causes a stutter in my reading whenever I have to parse COCA and CHILDES, BNC bothers me less, but Buckeye is by far the easiest on my brain. Officially naming it something like “San Francisco Sociolinguistic Interviews Corpus of the Sunset District” and then simply referring to it as “the Sunset Corpus” seems great.

  7. Kim Witten says:

    SLIC SF – SocioLinguistic Interview Corpus of SF
    SCorE – Sunset Corpus of English
    SunsetLing Corpus or SunsetLx Corpus
    Sunset District Corpus
    COLD, SF Corpus of Outside Lands District, SF
    SOLICE – Sociolinguistic Outside Lands Interview Corpus of English
    SLICE – Sunset Linguistic Interview Corpus of English
    oooh, I know…E COLI! (The English Corpus Of Linguistic Interviews!)

    (I may be joking about that last one)

    • vocalised says:

      Wow, Kim, you can go head-to-head with Carmen Fought (on FB) with her list! These are great! I’ll note that “E COLI” can actually apply to many other corpora, so in all fairness to other corpora, I should pass on that one. πŸ˜‰ COLD cracks me up, since my interviewees do spend much of their time talking about how cold it is. BTW, nice insider local-knowledge on your part putting that one together, too! Anyway, thanks, I’ll be adding this to the growing list. Maybe I’ll run a survey later of my favs, in order to make the choice…

  8. Kim Witten says:

    I just like the thought of calling it the ‘cold corpus’…so close to corpse, so CSI! It sounds like the kind of data that will take your mom out dancing and slap your sister. Or maybe just eat the plums that were sitting in the ICEbox. So sweet and so COLD.

  9. Stan says:

    I haven’t seen the Facebook ideas, but Kim’s is a great list. I’m embarrassed about my suggestions now!

    Not a serious proposal, but you could go all in with CSI COLD: Corpus of Sunset Interviews Conducted in Outside Lands District. (Add “CORPus of Sanfrancisco English” for a grisly extension.)

  10. What about Corpus of Sunset Interviewed English and keep it nice and COSIE?
    ‘corpus, linguistic interviews to open research in Sunset’ is to be avoided however.
    Sorry, couldn’t resist

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s