Language Documentation and Description


In recent years, it has become standard to distinguish language documentation from description (Himmelmann 1998). Briefly, a documentation is a “record of the linguistic practices and traditions of a speech community” (Himmelmann 1998: 165-166) while a description is “the investigation of the structure of a language through the collection of primary language data gathered through interaction with native-speaker consultants” (Chelliah & DeReuse 2011: 7). More simply put, documentation means recording native speakers (on audio and video), writing down what was said, and storing everything in an archive, whereas description means analyzing the language and writing grammars, or pieces of grammars.

While I find this a useful conceptual distinction, I do not believe that language documentation and description should be separated in practice. Imagine, for example, a situation where we had hundreds of hours of audio recordings in an unkown language, but there is no grammar and no dictionary, and none of the recordings are transcribed. Would we know what was said in those recordings? How would we go about transcribing them? How would we know where the word boundaries are? How would we create word-by-word interlinear glosses? How would we build a dictionary and a grammar? In this situation, the recordings would become practically unusable. Hopefully, one would eventually be able to decipher them, based on knowledge of related languages—but truly raw data in an unknown language is not easily accessible, either to linguists or to endangered language communities.

For me, then, language documentation means leaving an observationally adequate record of the language. To be observationally adequate means to “present the observed primary data correctly” (Chomsky 1964: 62). Basically, this means to have a label or “pigeonhole” for every observation: we have divided each sentence into separate words, each word is spelled correctly, with all of the long vowels, nasal vowels, and tones in the right place, and each word is appropriately glossed, e.g. 3rd person plural imperfective, 2nd person singular perfective, etc. However, this is by no means an easy task:

“What data is relevant is determined in part by the possibility for a systematic theory, and one might therefore hold that the lowest level of success [i.e. observational adequacy—A.J.] is no easier to achieve than the others.…under normal conditions speech is subject to various, often violent distortions that may in themselves indicate nothing about the underlying linguistic patterns. The problem of determining what data is valuable and to the point is not an easy one. What is observed is often neither relevant nor significant, and what is relevant and significant is often very difficult to observe…” (Chomsky 1964: 63).

To put this another way, in a completely new and unfamiliar language, we don’t even know what we should be paying attention to, and we don’t know what categories we should be using to label our observations. And yet it is necessary to label or annotate our observations in some way, if they are to be useful to future linguists or community members—for example, I may want to search for all 3rd person dual optative forms in my corpus, or all occurrences of the word tabı̨́ł ‘fishnet’. On the other hand, I must have already done some elicitation or analysis, to know what the word for ‘fishnet’ is, or which forms are 3rd person dual. Thus, observation depends on analysis just as much as analysis depends on observation (in the sense of “observation” used above)—there is no such thing as pure “bottom-up” linguistics (Gil 2001). In the sections below I elaborate on certain particular issues in language documentation and description, that I deal with in my own work.

Back to Language Documentation and Description contents ↑

Text Collection and Lexical Database.

Descriptive linguistic fieldwork almost always involves the collection of texts. At the same time, one of the main goals of fieldwork is often the production of some sort of dictionary for the language. Using the program Toolbox (Buseman & Buseman 2008), it is possible to build a dictionary directly from texts. That is, one enters text into Toolbox line-by-line, with word-for-word glosses. Toolbox automatically generates an interlinear text (i.e. with a gloss under each word), for the words it already knows—that is, those words that are already in the database. For those words that are not in the database, Toolbox asks you to enter a gloss (which is itself then added to the database). The following is an example of what a line of text looks like in Toolbox (from the story “The Founding of Yellowknife” by Michel Paper).

However, in a polysynthetic language, the notion of a “list of words” is problematic. In the Northeast Dene languages (Dogrib, Slave, Dëne Sų́łiné), a simple intransitive verb will typically have 27 forms (1st, 2nd, 3rd person; singular, dual, plural; perfective, imperfective, optative). A transitive verb, with object agreement, will have over 250 forms, and the number is greater still if one considers secondary, derived forms, such as iterative and distributive forms. Should all of these inflected verb forms be listed individually, or should they be grouped together under some common lexical entry (based, e.g. on verb base, or verbal root)? In my opinion, the answer is both—and this is the approach I have taken in constructing my databases (with the help of Albert Bickford, SIL). That is, for each language I work on, Weledeh and Taltsą́ot'ıné, there are actually three databases linked together: the texts database is linked to a wordforms database, and the wordforms database is linked to a morphemes database, as shown in (2).

For each word that occurs in an actual text, we enter a full-word gloss which is stored in the wordforms database. For each word in the wordforms database, in turn, we enter a morpheme-by-morpheme gloss which is stored in the morphemes database. An example of a word broken up into morphemes is given in (3).

The morphemes database will be discussed further in the next section. The point here is that we need not be forced to choose between recognizing words or morphemes as theoretical objects in morphology, anymore than we must choose between segments and syllables in phonology, or between words and phrases in syntax—these are all perfectly valid levels of representation. It seems likely—to me anyway—that speakers memorize both individual morphemes and whole words. There is no reason why both levels of representation cannot also be recorded in our observations.

Back to Language Documentation and Description contents ↑

Verb Paradigms and Morphemes Database.

Transcription, segmentation into words, and interlinear glossing are not trivial matters, as has been noted in the language documentation literature—these are one area where it does not seem possible to separate documentation from description (Himmelman 1998: 162-163). Work with Dene languages, however, poses a number of additional challenges. At the morphological level, all Dene languages exhibit complex verb morphology with stem suppletion and discontinuous dependencies, which makes lexicography difficult (e.g. Kari 1989, 1990). On top of this, in the Northeast Dene languages (Slave, Dogrib, and Dëne Sų́łiné)—and especially in the languages/dialects I work on (Weledeh and Taltsą́ot'ıné) grammatical categories that used to be signaled by their own separate prefix, clearly identifiable as its own separate syllable such as ghe, the, ne, or de, are now often signaled by suprasegmental or vocalic features such as High tone, vowel lengthening, vowel raising (e > ı) and/or nasalization. On the other hand, tone, vowel length, and nasality are also the targets of a number of phonological rules in these languages, making it difficult to recover the underlying forms of words from their surface forms. Despite having several very good reference works available, on related languages and dialects (Ackroyd 1982, Rice 1989, Cook 2004), the morphophonemics of Weledeh baffled me for many years, making glossing and transcription very difficult.

To address this problem, in my dissertation (Jaker 2012), I presented a complete phonology of Weledeh verbs, thus providing a principled basis to infer underlying forms from surface forms, and vice versa (see: phonology). This makes it possible to enter morpheme-by-morpheme glosses into the database, as shown in (4).

The question then arises—why should one do this? That is, why supply morpheme-by-morpheme glosses as part of a language documentation project? There are a number of reasons. It is true that underlying forms, and morpheme-by-morpheme glosses, do not constitute primary data—they represent a certain type of analysis, performed by the linguist. However, there is a great deal of uncertainty in the primary data itself: phonemic vowel length is affected by rate of speech, lexical tone can be overridden by phrasal intonation, and nasality and vowel quality can be influenced, at the phonetic level, by neighboring segments. What we hear is influenced, in large part, by what we expect to hear. For example, I may think I heard a long vowel, because I expect this verb form to be perfective (given the discourse context), or I think I heard an h-classifier before the stem, because I know that h is part of the verb theme for this verb. Native speakers use this kind of “top-down” information even more than linguists do. By providing a morpheme-by-morpheme gloss, I am at least being honest about my assumptions (i.e. what led me to believe this vowel is long). That way, if my assumptions are incorrect, future linguists (and speakers) will more easily be able to spot my errors.

Another use of the morphemes database is to be able to search for occurrences of a certain morpheme. For example, I could search the corpus for all words containing the areal prefix go, or the incorporated noun tł'ı ‘rope’, or for all verbs with incorporated nouns, etc—this could form the basis for qualitative or quantitative investigations of the behavior of certain prefixes. Finally, morpheme-by-morpheme glosses are potentially useful as a teaching tool, to help students understand verb structure: a language learner could click on (or hover over) a particular form in the verb dictionary, and see a morphemic analysis pop up, similar to those shown in (4). Click here for more info.

Back to Language Documentation and Description contents ↑

Archiving with ANLA.

All of my raw data—audio recordings, databases, and text files—as well as finished products such as the intermediate level readers and verb dictionaries—will be stored with the Alaska Native Language Archive at UAF. Copies will also be kept in our library and digital storage system at the Goyatıkǫ̀ Language Centre and the Yellowknives Dene First Nation band office. All items are labeled with appropriate metadata, according to the standards of ANLA and OLAC (Open Language Archives Community). As a general rule (i.e. by default), original recordings can be accessed only by the speakers themselves or their relatives. However, in the future we would like to make as many recordings as possible publicly accessible in the form of teaching materials, i.e. audio files to accompany written text, with individual speakers’ permission.

Back to Language Documentation and Description contents ↑


Ackroyd, Lynda. 1982. Dogrib Grammar. Manuscript, University of Toronto.
Buseman, Alan & Karen Buseman. 2008. Toolbox.
Chelliah, Shobhana & Willem de Reuse. 2011. Handbook of Descriptive Linguistic Fieldwork. Springer.
Chomsky, Noam. 1964. Current Issues in Linguistic Theory. In Fodor, Jerry & Jerrold Katz (eds.) The Structure of Language: Readings in the Philosophy of Language. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.
Cook, Eung-Do. 2004. A Grammar of Dëne Sų́łiné (Chipewyan). Algonquian and Iroquoian Linguistics, Memoir 17. John D. Nichols & H.C. Wolfart (eds.), Winnipeg, Manitoba.
Gil, David. 2001. Escaping Eurocentrism: fieldwork as a process of unlearning. In Paul Newman & Martha Ratliff (eds.) Linguistic Fieldwork. Cambridge University Press.
Himmelmann, Nikolaus. 1998. Documentary and descriptive linguistics. Linguistics 36 (1): 161-195.
Jaker, Alessandro. 2012. Prosodic Reversal in Dogrib (Weledeh Dialect). Doctoral dissertation, Stanford University.
Kari, James. 1989. Affix Positions and Zones in the Athapaskan Verb Complex: Ahtna and Navajo. International Journal of American Linguistics 55: 424-54.
Kari, James. 1990. Ahtna Athabaskan Dictionary. Fairbanks: Alaska Native Language Center.
Rice, Keren. 1989. A Grammar of Slave. Mouton de Gruyter.

Back to Language Documentation and Description contents ↑

Home | Languages | Language Documentation and Description | Language Revitalization |
Research Interests | Teaching | Papers | Presentations | Personal | CV | Contact