Glossary

Languoid

A dialect, language or language family. A more extensive definition can be found in Nordhoff & Hammarström 2011

Glottocode

Each languoid has a unique and persistent identifier called Glottocode, consisting of four alphanumeric characters (i.e. lowercase letters or decimal digits) and four decimal digits (abcd1234 follows this patters, but so does b10b1234).

Doculect

The language variety described in a specific document. For example, whatever language data is contained in a certain dictionary, grammar or specific article represents a unique doculect, without specifying whether that data constitutes an idiolect, a dialect, a variety mutually intelligible to some other variety or a mix of varieties from various locales.

Document type

The class a document belongs to. There are 16 classes. document can belong to more than one class. The following doctypes are distinguished

Grammar
an extensive description of most elements of the grammar ~ 150 pages and beyond
Grammar Sketch
a less extensive description of many elements of the grammar ~ 50 pages
Dictionary
~ 75 pages and beyond
(typological) Study Of A Specific Feature
description of some element of grammar (i.e., noun class system, verb morphology etc)
Phonology
phonological description
Text
some amount of unanalyzed text data ~ 10 pages and beyond
New Testament
a new testament translation
Wordlist
wordlist ~ a couple of hundred words
Comparative-historical Study
the language is featured in a comparative study
Some Very Small Amount Of Data/information On A Language
some small amount of lexical or grammatical data but not sufficient for a full wordlist or a substantial account of some grammatical feature
Sociolinguistically Oriented
sociolinguistic information (where spoken, by how many etc)
Dialectologically Oriented
containing dialectological information, e.g., the intelligibility between different dialects, the distribution of certain isoglosses within a language
Handbook/overview
the language is featured in a handbook/overview publication
Ethnographic Work
ethnographic information (whether extensive or brief)
Bibliographically Oriented
bibiographical information (i.e., the language is featured in a bibliography))
Unknown

Macro-area

An area of the globe of roughly continent size.

The division of the inhabited landmass into the macro-areas defined here is optimal in the following sense. It is the division

  1. into 6 areas,
  2. for which there are at least 250 languages in each area, such that
  3. the distance between the component parts inside each area is minimized, and
  4. the length of intersections between pairs of macro-areas is minimized.

See Harald Hammarström and Mark Donohue 2014.

The following areas are distinguished in Glottolog:

Pseudo-families

Sign Languages, Mixed Languages, Pidgins and Artificial Languages are treated as families in the database, even though they cannot be readily classified genealogically, These groups all share some features, which makes it convenient to treat them as if they were families.

Unclassified languages

Languages for which we have some lexical and/or grammatical information, but this information is so scanty that it cannot be decided whether or not it is genealogically related to any other language. They are treated as belonging to the top-level pseudo-family „Unclassified“. (See also the Languoids information section.)

Unclassified languoid in family X

A languoid for which

  1. enough lexical and/or grammatical information is available to assign it to a family, but for which
  2. its position within that family has not been resolved (either because of lack of data or because of lack of investigation).

(See also the Languoids information section.)

Unattested language

A language for which we have convincing evidence of its existence and distinctness from all other languages, but no grammatical or lexical information is available.

Unattested languages can often be assigned to a family on non-linguistic grounds, but here they are treated as belonging to the pseudo-family „Unclassified“.

(See also the Languoids information section.)

Spurious languoid

A languoid which is cited in the literature, but whose existence is not proven beyond doubt. This includes super-families like Amerind and small languages or dialects which somehow made it into other language catalogues without proof that they actually do exist. (See also the Languoids information section.)

Most Extensive Description (MED)

The Most Extensive Description (MED) for a language is the longest document of the highest ranking document type. From highest to lowest, the ranking is grammar, grammar sketch, dictionary/phonology/specific feature/text, wordlist, followed by the remaining document types. Note that 'description' here refers to grammatical description rather than, e.g., lexical documentation, so grammar trumps dictionary.

Descriptive status of a language

The descriptive status of a language is the document type of its MED. For example, in the case of language for which there is a grammar sketch, a phonology and a dictionary, its MED would be the grammar and so the descriptive status would be 'grammar'. It does not matter if there are fifty grammars for the language or just one, the descriptive status would still be 'grammar'.

Computerized assignment

A large class of the bibliographical references have been annotated manually for language and document type, sometimes via a decent translation of another annotation scheme than the one used on Glottolog. However, a large class of references have not been manually annotated. Such references are automatically tagged on the basis of words that occur in the title. For example, if the title of a bibliographical reference contains the name of a language, it may be guessed to pertain to that language. Similarly, if it contains the word 'dictionary' it is probably of the document type dictionary. Which words trigger which annotations is automatically learned from manually tagged training data (see Hammarström 2008 for details). The automatic annotation is far from perfect, but is nevertheless applied since it does more good than harm. However, assignments which are triggered by words in the title in this way, as opposed to manual annotations, are marked as "computerized assignment".

Harald Hammarström. 2008. Automatic Annotation of Bibliographical References with Target Language.
Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization, 57-64. ACL.