A dialect, language or language family. A more extensive definition can be found in Nordhoff & Hammarström 2011


The language variety described in a specific document. For example, whatever language data is contained in a certain dictionary, grammar or specific article represents a unique doculect, without specifying whether that data constitutes an idiolect, a dialect, a variety mutually intelligible to some other variety or a mix of varieties from various locales.

Document type

The class a document belongs to. There are 15 classes. document can belong to more than one class. The following doctypes are distinguished

bibiographical information (i.e., the language is featured in a bibliography))
the language is featured in a comparative study
containing dialectological information, e.g., the intelligibility between different dialects, the distribution of certain isoglosses within a language
~ 75 pages and beyond
ethnographic information (whether extensive or brief)
an extensive description of most elements of the grammar ~ 150 pages and beyond
Grammar Sketch
a less extensive description of many elements of the grammar ~ 50 pages
some small amount of lexical or grammatical data but not sufficient for a full wordlist or a substantial account of some grammatical feature
New Testament
a new testament translation
the language is featured in a handbook/overview publication
phonological description
sociolinguistic information (where spoken, by how many etc)
Specific Feature
description of some element of grammar (i.e., noun class system, verb morphology etc)
some amount of unanalyzed text data ~ 10 pages and beyond
wordlist ~ a couple of hundred words


An area of the globe of roughly continent size. The following areas are distinguished in Glottolog:

The continent
The continent
The Eurasian landmass North of Sinai. Includes Japan and islands to the North of it. Does not include Insular South East Asia.
North America
North and Middle America up to Panama. Includes Greenland.
All islands between Sumatra and the Americas, excluding islands off Australia and excluding Japan and islands to the North of it.
South America
Everything South of Darién

The division of the inhabited landmass into these macro-areas is optimal in the following sense. It is the division

  1. into 6 areas,
  2. for which there are at least 250 languages in each area, such that
  3. the distance between the component parts inside each area is minimized, and
  4. the length of intersections between pairs of macro-areas is minimized.
Hammarström, Harald & Mark Donohue. (2014) Principles on Macro-Areas in Linguistic Typology.
In Harald Hammarström & Lev Michael (eds.), Quantitative Approaches to Areal Linguistic Typology (Language Dynamics & Change Special Issue), 167-187. Leiden: Brill.


Sign Languages, Mixed Languages, Pidgins and Artificial Languages are treated as families in the database, even though they cannot be readily classified genealogically, These groups all share some features, which makes it convenient to treat them as if they were families.

Unclassified languages

Languages for which we have some lexical and/or grammatical information, but this information is so scanty that it cannot be decided whether or not it is genealogically related to any other language. They are treated as belonging to the top-level pseudo-family „Unclassified“. (See also the Languoids information section.)

Unclassified languoid in family X

A languoid for which

  1. enough lexical and/or grammatical information is available to assign it to a family, but for which
  2. its position within that family has not been resolved (either because of lack of data or because of lack of investigation).

(See also the Languoids information section.)

Status in Glottolog

The vast majority of languoids have the status „Established“, but there are some special cases: Some languages are Unattested, some are Provisional, and some languoids are Spurious. In addition, quite a few families have the status „Retired“. (See also the Languoids information section.)

Unattested language

A language for which we have convincing evidence of its existence and distinctness from all other languages, but no grammatical or lexical information is available.

Unattested languages can often be assigned to a family on non-linguistic grounds, but here they are treated as belonging to the pseudo-family „Unclassified“.

(See also the Languoids information section.)

Provisional language

A language whose status is currently under consideration by the Glottolog editors. (See also the Languoids information section.)

Spurious languoid

A languoid which is cited in the literature, but whose existence is not proven beyond doubt. This includes super-families like Amerind and small languages or dialects which somehow made it into other language catalogues without proof that they actually do exist. (See also the Languoids information section.)

Retired family

A family which existed in Glottolog 1 (2012), but which no longer exists in the current version of Glottolog.

Most Extensive Description (MED)

The Most Extensive Description (MED) for a language is the longest document of the highest ranking document type. From highest to lowest, the ranking is grammar, grammar sketch, dictionary/phonology/specific feature/text, wordlist, followed by the remaining document types.

Descriptive status of a language

The descriptive status of a language is the document type of its MED. For example, in the case of language for which there is a grammar sketch, a phonology and a dictionary, its MED would be the grammar and so the descriptive status would be 'grammar'. It does not matter if there are fifty grammar for the language or just one, the descriptive status would still be 'grammar'.

Computerized assignment

A large class of the bibliographical references have been annotated manually for language and document type, sometimes via a decent translation of another annotation scheme than the one used on Glottolog. However, a large class of references have not been manually annotated. Such references are automatically tagged on the basis of words that occur in the title. For example, if the title of a bibliographical reference contains the name of a language, it may be guessed to pertain to that language. Similarly, if it contains the word 'dictionary' it is probably of the document type dictionary. Which words trigger which annotations is automatically learned from manually tagged training data (see Hammarström 2008 for details). The automatic annotation is far from perfect, but is nevertheless applied since it does more good than harm. However, assignments which are triggered by words in the title in this way, as opposed to manual annotations, are marked as "computerized assignment".

Harald Hammarström. 2008. Automatic Annotation of Bibliographical References with Target Language.
Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization, 57-64. ACL.