A dialect, language or language family. A more extensive definition can be found in Nordhoff & Hammarström 2011
Each languoid has a unique and persistent identifier called Glottocode, consisting of four alphanumeric characters (i.e. lowercase letters or decimal digits) and four decimal digits (abcd1234 follows this patters, but so does b10b1234).
The language variety described in a specific document. For example, whatever language data is contained in a certain dictionary, grammar or specific article represents a unique doculect, without specifying whether that data constitutes an idiolect, a dialect, a variety mutually intelligible to some other variety or a mix of varieties from various locales.
The class a document belongs to. There are 16 classes. document can belong to more than one class. The following doctypes are distinguished
An area of the globe of roughly continent size. The following areas are distinguished in Glottolog:
The division of the inhabited landmass into these macro-areas is optimal in the following sense. It is the division
Hammarström, Harald & Mark Donohue. (2014) Principles on Macro-Areas in Linguistic Typology.
In Harald Hammarström & Lev Michael (eds.), Quantitative Approaches to Areal Linguistic Typology (Language Dynamics & Change Special Issue), 167-187. Leiden: Brill.
Sign Languages, Mixed Languages, Pidgins and Artificial Languages are treated as families in the database, even though they cannot be readily classified genealogically, These groups all share some features, which makes it convenient to treat them as if they were families.
Languages for which we have some lexical and/or grammatical information, but this information is so scanty that it cannot be decided whether or not it is genealogically related to any other language. They are treated as belonging to the top-level pseudo-family „Unclassified“. (See also the Languoids information section.)
A languoid for which
(See also the Languoids information section.)
A language for which we have convincing evidence of its existence and distinctness from all other languages, but no grammatical or lexical information is available.
Unattested languages can often be assigned to a family on non-linguistic grounds, but here they are treated as belonging to the pseudo-family „Unclassified“.
(See also the Languoids information section.)
A languoid which is cited in the literature, but whose existence is not proven beyond doubt. This includes super-families like Amerind and small languages or dialects which somehow made it into other language catalogues without proof that they actually do exist. (See also the Languoids information section.)
The Most Extensive Description (MED) for a language is the longest document of the highest ranking document type. From highest to lowest, the ranking is grammar, grammar sketch, dictionary/phonology/specific feature/text, wordlist, followed by the remaining document types. Note that 'description' here refers to grammatical description rather than, e.g., lexical documentation, so grammar trump dictionary.
The descriptive status of a language is the document type of its MED. For example, in the case of language for which there is a grammar sketch, a phonology and a dictionary, its MED would be the grammar and so the descriptive status would be 'grammar'. It does not matter if there are fifty grammars for the language or just one, the descriptive status would still be 'grammar'.
A large class of the bibliographical references have been annotated manually for language and document type, sometimes via a decent translation of another annotation scheme than the one used on Glottolog. However, a large class of references have not been manually annotated. Such references are automatically tagged on the basis of words that occur in the title. For example, if the title of a bibliographical reference contains the name of a language, it may be guessed to pertain to that language. Similarly, if it contains the word 'dictionary' it is probably of the document type dictionary. Which words trigger which annotations is automatically learned from manually tagged training data (see Hammarström 2008 for details). The automatic annotation is far from perfect, but is nevertheless applied since it does more good than harm. However, assignments which are triggered by words in the title in this way, as opposed to manual annotations, are marked as "computerized assignment".
Harald Hammarström. 2008. Automatic Annotation of Bibliographical References with Target Language.
Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization, 57-64. ACL.