Glossary

Languoid

A dialect, language or language family. A more extensive definition can be found in Nordhoff & Hammarström 2011

Doculect

The language variety described in a specific document. For example, whatever language data is contained in a certain dictionary, grammar or specific article represents a unique doculect, without specifying whether that data constitutes an idiolect, a dialect, a variety mutually intelligible to some other variety or a mix of varieties from various locales.

Document type

The class a document belongs to. There are 16 classes. document can belong to more than one class. The following doctypes are distinguished

Bibliographically Oriented
bibiographical information (i.e., the language is featured in a bibliography))
Comparative-historical Study
the language is featured in a comparative study
Dialectologically Oriented
containing dialectological information, e.g., the intelligibility between different dialects, the distribution of certain isoglosses within a language
Dictionary
~ 75 pages and beyond
Ethnographic Work
ethnographic information (whether extensive or brief)
Grammar
an extensive description of most elements of the grammar ~ 150 pages and beyond
Grammar Sketch
a less extensive description of many elements of the grammar ~ 50 pages
Handbook/overview
the language is featured in a handbook/overview publication
New Testament
a new testament translation
Phonology
phonological description
Sociolinguistically Oriented
sociolinguistic information (where spoken, by how many etc)
Some Very Small Amount Of Data/information On A Language
some small amount of lexical or grammatical data but not sufficient for a full wordlist or a substantial account of some grammatical feature
Text
some amount of unanalyzed text data ~ 10 pages and beyond
(typological) Study Of A Specific Feature
description of some element of grammar (i.e., noun class system, verb morphology etc)
Unknown
Wordlist
wordlist ~ a couple of hundred words

Macro-area

An area of the globe of roughly continent size. The following areas are distinguished in Glottolog:

Africa
The continent
Australia
The continent
Eurasia
The Eurasian landmass North of Sinai. Includes Japan and islands to the Northof it. Does not include Insular South East Asia.
North America
North and Middle America up to Panama. Includes Greenland.
Papunesia
All islands between Sumatra and the Americas, excluding islands off Australiaand excluding Japan and islands to the North of it.
South America
Everything South of Darién

The division of the inhabited landmass into these macro-areas is optimal in the following sense. It is the division

  1. into 6 areas,
  2. for which there are at least 250 languages in each area, such that
  3. the distance between the component parts inside each area is minimized, and
  4. the length of intersections between pairs of macro-areas is minimized.
Hammarström, Harald & Mark Donohue. (2014) Principles on Macro-Areas in Linguistic Typology.
In Harald Hammarström & Lev Michael (eds.), Quantitative Approaches to Areal Linguistic Typology (Language Dynamics & Change Special Issue), 167-187. Leiden: Brill.

Pseudo-families

Sign Languages, Mixed Languages, Pidgins and Artificial Languages are treated as families in the database, even though they cannot be readily classified genealogically, These groups all share some features, which makes it convenient to treat them as if they were families.

Unclassified languages

Languages for which we have some lexical and/or grammatical information, but this information is so scanty that it cannot be decided whether or not it is genealogically related to any other language. They are treated as belonging to the top-level pseudo-family „Unclassified“. (See also the Languoids information section.)

Unclassified languoid in family X

A languoid for which

  1. enough lexical and/or grammatical information is available to assign it to a family, but for which
  2. its position within that family has not been resolved (either because of lack of data or because of lack of investigation).

(See also the Languoids information section.)

Unattested language

A language for which we have convincing evidence of its existence and distinctness from all other languages, but no grammatical or lexical information is available.

Unattested languages can often be assigned to a family on non-linguistic grounds, but here they are treated as belonging to the pseudo-family „Unclassified“.

(See also the Languoids information section.)

Spurious languoid

A languoid which is cited in the literature, but whose existence is not proven beyond doubt. This includes super-families like Amerind and small languages or dialects which somehow made it into other language catalogues without proof that they actually do exist. (See also the Languoids information section.)

Most Extensive Description (MED)

The Most Extensive Description (MED) for a language is the longest document of the highest ranking document type. From highest to lowest, the ranking is grammar, grammar sketch, dictionary/phonology/specific feature/text, wordlist, followed by the remaining document types. Note that 'description' here refers to grammatical description rather than, e.g., lexical documentation, so grammar trump dictionary.

Descriptive status of a language

The descriptive status of a language is the document type of its MED. For example, in the case of language for which there is a grammar sketch, a phonology and a dictionary, its MED would be the grammar and so the descriptive status would be 'grammar'. It does not matter if there are fifty grammars for the language or just one, the descriptive status would still be 'grammar'.

Computerized assignment

A large class of the bibliographical references have been annotated manually for language and document type, sometimes via a decent translation of another annotation scheme than the one used on Glottolog. However, a large class of references have not been manually annotated. Such references are automatically tagged on the basis of words that occur in the title. For example, if the title of a bibliographical reference contains the name of a language, it may be guessed to pertain to that language. Similarly, if it contains the word 'dictionary' it is probably of the document type dictionary. Which words trigger which annotations is automatically learned from manually tagged training data (see Hammarström 2008 for details). The automatic annotation is far from perfect, but is nevertheless applied since it does more good than harm. However, assignments which are triggered by words in the title in this way, as opposed to manual annotations, are marked as "computerized assignment".

Harald Hammarström. 2008. Automatic Annotation of Bibliographical References with Target Language.
Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization, 57-64. ACL.