About Languoids

Glottolog aims to provide a comprehensive list of languoids (families, languages, dialects) that linguists need to be able to identify. Each languoid has a unique and persistent identifier called Glottocode, consisting of four letters and four digits [abcd1234].

Currently 2017-03-23 there are 8175 spoken L1 languages (i.e. spoken languages traditionally used by a community of speakers as their first language).

Languages are classified (see below) into 242 families and 188 isolates, i.e., one-member families. This classification is the best guess by the Glottolog editors and the classification principles are described in Figure 1 below and the accompanying text. Users should be aware that for many groups of languages, there is little available historical-comparative research, so the classifications are subject to change as scholarship and interest in those languages increase. Please contact the editors if you have corrections to the language classification.

In addition to the genealogical trees (families and isolates), the Families page also includes the following non-genealogical trees:

  • Unattested languages
  • Unclassifiable languages
  • Pidgin languages
  • Mixed languages
  • Speech registers
  • Artificial spoken languages
  • Sign languages and auxiliary sign systems

(Glottolog also contains lists of putative languages and families that are not regarded as real languoids by the editors but that are given a Glottocode for bookkeeping purposes; these are called bookkeeping languoids and they are described further below.)

Spoken L1 languages 8175
Unattested 68
Unclassifiable 120
Pidgin 80
Mixed Language 23
Artificial Language 10
Speech Register 7
Sign Language 179
All 8444

Principles

Every putative language is considered according to the decision procedure in Figure 1. All spoken languages for which a sufficient amount of linguistic data exists—the leaves of the decision tree with double boxes around them—are deemed classifiable, and are classified into genealogical families (and isolates). The other kinds of languages are filed into the other categories that were listed above. Glottolog is complete only for classifiable languages. Regarding unattested and unclassifiable languages, see Harald Hammarström (2012). A comprehensive listing of pidgins is Peter Bakker and Mikael Parkvall (2010). This listing differentiates different levels of evidence for the existence of a pidgin, rather than a strict yes/no existence-decision. Elsewhere there are extensive lists of sign languages (Taylor, Allan R. 1996, J. Albert Bickford 2005, Kamei, Nobutaka 2004, Ulrike Zeshan 2006, Roger Blench and Andy Warren 2003, Anonymous 2007), whistled languages (Julien Meyer 2005) and artificial languages (P. O. Bartlett 2006).

Figure 1: Decision procedure for inclusion in the present language listing.

Inclusion/exclusion of languages

1. Is the putative language assertably distinct from all other known languages?

For any alleged language to be considered in the classification we must first determine whether it was distinct from all other languages. By distinct, we mean not mutually intelligible with any other language. In principle, any convincing evidence to this effect is sufficient. For example, direct comparison of language data or testimonies of non-intelligibility to all neighbouring languages is the most straightforward kind of evidence. But also, various types of evidence for isolation from all other humans for a long time could make a convincing case that a language is indeed distinct from all others.

For example, Flecheiros is the name given to an uncontacted group in the Javari valley in Western Brazil (Carlos Alberto Ricardo 1986). Ethnographic evidence suggests that they, if akin to anyone in the vicinity, are Kanamari (a known Katukinan language, see, e.g., Zoraide dos Anjos 2011). However, Scott Wallace (2011) recounts one meeting between a Kanamari and the Flecheiros revealing that they do not speak intelligible languages (though one Kanamari woman captured at an early age was living among the Flecheiros). Even if not totally foolproof, this appears to be convincing evidence that the Flecheiros speak a language distinct from all others.

However, all the pieces of evidence must be present. There are plenty of other cases where a speech form (often extinct) is known not to have been unintelligible to some or most languages around it (e.g., Yalakalore in David M. Eberhard 2009), but this is not sufficient if it cannot be asserted for every plausible candidate. A further caveat is that testimonies must themselves be convincing to count as testimonies. There are cases where unintelligibility information comes from individuals who were in no position to judge it, e.g., they might be passing on hearsay, or pass on some kind of general impression not based solely on language.

If a putative language is or was not considered as a distinct language by these criteria, it is either a dialect of a language, or it is classified as “based on misunderstanding”. In the latter case, it is listed as a type of bookkeping languoid (see below).

2. Are there form-meaning pairs?

For a linguistic classification, we naturally require that actual linguistic data, i.e., form-meaning pairs (as opposed to purely sociolinguistic data), form the basis for the classification. That means that some linguistic data has been collected which provides the basis for classification, but does not necessarily mean that the data in question has been published. We also require that the data is not known to have vanished, meaning that once attested languages whose attestation now appears to be lost count as unattested. For example, grammar sketches of three extinct South American language Taimviae, Teutae and Agoiae that once did exist (Daniel G. Brinton 1898):203,208 now seem to have vanished completely. Thus, the three count as unattested because it is known that the attestation is gone.

3. Has it served as the main means of communication for a human society?

There are two reasons for restricting the scope to communication systems that serve(d) as the main means of communication for a human society.

First, language classification (see below) by the comparative method explicitly or implicitly assumes that language change is governed by certain (vaguely formulated) probabilistic laws. These laws have a plausible theoretical foundation if the communication system serve(d) as the main means of communication for a human society, but do not necessarily apply to all forms of normed human communication systems. For example, radical vocabulary replacement within one generation of speakers would be highly unlikely for a main means of communication of a society (communication would break down!), but might be possible in an auxiliary communication system taught to adults. Similarly, sound change is though to come about as humans hear and (mis)interpret spoken analog communication (John J. Ohala 1993, Brown, Cecil H. and Eric W. Holman and Søren Wichmann 2013) and would, for that reason, not be expected in, e.g., computer programming languages.

Second, one of the purposes for doing language classification in the first place is to obtain insights into the history of its speakers. All human societies have a main means of communication, so such a communication system reflects the history of a human society. It is not necessarily the case that all forms of normed human communication systems reflect the history of its speakers. For example, a whistled language may come and go in the course of history of a people, whereas a people cannot be without a main speech form for any period of history.

If a putative language is not the main means of communication for a society, it is classified as a pidgin or as a speech register. (Whistled and drummed languages as well as jargons are not currently included in Glottolog.)

4. Is the modality speech?

The present classification of languages is restricted to spoken languages for the sole reason that there exists a methodology for establishing genealogical relationships for spoken languages (Campbell, Lyle and Poser, William J. 2008). This is not necessarily the case for signed languages.

Sign languages are grouped into a variety of subgroups that also thought to reflect genealogical history. But here the same theoretical foundation is lacking, and thus the sign language groupings are much less secure. The Sign Language groupings are not accountable like the spoken language groupings, i.e., accompanied by a reference that justifies the outcome according to a well-understood theory. Rather, the sign language groupings reflect the impression of origin by individual researchers and/or simple lexicostatistical counts.

5. Are the form-meaning pairs enough to distinguish between different classification proposals?

We also require that the amount of form-meaning pairs is sufficient for a classification. There is no universal fixed threshold for how much is sufficient as this depends on how closely related the language is to other known languages. An approximate minimal requirement is 50 items or so of basic vocabulary, i.e., not personal names or special domain vocabulary. For example, the extinct language Gamela of northeastern Brazil is known from 19 words only (Curt Nimuendajú 1937:68)—hardly enough for a classification. It is arguable that the sound-values encoded in the Linear A script can be gauged, but little, if any, meaning can be inferred (Yves Duhoux 1998, Best, Jan 1989, K. Aartun 1997), rendering the data insufficient for classification.

If not enough form-meaning pairs are attested to allow classification, the language is filed under Unclassifiable.

Classification

6. Are the form-meaning similarities to at least one other language best explained by inheritance from a common ancestor?

Given a language with sufficient attestation, one can compare it with the remaining languages. If there are similarities to other language(s) that can be shown exceed chance, there are three possible kinds of explanations: universals, contact or inheritance from a common ancestor (Campbell, Lyle and Poser, William J. 2008). If the best explanation for the similarities are inheritance from common ancestor, languages are classified as belonging to the same family. A language which, by this principle, does not belong to the same family as any other language is also called an isolate. What constitutes the “best” explanation is not a static judgment, but subject to change as new considerations and new data appear. For example, some lexical parallels between Nadahup, Kakua-Nukak and Puinave (Rivet, Paul and Constant Tastevin 1920) were for a long time considered by many to be “best” explained by a genealogical relationship. However, thanks to increased documentation and interest in the languages, the explanation of the similarities as loans, chance resemblances and even data errors, is now favoured (Patience Epps 2008:5-9, Katherine Bolaños and Patience Epps 2009, Katherine Bolaños 2011, Girón, Jesús Mario 2008:419-439). Not only the state of documentation and investigation of specific groups may alter the perceived “best” explanation, but also new arguments regarding the probative value of various kinds of evidence. For example, Malcolm Ross (1995), Malcolm Ross (2001), Ross, Malcolm (2005) argues that similarities in pronoun signatures can be used to create preliminary groupings of Papuan languages, whereas Harald Hammarström (2012), using data from all over the world, argues that such usage of the evidence is not probative for genealogical groupings.

There is the theoretical possibility that a language with sufficient attestation has simply not (yet) been compared to other relevant languages to determine if there are any non-random similarities. In practice, we know of no such language, and therefore have no separate category for languages inhabiting this logical possibility.

7. Has there been sufficient comparison to determine its closest relative(s)?

Given a language and the other languages that belong to the same family, if insufficient data is available or insufficient comparative work has been done to determine the closest relative(s) of the language at hand, it is left unclassified within the finest-level (sub)family that can be discerned.

For example, the subgrouping study of the Greater Awyu subfamily by Lourens de Vries and Ruth Wester and Wilco van den Heuvel (2012) uses shared innovations in verb morphology as the most reliable indicator of linguistic ancestry because, in a landscape of dialect chains and clan loyalty shifts (de Vries, Lourens J. 2012), lexicon and phonology is thought to be particularly vulnerable to diffusion. Within the Greater Awyu languages, there is a binary split between the Becking-Dawi group and the Awyu-Dumut groups. Awyu-Dumut, in turn, divides into three large dialect chains Awyu, Dumut and Ndeiram. For one language (clearly belonging to the Greater Awyu family on lexical and pronominal grounds), Sawi, no morphological data is available, so, for lack of data on verb morphology, its position within the subfamily cannot be determined and it is consequently left unclassified within it.

In other cases, data availability is not the bottleneck, but the work required to ascertain the subgrouping. Plenty of data exists for Adamawa Fali and other Volta-Congo languages (although patchily distributed), but subgrouping in the Volta-Congo languages is a large and complicated issue, leaving the subgrouping of Adamawa Fali unresolved (Boyd, Raymond 1989):180.

8. Is there a subgrouping based on shared innovations?

The preferred subgrouping criterion is a subgrouping based on shared innovations (Malcolm Ross 1988, Malcolm Ross 1997). For each language where such is available, that subgrouping is followed.

9. Are there other, weaker, arguments for subgrouping?

If no subgrouping based on shared innovations is available, whatever other (weaker) arguments are considered. Weaker arguments would be shared similarities in general, e.g., lexicostatistics, which may reflect borrowings and/or retentions. The subgrouping of the least bad such evidence is followed. For example, two independent published opinions exist on the internal subgrouping of the Mek languages, namely that of Volker Heeschen (1978), Volker Heeschen (1992) and that which appears in Peter J. Silzer and Heljä Heikkinen-Clouse (1991). The former gives a lexicostatistical argument for a subgrouping while the latter lists a subgrouping without pointing to any evidence at all. The lexicostatistical evidence is preferrable to no evidence at all, and is therefore followed.

Accountability

The outcome classification is presented in the glottolog tree. Detailed evidence that the presented classification actually conforms to the principles above is provided in the form of references to work containing or subsuming the required evidence for the decisions reflected in the classification.

On the leaf level, i.e., for languages, references to actual data for each language are given, justifying principles 1-5.

For the classification, principles 6-9, references justifying nodes are displayed in the green box below the tree-fragment box. Wherever necessary, a comment accompanies the reference if the decision reflected in the tree does not follow straightforwardly from the argumentation in the references work(s).

We do not always conform to the interpretation and conventions of the authors cited as justification. It may be, for example, that an author states that a certain group should be assumed on purely geographic grounds, in anticipation of future work, or some other reason not admissible as justification in the present classification. In such cases, the justificational value of the reference is on the (lack of) evidence and/or arguments found in the reference, not necessarily the interpretation of this state given in that reference.

Even though the information given in the current version of Glottolog is fairly substantial, we cannot guarantee that we have included all the relevant information yet. We decided to release Glottolog early rather than wait for the completed version, which will be evolving continually anyway.

Names of families and subfamilies

Whenever possible, names of families and subfamilies are taken over from the current literature. This is considered possible when there is no name clash (with another language or (sub-)family in the world) and the name in the literature in principle refers to the intended set of languages. If the (sub-)family in the present classification differs in any significant way from that associated with a certain name, we have introduced a new unique name which is in often not found in the literature. The new names are all unique and unambiguous but otherwise, for the current edition of Glottolog, we spent little effort on finding the name optimal in describing its set of languages (e.g., with the name of a central river or by taking the word for “man”) or optimal in the system of names in the region or greater family (e.g., by using a name with a Spanish flavour if the surrounding (sub-)families have Spanish-flavoured names). A number of names may look somewhat artificial (e.g., Nuclear A, or, A-B-C) or out of place (e.g., a subfamily with an Anglophone name whose parent has a Francophone name), reflecting the fact that no particular value is attached to names beyond being unique and unambiguous.

Example

For example, Tucanoan is a South American language family. Chacon, Thiago C. (2012) contains a subgrouping based on shared phonological innovations and defines the position in the tree for all the below nodes except Arapaso, Miriti, Macaguaje and Tama, which fall outside the scope of his study. Thus, Chacon, Thiago C. (2012) is given as the reference justifying the top-level family as well as the reference justifying most intermediate nodes. The remaining languages, Arapaso, Miriti, Macaguaje and Tama do exist (or did exist) and they are arguably Tucanoan. For Macaguaje and Tama, a small amount of data is attested and published, and this is enough for Sergio Elías Ortiz (1965):133 to show that they are within the Siona-Secoya group. Thus, here Sergio Elías Ortiz (1965):133 is cited as the reference justifying the position of Macaguaje and Tama. For Miriti and Arapaso, Brüzzi Alves da Silva, Alcionilio (1972) collected short wordlists of them, and concluded that they were Tucanoan, but he gives no further information that would allow us to infer their relation to each other or to other Tucanoan languages. The wordlists themselves were never published, and are possibly now lost (but this is not certain). Hence, Arapaso and Miriti are labeled Unclassified Tucanoan languages. There is no implication that Arapaso and Miriti would form a subgroup in the sense of having a common ancestor unique only to them.

Dialects

For the current edition of Glottolog, we spent little effort on making dialect classifications consistent and on providing references for dialects. Most of the information on dialects in Glottolog is lifted from the Multitree project and contains numerous errors and inconsistencies which we are aware of, but have not yet had the resources to systematically correct. We hope to provide more information on dialects in the future.

Bookkeeping languoids

Glottolog contains lists of three types of languoids that the editors do not regard as real languoids but that are included for bookkeeping purposes: languages based on misunderstanding, languages that need to be reassigned, and pseudo-families.

Languages based on misunderstanding
Sometimes linguists claim the existence of a language that later turns out to be a misunderstanding. For instance, Yarsun was once claimed by Ethnologue to be an Austronesian language of northern New Guinea, and there is still an ISO 639-3 code for it. However, recent research provided insufficient evidence that such a language ever existed in the sense of being distinct from every other language. In such cases, ISO 639-3 codes are often retired, because active ISO 639-3 codes must be about real languages. Glottolog never retires Glottocodes and keeps them also for bookkeeping purposes.
Languoids that need to be reassigned
Some languoids have been classified as non-language languoids, but have not been reassigned yet. These are put in this preliminary category. For example, ISO 639-3 once had a language “Durango Nahuatl” [nln]. This language was split into two distinct languages “Eastern Durango Nahuatl” [azd] and “Western Durango Nahuatl” [azn]. The code [nln] was retired. Glottolog still includes the former “Durango Nahuatl” with the ISO code [nln], but in the future this code will be reassigned to the family “Durango Nahuatl” [dura1246] (in contrast to ISO 639-3, Glottocodes may also be assigned to families). Similarly, Ethnologue used to have an Arawakan language “Ipeka-Tapuia” [paj], which was merged into Curripaco, because it turned out to be a dialect of this language. The code [paj] was retired, but in Glottolog it will be assigned to the Ipeka-Tapuia dialect of Curripaco.
Pseudo-families
Since Glottolog also serves to classify bibliographical references, it also contains a number of pseudo-families such as Nostratic, Altaic and Mon-Khmer, which are not recognized as families in Glottolog, but for which a substantial literature exists that linguists may still be interested in..

Acknowledgements

Thanks

  • To Tim Usher for many points of dicussion re Papuan languages
  • To Mark Donohue for many points of dicussion re Papuan and Austronesian languages
  • To Hilário de Sousa and Andy Hsiu for many points of dicussion and clarification regarding languages in the Sinosphere
  • To Matthew Dryer for many points of dicussion re Papuan and other languages
  • To Roger Blench for all things Nigerian and beyond
  • To Tom Güldemann the 'bald eagle' of African language classification
  • To Bonny Sands for help with access to various valuable documents
  • To Mikael Parkvall for help with “Creole” language classification
  • To Willem Adelaar for many points of discussion re South American languages
  • To Raoul Zamponi for help with access to various valuable documents
  • To Guillaume Segerer for help with access to various valuable documents
  • To all authors of descriptive and comparative works on the languages of the world
  • To 25 libraries for access and services
  • To over 250 individuals who provided confirming and/or clarificatory information