Computational historical linguistics

Computational historical linguistics

Gerhard Jäger
University of Tübingen
Institute of Linguistics
Wilhelmstr. 19, 72074 Tübingen, Germany

Computational approaches to historical linguistics have been proposed since half a century. Within the last decade, this line of research has received a major boost, owing both to the transfer of ideas and software from computational biology and to the release of several large electronic data resources suitable for systematic comparative work.

In this article, some of the central research topic of this new wave of computational historical linguistics are introduced and discussed. These are automatic assessment of genetic relatedness, automatic cognate detection, phylogenetic inference and ancestral state reconstruction. They will be demonstrated by means of a case study of automatically reconstructing a Proto-Romance word list from lexical data of 50 modern Romance languages and dialects.

1 Introduction

Historical linguistics is the oldest sub-discipline of linguistics, and it constitutes an amazing success story. It gave us a clear idea of the laws governing language change, as well as detailed insights into the languages — and thus the cultures and living conditions — of prehistoric populations which left no written records. The diachronic dimension of languages is essential for a proper understanding of their synchronic properties. Also, the findings from historical linguistics are an important source of information for other fields of prehistory studies, such as archaeology, paleoanthropology and, in recent years, paleogenetics \@BBOPcitep\@BAP\@BBN(Renfrew, 1987; Pietrusewsky, 2008; Anthony, 2010; Haak et al., 2015, and many others)\@BBCP.

The success of historical linguistics is owed to a large degree to a collection of very stringent methodological principles that go by the name of the comparative method \@BBOPcitep\@BAP\@BBN(Meillet, 1954; Weiss, 2015)\@BBCP. It can be summarized by the following workflow (from \@BBOPcitep\@BAP\@BBNRoss & Durie, 1996, pp. 6–7\@BBCP):

  1. Determine on the strength of diagnostic evidence that a set of languages are genetically related, that is, that they constitute a ‘family’.

  2. Collect putative cognate sets for the family (both morphological paradigms and lexical items).

  3. Work out the sound correspondences from the cognate sets, putting ‘irregular’ cognate sets on one side.

  4. Reconstruct the protolanguage of the family as follows:

    1. Reconstruct the protophonology from the sound correspondences worked out in (3), using conventional wisdom regarding the directions of sound changes.

    2. Reconstruct protomorphemes (both morphological paradigms and lexical items) from the cognate sets collected in (2), using the protophonology reconstructed in (4a).

  5. Establish innovations (phonological, lexical, semantic, morphological, morphosyntactic) shared by groups of languages within the family relative to the reconstructed protolanguage.

  6. Tabulate the innovations established in (5) to arrive at an internal classification of the family, a ‘family tree’.

  7. Construct an etymological dictionary, tracing borrowings, semantic change, and so forth, for the lexicon of the family (or of one language of the family).

In practice it is not applied in a linear, pipeline-like fashion. Rather, the results of each intermediate step are subsequently used to inform earlier as well as later steps. This workflows is graphically depicted in Figure 1.

Figure 1: Workflow of the comparative method (according to \@BBOPcitep\@BAP\@BBNRoss & Durie, 1996\@BBCP)

The steps (2)–(7) each involve a systematic, almost mechanical comparison and evaluation of many options such as cognacy relations, proto-form reconstructions, or family trees. The first step, establishing genetic relatedness, is less regimented, but it generally involves a systematic comparison of many variables from multiple languages as well. It is therefore not surprising that there have been many efforts to formalize parts of this workflow to a degree sufficient to implement it on a computer.

Lexicostatistics (e.g. \@BBOPcitep\@BAP\@BBNSwadesh, 1952, 1955\@BBCP and much subsequent work) can be seen as an early attempt to give an algorithmic treatment of step (6), even though it predates the computer age. Since the 1960s, several scholars applied computational methods within the overall framework of lexicostatistics (cf. e.g. \@BBOPcitep\@BAP\@BBNEmbleton, 1986\@BBCP, inter alia). Likewise, there have been repeated efforts for computational treatments of other aspects of the comparative method, such as \@BBOPcitep\@BAP\@BBN(Ringe, 1992; Baxter & Manaster Ramer, 2000; Kessler, 2001)\@BBCP for step (1), \@BBOPcitep\@BAP\@BBN(Kay, 1964)\@BBCP for step (2), \@BBOPcitep\@BAP\@BBN(Kondrak, 2002)\@BBCP for steps (2) and (3), \@BBOPcitep\@BAP\@BBN(Lowe & Mazaudon, 1994)\@BBCP for steps (2) and (4), \@BBOPcitep\@BAP\@BBN(Oakes, 2000)\@BBCP for steps (2)–(7), \@BBOPcitep\@BAP\@BBN(Covington, 1996)\@BBCP for step (3), and \@BBOPcitep\@BAP\@BBN(Ringe et al., 2002)\@BBCP for step (6), to mention just a few of the earlier contributions.

There is also a plethora of exciting work using historical corpora from different stages of the same language to track lexical, grammatical and semantic change by computational means (see for instance the overview in \@BBOPcitep\@BAP\@BBNHilpert & Gries, 2016\@BBCP and the literature cited therein).

While the mentioned proposals mostly constitute isolated efforts of historical and computational linguists, the emerging field of computational historical linguistics received a major impetus since the early 2000s by the work of computational biologists such as Alexandre Bouchard-Côté, Russell Gray, Robert McMahon, Mark Pagel, or Tandy Warnow and co-workers, who applied methods from their field to the problem of the reconstruction language history, often in collaboration with linguists. This research trend might be dubbed computational phylogenetic linguistics as it heavily draws on techniques of phylogenetic inference from computational biology \@BBOPcitep\@BAP\@BBN(Gray & Jordan, 2000; Gray & Atkinson, 2003; McMahon & McMahon, 2005; Pagel et al., 2007; Atkinson et al., 2008; Gray et al., 2009; Dunn et al., 2011; Bouckaert et al., 2012; Bouchard-Côté et al., 2013; Pagel et al., 2013; Hruschka et al., 2015)\@BBCP.

In recent years, more and more large collections of comparative linguistic data become available in digital form, giving the field another boost. The following list gives a sample of the most commonly used databases; it is necessarily incomplete as new data sources are continuously made public.

  • Cognate-coded word lists

    • Indo-European Lexical Cognacy Database (IELex; collection of 225-concept Swadesh lists from 163 Indo-European languages (based on \@BBOPcitep\@BAP\@BBNDyen et al., 1992\@BBCP). Entries are given in orthography with manually assigned cognate classes; for part of the entries, IPA transcriptions are given.

    • Austronesian Basic Vocabulary Database (ABVD; \@BBOPcitep\@BAP\@BBNGreenhill et al., 2008\@BBCP; collection of 210-item Swadesh lists for 1,467 languages from the Pacific region, mostly belonging to the Austronesian language family. Entries are given in phonetic transcription with manually assigned cognate classes.

  • Phonetically transcribed word lists

    • ASJP database (compiled by the Automatic Similarity Judgment Program; \@BBOPcitep\@BAP\@BBNWichmann et al., 2016\@BBCP; collection of word lists for 7,221 doculects (languages and dialects) over 40 concepts (100-item word lists for ca. 300 languages); entries are given in phonetic transcription.

  • Grammatical and typological classifications

    • World Atlas of Language Structure (\@BBOPcitep\@BAP\@BBNHaspelmath et al., 2008\@BBCP; manual expert classifications of 2,679 languages and dialects according to 192 typological features.

    • Syntactic Structures of the World’s Languages ( Classification of 274 languages according to 148 syntactic features.

  • Expert language classifications

    • Ethnologue (\@BBOPcitep\@BAP\@BBNLewis et al., 2016\@BBCP; genetic classification of 7,457 languages, alongside with information about number of speakers, location, and viability.

    • Glottolog (\@BBOPcitep\@BAP\@BBNHammarström et al., 2016\@BBCP; genetic classification of 7,943 languages and dialects, alongside with information about geographic locations and extensive bibliographic references

Additionally there is a growing body of diachronic corpora of various languages. The focus of this article is on computational work inspired by the comparative method, so this line of work will not further be covered here.

2 A program for computational historical linguistics

Conceived in a broad sense, computational historical linguistics comprises all efforts deploying computational methods to answer questions about the history of natural languages. As spelled out above, there is a decade-old tradition of this kind of research.

In this article, however, the term will be used in a rather narrow sense to describe an emerging subfield which has reached a certain degree of convergence regarding research goals, suitable data source and computational methods and tools to be deployed. I will used the abbreviation CHL to refer to computational historical linguistics in this narrow sense. The following remarks strive to describe this emerging consensus. They are partially programmatic in nature though; not all researchers active in this domain will agree with all of them.

CHL is informed by three intellectual traditions:

  • the comparative method of classical historical linguistics,

  • computational biology, especially regarding sequence alignment \@BBOPcitep\@BAP\@BBN(cf. Durbin et al., 1989)\@BBCP and phylogenetic inference \@BBOPcitep\@BAP\@BBN(see, e.g., Ewens & Grant, 2005; Chen et al., 2014)\@BBCP, and

  • computational linguistics in general, especially modern statistical Natural Language Processing (NLP).

CHL shares, to a large degree, the research objectives of the comparative method. The goal is to reconstruct the historical processes that led to the observed diversity of extant or documented ancient languages. This involves, inter alia establishing cognacy relations between words and morphemes, identifying regular sound correspondences, inferring family trees (phylogenetic trees or simply phylogenies in the biology-inspired terminology common in CHL), reconstructing proto-forms and historical processes such as sound laws and lexical innovations.

CHL’s guiding model is adapted from computational biology. The history of a group of languages is represented by a phylogenetic tree (including branch lengths), with observed linguistic varieties at the leafs of the tree. Splits in a tree represent diversification events, i.e., the separation of an ancient language into daughter-lineages. Language change is conceptualized as a continuous-time Markov process applying to discrete, finite-values characters. (Details will be spelled out below.) Inference amounts to finding the model (a phylogenetic tree plus a parameterization of the Markov process) that best explains the observed data.

Last but not least, CHL adopts techniques and methodological guidelines from statistical NLP. The pertinent computational tools, such as string comparison algorithms, to a certain degree overlap with those inspired by computational biology. Equally important are certain methodological standards from NLP and machine learning.

Generally, work in CHL is a kind of inference, where a collection of data are used as input (premises) to produce output data (conclusions). Input data can be phonetically or orthographically transcribed word lists, pairwise or multiply aligned word lists, grammatical feature vectors etc. Output data are for instance cognate class labels, alignments, phylogenies, or proto-form reconstructions. Inference is performed by constructing a model and training its parameters. Following the standards in statistical NLP, the following guiding principles are desirable when performing inference:

  • Replicability. All data used in a study, including all manual pre-processing steps, is available to the scientific community. Likewise, each computational inference step is either documented in sufficient detail to enable re-implementation, or made available as source code.

  • Rigorous evaluation. The quality of the inference, or goodness of fit of the trained model, is evaluated by applying a well-defined quantitative measure to the output of the inference. This measure is applicable to competing model for the same inference task, facilitating model comparison and model selection.

  • Separation of training and test data. Different data sets are used for training and evaluating a model.

  • Only raw data as input. Only such data are used as input for inference that can be obtained without making prior assumptions about the inference task. For instance, word lists in orthographic or phonetic transcription are suitable as input if the transcriptions were produced without using diachronic information.

The final criterion is perhaps the most contentious one. It excludes, for instance, the use of orthographic information in languages such as English or French for training purposes, as the orthographic conventions of those languages reflect the phonetics of earlier stages. Also, it follows that the cognate class labels from databases such as IELex or ABVD, as well as expert classifications such as Ethnologue or Glottolog, are unsuitable as input for inference and should only be used as gold standard for training and testing.

Conceived this way, CHL is much narrower in scope than, e.g., computational phylogenetic linguistics. For instance, inference about the time depth and homeland of language families \@BBOPcitep\@BAP\@BBN(such as Gray & Atkinson, 2003; Bouckaert et al., 2012)\@BBCP is hard to fit into this framework as long as there are no independent test data to evaluate models against (but see \@BBOPcitep\@BAP\@BBNRama, 2013\@BBCP). Also, it is common practice in computational phylogenetic linguistics to use manually collected cognate classifications as input for inference \@BBOPcitep\@BAP\@BBN(Gray & Jordan, 2000; Gray & Atkinson, 2003; Pagel et al., 2007; Atkinson et al., 2008; Gray et al., 2009; Dunn et al., 2011; Bouckaert et al., 2012; Bouchard-Côté et al., 2013; Pagel et al., 2013; Hruschka et al., 2015)\@BBCP. While the results obtained this way are highly valuable and insightful, they are not fully replicable, since expert cognacy judgments are necessarily subjective and variable. Also, the methods used in the work mentioned do not generalize easily to under-studied language families, since correctly identifying cognates between distantly related languages requires the prior application of the classical comparative method, and the necessary research has not been done with equal intensity for all language families.

3 A case study: reconstructing Proto-Romance

In this section a case study will be presented that illustrates many of the techniques common in current CHL. Training data are 40-item word lists from 50 Romance (excluding Latin) and 3 Albanian111The inclusion of Albanian will be motivated below. languages and dialects in phonetic transcription from the ASJP database \@BBOPcitep\@BAP\@BBN(Wichmann et al., 2016)\@BBCP (version 17, accessed on August 2, 2016 from The inference goal is the reconstruction of the corresponding word list from the latest common ancestor of the Romance languages and dialects (Proto-Romance, i.e., some version of Vulgar Latin). The results will be tested against the Latin word lists from ASJP. A subset of the data used is shown in Table 1 for illustration.

horn bri kerno korno korn kornu
knee Tu rodiya jinokkyo jenuNk genu
mountain mal sero monta5a munte mons
liver m3lCi igado fegato fikat yekur
we ne nosotros noi noi nos
you ju ustet tu tu tu
person vet3 persona persona persoan3 persona
louse morr pioho pidokko p3duke pedikulus
new iri nuevo nwovo nou nowus
hear d3gyoy oir ud auz audire
sun dyell sol sole soare sol
tree dru arbol albero pom arbor
breast kraharor peCo pEtto pept pektus
drink pirye bebe bere bea bibere
hand dor3 mano mano m3n3 manus
die vdes mori mor mur mori
name em3r nombre nome nume nomen
eye si oho okkyo ok okulus
Table 1: Sample of word lists used

The phonetic transcriptions use the 41 ASJP sound classes (cf. \@BBOPcitep\@BAP\@BBNBrown et al., 2013\@BBCP). Diacritics are removed. If the database lists more than one translations for a concept in a given language, only the first one is used.

The following steps will be performed (mirroring to a large degree the steps of the comparative method):

  1. Demonstrate that the Romance languages and dialects are related.

  2. Compute pairwise string alignments and string similarities between synonymous words from different languages/dialects.

  3. Cluster the words for each concept into automatically inferred cognate classes.

  4. Infer a phylogenetic tree (or a collection of trees).

  5. Perform Ancestral State Reconstruction for cognate classes to infer the cognate class of the Proto-Romance word for each concept.

  6. Perform multiple sequence alignment of the reflexes of those cognate classes within the Romance languages and dialects.

  7. Perform Ancestral State reconstruction to infer the state (sound class or gap) of each column in the multiple sequence alignments.

  8. Compare the results to the Latin ASJP word list.

3.1 Demonstration of genetic relationship

In \@BBOPcitep\@BAP\@BBN(Jäger, 2013)\@BBCP a dissimilarity measure between ASJP word lists is developed. Space does not permit to explain it in any detail here. Suffice it to say that this measure is based on the average string similarity between the corresponding elements of two word lists while controlling for the possibility of chance similarities. Let us call this dissimilarity measure between two word lists the PMI distance, since it makes crucial use of the pointwise mutual information (PMI) between phonetic strings.

To demonstrate that all Romance languages and dialects used in this study are mutually related, we will use the ASJP word lists from Papunesia, i.e., “all islands between Sumatra and the Americas, excluding islands off Australia and excluding Japan and islands to the North of it” \@BBOPcitep\@BAP\@BBN(Hammarström et al., 2016)\@BBCP as training data and the ASJP word lists from Africa as test data.222We chose different macro-areas for training and testing to minimize the risk that the data are non-independent due to common ancestry or language contact. Input for inference are PMI distances between pairs of languages/dialect, and the output is the classification of this pair as related or unrelated, where two doculects count as related if they belong to the same language family according to the Glottolog classification.

Figure 2: PMI distances between related and unrelated doculects from Papunesia, and between the Romance doculects

The graphics illustrates that all doculect pairs with a PMI distance are, with a very high probability, related. The largest PMI distance among Romance dialects (between Aromanian and Nones) is 0.65.

A statistical test confirms this impression. We fitted a cumulative density estimation for the PMI distances of the unrelated doculect pairs from the training data, using the R-package logspline \@BBOPcitep\@BAP\@BBN(Kooperberg, 2016)\@BBCP. If a pair of doculects has a PMI distance , the value of the cumulative density function for can then be interpreted as the (one-sided) -value for the null hypothesis that the doculects are unrelated.

Using a threshold of , we say that a doculect pair is predicted to be related if the model predicts it to be unrelated with a probability . In Table 2, the predictions are tabulated against the Glottolog gold standard.

unrelated related
prediction: unrelated 1,254,726 787,023
prediction: related 532 153,279
Table 2: Contingency table of gold standard versus prediction for the test data

These results amount to ca.  of false positives and ca.  of false negatives. This test ensures that the chosen model and threshold is sufficiently conservative to keep the risk of wrongly assessing doculects to be related small. Since the method so conservative, it produces a large amount of false negatives though.

In the next step, we compute the probability of all pairs of Romance doculects to be unrelated, using the model obtained from the training data. Using the Holm-Bonferroni method to control for multiple tests, the highest -value for the null hypothesis that the doculect pair in question is unrelated is , i.e., all adjusted -values are . We can therefore reject the null hypothesis for all Romance doculect pairs.

3.2 Pairwise string comparison

All subsequent steps rely on a systematic comparison of words for the same concept from different doculects. Let us consider as an example the words for water from Catalan and Italian from ASJP, “aigua” and “acqua”. Both are descendants of the Latin “aqua” \@BBOPcitep\@BAP\@BBN(Meyer-Lübke, 1935, p. 46)\@BBCP. In ASJP transcription, these are aixw~3 and akwa. The sequence w~ in the Catalan word encodes a diacritic (indicating labialization of the preceding segment) and is removed in the subsequent processing steps.

A pairwise sequence alignment of two strings arranges them in such a way that corresponding segments are aligned, possibly inserting gap symbols for segments in one string that have no correspondent in the other string. For the example, the historically correct alignment would arguably be as follows:


In this study, the quality of a pairwise alignment is quantified as its aggregate pointwise mutual information (PMI). (See \@BBOPcitep\@BAP\@BBNList, 2014\@BBCP for a different approach.) The PMI between two sound classes is defined as


Here is the probability that is aligned to in a correct alignment, and are the probabilities of occurrence of and in a string. If one of the two symbols is a gap, the PMI score is a gap penalty. We use affine gap penalties, i.e., the gap penalty is reduced if the gap is preceded by another gap.

If the PMI scores for each pair of sound classes, and the gap penalties are known, the best alignment between two strings (i.e., the alignment maximizing the aggregate PMI score) can efficiently be computed using the Needleman-Wunsch algorithm \@BBOPcitep\@BAP\@BBN(Needleman & Wunsch, 1970)\@BBCP.

The quantities and must be estimated from the data. Here we follow a simplified version of the parameter estimation technique from \@BBOPcitep\@BAP\@BBN(Jäger, 2013)\@BBCP. In a first step, we set

Also, we set the initial gap penalties to . (This amounts to Levenshtein alignment.) Using these parameters, all pairs of word for the same concept from different doculects are aligned.

From those alignments, is estimated as the relative frequency of and being aligned among all non-gap alignment pairs, while is estimated as the relative frequency of sound class in the data. The PMI scores are then estimated using equation (1). For the gap penalties we used the values from \@BBOPcitep\@BAP\@BBN(Jäger, 2013)\@BBCP, i.e., for opening gaps and for extending gaps. Using those parameters, all synonymous word pairs are re-aligned.

In the next step, only word pairs with an aggregate PMI score are used. (This threshold is taken from \@BBOPcitep\@BAP\@BBNJäger, 2013\@BBCP as well.) Those word pairs are re-aligned and the PMI scores are re-estimated. This step is repeated ten times.

The threshold of is rather strict; almost all word pairs above this threshold are either cognates or loans. For instance, for the language pair Italian/Albanian, the only translation pair with a higher PMI score is Italian peSe/ Albanian peSk (“fish”), where the former is a descendant and the latter a loan from Latin piscis (cf. For Spanish/Romanian, two rather divergent Romance languages, we find eight such word pairs. They are shown alongside with the inferred alignments in Table 3.

concept alignment PMI score
person perso-na
tooth diente
blood sangre
hand mano
one uno
die mori
come veni
name nombre
Table 3: Word pair alignments from Spanish and Romanian

The aggregate PMI score for the best alignment between two strings is a measure for the degree of similarity between the strings. We will call it the PMI similarity henceforth.

3.3 Cognate clustering

Automatic cognate detection is an area of active investigation in CHL \@BBOPcitep\@BAP\@BBN(Dolgopolsky, 1986; Bergsma & Kondrak, 2007; Hall & Klein, 2010; Turchin et al., 2010; Hauer & Kondrak, 2011; List, 2012, 2014; Rama, 2015; Jäger & Sofroniev, 2016; Jäger et al., 2017, inter alia)\@BBCP. For the present study, we chose a rather simple approach based on unsupervised learning.

Figure 3 shows the PMI similarities for words from different doculects have different or identical meanings.

Figure 3: PMI similarities for synonymous and non-synonymous word pairs

Within our data, synonymous word pairs are, on average, more similar to each other than non-synonymous ones. The most plausible explanation for this effect is that the synonymous word pairs contain a large proportion of cognate pairs. Therefore “identity of meaning” will be used as a proxy for “being cognate”.

We fitted a logistic regression with PMI similarity as independent and synonymy as dependent variable.

For each concept, a weighted graph is constructed, with the words denoting this concept as vertices. Two vertices are connected if the predicted probability of these words to be synonymous (based on their PMI similarity and the logistic regression model) is . The weight of each edge equals the predicted probabilities. The nodes of the graph are clustered using the weighted version of the Label Propagation algorithm \@BBOPcitep\@BAP\@BBN(Raghavan et al., 2007)\@BBCP as implemented in the igraph software \@BBOPcitep\@BAP\@BBN(Csardi & Nepusz, 2006)\@BBCP. As a result, a class label is assigned to each word. Non-synonymous words never carry the same class label.333The implicit assumption underlying this procedure is that cognate words always have the same meaning. This is evidently false when considering the entire lexicon. There is a plethora of examples, such as as English deer vs. German Tier “animal”, which are cognate \@BBOPcitep\@BAP\@BBN(cf. Kroonen, 2013, p. 94)\@BBCP without being synonyms. However, within the 40-concept core vocabulary space covered by ASJP, such cross-concept cognate pairs are arguably very rare. Table 4 illustrates the resulting clustering for the concept “person” and a subset of the doculects.

doculect word class label
VLACH omu 2
ASTURIAN persona 3
CATALAN p3rson3 3
ITALIAN persona 3
SPANISH persona 3
VALENCIAN persone 3
GASCON omi 7
Table 4: Automatic cognate clustering for concept “person”

A manual inspection reveals that the automatic classification does not completely coincide with the cognate classification a human expert would assume. For instance, the descendants of Latin homo are split into classes 1, 2, 5, and 7. Also, Gheg Albanian 5eri and Sardinian omini have the same label but are not cognate.

Based on evaluations against manually assembled cognacy judgments for different but similar data \@BBOPcitep\@BAP\@BBN(Jäger & Sofroniev, 2016; Jäger et al., 2017)\@BBCP, we can expect an average F-score of 60%–80% for automatic cognate detection. This means that on average, for each word, 60%–80% of its true cognates are assigned the same label, and 60%–80% of the words carrying the same label are genuine cognates.

3.4 Phylogenetic inference

3.4.1 General remarks

A phylogenetic tree (or simply phylogeny) is a similar data structure than family trees according to the comparative method, but there are some subtle but important differences between those concepts. Like a family tree, a phylogeny is a tree graph, i.e., an acyclic graph. If one node in the graph is identified as root, the phylogeny is rooted; otherwise it is unrooted. The branches (or edges) of a phylogeny have non-negative branch lengths. A phylogeny without branch length is called topology.

Nodes with a degree 1 (i.e., nodes which are the endpoint of exactly one branch) are called leaves or tips. The are usually labeled with the names of observed entities, such as documented languages. Nodes with a degree are the internal nodes. If the root (if present) has degree 2 and all other internal nodes have degree 3, the phylogeny is binary-branching. Most algorithms for phylogenetic inference produce binary-branching trees.

Like a linguistic family tree, a rooted phylogeny is a model of the historic process leading to the observed diversity between the objects at the leaves. Time flows from the root to the leaves. Internal nodes, represent unobserved historical objects, such as ancient languages. Branching nodes represent diversification events, i.e. the splitting of a lineage into several daughter lineages.

The most important difference between family trees and phylogenies is the fact that the latter have branch lengths. Depending on the context, these lengths may represent two different quantities. They may capture the historic time (measured in years) between diversification events, or they indicate the amount of change along the branch, measured for instance as the expected number of lexical replacements or the expected number of sound changes. The two interpretations only coincide if the rate of change is constant. This assumption is known to be invalid for language change (cf. e.g. the discussion in \@BBOPcitep\@BAP\@BBNMacMahon & MacMahon, 2006\@BBCP).

Another major difference, at least in practice, between family trees and phylogenies concerns the type of justification that is expected for the stipulation of an internal node. According to the comparative method, such a node is justified if and only if a shared innovation can be reconstructed for all daughter lineages of this node.444“The only generally accepted criterion for subgrouping is shared innovation.” (\@BBOPcitep\@BAP\@BBNCampbell, 1998\@BBCP, p. 190, emphasis in original). Consequently, family trees obtained via the comparative method often contain multiply branching nodes because the required evidence for further subgrouping is not available. Phylogenies, in contradistinction, are mostly binary-branching, at least in practice. Partially this is a matter of computational convenience since this reduces the search space. Also, algorithms working recursively leaves-to-root can be formulated in a more efficient way if all internal nodes are known to have at most two daughters. Furthermore, the degree of justification of a certain topology is evaluated globally, not for each internal node individually. In the context of phylogenetic inference, it is therefore not required to identify shared innovations for individual nodes.

There is a large variety of algorithms from computational biology to infer phylogenies from observed data. The overarching theme of phylogenetic inference is that a phylogeny represents (or is part of) a mathematical model explaining the observed variety. There are criteria quantifying how good an explanation a phylogeny provides for observed data. Generally speaking, the goal is to find a phylogeny that provides an optimal explanation for the observed data. The most commonly used algorithms are (in ascending order of sophistication and computational costs) Neighbor Joining \@BBOPcitep\@BAP\@BBN(Saitou & Nei, 1987)\@BBCP and its variant BIONJ \@BBOPcitep\@BAP\@BBN(Gascuel, 1997)\@BBCP, FastMe \@BBOPcitep\@BAP\@BBN(Desper & Gascuel, 2002)\@BBCP, Fitch-Margoliash \@BBOPcitep\@BAP\@BBN(Fitch & Margoliash, 1967)\@BBCP, Maximum Parsimony \@BBOPcitep\@BAP\@BBN(Fitch, 1971)\@BBCP, Maximum Likelihood555This method was developed incrementally; \@BBOPcitep\@BAP\@BBN(Edwards & Cavalli-Sforza, 1964)\@BBCP is an early reference. and Bayesian Phylogenetic Inference (cf. \@BBOPcitep\@BAP\@BBNChen et al., 2014\@BBCP for an overview).

The latter two approaches, Maximum Likelihood and Bayesian Phylogenetic Inference are based on a probabilistic model of language change. To apply them, a language has to be represented as a character vector. A character is a feature with a finite number of possible values, such as “order of verb and object”, “the first person plural pronoun contains a dental consonant” or what have you. In most applications, characters are binary, with “0” and “1” as possible values. In the sequel, we will assume all characters are binary.

Diachronic change of a character value is modeled as a continuous time Markov process. At each point in time a character can spontaneously switch to the other value with a fixed probability density. A two-state process is characterized by two parameters, and , where is the rate of change of (the probability density of a switch to 1 if the current state is 0) and the rate of change for . For a given time interval of length , the probability of being in state at the start of the interval and in state at the end is then given by , where

The possibility of multiple switches occurring during the interval is factored in.

A probabilistic model for a given set of character vectors is a phylogenetic tree (with the leaves indexed by the characters vectors) plus a mapping from edges to rates for each character and a probability distribution over character values at the root for each character.

Suppose we know not only the character states at the leaves of the phylogeny but also at all internal nodes. The likelihood of a given branch is then given by , where and are the states at the top and the bottom of the branch respectively, and is the length of the branch. The likelihood of the entire phylogeny for a given character is then the product of all branch likelihoods, multiplied with the probability of the root state. The total likelihood of the phylogeny is the product of its likelihoods for all characters.

If only the character values for the leaves are known, the likelihood of the phylogeny given the character vectors at the leaves is the sum of its likelihoods for all possible state combinations at the internal nodes.

This general model is very parameter-rich since for each branch and each character, a pair of rates have to be specified. There are various ways to reduce the degrees of freedom. The simplest method is to assume that rates are constant across branches and characters, and that the root probabilities of each character equal the equilibrium probabilities of the Markov process: . More sophisticated approaches assume that rates vary across characters and across branches according to some parameter-poor probability distribution, and the expected likelihood of the tree is obtained by integrating over this distribution. For a detailed mathematical exposition, the interested reader is referred to the relevant literature from computational biology, such as \@BBOPcitep\@BAP\@BBN(Ewens & Grant, 2005)\@BBCP.

A parameterized model, i.e., a phylogeny plus rate specifications for all characters and branches, and root probabilities for each characters, assigns a certain likelihood to the observed character vectors. Maximum Likelihood (ML) inference searches for the model that maximizes this likelihood given the observations. While the optimal numerical parameters of a model, i.e., branch lengths, rates and root probabilities, can efficiently be found by standard optimization techniques, finding the topology that gives rise to the ML-model is computationally hard. Existing implementations use various heuristics to search the tree space and find some local optimum, but there is no guarantee that the globally optimal topology is found.666Among the best software packages currently available for ML phylogenetic inference are RAxML \@BBOPcitep\@BAP\@BBN(Stamatakis, 2014)\@BBCP and IQ-Tree \@BBOPcitep\@BAP\@BBN(Nguyen et al., 2015)\@BBCP.

Bayesian phylogenetic inference requires some suitable prior probability distributions over models (i.e., topologies, branch lengths, rates, possibly rate variations across characters and rate variation across branches) and produces a sample of the posterior distribution over models via a Markov Chain Monte Carlo simulation.777Suitable software packages are, inter alia, MrBayes \@BBOPcitep\@BAP\@BBN(Ronquist & Huelsenbeck, 2003)\@BBCP and BEAST \@BBOPcitep\@BAP\@BBN(Bouckaert et al., 2014)\@BBCP.

3.4.2 Application to the case study

For the case study, doculects were represented by two types of binary characters:

  • Inferred class label characters (cf. Subsection 3.3). Each inferred class label is a character. A doculect has value 1 for such a character if and only if its word list contains a word carrying this label.888If a word list contains no entry for a certain concept, all characters pertaining to this concept are undefined for this concept. The same principle applies to the soundclass-concept characters. Leaves with undefined character values are disregarded when computing the likelihood of a phylogeny for that character.

  • Soundclass-concept characters. There is one character for each pair of a sound class and a concept . A doculect has value 1 for that character if and only if its word list contains a word for that contains in its transcription.

Both types of characters carry a diachronic signal. For instance, the mutation for class label 6/concept person (cf. Table 4) represents a lexical replacement of Latin “homo” or “persona” by descendants of Latin “christianus” in some Romance dialects \@BBOPcitep\@BAP\@BBN(Meyer-Lübke, 1935, p. 179)\@BBCP. The mutation for the soundclass-concept character k/person represents the same historical process. Soundclass-concept characters, however, also capture sound shifts. For instance, the mutation for b/person reflects the epenthetic insertion of b in descendants of Latin “homo” in some Iberian dialects.

We performed Bayesian phylogenetic inference on those characters. The inference was carried out using the Software MrBayes \@BBOPcitep\@BAP\@BBN(Ronquist & Huelsenbeck, 2003)\@BBCP. Separate rate models were inferred for the two character types. Rate variation across characters was modeled by a discretized Gamma distribution using 4 rate categories. We assumed no rate variation across edges. Root probabilities were identified with equilibrium probabilities. An ascertainment correction for missing all-0 characters was used.

We assumed rates to be constant across rates. This entails that the fitted branch lengths reflect the expected amount of change (i.e., the expected number of mutations) along that branch.

In such a model, the likelihood of a phylogeny does not depend on the location of the root (the assumed Markov process is time reversible.) Therefore phylogenetic inference provides no information about the location of the root. This motivates the inclusion of the Albanian doculects. Those doculects were used as outgroup, i.e., the root was placed on the branch separating the Albanian and the Romance doculects.

We obtained a sample of the posterior distribution containing 2,000 phylogenies. Figure 4 displays a representative member of this sample (the maximum clade credibility tree). The labels at the nodes indicate posterior probabilities of that node, i.e., the proportion of the phylogenies in the posterior sample having the same sub-group.

Figure 4: Representative phylogeny from the posterior distribution. Labels at the internal nodes indicate posterior probabilities

These posterior probabilities are mostly rather low, indicating a large degree of topological variation in the posterior sample. Some subgroups, such as Balkan Romance or the Piemontese dialects, achieve high posterior probabilities though.

Notably, branch lengths carry information about the amount of change. According to the phylogeny in Figure 4, for instance, the Tuscan dialect of Italian (ITALIAN_GROSSETO_TUSCAN) is predicted to be the most conservative Romance dialect (since its distance to the latest common ancestors of all Romance dialects is shortest), and French the most innovative one.

These results indicate that the data only contain a weak tree-like signal. This is unsurprising since the Romance languages and dialects form a dialect continuum where horizontal transfer of innovations is an important factor.

Phylogenetic trees, like traditional family trees, only model vertical descent, not horizontal diffusion. They are therefore only an approximation of the historical truth. But even though, they are useful as statistical models for further inference steps.

3.5 Ancestral state reconstruction

If a fully specified model is given, it is possible to estimate the probability distributions over character states for each internal node.

Let be a model, i.e., a phylogeny plus further parameters (rates and root probabilities, possibly specifications of rate variation). Let be a character and a node within .

The parameters specify a Markov process, including rates, for the branch leading to . Let be the equilibrium probabilities of that process. (If is the root, are directly given by .)

Let be the same model as , except that the value of character at node is fixed to the value . is the likelihood of model given the observed character vectors for the leaves.

The probability distribution over values of character at node , given , is determined by Bayes Rule:

Figure 5: Ancestral state reconstruction for character person:3

Figure 5 illustrates this principle with the Romance part of the tree from Figure 4 and the character person:3 (cf. Table 4). The pie charts at the nodes display the probability distribution for that node, where white represents 0 and red 1.

This kind of computation was carried out for each class label character and each tree in the posterior sample for the latest common ancestor of the Romance doculects. For each concept, the class label for that concept with the highest average probability for value 1 at the root of the Romance subtree was inferred to represent the cognate class of the Proto-Romance word for that concept.999See \@BBOPcitep\@BAP\@BBN(Jäger & List, 2017)\@BBCP for further elaboration and justification of this method of Ancestral State Reconstruction. For the concept person, e.g., character person:3 (representing the descendants of Latin “persona”) comes out as the best reconstruction.

3.6 Multiple sequence alignment

In the previous step, for the concept eye, the class label 6 was reconstructed for Proto-Romance. Its reflexes are given in Table 5.

doculect word
SICILIAN_UnnamedInSource okiu
VLACH okklu
Table 5: Reflexes of class label eye:6

A multiple sequence alignment (MSA) is a generalization of pairwise alignment to more than two sequences. Ideally, all segments within a column are descendants of the same sound in some common ancestor.

MSA, as applied to DNA or protein sequences, is a major research topic in bioinformatics. The techniques developed in this field are mutatis mutandis also applicable to MSA of phonetic strings. In this Subsection one approach will briefly be sketched. For a wider discussion and and proposals for related but different approaches, see \@BBOPcitep\@BAP\@BBN(List, 2014)\@BBCP.

Here we will follow the overall approach from \@BBOPcitep\@BAP\@BBN(Notredame et al., 2000)\@BBCP and combine it with the PMI-based method for pairwise alignment described in Subsection 3.2. \@BBOPcitep\@BAP\@BBN(Notredame et al., 2000)\@BBCP dub their approach T-Coffee (“Tree-based Consistency Objective Function For alignment Evaluation”), and we will use this name for the method sketched here as well.

In a first step, all pairwise alignments between words from the list to be multiply aligned are collected. For this purpose we use PMI pairwise alignment. Some examples would be

okiu vaklo okkyo -okyo o-ky- okru
oky- wokLu o-ky- wokyo okklu okiu

The last row shows the score of the alignment, i.e., the proportion of identical matches (disregarding gaps).

In a second step, all indirect alignments between a given word pair are collected, which are obtained via relation composition with a third word. Some examples for indirect alignments between okiu and oky would be:

okiu -okiu okiu -okiu oki-u
okyo wokyo oky- wokLu okklu
oky- -oky- oky- -oky- o-ky-

The direct pairwise alignment matches the i in okiu with the y in oky. Most indirect alignments pair these two positions as well, but not all of them. In the last columns, the i from okiu is related to the k of oky, and the y from oky with a gap. For each pair of positions in two strings, the relative frequency of them being indirectly aligned, weighted by the score of the two pairwise alignments relating them, are summed. They form the extended score between those positions.

The optimal MSA for the entire group of words is the one were the sum of the pairwise extended scores per column are maximized. Finding this global optimum is computationally not feasible though, since the complexity of this task grows exponentially with the number of sequences. Progressive alignment \@BBOPcitep\@BAP\@BBN(Hogeweg & Hesper, 1984)\@BBCP is a method to obtain possibly sub-optimal but good MSAs in polynomial time. Using a guide tree with sequences at the leaves, MSAs are obtained recursively leaves-to-root. For each internal node, the MSAs at the daughter nodes are combined via the Needleman-Wunsch algorithm while respecting all partial alignments from the daughter nodes.

For the words from Table 5, this method produces the MSA in Table 6. The tree in Figure 4, pruned to the doculects represented in the word lists, was used as guide tree.

doculect alignment
ITALIAN -okkyo
ROMANIAN_2 -o-ky-
SICILIAN_UnnamedInSource -o-kiu
VLACH -okklu
Table 6: Multiple Sequence Alignment for the word from Table 5, using the tree from Figure 4 as guide tree

Using this method MSAs were computed for each inferred class label that was inferred to be present in Proto-Romance via Ancestral State Reconstruction.

3.7 Proto-form reconstruction

A final step toward the reconstruction of Proto-Romance forms, Ancestral State Reconstruction is performed for the sound classes in each column, for each MSA obtained in the previous step.

Consider the first column of the MSA in Table 5. It contains three possible states, v, w, and the gap symbols -. For each of these states, a binary presence-absence character is constructed. For doculects which do not occur in the MSA in question, this character is undefined.

The method for Ancestral State Reconstruction described in Subsection 3.5 was applied to these characters. For phylogeny in the posterior sample, the probabilities for state 1 at the Proto-Romance node was computed for each character. For each column of an MSA, the state with the highest average probability was considered as reconstructed.

The reconstructed proto-form for a given concept is then obtained by concatenating the reconstructed states for the corresponding MSA and deleting all gap symbols. The results are given in Table 7.

concept Latin reconstruction
blood saNgw~is saNg
bone os os
breast pektus, mama pet
come wenire venir
die mori murir
dog kanis kan
drink bibere beb3r
ear auris oreL3
eye okulus okyu
fire iNnis fok
fish piskis peS
full plenus plen
hand manus man
hear audire sentir
horn kornu korn3
I ego iy3
knee genu Z3nuL
leaf foly~u* foLa
liver yekur figat
louse pedikulus pidoko
mountain mons munta5a
name nomen nom
new nowus novo
night noks note
nose nasus nas
one unus unu
path viya strada
person persona, homo persona
see widere veder
skin kutis pel
star stela stela
stone lapis pEtra
sun sol sol
tongue liNgw~E liNga
tooth dens dEnt
tree arbor arbur
two duo dos
water akw~a akwa
we nos nos
you tu tu
Table 7: Reconstructions for Proto-Romance

3.8 Evaluation

To evaluate the quality of the automatic reconstructions, they were compared to the corresponding elements of the Latin word list. For each reconstructed word, the normalized Levenshtein distance (i.e., the Levenshtein distance divided by the length of the longer string) to each Latin word (without diacritics) for that concept was computed. The smallest such value counts as the score for that concept. The average score was . The extant Romance doculects have an average score of . The most conservative doculect, Sardinian, has a score of , and the least conservative, Arpitan, . The evaluation results are depicted in Figure 6.

Figure 6: Average normalized Levenshtein distance to Latin words: reconstruction (dashed line) and extant Romance doculects (white bars)

These findings indicate that the automatic reconstruction does in fact capture a historical signal. Manual inspection of the reconstructed word list reveals that to a large degree, the discrepancies to Latin actually reflect language change between Classical Latin and the latest common ancestor of the modern Romance doculects, namely Vulgar Latin. To mention just a few points: (1) Modern Romance nouns are mostly derived from the Latin accusative form \@BBOPcitep\@BAP\@BBN(Herman, 2000, p. 3)\@BBCP, while the word lists contains the nominative form. For instance, the common ancestor forms for “tooth” and “night” are dentem and noctem. The reconstructed t in the corresponding reconstructed forms are therefore historically correct. (2) Some Vulgar Latin words are morphologically derived from their Classical Latin counterparts, such as mons montanea “mountain” \@BBOPcitep\@BAP\@BBN(Meyer-Lübke, 1935, p. 464)\@BBCP or genus genukulum “knee” \@BBOPcitep\@BAP\@BBN(Meyer-Lübke, 1935, p. 319)\@BBCP. This is likewise partially reflected in the reconstructions. (3) For some concepts, lexical replacement by non-cognate words took place between Classical and Vulgar Latin, such as via strata “path”,101010Latin makes a semantic distinction between via for unpaved and strata for paved roads; cf. \@BBOPcitep\@BAP\@BBN(Meyer-Lübke, 1935, p. 685)\@BBCP ignis focus “fire” \@BBOPcitep\@BAP\@BBN(Meyer-Lübke, 1935, p. 293)\@BBCP, or iecur ficatum “liver” \@BBOPcitep\@BAP\@BBN(Herman, 2000, p. 106)\@BBCP. Again, this is reflected in the reconstruction.

On the negative side, the reconstructions occasionally reflect sound changes that only took place in the Western Romania, such as the voicing of plosives between vowels \@BBOPcitep\@BAP\@BBN(Herman, 2000, p. 46)\@BBCP.

Let us conclude this section with some reflections on how the reconstructions were obtained and how this relates to the comparative method.

A major difference to the traditional approach is the stochastic nature of the workflow sketched here. Both phylogenetic inference and ancestral state reconstruction is based on probabilities rather than categorical decisions. The results shown in Table 7 propose a unique reconstruction for each concept, but it would be a minor modification of the workflow only to derive a probability distribution over reconstructions instead. This probabilistic approach is arguably an advantage since it allows to utilize uncertain and inconclusive information while taking this uncertainty properly into account.

Another major difference concerns the multiple independence assumptions implicit in the probabilistic model sketched in Subsection 3.4. The likelihood of a phylogeny is the product of its likelihoods for the individual characters. This amounts to the assumptions that the characters are mutually stochastically independent.

For the characters used here (and generally in computational phylogenetics as applied to historical linguistics) are mutually dependent in manifold ways though. For instance, the loss of a cognate class makes it more likely that the affected lineage will acquire another cognate class for the same semantic slot and vice versa.

This problem is even more severe for phonetic change. Since the work of the Neogrammarians in the 19th century, it is recognized that many sound changes are regular, i.e., they apply to all instances of a certain sound (given contextual conditions) throughout the lexicon. Furthermore, both regular and irregular sound changes are usually dependent on their syntagmatic phonetic context, and sometimes on the paradigmatic context within inflectional paradigms as well. \@BBOPcitep\@BAP\@BBN(Bouchard-Côté et al., 2013)\@BBCP and \@BBOPcitep\@BAP\@BBN(Hruschka et al., 2015)\@BBCP propose more sophisticated probabilistic models of language change than the one used here to take these dependencies into account.111111So far these model have only been tested only on one language family each (Austronesian and Turkic respectively), and the algorithmic tools have not been released.

Last but not least, the treatment of borrowing (and language contact in general) are an unsolved problem for computational historical linguistics. Automatic cognate clustering does not distinguish between genuine cognates (related via unbroken chains of vertical descent) and (descendants of) loanwords. This introduces a potential bias for phylogenetic inference and ancestral state reconstruction, since borrowed items might be misconstrued as shared retentions.

4 Conclusion

This article give a brief sketch of the state of the art in computational historical linguistics, a relatively young subfield at the interface between historical linguistics, computational linguistics and computational biology. The case study discussed in the previous section serves to illustrate some of the major research topics in this domain: identification of genetic relationships between languages, phylogenetic inference, automatic cognate detection and ancestral state recognition. These concern the core issues of the field; the information obtained by these methods are suitable to address questions of greater generality, pertaining to general patterns of language change as well as the relationship between the linguistic and non-linguistic history of specific human populations.


All code and data used and produced when conduction the case study in Section 3 are available for download and inspection from


  • Anthony (2010) Anthony, D. W.  (2010). The horse, the wheel, and language: how bronze-age riders from the Eurasian steppes shaped the modern world. Princeton: Princeton University Press.
  • Atkinson et al. (2008) Atkinson, Q. D., Meade, A., Venditti, C., Greenhill, S. J., & Pagel, M.  (2008). Languages evolve in punctuational bursts. Science, 319(5863), 588–588.
  • Baxter & Manaster Ramer (2000) Baxter, W. H., & Manaster Ramer, A.  (2000). Beyond lumping and splitting. Probabilistic issues in historical linguistics. In C. Renfrew, A. McMahon, & L. Trask (Eds.), Time depth in historical linguistics (Vol. 1, pp. 167–188). Cambridge: McDonald Institute for Archaeological Research.
  • Bergsma & Kondrak (2007) Bergsma, S., & Kondrak, G.  (2007). Multilingual cognate identification using integer linear programming. In Proceedings of the ranlp workshop (p. 656-663).
  • Bouchard-Côté et al. (2013) Bouchard-Côté, A., Hall, D., Griffiths, T. L., & Klein, D.  (2013). Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences, 36(2), 141–150.
  • Bouckaert et al. (2014) Bouckaert, R., Heled, J., Kühnert, D., Vaughan, T., Wu, C.-H., Xie, D., … Drummond, A. J.  (2014). BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Computational Biology, 10(4), e1003537.
  • Bouckaert et al. (2012) Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., … Atkinson, Q. D.  (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337(6097), 957–960.
  • Brown et al. (2013) Brown, C. H., Holman, E., & Wichmann, S.  (2013). Sound correspondences in the world’s languages. Language, 89(1), 4–29.
  • Campbell (1998) Campbell, L.  (1998). Historical linguistics. an introduction. Edinburgh: Edinburgh University Press.
  • Chen et al. (2014) Chen, M.-H., Kuo, L., & Lewis, P. O.  (2014). Bayesian phylogenetics. methods, algorithms and applications. Abingdon: CRC Press.
  • Covington (1996) Covington, M. A.  (1996). An algorithm to align words for historical comparison. Computational linguistics, 22(4), 481–496.
  • Csardi & Nepusz (2006) Csardi, G., & Nepusz, T.  (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5), 1–9.
  • Desper & Gascuel (2002) Desper, R., & Gascuel, O.  (2002). Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of computational biology, 9(5), 687–705.
  • Dolgopolsky (1986) Dolgopolsky, A. B.  (1986). A probabilistic hypothesis concerning the oldest relationships among the language families of Northern Eurasia. In V. V. Shevoroshkin (Ed.), Typology, relationship and time: A collection of papers on language change and relationship by soviet linguists (pp. 27–50). Ann Arbor: Karoma Publisher.
  • Dunn et al. (2011) Dunn, M., Greenhill, S. J., Levinson, S., & Gray, R. D.  (2011). Evolved structure of language shows lineage-specific trends in word-order universals. Nature, 473(7345), 79–82.
  • Durbin et al. (1989) Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G.  (1989). Biological sequence analysis. Cambridge, UK: Cambridge University Press.
  • Dyen et al. (1992) Dyen, I., Kruskal, J. B., & Black, P.  (1992). An Indoeuropean classification: A lexicostatistical experiment. Transactions of the American Philosophical Society, 82(5), 1–132.
  • Edwards & Cavalli-Sforza (1964) Edwards, A. W. F., & Cavalli-Sforza, L. L.  (1964). Reconstruction of evolutionary trees. In V. H. Heywood & J. R. McNeill (Eds.), Phenetic and phylogenetic classification (pp. 67–76). London: Systematics Association Publisher.
  • Embleton (1986) Embleton, S. M.  (1986). Statistics in historical linguistics. Bochum: Brockmeyer.
  • Ewens & Grant (2005) Ewens, W., & Grant, G.  (2005). Statistical methods in bioinformatics: An introduction. New York: Springer.
  • Fitch (1971) Fitch, W. M.  (1971). Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Zoology, 20(4), 406–416.
  • Fitch & Margoliash (1967) Fitch, W. M., & Margoliash, E.  (1967). Construction of phylogenetic trees. Science, 155(3760), 279-284.
  • Gascuel (1997) Gascuel, O.  (1997). BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution, 14(7), 685–695.
  • Gray & Atkinson (2003) Gray, R. D., & Atkinson, Q. D.  (2003). Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature, 426(27), 435–439.
  • Gray et al. (2009) Gray, R. D., Drummond, A. J., & Greenhill, S. J.  (2009). Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science, 323(5913), 479–483.
  • Gray & Jordan (2000) Gray, R. D., & Jordan, F. M.  (2000). Language trees support the express-train sequence of Austronesian expansion. Nature, 405(6790), 1052–1055.
  • Greenhill et al. (2008) Greenhill, S. J., Blust, R., & Gray, R. D.  (2008). The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4, 271–283.
  • Haak et al. (2015) Haak, W., Lazaridis, I., Patterson, N., Rohland, N., Mallick, S., Llamas, B., … Reich, D.  (2015). Massive migration from the steppe was a source for Indo-European languages in Europe. Nature, 522(7555), 207–211.
  • Hall & Klein (2010) Hall, D., & Klein, D.  (2010). Finding cognate groups using phylogenies. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics (pp. 1030–1039). Association for Computational Linguistics.
  • Hammarström et al. (2016) Hammarström, H., Forkel, R., Haspelmath, M., & Bank, S.  (2016). Glottolog 2.7. Jena: Max Planck Institute for the Science of Human History. (Available online at, Accessed on 2017-01-29)
  • Haspelmath et al. (2008) Haspelmath, M., Dryer, M. S., Gil, D., & Comrie, B.  (2008). The World Atlas of Language Structures online. Max Planck Digital Library, Munich. (
  • Hauer & Kondrak (2011) Hauer, B., & Kondrak, G.  (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In Proceedings of the 5th international joint NLP conference (p. 865-873).
  • Herman (2000) Herman, J.  (2000). Vulgar Latin. University Park, PA: The Pennsylvania State University Press.
  • Hilpert & Gries (2016) Hilpert, M., & Gries, S. T.  (2016). Quantitative approaches to diachronic corpus linguistics. In M. Kytö & P. Pahta (Eds.), The Cambridge handbook of English historical linguistics (pp. 36–53). Cambridge University Press.
  • Hogeweg & Hesper (1984) Hogeweg, P., & Hesper, B.  (1984). The alignment of sets of sequences and the construction of phyletic trees: an integrated method. Journal of molecular evolution, 20(2), 175–186.
  • Hruschka et al. (2015) Hruschka, D. J., Branford, S., Smitch, E. D., Wilkins, J., Meade, A., Pagel, M., & Bhattachary, T.  (2015). Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology, 25(1), 1–9.
  • Jäger (2013) Jäger, G.  (2013). Phylogenetic inference from word lists using weighted alignment with empirically determined weights. Language Dynamics and Change, 3(2), 245–291.
  • Jäger & List (2017) Jäger, G., & List, J.-M.  (2017). Using ancestral state reconstruction methods for onomasiological reconstruction in multilingual word lists. (Manuscript, Tübingen and Jena)
  • Jäger et al. (2017) Jäger, G., List, J.-M., & Sofroniev, P.  (2017). Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists. In Proceedings of the 15th conference of the European chapter of the Association for Computational Linguistics. ACL.
  • Jäger & Sofroniev (2016) Jäger, G., & Sofroniev, P.  (2016). Automatic cognate classification with a Support Vector Machine. In S. Dipper, F. Neubarth, & H. Zinsmeister (Eds.), Proceedings of the 13th Conference on Natural Language Processing (Vol. 16, pp. 128–134).
  • Kay (1964) Kay, M.  (1964). The logic of cognate recognition in historical linguistics. Rand Corporation.
  • Kessler (2001) Kessler, B.  (2001). The significance of word lists. Stanford: CSLI Publications.
  • Kondrak (2002) Kondrak, G.  (2002).  Algorithms for language reconstruction (Unpublished doctoral dissertation).  University of Toronto.
  • Kooperberg (2016) Kooperberg, C.  (2016). Package ‘logspline’. (version 2.1.9)
  • Kroonen (2013) Kroonen, G.  (2013). Etymological dictionary of Proto-Germanic. Leiden, Boston: Brill.
  • Lewis et al. (2016) Lewis, M. P., Simons, G. F., & Fennig, C. D. (Eds.).  (2016). Ethnologue: Languages of the world (Nineteenth ed.). Dallas, Texas: SIL International.
  • List (2012) List, J.-M.  (2012). Lexstat: Automatic detection of cognates in multilingual wordlists. In M. Butt & J. Prokić (Eds.), Proceedings of lingvis & unclh, workshop at eacl 2012 (pp. 117–125). Avignon.
  • List (2014) List, J.-M.  (2014). Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.
  • Lowe & Mazaudon (1994) Lowe, J. B., & Mazaudon, M.  (1994). The reconstruction engine: a computer implementation of the comparative method. Computational Linguistics, 20(3), 381–417.
  • MacMahon & MacMahon (2006) MacMahon, A., & MacMahon, R.  (2006). Why linguists don’t do dates: evidence from Indo-European and Australian languages. In P. Forster & C. Renfrew (Eds.), Phylogenetic methods and the prehistory of languages (pp. 153–160). Cambridge, UK: McDonald Institute for Archaeological Research, Cambridge.
  • McMahon & McMahon (2005) McMahon, A., & McMahon, R.  (2005). Language classification by numbers. Oxford: Oxford University Press.
  • Meillet (1954) Meillet, A.  (1954). La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Paris: Honoré Champion. (reprint)
  • Meyer-Lübke (1935) Meyer-Lübke, W.  (1935). Romanisches etymologisches Wörterbuch. Heidelberg: Carl Winters Universitätsbuchhandlung. (3. Auflage)
  • Needleman & Wunsch (1970) Needleman, S. B., & Wunsch, C. D.  (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443–453.
  • Nguyen et al. (2015) Nguyen, L.-T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q.  (2015). IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular biology and evolution, 32(1), 268–274.
  • Notredame et al. (2000) Notredame, C., Higgins, D. G., & Heringa, J.  (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology, 302(1), 205–217.
  • Oakes (2000) Oakes, M. P.  (2000). Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics, 7(3), 233–243.
  • Pagel et al. (2013) Pagel, M., Atkinson, Q. D., Calude, A. S., & Meade, A.  (2013). Ultraconserved words point to deep language ancestry across Eurasia. Proceedings of the National Academy of Sciences, 110(21), 8471–8476.
  • Pagel et al. (2007) Pagel, M., Atkinson, Q. D., & Meade, A.  (2007). Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature, 449(7163), 717–720.
  • Pietrusewsky (2008) Pietrusewsky, M.  (2008). Craniometric variation in Southeast Asia and neighboring regions: a multivariate analysis of cranial measurements. Human evolution, 23(1–2), 49–86.
  • Raghavan et al. (2007) Raghavan, U. N., Albert, R., & Kumara, S.  (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3), 036106.
  • Rama (2013) Rama, T.  (2013). Phonotactic diversity predicts the time depth of the world’s language families. PLoS ONE, 8(5), e63238.
  • Rama (2015) Rama, T.  (2015). Automatic cognate identification with gap-weighted string subsequences. In Proceedings of the North American Association for Computational Linguistics (pp. 1227–1231). Association for Computational Linguistics.
  • Renfrew (1987) Renfrew, C.  (1987). Archaeology and language: the puzzle of Indo-European origins. Cambridge, UK: Cambridge University Press.
  • Ringe (1992) Ringe, D. A.  (1992). On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society, 82(1), 1–110.
  • Ringe et al. (2002) Ringe, D. A., Warnow, T., & Taylor, A.  (2002). Indo-European and computational cladistics. Transactions of the Philological Society, 100(1), 59–129.
  • Ronquist & Huelsenbeck (2003) Ronquist, F., & Huelsenbeck, J. P.  (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19(12), 1572–1574.
  • Ross & Durie (1996) Ross, M., & Durie, M.  (1996). Introduction. In M. Durie & M. Ross (Eds.), The comparative method reviewed. regularity and irregularity in language change (pp. 3–38). New York and Oxford: Oxford University Press.
  • Saitou & Nei (1987) Saitou, N., & Nei, M.  (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4), 406–425.
  • Stamatakis (2014) Stamatakis, A.  (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9), 1312–1313.
  • Swadesh (1952) Swadesh, M.  (1952). Lexico-statistic dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society, 96(4), 452–463.
  • Swadesh (1955) Swadesh, M.  (1955). Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21, 121–137.
  • Turchin et al. (2010) Turchin, P., Peiros, I., & Gell-Mann, M.  (2010). Analyzing genetic connections between languages by matching consonant classes. Journal of Language Relationship, 3, 117-126.
  • Weiss (2015) Weiss, M.  (2015). The comparative method. In C. Bowern & B. Evans (Eds.), The Routledge handbook of historical linguistics (pp. 119–121). Routledge.
  • Wichmann et al. (2016) Wichmann, S., Holman, E. W., & Brown, C. H.  (2016). The ASJP database (version 17).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description