A Coalescent Prior

Three tree priors and five datasets: A study of the effect of tree priors in Indo-European phylogenetics

Abstract

The age of the root of the Indo-European language family has received much attention since the application of Bayesian phylogenetic methods by Gray and Atkinson (2003). The root age of the Indo-European family has tended to decrease from an age that supported the Anatolian origin hypothesis to an age that supports the Steppe origin hypothesis with the application of new models (Chang et al., 2015). However, none of the published work in the Indo-European phylogenetics studied the effect of tree priors on phylogenetic analyses of the Indo-European family. In this paper, I intend to fill this gap by exploring the effect of tree priors on different aspects of the Indo-European family’s phylogenetic inference. I apply three tree priors—Uniform, Fossilized Birth-Death (FBD), and Coalescent—to five publicly available datasets of the Indo-European language family. I evaluate the posterior distribution of the trees from the Bayesian analysis using Bayes Factor, and find that there is support for the Steppe origin hypothesis in the case of two tree priors. I report the median and 95% highest posterior density (HPD) interval of the root ages for all the three tree priors. A model comparison suggested that either Uniform prior or FBD prior is more suitable than the Coalescent prior to the datasets belonging to the Indo-European language family.

1 Introduction

The Indo-European language family is widely spoken and consists of languages belonging to subgroups such as Albanian, Armenian, Balto-Slavic, Germanic, Greek, Indo-Iranian, and Italo-Celtic. The root age of the Indo-European family has been a heavily debated topic since the application of Bayesian phylogenetic methods to lexical cognate data. The root age of the Indo-European language family was estimated using phylogenetic methods developed in computational biology (Gray and Atkinson, 2003, Atkinson et al., 2005, Nicholls and Gray, 2008, Ryder and Nicholls, 2011, Bouckaert et al., 2012). These phylogenetic methods employ lexical cognate data (from Swadesh word lists [table 5]; Swadesh 1952) and external evidence (from archeology and history) regarding both the age of the ancient languages (such as Latin) and the age of the internal subgroups (such as Germanic) to infer the timescale of the Indo-European phylogeny. The work of Gray and colleagues produced root age estimates that supported the Anatolian origin hypothesis (8000–9500 Years Before Present [B.P]; Renfrew, 1987) of the Indo-European language family. In contrast, historical linguistics—based on cultural and material vocabulary—points to a Steppe origin of the Indo-European language family where the root age falls within the range 5500–6500 Years B.P (Anthony and Ringe, 2015). 1

In a followup work, Chang et al. (2015) corrected the IELex dataset (Dunn, 2012)—originally compiled by Dyen et al. (1992)—and tested a wide range of models and datasets. Chang et al. (2015) modified the Bayesian phylogenetic inference software BEAST (Drummond et al., 2012) such that the software samples trees that show eight ancient languages—Vedic Sanskrit, Ancient Greek, Latin, Classical Armenian, Old Irish, Old English, Old High German, and Old West Norse—as ancestors of modern descendant languages (table 1). The results of their analysis showed that the estimated median root age of the Indo-European language family falls within the age range that supports the Steppe origin of the Indo-European language family.

Ancient language Modern descendants
Vedic Sanskrit Indo-Aryan languages
Ancient Greek Modern Greek
Latin Romance languages
Classical Armenian Modern Armenian dialects: Adapazar, Eastern Armenian
Old Irish Irish, Scots Gaelic
Old English English
Old West Norse Faroese, Icelandic, Norwegian
Old High German German, Swiss German, Luxembourgish
Table 1: Ancestry constraints: ancient languages and their descendants employed by Chang et al. (2015).

The phylogenetic dating analyses reported by Bouckaert et al. (2012) and Chang et al. (2015) are based on a coalescent tree prior that employs both the ages of the ancient languages and the internal node ages to infer the dates of all the internal nodes (and the root) of a language tree. The coalescent tree prior described in the context of Bayesian phylogenetic inference by Yang (2014, 309–320) is based on the coalescence process studied by Kingman (1982), and is used to model the spread of viruses or alleles in a population of individuals across time.

The coalescent tree prior cannot model the linguistic reality that an ancient language such as Old English is the ancestor of Modern English. It will infer that both Old English and Modern English descended from an unattested linguistic common ancestor. This observation is the departure for the ancestry constrained analyses reported by Chang et al. (2015). The authors found that constraining an ancient language to be the ancestor of modern language(s) infers a reduced age for the root of the Indo-European language family which supports the Steppe origin hypothesis.

While discussing their results, Chang et al. (2015) observed that the coalescent tree prior without ancestry constraints does not sample trees where an ancient language can be the ancestor of modern language(s). Therefore, the coalescent tree prior might not be appropriate for modeling the evolution of the Indo-European family. This observation marks the departure point of the analyses reported in this paper where I explore the effect of tree priors in the Indo-European phylogenetics. All the previous phylogenetic studies involving the Indo-European family compare the fit and effect of the age of different substitution models such as Covarion, Stochastic Dollo, and a binary state Generalized Time Reversible model. However, none of the above studies studies the effect of tree priors on dating of the Indo-European language family.

Therefore, in this paper, I attempt to fill this gap by analyzing all the five publicly available datasets (section 3.1) using FBD tree prior, uniform prior, and constant population size coalescent prior. I perform a Bayes Factor analysis similar to Chang et al. (2015) in section 3.5 and find that the trees inferred with FBD prior (Stadler, 2010, Heath et al., 2014, Gavryushkina et al., 2014, Zhang et al., 2016) and uniform tree prior (Ronquist et al., 2012a) support the Steppe origin hypothesis of the Indo-European languages. Finally, the root’s median age and 95% highest posterior density ages inferred from the coalescent analysis support an Anatolian origin of the Indo-European languages.

Unlike Bouckaert et al. (2012) and Chang et al. (2015), I do not supply the subgroup constraint information to the phylogenetic program beforehand, but allow the phylogenetic program to infer the tree topology along with the divergence times of the internal nodes. I find that the Bayesian phylogenetic program infers known subgroups correctly across tree priors. My experiments with FBD and uniform priors show that ancestry constraints are not necessary to infer support for the Steppe origin of the Indo-European family. I also performed a model comparison based on the Akaike Information Criterion through MCMC (AICM; Baele et al., 2012) and found that both uniform and FBD priors fit better than coalescent tree prior.

The rest of the paper is organized as follows. I will motivate the appropriateness of FBD prior for the Indo-European family diversification scenario and describe other tree priors in section 2. I will discuss the datasets, substitution model, tree prior settings, Monte Carlo Markov Chain settings, and calculation of Bayes Factor support for the Steppe origin hypothesis vs. the Anatolian origin hypothesis in section 3. I will present the inferred median ages and 95% highest posterior density (HPD) age intervals, Bayes Factors, relevance of ancestry constraints, and quality of inferred trees in section 4. Finally, I will conclude the paper in section 5.

2 Tree priors

In this section, I will describe the three different tree priors used in the paper. First, I describe the coalescent tree prior in section 2.1. Next, I will motivate why FBD tree prior is more suitable than the Coalescent tree prior for the Indo-European family in section 2.2. Finally, I describe the uniform tree prior in section 2.3.

2.1 Constant size coalescent prior

The constant population size coalescent tree prior is dependent on the parameter where is the effective population size and is the base clock rate. The probability of a tree under this model is , where is the time during which there are lineages ancestral to the sequences in the data. Both and are sampled in this paper. I note that the constant size population prior was also used by Chang et al. (2015, A6,220) to perform an ancestry-constrained phylogenetic analysis which supports the Steppe origin hypothesis.2 To the best of my knowledge, I am not aware of any previous interpretation of coalescent process in a linguistic scenario. I make the following interpretation when applying the constant size coalescent prior to languages.3 According to this interpretation, the observed languages are lineages from a large haploid population of individual languages where each language is spoken in a community.

2.2 Birth-Death priors

Birth-Death tree priors are used to model lineage diversification and to date the split event within a phylogeny. The standard birth-death prior of Yang and Rannala (1997) is conditioned on the age of the most recent common ancestor () and assumes that birth () and death () rates are constant over time. In this model, all the tips in the tree are extant and do not contain any fossils (figure 3). A fossil can be the ancestor of a modern language or can be extinct without leaving any descendants. For instance, Vedic is considered to be the ancestor of all the modern Indo-Aryan languages (table 1), whereas, Hittite or Gothic are languages that died out without leaving any descendant.

The birth-death model described by Yang and Rannala (1997) handles incomplete languages sampling through where is the number of languages in the sample and is the total number of extant languages in the family. The birth-death model estimates the species divergence times on a relative scale. The relative times can be converted into geological time scale by tying one or more internal nodes to known historical or archaeological evidence. It has to be noted that the coalescent process is mathematically different from birth-death process (Stadler, 2009, 62–63).

In the case of the Indo-European language family, the standard birth-death tree prior of Yang and Rannala (1997) only uses the internal node calibrations (for instance, the information that Germanic subgroup is about 2200 years old (Chang et al., 2015)) to infer the remaining internal nodes’ dates. This procedure is known as node dating and has been used for inferring the phylogeny of Bantu languages4 (Grollemund et al., 2015) and Turkic languages (Hruschka et al., 2015).5

The node dating method does not utilize the available lexical cognate information about attested ancient languages that went extinct (e.g. Gothic) or evolved into modern languages (e.g. Latin). However, the node dating method indirectly uses the age information of extinct languages to apply constraints to the internal node ages of a language family. In another argument against node dating, Ronquist et al. (2012a) noted that if there is more than one fossil in the same language group, then, only the oldest fossil provides the age constraint for the associated internal node. For example, in the case of the Germanic subgroup, there are four fossil languages—Gothic, Old High German, Old English, and Old West Norse—out of which only Gothic’s age information would be used to specify the minimum age of the Germanic subgroup, whereas the rest of the fossil languages cannot provide extra information regarding the age of the Germanic subgroup.

Stadler (2010) proposed an extension to the standard birth-death prior that can handle the placement of ancient languages as tips or as internal nodes (fossils; figure 3). This prior is known as Fossilized Birth-Death (FBD) Prior since it can handle both fossil and extant species in a single model. The FBD family of priors can model the linguistic fact that Old English is the ancestor of Modern English. Along with the parameters, and , the FBD prior also features fossil sampling rate parameter , which is the rate at which fossils are observed along a branch. The FBD tree prior requires only the ages of fossils to infer the root age of a tree; and, is more objective than node dating that requires internal node age constraints that are not directly observed. The standard birth-death prior conditioned on is a special case of FBD prior when (Stadler, 2010, 401). An example of a fossilized birth-death tree is presented in figure 3.

{subfigure}

0.4   {subfigure}0.4

Figure 1: Fossilized Birth-Death tree
Figure 2: Birth-Death tree
Figure 3: The red dots show fossils and the blue dots show the extant languages (Zhang et al., 2016). The left figure shows the FBD tree with fossils as both tips and ancestors of modern languages. The right figure shows the corresponding standard birth-death tree with extant languages. shows the present time whereas, shows the age of the most recent common ancestor.

The left tree (3) in figure 3 shows the FBD tree including lineages with sampled extant and fossil languages whereas the right figure shows the standard birth-death tree with extant languages.

The probability of a tree under the FBD tree prior is conditioned on and the nature of extant taxa sampling. In this paper, I assume that the extant taxa are sampled uniformly at random. Unlike Chang et al., who impose ancestry constraints externally, the FBD tree prior can infer the ancestry constraints from the data (if such a signal exists) and do not have to be supplied beforehand. The species sampling probability is determined as the ratio between the number of extant languages in the dataset to the total number of extant Indo-European languages.

The probability of the tree under the FBD model (Stadler, 2010, equation 5) conditioned on () is given below. Here, is the number of extant sampled tips, is the number of extinct sampled tips, is the number of sampled ancestors with sampled descendants, and is the age of a extinct sampled tip.

(1)

Here, , , , , and are defined as followed:

  • is the probability that an individual present at time before present has no sampled extinct or extant descendants, which is given as

  • is the probability that an individual present at time before present has only one sampled extant descendant and no sampled extinct descendant, which is given as

  • , ,

FBD tree priors have been used for estimating divergence times for datasets with extant and fossil species (Heath et al., 2014, Gavryushkina et al., 2014, Zhang et al., 2016). Since the Indo-European family has both fossils and extant languages, the FBD tree prior that handles attested fossil ancestors is more suitable than the coalescent tree prior that places fossils as tips. For instance, Tocharian languages went extinct without leaving any modern descendant language, whereas modern Romance languages are the descendants of Latin (an ancient language). Moreover, the data for the Indo-European language family comes from divergent languages and not from a single population. These arguments support the choice of FBD prior over a coalescent prior for modeling the evolution of the Indo-European language family.

2.3 Uniform tree prior

Similar to the coalescent tree prior, the uniform tree prior (Ronquist et al., 2012a) places fossils as tips of the tree. However, the uniform tree prior does not make any assumptions regarding the lineage diversification process. The uniform tree prior assumes that the internal nodes’ ages are uniformly distributed between tip ages and the root age. The prior probability of a tree under uniform model is conditioned on the root age which is drawn from a prior distribution . Under this model, an interior node age is drawn from a uniform distribution with a tip age as the lower bound and the root age as the upper bound. The probability of a tree under the uniform model is proportional to where is the age of a tip .

3 Methods

In this section, I describe the datasets, prior settings, inference procedure details, and calculation of the Bayes Factor.

3.1 Data

Language Age Prior Language Age Prior
Hittite Old High GermanA
Old IrishA Tocharian B
Classical ArmenianA Tocharian A
Ancient GreekA Lycian
Luvian Old Prussian
Vedic SanskritA Umbrian
Old EnglishA Avestan
Old Persian Gothic
LatinA Old NorseA
Oscan Old Church Slavonic
Cornish Sogdian
Table 2: Calibration dates for the ancient/medieval languages. All dates are given as years before present (BP). The superscript A denotes those languages that are assumed to be ancestors of extant languages by Chang et al. (2015).

All the five datasets used in this paper—B1, B2, Broad, Medium, and Narrow—are assembled from IELex by Chang et al. (2015).6 The B1 dataset is derived from Bouckaert et al. (2012) and consists of 207 meanings for 103 languages. The B2 dataset consists of 97 languages and is a subset of the B1 dataset. The B2 dataset is obtained after discarding six languages (Lycian, Oscan, Umbrian, Old Persian, Luvian, and Kurdish) that have attestation in less than 50% of the meanings.

The Broad dataset consists of 94 languages and 197 meaning classes. The Broad dataset is corrected for cognate judgments in the Indo-Iranian subgroup; and, also has an extra medieval language, Sogdian, which is not present in B1. Ten meanings that are susceptible to sound symbolism and have poor coverage in terms of number of languages are also removed from the Broad dataset (Chang et al., 2015, 213). The Medium dataset is a subset of the Broad dataset and is assembled in such a way that the languages and meanings with poor coverage are excluded. The Medium dataset has 82 languages and 143 meanings. The Narrow dataset is a subset of the Medium dataset and consists of only those modern languages that have an attested ancestor. This selection leaves the Narrow dataset with 52 languages.7

3.2 Substitution models

Bayesian phylogenetics originated in evolutionary biology and works by inferring the evolutionary relationship (trees) between DNA sequences of species. The same method can also be applied to binary (morphological) traits of species (Yang, 2014). Linguistic data is binary trait data where each column in the trait matrix is a cognate class. Words that belong to the same cognate class are coded as 1, else, they are coded as 0. For example, in the case of German, French, Swedish, and Spanish, the word for all in German [al\textipa@] and Swedish [\textipa”al\textipa:a] would belong to the same cognate set as English, while French [tu] and Spanish [to\textipaDo] belong to a different cognate set. The binary trait matrix for these languages for the meaning all is shown in table 5. If a language is missing in a cognate set, then the entry for that language is coded as ?, and is ignored in the calculation of likelihood using pruning algorithm (Felsenstein, 2004, 255). I used a Generalized Time Reversible model (equivalent to a F81 model in the case of binary traits) with ascertainment bias correction (Felsenstein, 1992, Lewis, 2001) for all unobserved 0 columns. The rate variation across sites is modeled using a discrete Gamma model with four rate categories (Yang, 1994), where the shape parameter of the Gamma distribution is drawn from a exponential prior with mean .

{subtable}

.4 Language ALL AND English \textipaO:l1 \textipaaend1 German al\textipa@1 \textipaUnt1 French tu2 e2 Spanish to\textipaDo2 i2 Swedish \textipa”al\textipa:a1 \textipaOk\textipa:3 {subtable}.4 Language ALL AND English 1 0 1 0 0 German 1 0 1 0 0 French 0 1 0 1 0 Spanish 0 1 0 1 0 Swedish 1 0 0 0 1

Table 3: Forms and cognate classes
Table 4: Binary Matrix
Table 5: Excerpt from meaning list showing cognate classes (table 5) and the binary cognate matrix (table 5) for meanings ALL and AND in five languages. The superscript indicates words that are cognate.

3.3 Tree prior settings

In this paper, I assumed that the extant languages are randomly sampled. The FBD tree prior is dependent on the number of extant languages in the sample. I estimated the number of extant Indo-European languages (400) from Glottolog (Nordhoff and Hammarström, 2011), and set the parameter accordingly for each dataset. For FBD prior, the net diversification rate is drawn from a exponential prior with mean , the relative extinction rate (turnover) is drawn from a Beta(1,1) prior, and the fossil sampling probability is also drawn from a Beta(1,1) prior.

I draw the root age from a uniform distribution bounded between and years in the case of FBD and uniform priors. The root age’s upper bound is fixed at years since this age is more than double the upper bound of the age limit of the Anatolian origin hypothesis. In fact, none of the inferred trees’ root ages are close to years. The coalescent prior, as implemented in MrBayes, is not conditioned on . All the fossils’ age priors were drawn from uniform distributions whose age ranges are given in table 2.

In the case of the coalescent prior, population parameter is drawn from a Gamma distribution with shape parameter and rate parameter . The base clock rate is drawn from an exponential prior with mean . In all the analyses, I use a Independent Gamma Rate model (Lepage et al., 2007), where each branch rate is drawn from a Gamma distribution with mean and variance , where —the branch length of a branch —is computed as the product of geological (or calendar) time and . is the independent gamma rate model’s variance parameter that is drawn from an exponential prior with mean . I do not employ topology constraints and allow the software to infer the Indo-European phylogeny along with the time scale from the data.

3.4 Markov chain Monte Carlo sampling

I ran all the experiments using MrBayes software.8 I ran two independent runs (each run consisted of one cold chain and two hot chains) and verified that the average standard deviation of split frequencies (Ronquist et al., 2012b) between both the runs is less than . I ran all the analyses for 20–80 million states and sampled every state to reduce auto-correlation between the sampled states. For each dataset, I threw away the initial 25% of the states as burn-in and generated a 50% majority rule consensus tree9 from the remaining 75% of the states (Felsenstein, 2004, chapter 30).10

3.5 Evaluating Steppe vs. Anatolian Hypothesis

For each dataset, I ran the MrBayes software twice: once without cognate data to generate a prior sample of trees and once with cognate data to generate a posterior sample of trees. Then, I used Bayes Factor (BF) formulation from Chang et al. (2015) to calculate the support for respectively the Anatolian (A) and Steppe (S) hypothesis. Given data , the Bayes factor is calculated as follows:

(2)

where, and represents the range of Steppe and Anatolian ages and denotes the root age of a tree which is in the case of FBD prior. The numerator and denominator in equation 2 are computed as follows:

(3)

The numerators in equation 3 correspond to the fraction of trees in the posterior sample for which and . The denominators correspond to the fraction of trees in the prior sample for which and . Following the interpretation of Bayes Factor by Kass and Raftery (1995), the support for Steppe origin hypothesis is very strong if , strong if , positive if , not worth more than a bare mention (neutral) if and negative if .

4 Results

In this section, I present and discuss the root’s median age and 95% HPD age intervals, fit of tree prior, Bayes Factor support for the Steppe vs. the Anatolian hypotheses, comparison of subgroups’ inferred dates with expert dates, relevance of clade constraints, and ancestry constraints.

4.1 Median and 95% HPD ages

Dataset 95% HPD Median Age
FBD Coalescent Uniform FBD Coalescent Uniform
B1 6244–8766 8370–11695 5760–8115 7512 9821 6789
B2 6150–8430 7590–10913 5536–7986 7177 9133 6738
broad 5591–7585 6654–9327 5073–6947 6551 7984 5935
medium 5942–7921 7070–9818 5395–7392 6845 8345 6339
narrow 5790–7984 6826–9791 5423–7646 6826 8228 6462
Table 6: Columns 2–4 show the 95% Highest Posterior Density (HPD) and columns 5–7 show the median ages (in years before present) of the root node from the consensus tree for each dataset and a tree prior.

Table 6 shows the HPD intervals and median root ages for all dataset and tree prior combinations. None of the reported HPD age intervals lie completely within the Steppe age interval or the Anatolian age interval. The lower bounds of HPD ages in the case of FBD and uniform priors fall within the Steppe interval, whereas the lower bound of the coalescent prior’s HPD interval falls beyond the Steppe age interval. In the case of narrow and medium datasets, the root age is further reduced to 6826 and 6845 years respectively in the case of FBD prior. The median ages inferred by FBD prior belong neither to the Steppe hypothesis interval nor to the Anatolian hypothesis interval for all the datasets. The median age inferred by uniform prior for Broad, Medium, and Narrow datasets lie within the range of the Steppe interval. All the priors infer median ages that lie beyond the Steppe interval in the case of B1 and B2 datasets. The coalescent prior infers root ages that lie within the Antolian hypothesis in the case of all the datasets except B1 dataset. Across all the priors, the median root ages decrease when the datasets are corrected for errors. The descreasing trend in the median ages is similar to the trend observed in Chang et al. (2015).

Why the broad dataset yields younger ages?

Chang et al. (2015) argue that sparsely attested languages can influence the chronology estimates. The authors argue by observing that the ascertainment bias correction to the likelihood calculation (Felsenstein, 1992) accounts for unobserved cognate sets that are not observed in the data, but, does not account for the missing entries in a dataset. For example, if 50% of the data is missing for a language, then the ascertainment bias correction does not account for missing 50% of the data. If there are unique cognate sets in the observed 50% of the data, then, there is a possibility that the unobserved 50% of the data also has unique cognate sets that do not enter the likelihood calculation.

The likelihood calculation would only consider the observed unique cognate sets, therefore, underestimating the true number of unique cognate sets for a language in a dataset. Due to this reason, a language with higher number of missing entries is treated as more conservative (or lesser number of character changes) than it should be. This is particularly true for languages such as Hittite, Tocharian A & B which have about and missing entries in the case of the broad dataset as compared to and in the case of the medium dataset. Since, both Hittite and Tocharian doculects are very close to the root of the Indo-European tree, this underestimation of number of unique cognate sets leads to a shorter branch length which causes the median root age to be younger. Both coalescent and FBD tree priors infer a younger age for broad dataset than medium and narrow datasets.

Why the B2 dataset yields younger ages?

The B1 dataset features six sparsely attested languages—Lycian, Oscan, Umbrian, Old Persian, Luvian, and Kurdish—where more than 50% of the meanings are unattested. As explained in the previous paragraph, inclusion of sparsely attested languages causes the Bayesian inference program to underestimate the root age. The opposite happens when a language has more number of unique cognates than it should have. This is the case of Luvian, where 33% of the attested cognate sets are erroneously coded as unique cognate sets, although, they are cognate with either Hittite or Lycian. This erroneous coding causes the Bayesian software to treat Luvian which is one internal node away from the root node to have evolved more and posits longer branches, therefore, pushing the root age of the tree away from the Steppe age interval. The B2 dataset excludes the six sparsely attested languages including erroneously coded Luvian which leads to shortening of the median root age in the posterior sample. This effect is clearly observed with both the median root age and 95% HPD age range in the B2 dataset. The median root age is pushed 400 years downwards towards the Steppe hypothesis in the case where the FBD tree prior is applied to the B2 dataset. The coalescent prior also infers a younger median age for B2 dataset than B1 dataset, whereas the uniform prior is not influenced by the six sparsely attested languages.

4.2 Which tree prior is the best?

Tree Prior B1 B2 broad medium narrow
Uniform Prior 94002.748 90299.551 89269.61 50769.888 32162.117
FBD Prior 94005.297 90297.721 89270.359 50764.79 32163.007
Coalescent Prior 94117.099 90396.491 89374.335 50917.074 32241.019
Table 7: AICM values for each of the datasets. The lower the value the better is the model’s fit to the data. The best fitting model’s AICM value is shown in bold and is computed using tracer (Rambaut et al., 2013).

I determine the best model through Akaike Information Criterion through MCMC (AICM; Baele et al., 2012). It has to be noted that Bouckaert et al. (2012) employ both Harmonic Mean and AICM to perform model comparison. In this paper, I only use AICM, since, it is more accurate than harmonic mean which is unstable. On the other hand, methods such as stepping stone sampling (Xie et al., 2010) and thermodynamic integration (Lartillot and Philippe, 2006) used to estimate marginal likelihood are more accurate than AICM but are computationally intensive and require at most times (usually set to 10) the computation as the original MCMC runs (Yang, 2014, 258–259).

The AICM values for each dataset and tree prior are presented in table 7. The results show that the Uniform tree prior fits the best for B1, Broad, and narrow datasets. The difference between AICM values of Uniform and FBD priors is almost negligible in the case of Broad and narrow datasets. The coalescent prior shows the highest AICM value and differs by a large margin when compared with FBD and Uniform priors. Since uniform tree prior has fewer parameters than FBD prior, I suggest that any future phylogenetic experiment should test uniform tree prior as a baseline before testing more parameter-rich priors such as FBD or Coalescent priors.

4.3 Bayes Factor for Steppe vs. Anatolia

Dataset FBD Coalescent Uniform
B1 0.138 (Negative) ** 67.043 (Strong)
B2 1.015 (Neutral) ** 1022.968 (Very Strong)
broad 88.624 (Strong) * 6728.994 (Very Strong)
medium 18.536 (Positive) ** 113.968 (Strong)
narrow 16.55 (Positive) * 27.549 (Strong)
Table 8: Bayes Factor Support for the Steppe origin vs. the Anatolian origin across different datasets and tree priors. * represents a entry where there is no tree in the Prior sample with root age that falls within Steppe range. ** indicates those datasets that do not have a posterior and prior root age within Steppe range.

I present the results of the Bayes factor (BF) analysis in table 8. In the case of the FBD prior, BF results support the Steppe origin hypothesis for all the datasets, except, for the B1 dataset. The corrected datasets clearly support the Steppe hypothesis positively in terms of Bayes Factor in the case of FBD prior. In the case of the uniform prior, all the datasets support the Steppe origin hypothesis over the Anatolian origin hypothesis. In the case of the coalescent prior, the Bayes Factor was not possible to calculate since there is no tree in either prior or posterior sample that has a root age belonging to the age range of the Steppe hypothesis. Overall, the interpretation of the strength of the Bayes Factor analysis suggests that appropriate tree priors and corrected datasets support the Steppe origin hypothesis of the Indo-European language family.

4.4 Internal node ages

In this subsection, for each dataset, I compare the inferred dates for the language subgroups with the historically attested dates given in table 9. The uniform tree prior, on an average, overestimates the ages for all the datasets, except, for the narrow dataset. The predicted ages from the uniform tree prior come closest to the historical ages in the case of the median dataset. In contrast, Chang et al. present younger ages for both the narrow (100 years on an average) and medium datasets (330 165 years).

Subgroup Historical Age B1 B2 Broad Medium narrow
Germanic 2250 2876 [2286-3572] 2816 [2256-3458] 2615 [2147-3166] 2449 [2031-2935] 2334 [1943-2807]
Romance 1750 2987 [2400-3629] 2149 [1628-2714] 1980 [1515-2493] 1841 [1401-2345] 1736 [1309-2248]
Scandinavian 1500 1523 [1127-2016] 1469 [1102-1906] 1340 [1024-1697] 1164 [898-1477]
Slavic 1500 1860 [1401-2423] 1822 [1378-2309] 1647 [1301-2069] 1575 [1226-1972]
East Baltic 1300 1584 [914-2356] 1561 [936-2265] 1465 [891-2086] 1460 [892-2115]
British Celtic 1250 1732 [1105-2402] 1687 [1137-2343] 1537 [1024-2093] 1450 [955-2011]
Modern Irish/Scots Gaelic 1050 1058 [530-1615] 1052 [589-1620] 967 [523-1442] 834 [451-1260] 829 [442-1290]
Persian-Tajik 750 882 [424-1412] 842 [386-1360] 819 [409-1250] 704 [336-1098]
Average difference -394 -256 -127 -15.875 50.33
Table 9: The first column shows the name of the language subgroup and the second column shows the ages based on historical events. The rest of the columns show the uniform prior’s median ages (in years before present) and 95% HPD (the numbers in square brackets) age intervals across the five datasets. The last row shows the average difference between historical ages and predicted ages. The historical ages are obtained from Chang et al. (2015, 226). The East Baltic group consists of Lithuanian and Latvian. The British Celtic group consists of Cornish, Breton, and Welsh. The Romance group consists of all the Romance languages except Latin.
Figure 4: The majority-rule consensus tree inferred using uniform prior for the broad dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.

4.5 Relevance of clade constraints

Both Bouckaert et al. (2012) and Chang et al. (2015) constrain the topologies in tree search through clade constraints. For instance, a Germanic clade constraint would mean that the Bayesian software would only sample those trees that place all the Germanic languages under a single node. Both the studies do not follow the same set of topological constraints when inferring the dates of Indo-European language family. Chang et al. (2015) apply a stricter set of constraints—derived from the linguistic knowledge of Indo-European language family—than those of Bouckaert et al. (2012). In this paper, I do not employ any clade constraints and allow the software to automatically infer the tree topology from the datasets.11

I present the majority rule consensus tree inferred using uniform prior for the broad dataset in figure 4.12 The majority rule consensus tree retrieves the well-established language subgroups such as Balto-Slavic, Greek, Indo-Iranian, Germanic, and Italo-Celtic correctly. I observe that all the consensus trees (appendix) retreive the subgroups correctly without being supplied as constraints to the phylogenetic software.

Position of Anatolian and Tocharian languages

There is a general consensus among the Indo-European scholars that the Anatolian language group was the first branch to split from the Proto-Indo-European stage, after which, the Tocharian language group was the second to split off from the post-Anatolian Indo-European languages (Ringe et al., 2002). In fact, Chang et al. supply this linguistic knowledge as two constraints to the Bayesian software: Nuclear Indo-European group consisting of all the non-Anatolian languages; and, Inner Indo-European group consisting of all the Nuclear Indo-European languages excluding Tocharian languages. I observe that the majority consensus trees constructed from the analyses inferred with uniform tree prior always groups both the Anatolian and Tocharian languages as distinct subgroups unified under the same internal node which is directly connected to the root node. This is also true in the case of the majority consensus tree inferred when the coalescent tree prior is applied to the B2 dataset. The majority consensus trees constructed from FBD tree prior’s analyses always show that the Anatolian languages were the first to split off, followed by the branching of the Tocharian languages from the post-Anatolian Indo-European complex. This observation also holds for the the majority consensus trees inferred with colaescent tree priors applied to B1, broad, medium, and narrow datasets.

In conclusion, the majority consensus trees suggest that the well-established Indo-European subgroups can be inferred directly, and need not be supplied beforehand. The exact placement of the well-established subgroups with respect to each other within the Inner Indo-European clade is a topic of research among scholars and has to be determined to full satisfaction (Anthony and Ringe, 2015).

4.6 Relevance of ancestry constraints

Chang et al. (2015) introduced ancestry constraints into their phylogenetic analysis, which, then, supported the Steppe origin hypothesis. The application of the FBD prior can be used to verify if the ancestry constraints can be inferred from the data. The FBD prior can infer whether an ancient language is an ancestral language or a tip in the tree. However, the majority rule consensus trees inferred from all the datasets using FBD tree prior do not show any support for the ancestry relationships enforced as constraints by Chang et al. (2015). I examined the log files of the MCMC runs and found that the MCMC proposal move (delete-branch) in MrBayes supporting the placement of an ancient language as an internal node was never accepted during the MCMC sampling. At least, based on trees inferred from lexical datasets, I conclude that the FBD prior does not infer any ancestry relations employed by Chang et al. (2015).

5 Conclusion

In this paper, I addressed the question of the effect of tree priors in Bayesian phylogenetic analysis and found the following.

  • The model comparison results suggest that both Uniform and FBD priors show better fit to the datasets of the Indo-European language family than the coalescent prior. Therefore, based on the Bayes Factor analysis, I conclude that the Steppe hypothesis is supported by FBD and Uniform priors for majority of the datasets.

  • The FBD tree prior does not infer any ancestry relation from any of the datasets suggesting that the lexical datasets used in the paper does not have signal for ancestry relations.

  • I also observe that the Bayesian inference program can infer well-established subgroups correctly from the data and need not be supplied beforehand.

  • Finally, the experiments reported in the paper suggest that right tree priors and corrected cognacy judgments are important for estimating the phylogeny and the age of Indo-European language family.

Acknowledgments

The paper would not have been possible without the continuous support of Igor Yanovich, Søren Wichmann, Chris Bentz, Gerhard Jäger, Johann-Mattis List, Richard Johansson, Lilja Øvrelid, Sowmya Vajjala, Çağrı Çöltekin, and Aparna Subhakari. I thank Remco Bouckaert, Johannes Wahle, Armin Buch, Johannes Dellert, Marisa Köllner, Roland Mühlenbernd, and Vijayaditya Peddinti for all the comments and discussions that improved the paper. Finally, I thank the anonymous reviewers for all the comments which helped improved the paper. One of the reviewers provided extensive comments regarding the models and results which helped improve the paper. All the remaining errors are mine. The author is supported by BIGMED and ERC Advanced Grant 324246 EVOLAEMP, which is gratefully acknowledged.

Appendix A Coalescent Prior

Figure 5: The majority-rule consensus tree for B1 dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 6: The majority-rule consensus tree for B2 dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 7: The majority-rule consensus tree for broad dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 8: The majority-rule consensus tree for medium dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 9: The majority-rule consensus tree for narrow dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.

Appendix B FBD Prior

Figure 10: The majority-rule consensus tree for B1 dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 11: The majority-rule consensus tree for B2 dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 12: The majority-rule consensus tree for broad dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 13: The majority-rule consensus tree for medium dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 14: The majority-rule consensus tree for narrow dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.

Appendix C Uniform Prior

Figure 15: The majority-rule consensus tree for B1 dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 16: The majority-rule consensus tree for B2 dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 17: The majority-rule consensus tree for medium dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.
Figure 18: The majority-rule consensus tree for narrow dataset. The numbers at each internal node shows the support for the subtree in the posterior sample. The blue bars show the 95% HPD intervals for the node ages. The time scale shows the height of the tree in terms of age.

Footnotes

  1. The scripts, the data files, and the results of the paper are available at https://github.com/PhyloStar/ie-phylo-exps.
  2. I discovered a bug in the MrBayes implementation with the coalescent prior that was calculating the Metropolis-Hastings ratio incorrectly. My implementation is already made available here: https://github.com/PhyloStar/mrbayes-coal.
  3. This interpretation is due to Igor Yanovich.
  4. To be precise, the scholars used a pure birth (Yule) process with , a special case of birth-death process, to estimate the divergence times of the internal node splits in the Bantu language family phylogeny.
  5. Hruschka et al. (2015) use cognate sets from etymological dictionary where the reflexes within a cognate set need not have the same meaning. This approach is different from the phylogenetic approaches used in this and other papers, where the cognates are root-meaning pairs derived from Swadesh lists (Chang et al., 2015, 201).
  6. One of the reviewers asked why I did not experiment with CoBL database (http://www.shh.mpg.de/207610/cobldatabase). The database is not publicly available to perform experiments.
  7. All the datasets are available at http://muse.jhu.edu/article/576999/file/supp02.zip.
  8. Available at http://mrbayes.sourceforge.net/.
  9. A 50% majority consensus tree is a summary tree that consists of only those clades that occur in more than 50% of the post burn-in sample of trees.
  10. I also present the inferred phylogenies, posterior support and HPD intervals of the internal nodels for all the tree priors and datasets in the appendix.
  11. I note that the clade constraint information is derived from historical linguistics research that is limited to language families such as Indo-European, Dravidian, Uralic, Austronesian, and Sino-Tibetan with long tradition of classical comparative linguistic research (Campbell and Poser, 2008).
  12. All the trees presented in this paper are visualized using FigTree (Rambaut, 2016).

References

  1. Anthony, David W and Don Ringe. 2015. The Indo-European homeland from linguistic and archaeological perspectives. Annu. Rev. Linguist. 1(1): 199–219.
  2. Atkinson, Quentin, Geoff Nicholls, David Welch, and Russell Gray. 2005. From words to dates: Water into wine, mathemagic or phylogenetic inference? Transactions of the Philological Society 103(2): 193–219.
  3. Baele, Guy, Philippe Lemey, Trevor Bedford, Andrew Rambaut, Marc A Suchard, and Alexander V Alekseyenko. 2012. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Molecular biology and evolution 29(9): 2157–2167.
  4. Bouckaert, Remco, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard, and Quentin D. Atkinson. 2012. Mapping the origins and expansion of the Indo-European language family. Science 337(6097): 957–960.
  5. Campbell, Lyle and William J. Poser. 2008. Language classification: History and Method. Cambridge University Press.
  6. Chang, Will, Chundra Cathcart, David Hall, and Andrew Garrett. 2015. Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language 91(1): 194–244.
  7. Drummond, Alexei J, Marc A Suchard, Dong Xie, and Andrew Rambaut. 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular biology and evolution 29(8): 1969–1973.
  8. Dunn, Michael. 2012. Indo-European lexical cognacy database (IELex). Nijmegen: Max Planck Institute for Psycholinguistics .
  9. Dyen, Isidore, Joseph B. Kruskal, and Paul Black. 1992. An Indo-European classification: A lexicostatistical experiment. Transactions of the American Philosophical Society 82(5): 1–132.
  10. Felsenstein, Joseph. 1992. Phylogenies from restriction sites: A maximum-likelihood approach. Evolution 46(1): 159–173.
  11. Felsenstein, Joseph. 2004. Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates.
  12. Gavryushkina, Alexandra, David Welch, Tanja Stadler, and Alexei J Drummond. 2014. Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. PLoS Computational Biology 10(12): e1003,919.
  13. Gray, Russell D. and Quentin D. Atkinson. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426(6965): 435–439.
  14. Grollemund, Rebecca, Simon Branford, Koen Bostoen, Andrew Meade, Chris Venditti, and Mark Pagel. 2015. Bantu expansion shows that habitat alters the route and pace of human dispersals. Proceedings of the National Academy of Sciences 112(43): 13,296–13,301.
  15. Heath, Tracy A, John P Huelsenbeck, and Tanja Stadler. 2014. The fossilized birth–death process for coherent calibration of divergence-time estimates. Proceedings of the National Academy of Sciences 111(29): E2957–E2966.
  16. Hruschka, Daniel J, Simon Branford, Eric D Smith, Jon Wilkins, Andrew Meade, Mark Pagel, and Tanmoy Bhattacharya. 2015. Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25(1): 1–9.
  17. Kass, Robert E and Adrian E Raftery. 1995. Bayes Factors. Journal of the American Statistical Association 90(430): 773–795.
  18. Kingman, John Frank Charles. 1982. The coalescent. Stochastic processes and their applications 13(3): 235–248.
  19. Lartillot, Nicolas and Hervé Philippe. 2006. Computing bayes factors using thermodynamic integration. Systematic Biology 55(2): 195–207. doi:10.1080/10635150500433722. URL http://dx.doi.org/10.1080/10635150500433722.
  20. Lepage, Thomas, David Bryant, Hervé Philippe, and Nicolas Lartillot. 2007. A general comparison of relaxed molecular clock models. Molecular biology and evolution 24(12): 2669–2680.
  21. Lewis, Paul O. 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic biology 50(6): 913–925.
  22. Nicholls, Geoff K and Russell D Gray. 2008. Dated ancestral trees from binary trait data and their application to the diversification of languages. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(3): 545–566.
  23. Nordhoff, Sebastian and Harald Hammarström. 2011. Glottolog/Langdoc: Defining dialects, languages, and language families as collections of resources. In Proceedings of the First International Workshop on Linked Science, vol. 783.
  24. Rambaut, Andrew. 2016. Figtree v1.6. URL http://tree.bio.ed.ac.uk/software/figtree/.
  25. Rambaut, Andrew, Alexie J Drummond, and Marc Suchard. 2013. Tracer. URL http://tree.bio.ed.ac.uk/software/tracer/.
  26. Renfrew, Colin. 1987. Archaeology and language : The puzzle of Indo-European origins. London : Cape.
  27. Ringe, Don, Tandy Warnow, and Ann Taylor. 2002. Indo-European and computational cladistics. Transactions of the Philological Society 100(1): 59–129.
  28. Ronquist, Fredrik, Seraina Klopfstein, Lars Vilhelmsen, Susanne Schulmeister, Debra L Murray, and Alexandr P Rasnitsyn. 2012a. A total-evidence approach to dating with fossils, applied to the early radiation of the Hymenoptera. Systematic Biology 61(6): 973–999.
  29. Ronquist, Fredrik, Maxim Teslenko, Paul van der Mark, Daniel L Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc A Suchard, and John P Huelsenbeck. 2012b. Mrbayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space. Systematic Biology 61(3): 539–542.
  30. Ryder, Robin J and Geoff K Nicholls. 2011. Missing data in a stochastic Dollo model for binary trait data, and its application to the dating of Proto-Indo-European. Journal of the Royal Statistical Society: Series C (Applied Statistics) 60(1): 71–92.
  31. Stadler, Tanja. 2009. On incomplete sampling under birth–death models and connections to the sampling-based coalescent. Journal of Theoretical Biology 261(1): 58–66.
  32. Stadler, Tanja. 2010. Sampling-through-time in birth–death trees. Journal of Theoretical Biology 267(3): 396–404.
  33. Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts: with special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96(4): 452–463.
  34. Xie, Wangang, Paul O Lewis, Yu Fan, Lynn Kuo, and Ming-Hui Chen. 2010. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology 60(2): 150–160.
  35. Yang, Ziheng. 1994. Estimating the pattern of nucleotide substitution. Journal of Molecular Evolution 39(1): 105–111.
  36. Yang, Ziheng. 2014. Molecular Evolution: A Statistical Approach. Oxford: Oxford University Press.
  37. Yang, Ziheng and Bruce Rannala. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Molecular biology and evolution 14(7): 717–724.
  38. Zhang, Chi, Tanja Stadler, Seraina Klopfstein, Tracy A. Heath, and Fredrik Ronquist. 2016. Total-Evidence Dating under the Fossilized Birth–Death Process 65(2): 228–249. doi:10.1093/sysbio/syv080.
190778
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description