Full Reconstruction of Non-Stationary Strand-Symmetric Models on Rooted Phylogenies

# Full Reconstruction of Non-Stationary Strand-Symmetric Models on Rooted Phylogenies

## Abstract

Understanding the evolutionary relationship among species is of fundamental importance to the biological sciences. The location of the root in any phylogenetic tree is critical as it gives an order to evolutionary events. None of the popular models of nucleotide evolution used in likelihood or Bayesian methods are able to infer the location of the root without exogenous information. It is known that the most general Markov models of nucleotide substitution can also not identify the location of the root or be fitted to multiple sequence alignments with less than three sequences. We prove that the location of the root and the full model can be identified and statistically consistently estimated for a non-stationary, strand-symmetric substitution model given a multiple sequence alignment with two or more sequences. We also generalise earlier work to provide a practical means of overcoming the computationally intractable problem of labelling hidden states in a phylogenetic model.

Keywords:  Identifiability; Phylogenetic inference; Markov process; Maximum likelihood

## 1 Introduction

The location of the root in a molecular phylogeny has contributed to criminal convictions (González-Candelas et al. 2013), been used to understand the source and epidemiology of human viruses (Podsiadlo and Polz-Dacewicz 2013), determined how biodiversity conservation resources were distributed (Faith and Baker 2006), been used to develop potential HIV vaccines (Nickle et al. 2003), and played an important role in our understanding of the tree of life (Murphy et al. 2007). While an unrooted phylogenetic tree can be used to infer relatedness between species, without the location of the root we know nothing of the order of evolutionary events. It might then be surprising that none of the commonly used Markov models of nucleotide or codon substitution can be used to identify the location of the root without incorporating information that is exogenous to the model. The class of Markov models to which we refer will be made precise in the next section.

ModelTest is one of the most popular pieces of software for selecting phylogenetic models of character substitution (Posada 2008). It allows users to determine which of 88 time-reversible (hereafter reversible) substitution models best fits their data. By definition, for a reversible model the location of the root in a phylogeny cannot change the probability distribution of the observed data, the column frequencies. Some software (Knight et al. 2007) allows users to fit non-stationary models of character substitution such as that of Barry and Hartigan (1987a). Unfortunately, the theoretical results that exist around fully general models (Chang 1996) explicitly state that for such models the location of the root is not statistically identifiable, that one cannot use such models to ask where on an edge a root resides, and that it is possible to reformulate the model so that any node is the root.

In practice, the location of the root is usually determined by declaring that a specific taxon in a phylogeny is an outgroup or by making a molecular clock assumption (Felsenstein 2004). The first method assumes that the location of the root is already known, that it is on the edge connected to the outgroup. The second method comes in varying degrees of complexity. In its most simple form it assumes that the tree is ultrametric, that the genetic distance from the root node to each tip is identical. In more sophisticated Bayesian approaches the location of the root enters the calculation as part of the prior distribution of tree topologies and branch lengths, so that the tree is not necessarily ultrametric but that in some sense the evolutionary time from the root of the tree to the tips is the same along every lineage (Drummond and Rambaut 2007).

A third method is to use a substitution model that is able to identify the location of the root. Such a model must not be reversible, but using a model that is not reversible does not automatically ensure that the model is able to identify the root. This statement is easily justified using the findings in Chang (1996), where a model that is non-stationary, so also non-reversible, is not able to recover the root. That a non-reversible model might not be even theoretically able to discover the root seems to have been missed by some authors.

Yang and Roberts (1995) fitted a non-stationary model to rooted topologies of real data using maximum likelihood and found that the location of the root of the tree had a significant effect on likelihood estimates. This is useful empirical evidence but the authors made no attempt to prove that their model is identifiable.

Huelsenbeck et al. (2002) fitted a non-reversible but stationary model to real and simulated data and found that while the outgroup and molecular clock methods were able to recover the location of the root in many cases, their model was not. Again, they made no effort to show that their model is theoretically capable of recovering the location of the root, so the poor performance of their model is not necessarily a reflection on the ability of all substitution models to recover the location of the root.

Yap and Speed (2005) systematically reproduced the results in Yang and Roberts (1995) and Huelsenbeck et al. (2002) and, with some small discrepancies with the earlier studies, found again that a non-stationary model was able to make statistically significant inferences about the location of the root, but that a non-reversible model did little better than a reversible model. This also was empirical research that left unanswered questions about whether any of the models were theoretically able to identify the location of the root.

The contribution of the present work is to constructively prove that there is a non-stationary substitution process that identifies the location of the root that can be statistically consistently estimated from data. Indeed, the model is shown to be consistently estimable for two taxa. This is not possible for general non-stationary processes (Chang 1996; Bonhomme et al. 2014), so we make the additional assumption that the process is strand-symmetric; that the process of evolution is identical on the sense and antisense strands of DNA. The conditions of the proof ensure that the process is non-stationary, and we show that a non-reversible, stationary model is not identifiable so cannot be consistently estimated. This observation sheds some light on the success of non-stationary processes and the failure of non-reversible, stationary processes at detecting the root in the literature.

Much has been written about the biological mechanisms that result in nucleotide substitution processes being strand asymmetric and there is now substantial empirical evidence to support strand asymmetry’s existence in nature. Touchon and Rocha (2008) provide a good review of the subject. Strand asymmetry seems to be a localised phenomenon, existing on the scale of genes rather than genomes, and appears to be common in prokaryote and organelle genomes but not in eukaryotes. Nucleotide compositional asymmetry is the most common measure used for statistical inference. Under very loose assumptions (Lobry 1995; Lobry and Lobry 1999) a strand-symmetric process should result in the proportions of As and Ts being equal and the proportions of Gs and Cs being equal on a single strand. Strand asymmetry can also be inferred by directly comparing estimated rates of nucleotide substitution, although most of the evidence seems to come from counting substitutions in ancestral state reconstructions based on maximum parsimony (eg. Wu and Maeda 1987; Bulmer 1991; Francino and Ochman 2000).

Strand-symmetric models have been used in a maximum likelihood context, although rarely for the purpose of establishing whether strand symmetry is a reasonable assumption. Yap and Pachter (2004) comment that the reversible models that they fitted to real data seemed to exhibit strand symmetry. Squartini and Arndt (2008) fitted a continuous-time, non-stationary, strand-symmetric model on a known, rooted, four-taxon topology to two genome-scale data sets. They comment that their model is not identifiable on the edges incident to the root, but that it is identifiable on the other two edges. This is not strictly correct. As stated in Chang (1996, Remark 2), the labelling of states at the internal nodes is not automatically identifiable even for a continuous-time model. Also, as the mapping from the discrete-time process considered in Chang (1996) to the continuous-time process fitted in Squartini and Arndt (2008) is not always unique (Higham 2008, Section 2.3), continuous-time models are not identifiable under the results in Chang (1996) without further constraints.

Strand-symmetric substitution models have also been approached from a theoretical perspective (Casanellas and Sullivant 2005; Jarvis and Sumner 2013) in the context of reversible models. Jarvis and Sumner (2013) make the observation that strand-symmetric models enjoy the property of closure; that a model which is strand-symmetric on two adjacent phylogenetic branches is strand-symmetric across the two branches as well. It is straightforward to show that this property extends to the non-stationary processes that we consider here.

The proofs in this work mirror those in Chang (1996), but apart from adding the assumption of strand symmetry and removing the assumption of an unrooted tree, we also relax an important assumption that Chang (1996) makes about the structure of the model. As noted in Zou et al. (2011) and addressed in Mossel and Roch (2006) the model of Barry and Hartigan (1987a) is only identifiable up to an arbitrary relabelling of states at internal nodes of the phylogeny. Chang (1996) addresses this problem by assuming that the transition probability matrix for every edge in the topology is reconstructible from rows. As we shall demonstrate, this assumption can be restrictive in practice, so, motivated by Remark 4 in Chang (1996), we partially relax it.

In Section 2 we briefly introduce the necessary notation and theoretical context. Section 3 contains the main results of the paper, where we prove that the full topology and parameters of a discrete-time, non-stationary, strand-symmetric Markov model can be recovered from the pairwise joint probability distributions of states between extant taxa. In this section we also extend the result to continuous-time models which are used more commonly in practice. Section 4 proves that the results in Section 3 provide the necessary basis for consistent statistical estimation of the models in question for multiple sequence alignments of increasing length. Section 5 gives some concluding remarks.

## 2 Markov Models on Trees: Definitions and Notation

We consider a finite set of extant taxa whose phylogenetic history we wish to infer. The history is modelled as a tree which is a set of nodes which represent extant or ancestral species and undirected edges which represent genetic descent. The edges are a set of unordered pairs of nodes, so if , an edge exists between nodes and . The nodes consist of the terminal nodes, which are just , and the internal nodes , so that . The internal nodes represent ancestral taxa at branching points.

The degree of a node is the number of edges incident on that node. We say that a tree is unrooted if it contains no internal nodes with degree less than three. A tree is rooted if it contains a single node with degree two, which we call the root. We do not consider trees with more than one internal node of degree two.

The model of character substitution is properly considered a probabilistic graphical model defined on the tree. That is, we associate a random variable with each node so that represents the history of a single column in a multiple sequence alignment. Each is independent of all other , conditional on the states at the nodes neighbouring . As we will focus on the assumption of strand symmetry, we will assume that . We also assume that each column in the alignment is an independent observation of the multivariate random variable .

The model is then specified by a marginal probability row 4-vector , where for a fixed node , and a transition probability matrix for each edge , where . As we are interested in rooted tree topologies, it is convenient to characterise the model by where is the root and the where and is further than from . In this context we will sometimes characterise each as the result of a continuous-time Markov substitution process on , in which case , where is the matrix exponential and is a transition rate matrix.

The statistical challenge is then, for a fixed set of terminal nodes and a set of observations of , which are the columns in our multiple sequence alignment, to show that as tends to infinity our estimates of the tree topology and the probabilistic parameters tend to the true values of the generating model.

## 3 Identifiability of the Full Model

### 3.1 Identifiability with a Rooted Two-Taxon Topology

It has been known at least since Chang (1996) that a full Markov model is not identifiable for a rooted topology with two terminal taxa without additional assumptions. We shall now prove, under mild conditions, that a non-stationary, strand-symmetric model is identifiable for this topology. This result will provide the foundation for showing that rooted topologies of any size are recoverable, and that the full model is identifiable using joint distributions of states between pairs of taxa.

###### Definition 3.1.

We call a matrix strand-symmetric if it takes the form

 M=⎛⎜ ⎜ ⎜ ⎜⎝δαβγζϕηθθηϕζγβαδ⎞⎟ ⎟ ⎟ ⎟⎠.

Note that if a transition probability matrix is strand-symmetric in this sense, it only fits the usual definition of strand symmetry if the states are properly ordered. Several such orderings are possible. One is .

###### Definition 3.2.

We say that a probability 4-vector is compositionally asymmetric if it has non-zero elements and , , , and .

We note that any stationary marginal distribution of a strand-symmetric process violates all four of the conditions of compositional asymmetry, provided that the states are again appropriately ordered as or similar. However, some distributions that are not the stationary distribution of a strand-symmetric process are included in the set of compositionally asymmetric distributions. The reason for this will become apparent in the following proof.

###### Lemma 3.1.

Take a two-taxon discrete-time Markov model that is defined by a root distribution and two probability transition matrices and . Define . Assume

[label=(),ref=]
1. and are strand-symmetric;

2. is compositionally asymmetric; and

3. and are invertible.

Then , , and are uniquely determined by the joint probability matrix up to one of eight permutations of the states at node . That is, the model is identifiable up to a suitable reordering of internal states.

###### Example 3.1.

It is interesting that a non-reversible, stationary, strand-symmetric model is not identifiable for two taxa, as can be shown by counterexample. Take such a model defined by

 πm =(0.200.300.300.20)and Pma =Pmb=⎛⎜ ⎜ ⎜⎝0.380.250.210.160.160.430.270.140.140.270.430.160.160.210.250.38⎞⎟ ⎟ ⎟⎠

that yields the joint probability distribution . The model defined by

 πm =(0.200.300.300.20), Pma =⎛⎜ ⎜ ⎜⎝0.350.340.120.190.150.460.240.150.150.240.460.150.190.120.340.35⎞⎟ ⎟ ⎟⎠,andPmb=⎛⎜ ⎜ ⎜⎝0.430.270.190.110.070.390.310.230.230.310.390.070.110.190.270.43⎞⎟ ⎟ ⎟⎠

is also non-reversible, stationary, and strand-symmetric, and yields the same joint probability distribution . Both models satisfy all constraints of Lemma 3.1 except assumption 2. Generating such examples is straightforward, and a Python script for generating random examples is included in the ancillary material. The above example was generated by that script and rounded to two decimal places.

###### Proof of Lemma 3.1.

Single out the permutation matrix

 S=⎛⎜ ⎜ ⎜⎝0001001001001000⎞⎟ ⎟ ⎟⎠,

which has the effect of reversing the order of rows or columns, depending on the direction of multiplication.

First note that if a matrix is strand-symmetric, then . Also, and .

By assumptions 1 and 3 and because ,

 S(Jab)−1SJab =S(Pma)−1SS(Πm)−1((Pmb)⊺)−1S(Pmb)⊺SSΠmPma =(Pma)−1S(Πm)−1SΠmPma =(Pma)−1diag(πm(1)πm(4)πm(2)πm(3)πm(3)πm(2)πm(4)πm(1))Pma.

The final line is an eigendecomposition of whose left-eigenvectors are the rows of . The eigenvectors of are unique by assumption 2, so the eigendecomposition is unique up to scaling of the eigenvectors. As the rows of must sum to one, these scaling factors are uniquely determined.

We have shown that the rows of can be uniquely recovered from . The order of these rows is not identifiable without further assumptions, so for the moment we will say that can be identified up to a set of one of eight permutations of the rows such that the resulting is strand-symmetric in form. Then, by assumption 3, and , so and are also identifiable up to one of eight permutations of their elements and rows that correspond to the permutation chosen for . ∎

### 3.2 Identifiability of the Topology

In this section we build on a result from Chang (1996) to demonstrate that, under mild conditions, the rooted tree topology is identifiable under a non-stationary, strand-symmetric Markov model from the pairwise distributions of states at terminal nodes.

We first need to also import an equivalence relation between tree topologies. Let and be trees with the same set of terminal nodes . We say that and are equivalent if there is a bijective “relabelling” function from to that is the identity function for the terminal nodes and such that the edges are obtained by applying the function to the nodes in the edges . That is, the topologies and are equivalent if they are the same up to a possible relabeling of internal nodes.

The following is Proposition 3.1 in Chang (1996):

###### Proposition 3.1.

Consider a family of Markov models satisfying the following conditions:

1. The edge transition matrices are invertible and not equal to a permutation matrix.

2. There is a node with for each , that is, each character state has positive marginal probability at .

Then the unrooted topology is identifiable from the joint distributions of character states at pairs of terminal nodes. That is, if two models in the family induce the same pairwise distributions of character states at their terminal nodes, then the topologies of those two models must be equivalent.

As stated in Chang (1996), Proposition 3.1 follows by combining a result in Buneman (1971) with the additive function defined on pairs of nodes as

 f({r,s})=−logdetPrs−logdetPsr.

The function is additive in the sense that , where is the set of edges joining any two nodes and . Chang attributed the distance to Barry and Hartigan (1987b) and Cavender and Felsenstein (1987). It is unclear whether he was aware that the above definition of is equivalent to the paralinear genetic distance measure introduced by Lake (1994). In any case we will prefer to formulate this function almost equivalently as

 f({r,s})=−logdet(PrsPsr), (3.1)

because

 det(PrsPsr)=det((diagπr)−1(Psr)⊺diagπsPsr)≥0,

where and are the marginal probability vectors at and respectively, so that the sign of becomes irrelevant. The result in Buneman (1971) shows that if the values of an additive function are known between the pairs of all terminal nodes, then under the conditions of Proposition 3.1, the topology and the value of the function on all edges of the topology are determined. As can be calculated from pairwise distributions between terminal nodes, so the topology and the value of on every edge can be determined.

###### Corollary 3.1.

Suppose the evolutionary tree has nodes of degree three or one, with the exception of one special node which we designate the root. The root can have degree one, two, or three. Assume that

[label=(),ref=]
1. is compositionally asymmetric for every internal node .

Assume also that for each edge , where is further from the root than ,

[label=(),ref=,resume]
1. is invertible,

2. is not a permutation matrix, and

3. is strand-symmetric.

Then the rooted topology of the tree is recoverable from pairwise distributions of states at terminal nodes. That is, if two models in the family induce the same pairwise distributions of character states at the terminal nodes, then the topologies recovered by those two models must be equivalent.

###### Proof.

Under assumptions 1, 1, and 3, Lemma 3.1 can be applied to any pair of terminal nodes and to obtain and , where is the most recent common ancestor of and , up to a consistent permutation of their elements and rows. We can then calculate , and so . Note that the value of is independent of the chosen permutation of by (3.1).

Now fix and calculate where is the value of for and the most recent common ancestor of and , and are the terminal nodes. The root is then the most recent common ancestor of and .

By assumptions 1, 1, and 2, Proposition 3.1 applies and the unrooted topology is identifiable. As for Proposition 3.1, we can also determine the value of for every edge in the unrooted topology. As the most recent common ancestor of and must lie on the path from to , then gives us the address of the root on that path. That is, we have determined the rooted topology and the value of for every edge in the topology. ∎

### 3.3 Pairwise Distributions Determine the Model

We now approach the main result in Theorem 3.1. We show that a non-stationary, strand-symmetric model is fully identifiable given pairwise joint state distributions between all taxa. Up to this point we have made limited assertions about the labelling of states at internal nodes, but we will now make stronger assumptions to remove this ambiguity. One value of knowing this labelling is that it will allow us to fit continuous-time models.

Start by defining two sets of matrices, the first of which is previously defined in Chang (1996).

###### Definition 3.3.

A set of matrices is reconstructible from rows if for each and each permutation matrix , we have .

As we mentioned in the introduction, the assumption in Chang (1996) that every transition probability matrix belong to a class of matrices that is reconstructible from rows is used to identify the labelling of states at internal nodes, which otherwise would not be identifiable.

###### Example 3.2.

To illustrate the abstract concept of a set of matrices that is reconstructible from rows, we borrow an example from (Chang 1996). We say that a square matrix is diagonal largest in column (DLC) if for all . The set of DLC matrices is reconstructible from rows.

While DLC serves as a good means for conceptualising a set of matrices that is reconstructible from rows, it has also been shown to be useful as an empirical test for model identifiability (Kaehler et al. 2015). As remarked in Chang (1996), it is not the only set of matrices that is reconstructible from rows but might be a reasonable assumption. Unfortunately, as branch lengths increase, the assumption of DLC becomes less reasonable.

###### Example 3.3.

Under very weak assumptions, as branch length increases the transition probability matrix on that branch must tend towards its stationary limit. That is, all its rows must tend towards equality. Rows that are closer to equality are in some sense less likely to satisfy DLC. To illustrate the point,

 P =⎛⎜⎝0.50.30.20.10.40.50.10.10.8⎞⎟⎠is DLC. P2 =⎛⎜⎝0.30.290.410.140.240.620.140.150.71⎞⎟⎠is not DLC.

However, it is an important observation that there is no way of permuting the rows of the second matrix in the above example so that it is DLC. We are therefore motivated to coin the following term.

###### Definition 3.4.

A set of matrices is sympathetic to a set of matrices if it contains all matrices such that for every permutation matrix , we have .

Note that if is reconstructible from rows then . To slightly abuse the terminology, sympathetic matrices are useful because they provide matrices that might not be reconstructible from rows, but that will not contradict the ordering of states of internal nodes that is suggested by matrices that are.

###### Example 3.4.

The second matrix in Example 3.3 is not DLC but is from the set of matrices that is sympathetic to the set of DLC matrices.

The following theorem extends and restricts Theorem 4.1 in Chang (1996). It makes an additional assumption of strand symmetry, but in so doing is able to accommodate rooted topologies. Motivated by Remark 4 in Chang (1996), it relaxes the assumption that every edge must be reconstructible from rows. Theorem 4.1 in Chang (1996) allows degenerate topologies, where nodes are allowed to have degree greater than three. We assume nondegenerate topologies for ease of exposition and because we will only make that assumption later anyway when we prove that the model can be statistically consistently estimated, as Chang does in his Theorem 5.1. We note that extension of the following theorem to degenerate topologies should be straightforward.

###### Theorem 3.1.

Assume the conditions for Corollary 3.1. Define as a set of matrices that is reconstructible from rows and as the set of matrices that are sympathetic to . Further assume that

[label=(),ref=,start=5]
1. there exists a path from every internal node to a terminal node such that for each edge in the path, where is closer to than , , and that

2. for every edge in the topology.

Then the full model is identifiable. That is, the topology and all of the transition probability matrices are uniquely determined by the joint distribution of character states at the terminal nodes of the tree.

###### Remark 3.1.

The relaxation of the assumption that every transition probability matrix must be reconstructible from rows to the assumption that every matrix must be sympathetic to those which are might not seem important, but in fact it greatly increases the range of models to which the theorem applies. By allowing sympathetic transition probability matrices, we are in effect allowing long branches in the tree as long as there are sufficient short branches to consistently reconstruct the labelling of states at internal nodes.

###### Proof of Theorem 3.1.

We have from Corollary 3.1 the full rooted topology. It is enough to then determine a transition probability matrix for each edge. We will proceed by induction.

If the topology has two terminal nodes and one node is the root then the full model is trivially determined.

If the topology has two terminal nodes and neither node is the root, then by assumptions 1, 1, and 3 we can apply Lemma 3.1 to recover the model up to a permutation of labels of internal states. At least one transition probability matrix must be reconstructible from rows by assumption 1 and the other at worst will not give an alternative ordering by assumption 2, so the ordering of states at the internal node can be recovered and the full model is determined.

If the topology has more than two terminal nodes, choose an arbitrary terminal node and denote its neighbouring node . We treat two cases separately.

If is not the root, choose another terminal node such that is the most recent common ancestor of and . By assumptions 1, 1, and 3, we can apply Lemma 3.1 to and to determine up to a permutation of labels of internal states. By assumption 2 we can deduce whether , and if it is we can infer the correct permutation to recover . Otherwise, restart this iteration with a different .

If is the root, choose two other terminal nodes and such that the most recent common ancestor of and is . Again by assumptions 1, 1, and 3, we can apply Lemma 3.1 to and to obtain a stochastic matrix and a diagonal matrix such that and for some permutation matrix . Then . Once more if , we infer the correct matrix and we know . Otherwise, restart this iteration with a different .

So we are able to determine for a terminal node and its neighbour . The induction step is then to remove the edge from the tree, and make a terminal node in two subtrees, or one subtree if happens to be the root. For any subtree, for any terminal node of the subtree . This step can be repeated until all subtrees have two taxa and the full model is recovered on all edges. ∎

### 3.4 Continuous-Time Models

We will now show that Theorem 3.1 is still applicable in almost all cases if we further restrict the model to be continuous-time. We state this explicitly as some care is required because the matrix logarithm can have multiple roots. Continuous-time models provide more information than discrete-time models, for instance for questions concerning genetic distance and relative rates of evolution.

###### Theorem 3.2.

Make the assumptions of Theorem 3.1. Additionally assume that for every , where is further from the root than , there exists a unique mapping where every off-diagonal element of is non-negative. Also replace assumption 3 with the assumption that is strand-symmetric. Then the full model is identifiable from the joint distribution of states at the terminal nodes of the tree.

###### Remark 3.2.

The restriction that the mapping be unique may seem like a barrier to implementation of this model. It is difficult to imagine a sufficiently general parametrisation of that would guarantee this property. However, empirical evidence suggests that it may be better to beg forgiveness than to ask permission (Kaehler et al. 2015). Software exists for checking whether a Markov generator enjoys this property, and it was found that it was rarely necessary to actually enforce this constraint for the data sets considered in Kaehler et al. (2015).

###### Proof.

Again set

 S=⎛⎜ ⎜ ⎜⎝0001001001001000⎞⎟ ⎟ ⎟⎠.

Note that if for a square matrix , then for . This follows from the expansion and that because for example . The assumption that is strand-symmetric therefore also ensures that assumption 3 of Theorem 3.1 is satisfied.

By Theorem 3.1, the mapping from the joint distribution of states at the terminal nodes of the tree to the root marginal probability distribution, the tree topology, and the transition probability matrices is unique, so by assumption this property is also enjoyed by the transition rate matrices . ∎

## 4 Reconstruction from Data: Consistency of Maximum Likelihood

As an application of the above identifiability results, we shall now prove that the above non-stationary, strand-symmetric models can be statistically consistently estimated from multiple sequence alignments. That is, as the length of the alignment increases, the inferred model must in some sense converge to the correct model.

We will need to use the following definition from Chang (1996), where denotes the closure of the set of matrices under the Euclidean metric.

###### Definition 4.1.

We say that a set of matrices is strongly reconstructible from rows if, for each and each permutation matrix , we have .

We also define the following set, where is the open ball of radius centred at under the Euclidean metric.

###### Definition 4.2.

A set of matrices is strongly sympathetic to a set of matrices if it contains all matrices such that for every permutation matrix , we have for some .

###### Theorem 4.1.

Let be a set of Markov models on trees that have a fixed set of terminal nodes. Suppose the models satisfy the assumptions of Theorem 3.1, with the exception of assumptions 1 and 2, which we restrict slightly to instead assume that from every internal node there exists a path to a terminal node such that for every edge in the path, , where is further from than , and for every edge , , is strongly reconstructible from rows, and is strongly sympathetic to .

Then the method of maximum likelihood consistently recovers the topology, root marginal probability distribution, and edge transition probability matrices. That is, for any , let denote the maximum likelihood estimate based on independent observations of character states at the terminal nodes of the tree, then

 ^θn\tiny a.s.−−−−−−−→θ% asn→∞.
###### Theorem 4.2.

Let be a set of Markov models on trees that have a fixed set of terminal nodes. Suppose the models satisfy the assumptions of Theorem 4.1. Additionally assume that for every , where is further from the root than , that there exists a unique mapping , where every off-diagonal element of is non-negative. Also replace assumption 3 from Theorem 3.1 with the assumption that .

Then the method of maximum likelihood consistently recovers the topology, root marginal probability distribution, and edge transition rate matrices. That is, for any , let denote the maximum likelihood estimate based on independent observations of character states at the terminal nodes of the tree, then as .

The following is a restatement of Lemma 5.1 in Chang (1996), where a sketch of its proof is provided.

###### Lemma 4.1.

Let be a finite set and let be a set of probability distributions on , where the closure of is a compact subset of a metric space. Let be independent and identically distributed random variables (or vectors) with probability distribution for some . Assume the identifiability condition

 Pθ≠Pθ0for % eachθ∈¯¯¯¯Θwithθ≠θ0.

Suppose that for each the function is continuous on and let maximise the log likelihood over . Then under , as .

###### Proof of Theorem 4.1.

The proof of Theorem 5.1 in Chang (1996) carries to this context with only slight modification. We need to show that the conditions of Lemma 4.1 are implied by the assumptions of Theorem 4.1 with and for .

For , Theorem 3.1 gives that if . It remains to show for and that if .

The proof of Theorem 5.1 in Chang (1996) shows that the assumptions that is invertible and not a permutation matrix for every edge must be satisfied for a model if for some , by virtue of the form of . The set of strand-symmetric matrices is closed, so the assumption that each such is strand-symmetric must be satisfied for all . Finally, if compositional asymmetry is violated at any internal node for a , then for at least one pair of terminal nodes and , the eigendecomposition of will have at least one repeated eigenvalue. That is, if for some , then for we must have that all internal nodes satisfy compositional asymmetry.

So far we have shown that if for and then the assumptions of Corollary 3.1 are satisfied by both models and so they must have the same rooted topology.

It remains to show that if for and then the transition probability matrices for are the same as for . We will show that the inductive steps of Theorem 3.1 still apply in this context. As the assumptions of Corollary 3.1 are satisfied for , Lemma 3.1 can be safely applied, so the only remaining question is the ordering of rows of the transition probability matrices.

Take a transition probability matrix from the model and the matrix on the corresponding edge from . Firstly assume . For some permutation matrix , , but if , as is strongly reconstructible from rows. We must also check that . By construction, and , so and . Next assume that . Again , but . In this case the order of rows of is decided by the order of those on neighbouring edges, and must be the same for and .

It is therefore possible to modify the proof of Theorem 3.1 such that it ensures that for and , if

We are therefore able to apply Lemma 4.1 and the model can be estimated consistently under maximum likelihood. ∎

###### Proof of Theorem 4.2.

This theorem follows from the proofs of Theorems 3.2 and 4.1. The only difficulty is to check whether there might exist a which is part of the model and a corresponding which is a part of the model such that but . This could only occur if the transition probability matrix from the same edge and model mapped to multiple valid Markov generators. As for the proof of Theorem 4.1, all transition probability matrices are uniquely determined by and as we must have that for the corresponding in , so by assumption must be unique and . ∎

## 5 Discussion

Assumption 1 of Corollary 3.1 implies that the process under consideration is non-stationary and we subsequently find that we can recover the root. We further show in example 3.1 that in the stationary, non-reversible case it is not possible to recover the root. This seems to fit with the collective empirical evidence of Yang and Roberts (1995), Huelsenbeck et al. (2002), and Yap and Speed (2005). That is, in some cases a non-stationary process can recover the root, whereas a stationary, non-reversible process cannot. Assumption 1 is also the source of a subtle and interesting contradiction. It can be shown under mild assumptions that the product of arbitrary stochastic matrices will eventually have almost identical rows as the number of terms in the product increases (Seneta 2006, Section 4.3). It is not difficult to show that the rows must be the stationary distribution of the product, and that if the terms are strand symmetric then the product must be strand symmetric. So for any reasonably long history of strand-symmetric processes, we must have that the marginal probabilities are also strand-symmetric. The conclusion is that for our model to work, the process should probably have been strand asymmetric prior to the root of the tree, and strand symmetric thereafter. An alternative interpretation is that our model is just a model, and that we could start by assuming that the root probabilities are sufficiently asymmetric for the process to be non-stationary, and then that the process is sufficiently strand-symmetric for the model to be approximately correct. In either case assumption 1 of Corollary 3.1 could be an interesting tool for probing the limits of inference of this model.

In the spirit of Chang (1996) we have attempted to provide a constructive proof, and hope that the results here presented will enable the implementation of new phylogenetic methods. There are several options for such an implementation. Mossel and Roch (2006) provide an algorithm for directly applying the concepts in Chang (1996) to learn the full model parameters using spectral methods. It would be relatively straightforward to adapt that approach to the proof to Theorem 3.1. Alternatively, progress is being made towards fitting models that are heterogeneous across lineages (Jayaswal et al. 2011, 2014). These methods could be adapted to our setting. Also, the method of Yap and Speed (2005) where a homogeneous process is fitted to the whole tree would be trivial to implement for a strand symmetric process, but could be done with a solid theoretical basis and new insight into the limits of inference of such a model.

In addition to these immediate applications, this work provides a foundation for theoretical developments where a non-stationary model can be fully recovered for a two-taxon rooted topology. The ideas presented here could be extended to any such model, not just the strand-symmetric one on which we focus. Another possible direction for future theoretical development is a model that is rate-heterogeneous amongst alignment columns. We have ignored this possibility, both because it is possible to work around it with careful site classification, as in Yap and Speed (2005), and because recent results in non-stationary phylogenetic processes have shown that the general time-reversible model is less biased than the usual rate-heterogeneous general time-reversible model in comparison to a general non-stationary process in some circumstances (Kaehler et al. 2015). Another possible line of enquiry is to refine the sufficient conditions of Theorem 3.1. We know that a stationary process is not identifiable for a rooted topology, but that our non-stationary strand-symmetric process is. Our class of models is slightly narrower than the class of non-stationary, strand-symmetric processes, however, so it may be possible to broaden our sufficient assumptions.

### Acknowledgements

I am grateful to Teresa Neeman, Gavin Huttley, Von Bing Yap, Michael Roper, and John Trueman for their helpful comments. I am particularly indebted to Teresa for suggesting Corollary 3.1, which greatly simplified the proof of Theorem 3.1.

This work was supported by the National Health and Medical Research Council [grant number APP1085372].

### References

1. Barry, D. and J. Hartigan (1987a). Statistical analysis of hominoid molecular evolution. Statistical Science 2(2), 191–207.
2. Barry, D. and J. A. Hartigan (1987b). Asynchronous distance between homologous DNA sequences. Biometrics 43(2), 261–76.
3. Bonhomme, S., K. Jochmans, and J.-M. Robin (2014). Nonparametric spectral-based estimation of latent structures. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.
4. Bulmer, M. (1991). Strand symmetry of mutation rates in the -globin region. Journal of Molecular Evolution 33(4), 305–310.
5. Buneman, P. (1971). The recovery of trees from measures of dissimilarity. In D. G. Kendall and P. Tǎutu (Eds.), Mathematics in the Archaeological and Historical Sciences, pp. 387–395. Edinburgh University Press, Scotland, UK.
6. Casanellas, M. and S. Sullivant (2005). The strand symmetric model. In L. Pachter and B. Sturmfels (Eds.), Algebraic Statistics for Computational Biology, Volume 13, pp. 305–321. Cambridge University Press.
7. Cavender, J. A. and J. Felsenstein (1987). Invariants of phylogenies in a simple case with discrete states. Journal of Classification 4(1), 57–71.
8. Chang, J. T. (1996). Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Mathematical Biosciences 137(1), 51–73.
9. Drummond, A. J. and A. Rambaut (2007). Beast: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology 7(1), 1.
10. Faith, D. P. and A. M. Baker (2006). Phylogenetic diversity (pd) and biodiversity conservation: some bioinformatics challenges. Evolutionary Bioinformatics 2.
11. Felsenstein, J. (2004). Inferring Phylogenies (2 ed.). Sinauer Associates Sunderland.
12. Francino, M. P. and H. Ochman (2000). Strand symmetry around the -globin origin of replication in primates. Molecular Biology and Evolution 17(3), 416–422.
13. González-Candelas, F., M. A. Bracho, B. Wróbel, and A. Moya (2013). Molecular evolution in court: analysis of a large hepatitis c virus outbreak from an evolving source. BMC Biology 11(1), 76.
14. Higham, N. J. (2008). Functions of matrices: theory and computation. Siam.
15. Huelsenbeck, J. P., J. P. Bollback, and A. M. Levine (2002). Inferring the root of a phylogenetic tree. Systematic Biology 51(1), 32–43.
16. Jarvis, P. D. and J. G. Sumner (2013). Matrix group structure and Markov invariants in the strand symmetric phylogenetic substitution model. arXiv preprint arXiv:1307.5574.
17. Jayaswal, V., F. Ababneh, L. S. Jermiin, and J. Robinson (2011). Reducing model complexity of the general Markov model of evolution. Molecular Biology and Evolution 28(11), 3045–3059.
18. Jayaswal, V., T. K. Wong, J. Robinson, L. Poladian, and L. S. Jermiin (2014). Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. Systematic Biology 63(5), 726–742.
19. Kaehler, B. D., V. B. Yap, R. Zhang, and G. A. Huttley (2015). Genetic distance for a general non-stationary Markov substitution process. Systematic Biology 64(2), 281–293.
20. Knight, R., P. Maxwell, A. Birmingham, J. Carnes, J. G. Caporaso, B. C. Easton, M. Eaton, M. Hamady, H. Lindsay, Z. Liu, C. Lozupone, D. McDonald, M. Robeson, R. Sammut, S. Smit, M. J. Wakefield, J. Widmann, S. Wikman, S. Wilson, H. Ying, and G. A. Huttley (2007). PyCogent: a toolkit for making sense from sequence. Genome Biology 8(8), R171.
21. Lake, J. A. (1994). Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proceedings of the National Academy of Sciences 91(4), 1455–1459.
22. Lobry, J. (1995). Properties of a general model of DNA evolution under no-strand-bias conditions. Journal of Molecular Evolution 40(3), 326–330.
23. Lobry, J. and C. Lobry (1999). Evolution of DNA base composition under no-strand-bias conditions when the substitution rates are not constant. Molecular Biology and Evolution 16(6), 719–723.
24. Mossel, E. and S. Roch (2006). Learning nonsingular phylogenies and hidden Markov models. The Annals of Applied Probability 16(2), 583–614.
25. Murphy, W. J., T. H. Pringle, T. A. Crider, M. S. Springer, and W. Miller (2007). Using genomic data to unravel the root of the placental mammal phylogeny. Genome Research 17(4), 413–421.
26. Nickle, D. C., M. A. Jensen, G. S. Gottlieb, D. Shriner, G. H. Learn, A. G. Rodrigo, and J. I. Mullins (2003). Consensus and ancestral state HIV vaccines. Science 299(5612), 1515–1518.
27. Podsiadlo, L. and M. Polz-Dacewicz (2013). Molecular evolution and phylogenetic implications in clinical research. Annals of Agricultural and Environmental Medicine 20(3).
28. Posada, D. (2008). jModelTest: phylogenetic model averaging. Molecular Biology and Evolution 25(7), 1253–1256.
29. Seneta, E. (2006). Non-negative matrices and Markov chains. Springer Science; Business Media.
30. Squartini, F. and P. F. Arndt (2008). Quantifying the stationarity and time reversibility of the nucleotide substitution process. Molecular Biology and Evolution 25(12), 2525–2535.
31. Touchon, M. and E. P. Rocha (2008). From GC skews to wavelets: a gentle guide to the analysis of compositional asymmetries in genomic data. Biochimie 90(4), 648–659.
32. Wu, C.-I. and N. Maeda (1987). Inequality in mutation rates of the two strands of DNA. Nature 327(6118), 169–170.
33. Yang, Z. and D. Roberts (1995). On the use of nucleic acid sequences to infer early branchings in the tree of life. Molecular Biology and Evolution 12(3), 451–458.
34. Yap, V. B. and L. Pachter (2004). Identification of evolutionary hotspots in the rodent genomes. Genome Research 14(4), 574–579.
35. Yap, V. B. and T. Speed (2005). Rooting a phylogenetic tree with nonreversible substitution models. BMC Evolutionary Biology 5, 2.
36. Zou, L., E. Susko, C. Field, and A. J. Roger (2011). The parameters of the Barry and Hartigan general Markov model are statistically nonidentifiable. Systematic Biology 60(6), 872–5.
113413