Identifiability of Large Phylogenetic Mixture Models

# Identifiability of Large Phylogenetic Mixture Models

John A. Rhodes Department of Mathematics and Statistics
University of Alaska, Fairbanks AK 99775
and  Seth Sullivant Department of Mathematics
North Carolina State University, Raleigh, NC 27695
###### Abstract.

Phylogenetic mixture models are statistical models of character evolution allowing for heterogeneity. Each of the classes in some unknown partition of the characters may evolve by different processes, or even along different trees. The fundamental question of whether parameters of such a model are identifiable is difficult to address, due to the complexity of the parameterization. We analyze mixture models on large trees, with many mixture components, showing that both numerical and tree parameters are indeed identifiable in these models when all trees are the same. We also explore the extent to which our algebraic techniques can be employed to extend the result to mixtures on different trees.

## 1. Introduction

A fundamental question about any parametric statistical model is whether or not the parameters of that model are identifiable; that is, does a probability distribution arising from the model uniquely determine the parameters that produced it. Establishing the identifiability of parameters is important for statistical inference, especially in models where the parameters have a physical or biological interpretation. For example, it is well-known that identifiability is a necessary condition for statistical consistency of maximum likelihood estimation [14, Chapter 16].

In phylogenetics, parameters of interest include the discrete tree parameter and numerical parameters specifying substitution processes on the edges of the tree. For the simplest phylogenetic models, identifiability of both tree and numerical parameters have long been established [11]. But as models grow in complexity, with both the combinatorial description of trees and the underlying number of numerical parameters increasing, the question of identifiability is far from settled.

A particular class of complex phylogenetic models of growing interest and use are the phylogenetic mixture models. Relatively simple examples are the models with small numbers of parameters — including those with -distributed rates, invariable sites, and combinations of these — that are currently the most commonly used in data analysis. More elaborate mixtures allow across-site rate variation with more freedom in the distribution of the rate multipliers [15], the use of different rate matrices [18], or even multiple distinct trees each with their own rate and time parameters. Such models may have a large number of mixture components. For instance, a Bayesian nonparametric analysis conducted in [15] allowed a variable number of components, with a Dirichlet process prior specifying a mean of as many as 20.

However, only the simplest phylogenetic mixture models have been proven to be identifiable, typically where the number of parameters are small. The papers [1, 4, 6, 7, 10, 21] contain previous results on identifiability of such models, of various sorts. Note that only recently has it been shown that most choices of parameters of the widely-used GTR + I + model are identifiable [10], although for a certain type of rate matrix the question remains open.

Our goal in this paper is to develop methods to prove identifiability in phylogenetic models that are considerably more complex than in previous work. In particular, we investigate the identifiability of phylogenetic models with many mixing components. A consequence of our methods is the following theorem:

###### Theorem 1.1.

For an -component identical tree mixture of the general Markov model of character evolution with -state random variables on an -leaf binary phylogenetic tree, both the tree parameter and the numerical parameters are generically identifiable if .

By an identical tree mixture model we mean a mixture of probability distributions coming from the same topological phylogenetic tree. More complicated mixture models might have each distribution arising from a different topological tree. Theorem 1.1 improves substantially over past identifiability results on identical tree phylogenetic mixture models. Previously, it was only known that the tree parameter is identifiable, and then only in the case that (work of Allman and the first author [6]). This new theorem quantifies the intuition that larger taxon sets should allow for identifiability of more complex models, and is an exponential improvement over previous results.

Our strategy of proof is to combine two techniques coming from the algebraic study of phylogenetic models. First, we use the representation of probability distributions in a phylogenetic model as tensors with small tensor rank and employ a theorem of J. Kruskal to uniquely identify components of that tensor. Second, we use phylogenetic invariants as tools to identify deeply embedded features of phylogenetic trees, and to “untangle” probability distributions that have been shuffled together by the tensor analysis. While each technique by itself is only able to make a small advance on the identifiability problem, when combined they give dramatically stronger results. Background on these general techniques appears in Section 3, and the proofs of the main theorems are in Section 4.

Our techniques actually extend to mixtures from different trees provided they all share a certain type of common substructure. It is in this generality that we prove our main results, Theorems 4.6 and 4.7, with Theorem 1.1 arising as a corollary.

The assumption of any common substructure in the trees is of course false in some biological situations modeled by mixtures. For instance, if the mixture is due to the coalescent process modeling incomplete lineage sorting on a species tree of populations, then components will be present from all topological gene trees [13, 22]. However, one might also model lateral gene transfer at a number of (unknown) locations in a tree as a mixture, and for this the assumption of common substructure could be quite plausible.

## 2. Preliminaries

### 2.1. Mixture models

Consider the general Markov model of -state character evolution, , on -taxon trees (e.g., corresponding to DNA sequences). We assume the taxa labeling the leaves are identified with . Then for each rooted leaf-labeled tree , there is a parametrization map giving the joint distribution of states at the leaves of the tree as functions of continuous parameters, which specify the state distribution at the root and the transition probabilities on the edges. Let denote the continuous parameter space of on , which is a full dimensional subset of some . Then

 ψT:ST→Δκn−1,

where denotes the probability simplex comprised of non-negative real vectors summing to . The image of this map is the phylogenetic model .

The associated -component mixture model has the following parametrization: For every -tuple of trees on the same taxa , let and let

 ψT:ST→Δκn−1,

be defined by

 ψT(s1,…,sr,π)=π1ψT1(s1)+⋯+πrψTr(sr).

Thus is the vector of mixing parameters; each gives the proportion of i.i.d. sites that evolve along tree with parameter vector . The -component mixture model on is the image of the map , and is denoted

 MT=MT1∗MT2∗⋯∗MTr.

Clearly depends only on the unordered multiset of the trees in . In the case where for all , we call this an -component identical tree mixture model on .

We focus on the mixture models built from the basic model in this paper, as these are quite general algebraic models, for which the maps are naturally defined by polynomial formulas. Many models which are not polynomial (in particular, those built from the general time-reversible model) can be embedded in them. The polynomial structure of algebraic models allows them to be studied using techniques from algebraic geometry.

### 2.2. Identifiability of parameters

For algebraic models, it is convenient to slightly weaken the notion of identifiability of parameters to generic identifiability. The word “generic” is used to mean “except on a proper algebraic subvariety” of the parameter space. (See section 3.3 for a formal definition of variety.) Although it is sometimes possible to be explicit about this subvariety, we usually are not, since the key point in interpretation is that a proper subvariety is a closed set of Lebesgue measure 0 inside the larger set. Thus regardless of the precise subvariety involved, ÔrandomlyÕ chosen points are generic with probability 1.

On an unmixed model on a single tree , there are several well-understood issues with identifiability of parameters. First, at any internal node of the tree, in a phenomenon called label swapping, one may permute the names of the state space of the corresponding hidden variable (permuting the columns or rows of the Markov matrices on edges leading to or from the node) with no effect on the probability distribution. Second, while the standard parameterization of the model on a tree requires specification of the root of , for generic choices of parameters one can relocate the root (with an appropriate uniquely determined change to the parameters, up to label swapping) with no effect on the probability distribution. Third, if any internal nodes of have degree 2, they may be suppressed and the Markov matrices on incident edges combined, with no effect on the probability distribution. Thus one generally assumes trees have no such nodes. For simplicity, we do not always explicitly refer to these issues in our formal statements in this article. However, we will occasionally use the second fact to choose a convenient location for a root of a tree in our arguments.

That these are the only issues for parameter identifiability for the unmixed model is the content of the following theorem, which was essentially shown in [11].

###### Theorem 2.1.

For the model on a single tree,

1. The unrooted tree parameter is generically identifiable, in the class of binary trees.

2. For a fixed binary tree , the numerical parameters of the model on are generically identifiable, up to label swapping at internal nodes of the tree, and an arbitrary choice of a node as the root.

An additional issue for identifiability of -tree mixtures is component swapping: Interchanging the trees along with their parameters, while permuting the mixing parameters in the same way, has no effect on the resulting distribution. A useful notion of identifiability must allow for this.

###### Definition 2.2.

The tree parameters of the -tree mixture are generically identifiable if for any binary trees on the same set of taxa, and generic choices of parameters ,

 ψT(s1,…,sr,π)=ψT′(s′1,…,s′r,π′)

implies that for some , the symmetric group of permutations.

We also investigate identifiability of tree parameters when restricting to specific classes of -tuples of trees. For example, Theorem 1.1 concerns identifiability of tree parameters among all sets , where . Our main results, Theorems 4.6 and 4.7 concern identifiability in the class of -tuples of trees that all contain a specified deep common substructure, whose precise definition will be given in Section 4.

###### Definition 2.3.

The continuous parameters of an -tree mixture on are generically identifiable if for generic choices of and ,

 ψT(s1,…,sr,π)=ψT(s′1,…,s′r,π′)

implies that there is a permutation such that , , and for .

Note this definition only allows the swapping of continuous parameters with when .

### 2.3. Splits and tripartitions

We will use the combinatorial notion of a split of the leaves of a tree associated to an edge in a binary tree, as well as the analog of this concept for a node of the tree.

###### Definition 2.4.

A split of is a bipartition of with two nonempty elements. A split is said to be compatible with a tree if it arises as the partition of leaves induced by an edge in some binary resolution of .

Similarly, a tripartition of of leaves is said to be compatible with if it arises as the tripartition induced by an interior vertex in some binary resolution of .

A collection of trees is said to have a common split (or tripartition) if the split (or tripartition) is compatible with every tree in the collection.

A collection of trees has a common tripartition if, and only if, it also the three common splits , , and . For a binary tree, these are the splits associated to the edges radiating from the vertex inducing the tripartition. Note also that our definition of compatible splits differs from the standard definition (e.g., in [19]) in the case of trees with polytomies. Our notion is more useful when studying geometric properties of phylogenetic models.

## 3. Tensors and Invariants

The two main tools we use to prove our results are Kruskal’s theorem on uniqueness of tensor decompositions and phylogenetic invariants. In this section, we describe these tools. Both are connected to the notion of a flattening of the probability distribution arising from a phylogenetic model.

### 3.1. Tensors and Unique Decomposition

By a tensor, we mean simply an -way rectangular array of numbers. A 2-way tensor is thus a matrix.

For let be an matrix with th row . Let denote the 3-way tensor defined by

 [M1,M2,M3]=r∑i=1m1i⊗m2i⊗m3i.

In other words, is an array whose entry is

 [M1,M2,M3]u,v,w=r∑i=1m1i(u)m2i(v)m3i(w).

Every 3-way tensor can be expressed in this way, for sufficiently large . A nonzero tensor of this form with is said to have tensor rank 1. More generally, the minimal such that a 3-way tensor can be decomposed as such a sum is called its tensor rank. A natural question is when this expression is essentially unique.

Note there are two basic operations on the matrices which leave unchanged the tensor : one can simultaneously permute the rows of the three matrices and , or taking three numbers such that , one can replace the th rows by . Kruskal’s Theorem [16, 17] describes a situation where these operations lead to the only variants in a tensor decomposition.

Given an matrix , its Kruskal rank, denoted , is the largest value such that every subset of rows of is linearly independent. Note that .

###### Theorem 3.1 ([16, 17]).

Let , where is . If

 I1+I2+I3≥2r+2

then uniquely determines up to simultaneous permutation and scaling of the rows.

Kruskal’s theorem has proven useful for proving identifiability results of numerical parameters for both phylogenetic models [9] and for other statistical models with hidden variables [2, 3]. We will show how to combine this with other algebraic techniques to also deduce identifiability of tree parameters.

### 3.2. Flattenings

While Kruskal’s theorem concerns 3-way tensors, the tensors arising in phylogenetics are usually -way tensors, corresponding to the leaves of a phylogenetic tree. We will make frequent use of flattenings of -way tensors to lower order tensors. A flattening of a -way tensor is simply a reorganization of that tensor as a -way tensor, with , of larger dimensions. We take a tensor , with typical entry , and a partition of , and we represent this as a

 ∏a∈A1κa×⋯×∏a∈Akκa

tensor . The entry of becomes the entry of . That is, the indices for the new tensor are vectors of indices from the tensor .

Given a partition of , we denote the corresponding flattening of by .

### 3.3. Invariants, Phylogenetic and Otherwise

We begin with a little background on algebraic geometry (see [12] for more detail). Let be the set of all polynomials in the variables (or indeterminates) , with coefficients in the real numbers, . Algebraic geometry studies the zero sets of collections of polynomials. That is, to a collection of polynomials we associate the variety

 V(f1,…,fk)={a∈Rm:f1(a)=f2(a)=⋯=fk(a)=0}.

The fact that these geometric sets arise from polynomials vanishing implies they have important structural features.

Varieties arise in studying statistical models through describing models implicitly, rather than parametrically. For a fixed statistical model , an invariant of is a polynomial such that for all . In the case where is a phylogenetic model, such a polynomial is called a phylogenetic invariant.

Our main use in this paper for phylogenetic invariants is their connection to generic identifiability, through the following basic proposition from algebraic geometry.

###### Proposition 3.2.

Let and be two irreducible algebraic varieties, such as those arising from parameterized statistical models. Suppose is an invariant for , and there exists a point with Then , and the variety is of lower dimension than . That is, generic points on lie off of .

Among the most important and elementary phylogenetic invariants are the ones that arise from edge flattenings of tensors.

###### Definition 3.3.

Let be a split compatible with the tree . An edge invariant for is a phylogenetic invariant that can be expressed as a minor (i.e., the determinant of a submatrix) of the matrix .

As an indication of how edge invariants can be used to identify combinatorial information on the tree underlying a phylogenetic model, we recall the following theorem concerning models on a single tree. While this statement is well-known in the phylogenetic invariants literature, Theorem 4.1 of this article provides a more general extension to mixture models.

###### Theorem 3.4.

Suppose that and are two -leaf trees such that for , is a split compatible with and incompatible with , and let denote the -state general Markov model on . Then the -minors of vanish on and do not vanish on , and thus are edge invariants for the first model but not the second. In particular, edge invariants can be used to generically identify the tree topology.

Edge invariants have been the phylogenetic invariants most interesting for tree identifiability in the past, and contain enough information to reconstruct the combinatorial type of a single tree in some situations. However, we need some more complicated invariants to get more information in the case of the phylogenetic mixture models considered here. We describe these invariants, discovered in several different contexts [5, 20], in matrix form.

###### Theorem 3.5.

Let be a tensor giving a distribution from the model on a 3-leaf tree. For , let be the matrix slice . Then

Here denotes the classical adjoint of , which is given by polynomial expressions in the entries of . In the case of nonsingular , .

## 4. Identifiability of Mixture Models with Common Substructure

In this section, we prove our main result, that both tree parameters and numerical parameters are generically identifiable in a phylogenetic mixture model provided we restrict to multisets of trees that all share a certain substructure. More precisely, we require that all trees in have two splits in common. The number of mixing components that can be identified via our techniques will depend on the sizes of the sets in these splits. As a corollary, we deduce Theorem 1.1, after showing that if all trees are the same, there is a “deep” internal vertex with two of its incident edges giving the requisite splits.

Before proceeding to the statements and proofs of the main theorems, we prove three lemmas.

###### Lemma 4.1.

(Edge invariants for tree mixtures)

Consider the mixture model on trees . Let be a bipartition of the taxa, with

1. If is compatible with all trees in , then all -minors of vanish for all distributions arising from the model.

2. If is not compatible with at least one tree in , then for generic distributions arising from the model at least one -minor of does not vanish.

###### Proof.

The claims concerning (non)vanishing of minors are equivalent to claims that has rank at most in case (1), and generically has rank greater than in case (2). Therefore we focus on investigating ranks of flattenings.

If is compatible with all trees in , then, by passing to binary resolutions of the , we may assume it is a split associated to edge in . Then one sees that

 FlatA|B(P)=MTAQMB.

Here is the block-diagonal matrix whose th block gives the joint probability distribution of states for the random variables at and , weighted by the component proportion . The matrices , are stochastic, of sizes , , with entries in the th block of rows giving probabilities of states of variables in conditioned on states at ,. This factorization implies the claimed bound on the rank.

Suppose next that is not compatible with at least one of the trees in , say . To show that generically has rank greater than , it is enough to give a single choice of parameters producing such a rank. Indeed, this follows from Proposition 3.2, applied to the model and the variety of matrices of rank at most .

To simplify this choice, for each with choose all Markov matrices for all internal edges of to be the identity, . Since is not compatible with , by Theorem 3.8.6 of [19], it has an edge , with associated split such that all four sets , , , are nonempty. For all internal edges of except , choose Markov matrices to be as well. Since the effect of an identity matrix on an edge is the same as contracting that edge, with these choices we need henceforth argue only in the following special case: for , is a star tree with central node , and has the form of two star trees, on and on , that are joined at their central nodes by .

Now express the distribution where is the mixture component from , and the sum of the components on the star trees . Then, one can write

 M2:=FlatA|B(P′)=NTARNB,

with an diagonal matrix giving the distribution of states at in components weighted by the , and , are stochastic matrices of sizes , with entries giving conditional probabilities of states of variables in , conditioned on states/components at the . By choosing positive root distributions at the nodes , and positive , we ensure will have positive diagonal entries, and hence have full rank. Furthermore, the rows of are formed from the tensor product of corresponding rows of the Markov matrices on the edges of the star trees, and are thus generalized Vandermonde matrices. (Recall that if are a linearly independent set of polynomials, and are points, the generalized Vandermonde matrix is the matrix matrix with entry . Here the polynomials are determined by the formulae for the entries in the tensor product of the rows, and the by the entries in the Markov matrices.) A generalized Vandermonde matrix has full rank for generic choices of . Since , for generic parameters has rank .

On the other hand, consider , where we choose all matrices on pendant edges of to be , and both the root distribution at and to have all positive entries. Then

 M1:=FlatA|B(P1)=NT1,AR1N1,B,

where is a diagonal matrix with entries giving the joint distribution at and weighted by , and have all zero entries except for a single 1 in each row, and full row rank. Thus has rank . Moreover, it has at most one non-zero entry in each row and column, so both and are coordinate subspaces.

Since , our goal is to show that for generic choices of the parameters not yet specified (the Markov matrices on the trees ). Without loss of generality assume that , so to do this it is enough to make

 (1) rank(M1+M2)=min((r−1)κ+κ2,κ#B).

We use the following facts about matrices: Let and be matrices. With denoting the image and kernel of as a linear transformation from to , then implies . Also, if , then by the rank/nullity theorem .

First consider the case where . By the preceding paragraph, to show equation (1) it suffices to choose parameters so that and .

Since generically and have full rank, it follows that and . But is a coordinate subspace, so it intersects nontrivially if and only the submatrix of obtained by deleting rows corresponding to those coordinates has nontrivial kernel. That submatrix is a generalized Vandermonde matrix with , so it has full column rank. This proves that generically.

Since is also a coordinate subspace, its intersection with is isomorphic to the kernel of the submatrix of obtained by deleting the columns corresponding to required zero entries in vectors in . Since this submatrix is a generalized Vandermonde matrix, the dimension of this kernel is

 κ#B−κ2−(r−1)κ=(κ#B−κ2)+(κ#B−(r−1)κ)−κ#B.

Thus so .

In the case where the same arguments as above apply after modifying our choices so all but of the entries of are zero. Then we deduce that we can choose so that . ∎

Picking any internal vertex of a binary tree, the induced tripartition of the leaf variables allows us to create 3 agglomerate variables. In this way, we can view a phylogenetic model as one to which we can apply Kruskal’s theorem. More specifically, consider a probability distribution in the mixture model on trees , where the share a common tripartition of the leaves, arising from the vertices . Suppose is the weighted mixture component from in . Then from the parameters on , one can give , , stochastic matrices , , of conditional probabilities of states at the leaves in , , , given the state at . Letting be the matrix obtained from by multiplying rows by the corresponding entry of the root distribution at and by the weight , one checks that

 FlatA|B|C(Pi)=[˜Mi,A,Mi,B,Mi,C].

Let denote the matrix obtained by stacking the , and similarly be matrices obtained by stacking the . Then

 FlatA|B|C(P)=[MA,MB,MC].

To apply Kruskal’s theorem to this flattening, we must first show that the technical conditions on Kruskal rank of the matrices apply, at least generically.

###### Lemma 4.2.

Consider an -fold mixture model on trees with a common tripartition of the leaves. Then

 FlatA|B|C(P)=[MA,MB,MC]

for some matrices with rows. Moreover, for generic choices of the numerical parameters these matrices all have full Kruskal rank (i.e., Kruskal row rank equal to their smaller dimension).

###### Proof.

The first claim was established in the discussion preceding the lemma.

For the second, by similar reasoning as was used in Lemma 4.1, it is enough to show one choice of parameters gives these matrices full Kruskal rank. By choosing matrix parameters on all internal edges of every to be the identity matrix, we may essentially assume every is the star tree, rooted at central node . Choosing positive root distributions at , and positive mixing parameters , it then suffices to only consider one set of leaves, say .

Now, as in the discussion of in the proof of Lemma 4.1, one sees that is a generalized Vandermonde matrix. Since all its submatrices are also generalized Vandermonde matrices, it generically has full Kruskal rank. ∎

The next lemma allows us to tease apart distributions which arise from mixing together slices of distributions from different trees. After we have applied Kruskal’s Theorem via Lemma 4.2, it will be used to identify which rows of the matrices arise from the same mixture component of the model.

###### Lemma 4.3 (No Shuffling Lemma).

Let , be trees with leaves, or leaves if . For , let be a generic probability distributions from the model on the tree , scaled by positive constants . For a fixed choice of , let and form the flattenings . Form a new matrix from any rows from these flattenings (with repeats allowed), and define so that is this matrix. Then does not satisfy all the phylogenetic invariants for unless the chosen rows come from a single and is a refinement of .

###### Proof.

Note that the multiplication by the has no effect on whether the tensor satisfies non-trivial invariants, because the phylogenetic varieties for the model are invariant under the action of the general linear group at any leaf [8].

Consider first the case that , and . Suppose is constructed from rows which come from at least two different . Without loss of generality, we assume , so that in the notation of Theorem 3.5, the slices contain the entries of arising from a single row of the flattening. We will show that does not satisfy the invariants of that theorem.

For the time being, treat two of these slices as fixed, and the third slice , which we may assume does not come from the same as either or , as a variable. Generically, the matrix equation

then gives nonzero, linear constraints on the entries of .

However, for an arbitrary matrix with positive entries whose sum is less than 1, we can find a that has as any designated slice. This shows that there exist such slices not satisfying equation (2), and hence, by Proposition 3.2, that the generic slice does not.

When and , there are no non-trivial invariants for (those of Theorem 3.5 are identically zero), hence we consider , and use the edge invariants of Theorem 3.4. But for any choice of 4-leaf tree, and choice of index , we can find a in the tree model so that has any desired generic vector as its th row. Now is built from two such rows. If the and that we take these slices from are not the same, then generically, we can choose those slices to be arbitrary vectors. But then the flattening of with respect to the split of will generically be a rank 4 matrix, and hence will not satisfy the invariants for tree .

For larger , the result follows from the cases above by marginalization to 3- or 4-leaf trees. ∎

First we prove a theorem on the generic identifiability of numerical parameters in trees with a known common tripartition.

###### Theorem 4.4.

Suppose the trees have a known common tripartition , with , and . If also suppose . Then both and the numerical parameters of the mixture model on are generically identifiable.

###### Proof.

Since the trees in share a common tripartition , by Lemma 4.2 if a distribution arises from generic parameters of the model then

 FlatA|B|C(P)=[MA,MB,MC],

where , , and all have full Kruskal row rank, which will be , , and , respectively. According to Theorem 3.1, these matrices are uniquely determined up to simultaneous permutation and scaling of the rows provided

 (3) min(rκ,κ#A)+min(rκ,κ#B)+min(rκ,κ#C)≥2rκ+2.

Since and , this inequality holds for all .

At this point, we have recovered the matrices and up to scaling and permuting the rows. Each of the rows of the recovered will have entries from a scaled slice from a tree distribution on a subtree of one of the (the subtree spanning the vertex and all the leaves ). We need to group these rows by the mixture components they come from. However, the No Shuffling Lemma 4.3, says that generically it is possible to do this. Since ordering the rows of determines an order of the rows of , we can then reassemble the flattened mixture components as the product of appropriate submatrices of .

From , we recover the mixing weight via

 πi=∑(j1,…,jn)∈[κ]nPi(j1,…,jn).

Then, by Theorem 2.1, the tree and the numerical parameters on it can be identified from

Now we proceed to prove identifiability of the numerical parameters and tree parameters in our most general class of -tree mixture models, the -deep class.

###### Definition 4.5.

For a positive integer , the -deep class of -tuples of trees consists of all -tuples of binary trees such that there exists a tripartition with , , such that the splits and are compatible with all trees in .

Note that this definition does not require that be compatible with any of the trees in , so the full tripartition need not be associated to vertices in the . The trees must only share two splits, each sufficiently deep in the tree. Furthermore, if is in the -deep class, we do not assume the tripartition is known, only that it exists.

We now prove our main theorems on identifiability of parameters in -tree mixtures. We state two versions, one for when a -deep tripartition is known (including the case of when all the trees are known), and one for when it is not. The second of these requires a slightly stronger hypothesis on the number of mixture components.

###### Theorem 4.6.

Suppose is in the -deep class via a known tripartition . Then both and the numerical parameters of the mixture model associated to are generically identifiable provided and either , or and .

###### Proof.

Fix some , let , and let be the marginalization of to the leaves in . This is a probability tensor for the mixture of induced trees , with numerical parameters obtained by restricting to these induced trees. Note that the trees in share the common tripartition . Thus Theorem 4.4 applies to identify the trees and numerical parameters on them. Then by Lemma 4.2 we may write

 FlatA|B|{c}(Pc)=[MA,MB,Mc],

and for generic choices of the numerical parameters, these matrices all have full Kruskal row rank. We may further specify that the rows of these matrices, in particular , have been ordered into blocks of rows, corresponding to the various mixture components.

Note that since the matrix has full Kruskal row rank and is with , it has full row rank. Thus we may compute a left inverse , with the identity.

Returning to the consideration of the full distribution and trees , we use to disentangle the mixture components. In each let be the node in the subtree spanning through which this subtree is connected to all other leaves. Then

 FlatB∪C|A(P)=MTB∪CΠ˜MA,

where are stochastic matrices of probabilities of states at the leaves in , conditioned on components and states at the , and is a diagonal matrix with entries the product of the mixing weights, , and the root distributions at . While the ordering of the mixture components and root states in these matrices is arbitrary, we may assume it is the same as in the rows of . Then

 MA=R˜MA,

where and is a block diagonal matrix whose th block gives conditional probabilities of state changes from to on , and is generically invertible.

Thus

 FlatB∪C|A(P)QA=MTB∪CΠR−1MAQA=MTB∪CΠR−1.

This shows that by taking the columns of in blocks of we obtain entries associated to only one mixture component at a time. Moreover, multiplying a block of these columns by the corresponding block of rows of , we obtain a flattened form of a single mixture component .

Summing the entries of identifies , and hence . Then by Theorem 2.1 the tree and the numerical parameters on it are identifiable. ∎

###### Theorem 4.7.

Suppose is in the -deep class. Then both and the numerical parameters of the mixture model associated to are generically identifiable provided .

###### Proof.

Since the is in the -deep class and , for generic parameters we can use the edge invariants of Lemma 4.1 to find two splits and compatible with all trees in , with , , simply by testing for all splits of an appropriate size.

If , then implies , so . Thus for any , Theorem 4.6 applies to give the conclusion. ∎

We are now in a position to deduce Theorem 1.1, which will follow from Theorem 4.7 and the following lemma.

###### Lemma 4.8.

Let be an unrooted binary tree with leaves. Then there exists an internal vertex in inducing a tripartition such that two of the three components contain at least leaves of .

###### Proof.

According to Exercise 1.5 in [19], every tree has a centroid , which is an internal node such that each component of has at most vertices where is the number of vertices of . This same statement holds if we replace with and vertices with leaves in the definition of the centroid. Since the tree is binary and is an internal vertex, there are three components of . The largest component has at least leaves and at most . Thus there are at least leaves remaining between the other two components, which implies that in the most balanced case, one of the other two components has at least leaves. Since this proves the claim. ∎

Simple examples show the bound in this lemma is the best possible.

###### Proof of Theorem 1.1.

According to Lemma 4.8, there is an internal vertex of inducing a tripartition such that and . Thus is in the -deep class. Theorem 4.7 then applies. ∎

## 5. Further Directions

The techniques employed in this paper have been primarily concerned with, and are effective for, the identification of parameters in mixture models where the underlying trees share large common substructures. Establishing identifiability of either numerical or tree parameters in situations where there is no commonality between the trees remains an open problem.

Even in the case of general Markov mixtures of two 4-leaf trees little is understood: First, in the case of two different tree topologies being mixed, it is unknown if the tree parameters are generically identifiable. Second, if the two trees are given, it is unknown if numerical parameters are generically identifiable. These problems might be addressed by finding stronger versions of the tensor rank results we have employed (e.g., a strengthened version of Kruskal’s theorem). But it also seems likely that a solution to these these problems will require the development of new mathematical techniques.

## Acknowledgement

Thanks to John Huelsenbeck for stimulating this work through describing his own investigations with mixture models with many components.

John Rhodes was partially supported by US National Science Foundation (DMS 0714830). Seth Sullivant was partially supported by the David and Lucille Packard Foundation and the US National Science Foundation (DMS 0954865).

## References

• [1] E. S. Allman, C. Ané, and J. A. Rhodes. Identifiability of a Markovian model of molecular evolution with gamma-distributed rates. Adv. in Appl. Probab., 40:229–249, 2008. arXiv:0709.0531.
• [2] E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist., 37(6A):3099–3132, 2009.
• [3] E. S. Allman, C. Matias, and J. A. Rhodes. Parameter identifiability in a class of random graph mixture models, 2010.
• [4] E. S. Allman, S. Petrovic, J. A. Rhodes, and S. Sullivant. Identifiability of two-tree mixtures for group-based models. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2010.
• [5] E. S. Allman and J. A. Rhodes. Phylogenetic invariants for the general Markov model of sequence mutation. Math. Biosci., 186(2):113–144, 2003.
• [6] E. S. Allman and J. A. Rhodes. The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J. Comput. Biol., 13(5):1101–1113, 2006.
• [7] E. S. Allman and J. A. Rhodes. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci., 211(1):18–33, 2008.
• [8] E. S. Allman and J. A. Rhodes. Phylogenetic ideals and varieties for the general Markov model. Adv. in Appl. Math., 40(2), 2008.
• [9] E. S. Allman and J. A. Rhodes. The identifiability of covarion models in phylogenetics. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 6(1):76–88, 2009.
• [10] J. Chai and E. A. Housworth. On Rogers’s Proof of Identifiability for the GTR + Gamma + I Model, 2010. Preprint.
• [11] J. T. Chang. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci., 137(1):51–73, 1996.
• [12] D. Cox, J. Little, and D. O’Shea. Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra. Springer-Verlag, New York, second edition, 1997.
• [13] J. H. Degnan and L. A. Salter. Gene tree distributions under the coalescent process. Evolution, 59:24–37, 2005.
• [14] J. Felsenstein. Inferring Phylogenies. Sinauer and Associates, 2004.
• [15] J. P. Huelsenbeck and M. A. Suchard. A nonparametric method for accommodating and testing across-site rate variation. Syst. Biol, 56(6):975–987, 2007.
• [16] J. B. Kruskal. More factors than subjects, tests and treatments: an indeterminacy theorem for canonical decomposition and individual differences scaling. Psychometrika, 41(3):281–293, 1976.
• [17] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and Appl., 18(2):95–138, 1977.
• [18] M. Pagel and A. Meade. Mixture models in phylogenetic inference. In O. Gascuel, editor, Mathematics of Evolution and Phylogeny, pages 121–142. Oxford University Press, Oxford, 2005.
• [19] C. Semple and M. Steel. Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, Oxford, 2003.
• [20] V. Strassen. Rank and optimal computation of generic tensors. Linear Algebra Appl., 52/53:645–685, 1983.
• [21] D. Štefankovič and E. Vigoda. Phylogeny of mixture models: Robustness of maximum likelihood and non-identifiable distributions. J. Comput. Biol., 14(2):156–189, 2007.
• [22] J. Wakeley. Coalescent Theory. Roberts and Company, 2008.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters