A semialgebraic description of the general Markov model on phylogenetic trees

# A semialgebraic description of the general Markov model on phylogenetic trees

## Abstract

Many of the stochastic models used in inference of phylogenetic trees from biological sequence data have polynomial parameterization maps. The image of such a map — the collection of joint distributions for a model — forms the model space. Since the parameterization is polynomial, the Zariski closure of the model space is an algebraic variety which is typically much larger than the model space, but has been usefully studied with algebraic methods. Of ultimate interest, however, is not the full variety, but only the model space. Here we develop complete semialgebraic descriptions of the model space arising from the -state general Markov model on a tree, with slightly restricted parameters. Our approach depends upon both recently-formulated analogs of Cayley’s hyperdeterminant, and the construction of certain quadratic forms from the joint distribution whose positive (semi-)definiteness encodes information about parameter values. We additionally investigate the use of Sturm sequences for obtaining similar results.

p

hylogenetic tree, phylogenetic variety, semialgebraic set, general Markov model

{AMS}

60J20, 92D15, 92D20, 62P10, 14P10

## 1 Introduction

Statistical inference of evolutionary relationships among organisms from DNA sequence data is routinely performed using probabilistic models of sequence evolution along a tree. A site in a sequence is viewed as a 4-state (A,C,G, T) random variable, which undergoes state changes as it descends along the tree from an ancestral organism to its modern descendants. Such models exhibit a rich mathematical structure, which reflects both the combinatorial features of the tree, and the algebraic way in which stochastic matrices associated to edges of the tree are combined to produce a joint probability distribution describing sequences of the extant organisms.

One thread in the extensive literature on such models has utilized the viewpoint of algebraic geometry to understand the probability distributions that may arise. This is natural, since the distributions are in the image of a polynomial map, and the image thus lies in an algebraic variety. The defining equations of this variety (which depend on the tree topology), are called phylogenetic invariants. That a probability distribution satisfies them can be taken as evidence that it arose from sequence evolution along the particular tree. Phylogenetic invariants and varieties have been extensively studied by many authors [13, 26, 17, 21, 20, 3, 34, 8, 12, 30, 11] (see [7] for more references) with goals ranging from biological (improving data analysis) to statistical (establishing the identifiability of model parameters) to purely mathematical.

However, it has long been understood that, in addition to the equalities of phylogenetic invariants, inequalities should play a role in characterizing those distributions actually of interest for statistical purposes. Much of a phylogenetic variety is typically composed of points not arising from stochastic parameters, but rather from applying the same polynomial parameterization map to complex parameters. Thus the model space — the set of probability distributions arising as the image of stochastic parameters on a tree — can be considerably smaller than the set of all probability distributions on the variety. An instructive recent computation [37] demonstrated that for the -state general Markov model on the -leaf tree, for example, the model space is only about 8% of the nonnegative real points on the variety. Inequalities can thus be crucial in determining if a probability distribution arises from a model.

In the pioneering 1987 paper of Cavender and Felsenstein [13] polynomial equalities and inequalities are given that can test which of the 3 possible unrooted leaf-labeled 4-leaf trees might have produced a given probability distribution, and thus in principle determine evolutionary relationships between 4 organisms. Nonetheless, despite many advances in understanding phylogenetic invariants in the intervening years, little has been accomplished in finding or understanding the necessary inequalities. The potential usefulness of such inequalities, meanwhile, has been demonstrated in [18], where an inequality that holds for the 2-state model on all tree topologies plays a key role in studying loss of biodiversity under species extinction. In [9] a small number of inequalities, dependent on the tree, were used to show that for certain mixture models trees were identifiable from probability distributions.

Recent independent works by Smith and Zwiernik [37] and by Klaere and Liebscher [23] provided the first substantial progress on the general problem of finding sufficient inequalities to describe the model space. Both groups successfully formulated inequalities for the 2-state general Markov model on trees, using different viewpoints. While the 2-state model has some applicability to DNA sequences, through a purine/pyrimidine encoding of nucleotides, it is unfortunately not clear how to extend these works to the more general -state model, or even to the particular model that is directly applicable to DNA. Features of the statistical framework in [37] make generalizing to more states highly problematic, while the formulation of [23] involves beating through a thicket of algebraic details which are similarly difficult to generalize.

In this work we provide a third approach to understanding the model space of the general Markov model on trees which has the advantage of extending from the 2-state to the -state model with little modification. Our goal is a semialgebraic description (given by a boolean combination of polynomial equalities and inequalities) of the set of probability distributions that arise on a specific tree. Such a description exists by the Tarski-Seidenberg Theorem [36, 31], since the stochastic parameter space for any -state general Markov model is a semialgebraic set, so its image under the polynomial parameterization map must be as well. However, we seek an explicit description, and this theorem does not provide a useful means of obtaining it.

We describe below two methods for obtaining such a semialgebraic model description. In one approach, that applies equally easily to all and all binary trees, we obtain inequalities using a recently-formulated analog of Cayley’s hyperdeterminant from [1], and the construction of certain quadratic forms from the joint distribution whose positive (semi-)definiteness encodes information about parameter values. We note that the appearance of the hyperdeterminant in both [37] and [23] motivated the work of [1], but that our introduction of quadratic forms in this paper is an equally essential tool for obtaining our results. Moreover, we do not see direct precursors of this idea in either [37] or [23].

We also describe an alternative method using Sturm sequences for univariate polynomials to obtain inequalities. Specifically, we construct polynomials in the entries of a probability distribution whose roots are exactly a subset of the numerical parameters, and Sturm theory leads to inequalities stating that the roots lie in the interval , as the parameters must. Although for the -state model this leads to a complete semialgebraic description of the model on a -leaf tree, for higher it becomes more unwieldy. Nonetheless, this approach can produce inequalities of smaller degree than those found using quadratic forms, so we consider it a potentially useful technique.

In both approaches, we must impose some restrictions on the set of stochastic parameters in order to give our semialgebraic conditions. We thus formulate a notion of nonsingular parameters and mostly restrict to considering them for our results. In the case this notion is particularly natural from a statistical point of view, though it is slightly less so for higher . Indeed, an understanding of why this notion is needed algebraically illuminates, we believe, the difficulties of passing from 2-state results to -state results.

This paper is organized as follows: In §2 we formally introduce the general Markov model on trees and set basic notations and terminology, including the notion of nonsingular parameters. In §3, we give a semialgebraic description of the general Markov model on the -leaf tree using the work of [1] and Sylvester’s theorem on quadratic forms, a description that is made complete for the -state model, but holds only for nonsingular parameters of the -state model. Additionally, we discuss connections to several previous works on the -state model [29, 37, 23]. In §4, we use Sturm sequences to give partial semialgebraic descriptions of -leaf model spaces, and develop several examples. In §5, we give the main result: a semialgebraic description of the -state general Markov model on -leaf trees for nonsingular parameters. For the -state model we prove a slightly stronger result that drops the nonsingularity assumption.

## 2 Definitions and Notations

### 2.1 The general Markov model on trees

We review the -state general Markov model on trees, , whose parameters consist of a combinatorial object, a tree, and a collection of numerical parameters that are associated to a rooted version of the tree.

Let be a binary tree with leaves , , and a collection of discrete random variables associated to the nodes, all with state space . Distinguish an internal node of to serve as its root, and direct all edges of away from . (Often this model is presented with the root as a node of valence 2 which is introduced by subdividing some edge. However, under very mild assumptions this leads to the same probability distributions we consider here [33, 3], so we avoid that complication.) Though necessary for parameterizing the model, the choice of will not matter in our final results, as will be shown in §5.

For a tree rooted at , numerical parameters for the model on are: {romannum}

A root distribution row vector , with nonnegative entries summing to 1;

Markov matrices , with nonnegative entries and row sums equal to .

The vector specifies the distribution of the random variable , i.e., , and the Markov matrices , for , give transition probabilities of the various state changes in passing from the parent vertex to the child vertex . Letting , the joint probability distribution at all nodes of is thus

 Prob(X=j)=πjr∏e∈EMe(jae,jbe).

By marginalizing over all variables at internal nodes of , we obtain the joint distribution, , of states at the leaves of ; if is an assignment of states to leaf variables, then

 P(k)=∑m∈[k]|V∖L|Prob(X=(k,m))

where is an assignment of states to all the vertices of compatible with . It is natural to view as an -dimensional array, or tensor, with one index for each leaf of the tree.

For fixed and choice of , we use to denote the parameterization map

 ψT:{π,{Me}e∈E}↦P.

That the coordinate functions of are polynomial is obvious, but essential to our work here. Note that we may naturally extend the domain of the polynomial map to larger sets, by dropping the nonnegativity assumptions in (i) and (ii), but retaining the condition that rows must sum to 1. We will consider real parameters and a real parameterization map, as well as complex parameters and a complex parameterization map. In contrast, we refer to the original probabilistic model as having stochastic parameters. Since the parameterization maps are all given by the same formula, we use to denote them all, but will always indicate the current domain of interest.

The image of complex, real, or stochastic parameters under is an -dimensional tensor, whose entries sum to 1. When parameters are not stochastic, this tensor generally does not specify a probability distribution, as there can be negative or complex entries. We refer to any tensor whose entries sum to 1, regardless of whether the entries are complex, real, or nonnegative, as a distribution, but reserve the term probability distribution for a nonnegative distribution. With this language, the image of complex parameters under is a distribution, but may or may not be a probability distribution. Similarly, while the matrix parameters have rows summing to one even for complex parameters, we reserve the term Markov matrix exclusively for the stochastic setting.

### 2.2 Algebraic and semialgebraic model descriptions

Most previous algebraic analysis of the model has focused on the algebraic variety associated to it for each choice of tree . With this viewpoint one is essentially passing from the parameterization of the model, as given above, to an implicit description of the image of the parameterization as a zero set of certain polynomial functions, traditionally called phylogenetic invariants [13, 26, 7].

Whether one considers stochastic, real, or complex parameters, the collection of phylogenetic invariants for on a tree are the same. Thus they cannot distinguish probability distributions that arise from stochastic parameters from those arising from non-stochastic real or complex ones. To complicate matters further, there exist distributions that satisfy all phylogenetic invariants for the model on a given tree, but are not even in the image of complex parameters. Though the algebraic issues behind this are well understood, they prevent classical algebraic geometry from being a sufficient tool to focus exclusively on the distributions of statistical interest.

To gain a more detailed understanding, we seek to refine the algebraic description of the model given by phylogenetic invariants into a semialgebraic description: In addition to finding polynomials vanishing on the image of the parameterization (or equivalently polynomial equalities holding at all points on the image), we also seek polynomial inequalities sufficient to distinguish the stochastic image precisely.

Recall that a subset of is called a semialgebraic set if it is a boolean combination of sets each of which is defined by a single polynomial equality or inequality. The Tarski-Seidenberg Theorem [36, 31] implies that the image of a semialgebraic set under a polynomial map is also semialgebraic.

Since for all the stochastic parameter space of is clearly semialgebraic, this implies that semialgebraic descriptions exist for the images of the . Determining such descriptions explicitly is our goal.

### 2.3 Nonsingular parameters, positivity, and independence

Some of our results will be stated with additional mild conditions placed on the allowed parameters for the model. We state these conditions here, and explore their meaning.

{definition}

A choice of stochastic, real, or complex parameters for on a tree with root is said to be nonsingular provided {romannum}

at every (hidden or observed) node , the marginal distribution of has no zero entry, and

for every edge , the matrix is nonsingular. Parameters which are not nonsingular are said to be singular.

For stochastic parameters, the first condition in this definition can be replaced with a simpler one: {romannum}

the root distribution has no zero entry. Statement (i) follows from (i’) and (ii) inductively, since if all entries of are positive and is a nonsingular Markov matrix, then the distribution at a child of has positive entries. However, for complex or real parameters requirement (i) is not implied by (i’) and (ii), as a simple example shows: , and are singular parameters since , even though has no zero entries and is a nonsingular for .

It is also natural to require that all numerical parameters of on a tree be strictly positive. This means that all states may occur at the root, and every state change is possible in passing along any edge of the tree. This assumption is plausible from a modeling point of view, and can be desirable for technical statistical issues as well. Note that positivity of parameters does not ensure nonsingularity, since a Markov matrix may be singular despite all its entries being greater than zero. Similarly, nonsingularity of parameters does not ensure positivity since a nonsingular Markov matrix may have zero entries.

Given a joint probability distribution of random variables, two subsets of variables are independent when the marginal distribution for the union of the sets is the product of the marginal distributions for the two sets individually. We also use this term, in a nonstandard way, to apply to complex or real distributions when the same factorization holds.

To illustrate this usage, consider a tree with two nodes, , and one edge . For complex parameters and , the joint distribution of and is given by the matrix

 P=\diag(π)M(r,a).

Then the variables are independent exactly when is a rank matrix: . For this occurs precisely when the parameters are singular. For , however, independence implies the parameters are singular, but not vice versa. In general, singular parameters ensure that has rank strictly less than , but not that has rank .

These comments easily extend to larger trees to give the following.

{proposition}

Suppose for a choice of complex GM() parameters on an -leaf tree . If the parameters are nonsingular, then there is no proper partition of the indices of into independent sets. For , the converse also holds.

That the converse is false for is a complicating factor for the generalization of our results from the case. Indeed, this is the reason we ultimately restrict to nonsingular parameters.

In closing this section, we note that for any , there is an inherent and well-understood source of non-uniqueness of parameters giving rise to , sometimes called ‘label-swapping.’ Since internal nodes of are unobservable variables, the distribution is computed by summing over all assignments of states to such variables. As a result, if the state names were permuted for such a variable, and corresponding changes made in numerical parameters, is left unchanged. Thus parameters leading to can be determined at most up to such permutations.

In the case of nonsingular parameters, label-swapping is the only source of non-uniqueness of parameters leading to [15]. (See also [25, 2]). For singular parameters there are additional sources of non-uniqueness.

### 2.4 Marginalizations, slices, group actions, and flattenings

Viewing probability distributions on variables as -dimensional tensors gives natural associations between statistical notions and tensor operations. For example, summing tensor entries over an index, or a collection of indices, corresponds to marginalizing over a variable, or collection of variables. Considering only those entries with a fixed value of an index, or collection of indices, corresponds (after renormalization) to conditioning on an observed variable, or collection of variables. Rearranging array entries into a new array, with fewer dimensions but larger size, corresponds to agglomerating several variables into a composite one with larger state space. Here we introduce the necessary notation to formalize these tensor operations.

{definition}

For an -dimensional tensor , integer , and vector , define the -dimensional tensor by

 (P∗iv)(j1,…,^ji,…,jn)=k∑ji=1vjiP(j1,⋯,ji,⋯,jn),

where denotes omission.

Thus, the th slice of in the th index is defined by where is the th standard basis vector, and the th marginalization of is where is the vector of all s.

The above product of a tensor and vector extends naturally to tensors and matrices.

{definition}

For an -dimensional tensor and matrix , define the -dimensional tensor by

 (P∗iM)(j1,…,jn)=k∑ℓ=1P(j1,…,ji−1,ℓ,ji+1,…,jn)M(ℓ,ji).

If the above operations on a tensor by vectors or matrices are performed in different indices, then they commute. This allows the use of -tuple notation for the operation of matrices in all indices of a tensor, such as the following:

 P⋅(M1,M2,…,Mn)=(⋯((P∗1M1)∗2M2)⋯)∗nMn.

Although the need not be invertible, restricting to that case gives the natural (right) group action of on tensors. This generalizes the familiar operation on 2-dimensional tensors , i.e., on matrices, where

 P⋅(M1,M2)=(P∗1M1)∗2M2=MT1PM2.

If , then denotes the -dimensional diagonal tensor whose only nonzero entries are the in the positions. That this notion is useful for the model is made clear by the observation that for a 3-leaf star tree , rooted at the central node,

 ψT(π,{M1,M2,M3})=Diag(π)⋅(M1,M2,M3). (1)

If is an -dimensional tensor and is a disjoint union of nonempty sets, then the flattening of with respect to this bipartition, is the matrix with rows indexed by and columns indexed by , with

 FlatA|B(P)(i,j)=P(k),

where has entries matching those of and , appropriately ordered. Thus the entries of are simply rearranged into a matrix, in a manner consistent with the original tensor structure. When specifies a joint distribution for random variables, this flattening corresponds to treating the variables in and as two agglomerate variables, with state spaces the product of the state spaces of the individual variables.

Notations such as , for example, will be used to denote the matrix flattening obtained from a -dimensional tensor using the partition of indices , . If is an edge in an -leaf tree, then naturally induces a bipartition of the leaves, by removing the edge and grouping leaves according to the resulting connected components. A flattening for such a bipartition is denoted by .

Finally, we note that flattenings naturally occur in the notion of independence: If , then the sets are independent precisely when is a rank 1 matrix.

## 3 Gm(k) on 3-leaf trees

In this section we derive a semialgebraic description of GM() on the -leaf tree, the smallest example of interest. Results for the 3-leaf tree also serve as a building block for the study of the model on larger trees in §5. For this section, then, is fixed, with leaves , , and root at the central node.

When , Cayley’s hyperdeterminant plays a critical role, as has already been highlighted in [38]. Though our formulation will be different, we take the hyperdeterminant as our starting point. For any tensor , the hyperdeterminant [14, 19, 16] is

 Δ(A)=(a2111a2222+a2112a2221+a2121a2212+a2122a2211)−2(a111a112a221a222+a111a121a212a222+a111a122a211a222+a112a121a212a221+a112a122a221a211+a121a122a212a211)+4(a111a122a212a221+a112a121a211a222).

The function has the -invariance property

 Δ(P⋅(g1,g2,g3))=det(g1)2det(g2)2det(g3)2Δ(P). (2)

This fact, combined with a study of canonical forms for -orbit representatives, leads to the following theorem.

{theorem}

[[16], Theorem 7.1] A complex tensor is in the -orbit of if, and only if, . A real tensor is in the -orbit of if, and only if, .

Suppose that and arises from nonsingular parameters on . Then, equation (1) states but letting we also have

 P=D⋅(M′1,M2,M3).

Therefore by the forward implication of Theorem 3. This hyperdeterminantal inequality can thus be included in building a semialgebraic description of the GM() model when restricted to nonsingular parameters.

However, the inequality yields a weaker conclusion than that arises from stochastic, or even real, nonsingular parameters, so additional inequalities are needed for a semialgebraic model description.

Nonetheless, motivated by the role the hyperdeterminant plays in the semialgebraic description of the GM() model, in a separate work Allman, Jarvis, Rhodes, and Sumner [1] construct generalizations of for . These functions are defined by

 fi(P;x)=det(Hx(det(P∗ix))),

where is a vector of auxiliary variables, and denotes the Hessian operator. They also have invariance properties under such as

 f3(P⋅(g1,g2,g3);x)=det(g1)kdet(g2)kdet(g3)2f3(P;g3x).

The next theorem establishes that the nonvanishing of these polynomials, in conjunction with the vanishing of some others, identifies the orbit of , and thus is an analog of Theorem 3 for larger .

{theorem}

[[1]] A complex tensor lies in the -orbit of if, and only if, for some , {romannum}

for all . Here denotes the classical adjoint, and equality means as a matrix of polynomials in .

is not identically zero as a polynomial in . Moreover, if the enumerated conditions hold for one , then they hold for all.

When the -orbit of is not dense among all tensors; rather its closure is a lower dimensional subvariety. This explains the necessity of the equalities in item (i). In the case these equalities simplify to and thus hold for all tensors. One can further verify that if then , so that Theorem 3 includes the first statement of Theorem 3.

One might hope that the polynomials had a sign property similar to that given in Theorem 3 for , so that a simple test could further distinguish the image of nonsingular real parameters. For , using functions related to the , a semialgebraic description of the -orbit of can in fact be obtained in this manner (see [1]), giving a complete analog of Theorem 3. However, for no analog is known.

Finally, we emphasize that for the functions are not the ones usually referred to as hyperdeterminants [19], but rather a different generalization of .

With semialgebraic conditions ensuring a tensor is in the orbit of in hand, we wish to supplement these to ensure it arises from nonsingular stochastic parameters. We address this in several steps; first, we give requirements that a tensor is the image of complex parameters under , and then that these parameters be nonnegative.

{proposition}

Let be a complex distribution. Then is in the image of nonsingular complex parameters for GM() on the -leaf tree if, and only if, is in the -orbit of and for . Moreover, the parameters are unique up to label swapping.

{proof}

To establish the claimed reverse implication, suppose for some , and let denote the vector of row sums of . A computation shows that

 P∗31=gT1\diag(r3)g2.

Thus is equivalent to the row sums of being nonzero, and similarly for the other .

Now is a complex matrix with row sums equal to one. Letting be the vector of entry-wise products of the , the entries of are nonzero and

 P=Diag(π)⋅(M1,M2,M3).

Since is a distribution,

 1 =((P∗11)∗21)∗31 =(((Diag(π)⋅(M1,M2,M3))∗11)∗21)∗31 =((Diag(π)∗1M11)∗2M21)∗3M31 =((Diag(π)∗11)∗21)∗31 =π⋅1,

so is a valid complex root distribution. Thus, is in the image of for complex, nonsingular parameters.

The forward implication in the theorem is straightforward.

The uniqueness of nonsingular parameters up to permutation of states at the internal node of the tree was discussed at the end of subsection 2.3.

Combining this Proposition with Theorems 3 and 3 we obtain the following.

{corollary}

A complex distribution is the image of complex, nonsingular parameters for GM() on the -leaf tree if, and only if, it satisfies the semialgebraic conditions (i) and (ii) of Theorem 3 and {remunerate}

for , .

For , is the image of real nonsingular parameters for GM() on the -leaf tree if, and only if, it satisfies and the semialgebraic conditions (iii).

Next we characterize the image of nonsingular stochastic parameters, and finally of strictly positive nonsingular parameters. The key to this step is the construction of certain quadratic forms whose positive semi-definiteness (respectively definiteness) encodes nonnegativity (respectively positivity) of some of the numerical parameters. Sylvester’s Theorem [35], which we state for reference here, then gives a semialgebraic version of these conditions.

Recall that a principal minor of a matrix is the determinant of a submatrix chosen with the same row and column indices, and that a leading principal minor is one of these where the chosen indices are for any .

{theorem}

[Sylvester’s Theorem] Let be an real symmetric matrix and the associated quadratic form on . Then {remunerate}

is positive semidefinite if, and only if, all principal minors of are nonnegative, and

is positive definite if, and only if, all leading principal minors of are strictly positive.

We use Sylvester’s Theorem to establish the following theorem.

{theorem}

A tensor is the image of nonsingular stochastic parameters for the GM() model on the -leaf tree if, and only if, its entries are nonnegative and sum to , conditions (i), (ii), and (iii) of Theorem 3 and Corollary 3 are satisfied, and {remunerate}

are positive, and all principal minors of the following matrices are nonnegative:

Moreover, is the image of nonsingular positive parameters if, and only if, its entries are positive and sum to , conditions (i), (ii), and (iii) are satisfied and {remunerate}

all leading principal minors of the matrices in (3) and (3) are positive.

In both of these cases, the nonsingular parameters are unique up to label swapping.

{proof}

Let be an arbitrary nonnegative tensor whose entries sum to 1. By Corollary 3, the first 3 conditions are equivalent to for complex nonsingular parameters. We need to show the addition of assumption (iv) is equivalent to parameters being nonnegative.

Note that

 P⋅⋅+ =P∗31=MT1\diag(π)M2, P⋅+⋅ =P∗21=MT1\diag(π)M3, P+⋅⋅ =P∗11=MT2\diag(π)M3.

Since is nonsingular, we find

 PT+⋅⋅P−1⋅⋅+P⋅+⋅=MT3\diag(π)M3 (5)

is symmetric, and the matrix of a positive definite quadratic form if, and only if the entries of are positive. Equivalently, by Sylvester’s theorem, all leading principal minors of this matrix must be positive.

Similarly, using slices one has

 PTi⋅⋅P−1⋅⋅+P⋅+⋅=MT3\diag(π)Λ1,iM3

where is the diagonal matrix with entries from the th column of . Thus its principal minors being nonnegative is equivalent (since we already have that the entries of are positive) by Sylvester’s Theorem to the entries in the th column of being nonnegative. The product similarly can be used for a condition that the th column of be nonnegative, and the product for the columns of .

Multiplying all these matrices by the square of an appropriate nonzero determinant clears denominators and preserves signs, yielding (3) and (3).

For the second statement, that the matrices in (3) and (3) have positive leading principal minors is equivalent, by Sylvester’s Theorem, to the positive definiteness of the quadratic forms, which in turn is equivalent to the positiveness of parameters. Since these parameters are nonsingular, the only source of non-uniqueness is label swapping.

Remark. Matrix products such as that of equation (5) appeared in [3], where their symmetry was used to produce phylogenetic invariants, but their usefulness for stating nonnegativity of parameters was overlooked.

Remark. The minors of the matrices in (3) and (3) are polynomials in the entries of of degree , with . However, as the leading determinant in those products is real and nonzero, one can remove an even power of it without affecting the sign of the minors. Thus the polynomial inequality of degree can be replaced by one of lower degree, , where or is the parity of .

In the case of the -state model, the above result can be made more complete, by also explicitly describing the image of singular parameters. While semialgebraic characterizations of probability distributions for both nonsingular and singular parameters on the 3-leaf tree have been given previously by [32], [10], [23], and [37], we provide another since our approach is novel.

{theorem}

A tensor is in the image of the stochastic parameterization map for the GM(2) model on the 3-leaf tree if, and only if, its entries are nonnegative and sum to 1, and one of the following occur: {remunerate}

, for , all leading principal minors of

are positive, and all principal minors of the following six matrices are nonnegative:

In this case, is the image of unique (up to label swapping) nonsingular parameters.

, and all minors of at least one of the matrices , , are zero. In this case, arises from singular parameters. If has all positive entries, then it is the image of infinitely many singular stochastic parameter choices.

{proof}

Using Theorem 3, and the observations made for immediately following the statement of Theorem 3, case 1 is already established under the weaker condition that . However, since the parameters are nonsingular and real when and the conditions of case 1 are satisfied, by Theorem 3 we may assume equivalently that .

To establish case 2, first assume is the image of singular stochastic parameters. Then certainly has nonnegative entries summing to 1, and by equations (1) and (2), . Since

 Flat1|23(P)=MT1\diag(π)M,

where is the matrix obtained by taking the tensor product of corresponding rows of and , this flattening has rank 1, if has a zero entry or has rank 1. Similar products for the other flattenings show that singular parameters imply at least one of the flattenings has rank 1, and hence its minors vanish.

Conversely, suppose and at least one of the flattenings has vanishing minors, and hence rank 1. Then by the classification of orbits given in [16, Table 7.1], is in the -orbit of one of the following four tensors: the tensor (in which case all three flattenings have rank 1) or one of the 3 tensors with parallel slices and the zero matrix (in which case exactly one of the flattenings has rank 1).

If , then . Since the entries of are nonnegative and sum to 1, one sees the top rows of each can be chosen to be nonnegative, summing to 1. The bottom row of each can also be replaced with any nonnegative row summing to 1 that is independent of the top row. Taking , this gives us infinitely many choices of singular stochastic parameters giving rise to . Alternatively, one could choose each Markov matrix to have two identical rows, and any with nonzero entries to obtain other singular stochastic parameters leading to .

For the remaining cases assume, without loss of generality, that , where , and is the zero matrix. Then and . Since the entries of are nonnegative and add to 1, we may assume that the top row of is also nonnegative and adds to 1. Choose to have two identical rows matching the top row of . Now is a rank-2 nonnegative matrix with entries adding to 1. Such a matrix can be written in the form with, for instance , , . Then one has If has positive entries one may also choose sufficiently close to so that , , all have nonnegative entries, thus obtaining infinitely many singular parameter choices leading to . (The example of shows that with only nonnegative entries there may be only finitely many singular parameter choices leading to .)

Remark. The analysis of the singular parameter case in this proof, by appealing without explanation to [16, Table 7.1], has not made explicit the importance of the notion of tensor rank. Indeed, that concept is central to both [16] and [1] and thus plays a crucial behind-the-scenes role in this work as well. The first singular case, a tensor in the orbit of , is of tensor rank , while the second, a tensor in the orbit of , is of tensor rank yet multlilinear rank . The nonsingular case is those of tensor rank and multlinear rank .

Remark. In case 3 of the theorem, the polynomial inequalities are of degree 4 and 2 (from and the determinants) and degree 5 and 10 (from the minors), but the degree 10 ones can be lowered to degree 6 by removing a factor of a determinant squared. The polynomial equalities appearing in case 3 have degree and , with the quadratics simply expressing that one of the leaf variables is independent of the others.

Minor modifications to the argument give the extension to positive parameters below.

{theorem}

A tensor is in the image of the positive stochastic parameterization map for the GM(2) model on the 3-leaf tree if, and only if, its entries are positive and sum to 1, and the conditions of Theorem 3 are met with the following modification to case 1: all leading principal minors of the matrices are positive.

{proof}

Case 1 is proved by combining the arguments for Theorems 3 and 3.

For case 2, if is in the orbit of the argument of Theorem 3 still applies, replacing ‘nonnegative’ with ‘positive’, and using the second construction of singular parameters. If is in the orbit of we simply replace ‘nonnegative’ with ‘positive’ in the argument.

As mentioned above, semialgebraic descriptions of the binary general Markov model on trees have been given previously, but in ways where generalizations to -state models were not apparent. Although we have considered only the -leaf tree (with larger trees to be discussed in §5) thus far, we pause here to discuss some connections to the recent works of Zwiernik and Smith [37] and Klaere and Liebscher [23] on semialgebraic descriptions of the binary model on trees, as well as the older work of Pearl and Tarsi [29].

Though using different approaches (and in the case of [29] with different goals), all three of these works emphasize statistical interpretations of various quantities computed from a probability distribution (e.g., covariances, conditional covariances, moments, tree cumulants). While analogs of some of the same quantities appear in our generalization to -state models, we have used algebra, rather than statistics, to guide our derivation. Although an inequality such as which appears in our description can be given a simple statistical interpretation when (that two leaf variables are not independent), for larger its meaning is more subtle, as it is tied to our notion of nonsingular parameters. Thus our generalization to -state models uses a more detailed development than the simple generalization of statistical concepts from to larger .

The role played by the hyperdeterminant in giving a semialgebraic model description for was first made clear in [37]. Its role was essentially independently discovered in [23, Theorem 6], though without recognition that it is a classical algebraic object. Indeed, both works recognize as an expression in the 2-variable and 3-variable covariances (i.e., central moments). This is a fascinating intertwining of algebra and statistics, yet we did not find it helpful in understanding the correct analog of for higher ; rather, [1] develops the analog used here through algebraic motivation entirely. It would nonetheless be interesting to understand whether can be described in more statistical language.

The explicit semialgebraic model descriptions for the -state model given in both [37] and [23] take on quite different forms than ours. This is not a surprise as such a description is far from unique, and different reasoning may produce different inequalities. The version in [37], for example, is stated in a different coordinate system, using tree cumulants rather than the entries of . We find all these descriptions valuable, as what should be considered the simplest, or most natural, description is not obvious.

The focus of [29] is on recovery of parameters for on the -leaf tree from a probability distribution assumed to have arisen from stochastic parameters, in an approach based on earlier work in latent structure analysis [27]. In addressing this question, however, semialgebraic conditions on the distribution are obtained. For instance, the non-vanishing of denominators is needed for formulas to make sense, and thus certain polynomials must be nonzero. (The authors seem to assume nonsingularity of parameters, though that is never clarified in the paper.) While never arises in [29], it is remarkable to note, then, that using the results of Theorem 3 and making explicit the tacit assumptions that various rational expressions exist and are real, the conditions given in Theorem 1 of [29] are sufficient to show , and thus is in the image of stochastic nonsingular parameters. Thus one can extract a semialgebraic model description from this work, even if that was not its goal.

## 4 Inequalities for 3-leaf trees via Sturm theory

The semialgebraic model descriptions given in the previous section have the advantage of being easily describable in a uniform way for all . However, semialgebraic descriptions are not unique, and there is no clear notion of what description should be considered simplest. It is also of interest, therefore, to obtain alternative polynomial inequalities, possibly of lower degree, that must also be satisfied by probability distributions in the model, in the hopes that they lead to another, perhaps better, semialgebraic model description, or that they might be of further use for testing whether a distribution arises from the model. We explore another approach to doing so here.

### 4.1 Review of Sturm Theory

Sturm theory can be used to impose conditions that roots of a univariate polynomial lie in a certain interval. We briefly recall basic definitions and results. Suppose that is a non-constant polynomial of degree , with no multiple roots, and we wish to count the roots of in an interval where . Then a Sturm sequence for on the interval is a sequence of polynomials satisfying certain sign relationships at the zeros of the in the interval . We give an example below, but for specifics, see, for instance, [22]. For any which is not a root of any , the sign variation is the number of sign changes in the sequence .

{theorem}

[Sturm’s Theorem] Let be a non-constant polynomial and a Sturm sequence for on . Then the number of distinct roots of in is equal to .

Though other constructions of Sturm sequences exist, we use the standard sequence, derived using a modified Euclidean algorithm. If is a polynomial of degree , then set , and for , take to be the opposite of the remainder of division of by . This yields a Sturm sequence for on any interval with , .

To illustrate, suppose that . Then the standard sequence is . For the particular choice of coefficients and , using this sequence on , we calculate that is the sequence , and . Thus, , , so has roots in . Indeed, the factorization shows this directly.

Two comments are in order: First, in this example observe that is one-fourth the discriminant of , and its positivity ensures that a monic quadratic has distinct and real roots. This shows that Sturm theory can produce familiar algebraic expressions, such as the quadratic discriminant, and thus gives a tool for generalizing them. The second observation we state more formally, as it is needed in our arguments below.

{corollary}

If is of degree with neither nor a root, and is a Sturm sequence for on , then has distinct roots in this interval precisely when and .

This corollary allows us to obtain inequalities ensuring a polynomial has distinct roots in : One simply requires that the alternate in being or , while the either be all or all . We informally refer to inequalities obtained in this manner as Sturm sequence inequalities.

### 4.2 Eigenvalues and Sturm Sequences

We now give a second construction of inequalities that, if satisfied, ensure that model parameters on a 3-leaf tree are stochastic. While in §3 we constructed matrices encoding positivity of parameters through requirements on associated quadratic forms, here we instead construct matrices whose eigenvalues encode parameters.

Suppose that for complex nonsingular . Recall that for ,

 Pi⋅⋅ =P∗1ei=MT2\diag(π)Λ1,iM3, P+⋅⋅ =P∗11=MT2\diag(π)M3

where is the diagonal matrix with entries from the th column of . Thus, by nonsingularity of the parameters,

 A1,i:=P−1+⋅⋅Pi⋅⋅=M−13Λ1,iM3

has as eigenvalues the entries of the th column of . (This construction underlies the proof of parameter identifiability for the GM( model on trees [15], and the construction of phylogenetic invariants in [3], including the equality in condition (i) of Theorem 3 of this paper.) Similarly we define the matrices

 A2,i:=P−1⋅+⋅P⋅i⋅ =M−13Λ2,iM3, A3,i:=P−1⋅⋅+P⋅⋅i =M−12Λ3,iM2

with the diagonal matrix with entries from the th column of the matrix .

{proposition}

Let be a real tensor that is the image of complex nonsingular parameters. For each of the matrices , , , assume the characteristic polynomial has neither nor as a root, and let denote the standard Sturm sequence for it. Then and for all if, and only if, the matrices , , are positive Markov matrices with no repeated element in any column and is real.

{proof}

The statement about the follows from Corollary 4.1. Then, since the are real and invertible

 π=((P⋅(M−11,M−12,M−13))∗11)∗21

shows is real.

Each matrix has entries that are rational in the entries of , with denominator of degree and numerator of degree . The characteristic polynomial of thus also has coefficients that are rational functions in , and in fact the non-leading coefficients of are rational of degree over . This can be seen explicitly, for example when , in the following:

 f(x) =det(xI−P−1+⋅⋅Pi⋅⋅) =det(xP−1+⋅⋅P+⋅⋅−P−1+⋅⋅Pi⋅⋅) =det(P−1+⋅⋅(xP+⋅⋅−Pi⋅⋅)) =det(xP+⋅⋅−Pi⋅⋅)det(P+⋅⋅). (6)

It follows that the Sturm sequence inequalities, which are constructed from the coefficients , are rational in the entries of as well. Indeed, by multiplying each of these inequalities by a sufficiently high even power of to avoid changing signs, these expressions become polynomial in . Thus, one can phrase the conditions and as a collection of polynomial inequalities. Finally, note that since is a monic characteristic polynomial, then and , so the signs of all and are determined. This leads to the following:

{corollary}

Consider real tensors arising from complex nonsingular parameters. Then one can give a finite collection of strict polynomial inequalities that hold precisely when the are positive Markov matrices with no repeated column entries, and is real.

Remark. For simplicity, we have restricted our discussion to polynomials with no multiple roots, leading to the constraints on the matrix column entries given above. However, this restriction can be removed, by considering Sturm sequences for polynomials with repeated roots. We suggest [24, 28] for futher information.

Note that the non-strict versions of the inequalities of Corollary 4.2 must continue to hold on the closure of the image of parameters described in the corollary. As this closure includes the image of Markov matrices which may have repeated column entries, or entries of 0 or 1, or are singular, the non-strict inequalities hold for all stochastic parameters.

However, some distributions arising from non-stochastic parameters may satisfy the non-strict inequalities as well. Thus while we have obtained semialgebraic statements guaranteeing stochasticity of nonsingular parameters of a particular form, this does not seem to lead to a complete semialgebraic description of the image of all stochastic parameters for arbitrary . In the case , however, we can do better, as we show next.

### 4.3 Sturm sequences for GM(2)

In the specific case of the -state model, we explicitly give the inequalities of Corollary 4.2. To this end, suppose that is a probability distribution arising from nonsingular parameters.

For any , , let . The standard Sturm sequence of the characteristic polynomial of is

 f0 =t2−tr(A)t+det(A), f1 =2t−tr(A), f2 =14(tr(A)2−4det(A))=14δ(f0),

where denotes the discriminant of the monic quadratic polynomial .

Since , the nonsingularity of the together with the fact that their rows sum to 1 implies that none of their columns have repeated entries. Thus the characteristic polynomial must have distinct roots. To ensure that these roots are in , so that the matrix parameters are Markov, the variation in signs of the Sturm sequence must be:

 f0(0) =det(A)>0, f0(1)=1−tr(A)+det(A)>0, f