A new balance index for phylogenetic trees
Several indices that measure the degree of balance of a rooted phylogenetic tree have been proposed so far in the literature. In this work we define and study a new index of this kind, which we call the total cophenetic index: the sum, over all pairs of different leaves, of the depth of their least common ancestor. This index makes sense for arbitrary trees, can be computed in linear time and it has a larger range of values and a greater resolution power than other indices like Colless’ or Sackin’s. We compute its maximum and minimum values for arbitrary and binary trees, as well as exact formulas for its expected value for binary trees under the Yule and the uniform models of evolution. As a byproduct of this study, we obtain an exact formula for the expected value of the Sackin index under the uniform model, a result that seems to be new in the literature.
keywords:Phylogenetic tree, Imbalance index, Cophenetic value, Sackin index
A phylogenetic tree is a representation of the shared evolutionary history of a set of extant species. From the mathematical point a view, it is a leaf-labeled rooted tree, with its leaves representing the extant species under study, its internal nodes representing common ancestors of some of them, the root representing the most recent common ancestor of all of them, and the arcs representing direct descendants through mutations.
One of the most thoroughly studied shape properties of phylogenetic trees is their balance, that is, the degree to which the children of internal nodes tend to have the same number of descendant taxa. This global degree of balance of a tree is usually quantified by means of a single number generically called an balance index. The two most popular balance indices are Sackin’s Sackin:72 () and Colless’ Colless:82 () (see §2.2), but there are many more (fel:04, , Chap. 33), and Shao and Sokal (Shao:90, , p. 1990) explicitly advise to use more than one such index to quantify tree balance.
Such balance indices only depend on the topology of the trees, not on the branch lengths or the actual taxa labeling their leaves. Since it is believed that the raw topology of a phylogenetic tree already reflects, at least to some extent, the evolutionary processes that have produced it (fel:04, , Chap. 33), these indices have also been widely used as tools to test stochastic models of evolution Mooers97 (); Shao:90 ().
Two of the most popular stochastic models of evolutionary tree growth are the Yule and the uniform models. The Yule, or Equal-Rate Markov model Harding71 (); Yule (), starts with a single node and, at every step, a leaf is chosen randomly and uniformly, and it is replaced by a cherry, i.e., a phylogenetic tree consisting only of a root and two leaves. Finally, once the desired number of leaves is reached, the labels are assigned randomly and uniformly to the leaves. This corresponds to a model of evolution where, at each step, each currently extant species can give rise with the same probability to two new species. Under this model different trees with the same number of leaves may have different probabilities. In contrast, the main feature of the uniform, or Proportional to Distinguishable Arrangements model Rosen78 () is that all phylogenetic trees with the same number of leaves have the same probability. From the point of view of tree growth CS (); cherries (), this corresponds to a process where, starting with a node labeled 1, at the -th step a new pendant arc, ending in the leaf labeled , is added either to a new root or to some edge (being all possible locations of this new pendant arc equiprobable). Notice that this is not an explicit model of evolution, only of tree growth. Several properties of the distributions of Sackin’s and Colless’ indices have been studied in the literature under these models BF:05 (); BFJ:06 (); Heard92 (); KiSl:93 (); Mul11 (); Rogers:93 (); Rogers:94 (); Rogers:96 (); SM01 ().
In this paper we propose a new balance index, the total cophenetic index. It is defined as the sum of the cophenetic values Sokal:62 () of all pairs of different leaves. The main features of our index are that, unlike Colless’ index, it makes sense for arbitrary (i.e., not necessarily fully resolved) trees; as Colless’ and Sackin’s indices, it can be easily computed in linear time; its range of values is larger than Colless’ and Sackin’s (up to , instead of ), and it has a greater resolution power than those indices.
We compute the maximum and minimum values of our index, both in the arbitrary and the binary cases, and explicit formulas for its average value under the Yule and the uniform models for binary trees. We actually deduce its average value under the uniform model from an explicit formula for the average value of the Sackin index. This average value was known until now only for its limit distribution BFJ:06 (), and our formula seems thus to be new in the literature.
The rest of this paper is organized as follows. In a first section we introduce the basic notations and facts on phylogenetic trees that will be used henceforth, and we recall some basic facts on the Sackin and the Colless indices. Then, in Section 3, we define our total cophenetic index and we establish its basic properties. In Section 4 we compute its maximum and minimum values, and then, in subsequent sections, we compute its expected value under the Yule and the uniform models. We finally devote a last section to conclusions and the discussion of two preliminary numerical experiments involving .
2.1 Phylogenetic trees
In this paper, by a phylogenetic tree on a set of taxa we mean a rooted tree with its leaves bijectively labeled in the set . To simplify the language, we shall always identify a leaf of a phylogenetic tree with its label. We shall use the term phylogenetic tree with leaves to refer to a phylogenetic tree on the set . We shall denote by the set of leaves of a phylogenetic tree and by its set of internal nodes.
A phylogenetic tree is binary, or fully resolved, when all its internal nodes are bifurcating, that is, when every internal node has exactly two children.
Whenever there exists a path from to in a phylogenetic tree , we shall say that is a descendant of and also that is an ancestor of . The cluster of a node in is the set of its descendant leaves, an we shall denote by the cardinal , that is, the number of descendant leaves of .
Given a node of a phylogenetic tree , the subtree of rooted at is the subgraph of induced on the set of descendants of . It is a phylogenetic tree on with root this node .
The lowest common ancestor (LCA) of a pair of nodes of a phylogenetic tree , in symbols , is the unique common ancestor of them that is a descendant of every other common ancestor of them.
The depth of a node in a phylogenetic tree is the length (in number of arcs) of the unique path from the root to .
A rooted caterpillar is a binary phylogenetic tree all whose internal nodes have a leaf child: see Fig. 1.(a). A rooted star is a phylogenetic tree such that all its leaves have depth 1: see Fig. 1.(b).
Let be a binary phylogenetic tree. For every , say with children , the balance value of is . An internal node of is balanced when . So, a node with children and is balanced if, and only if, .
We shall say that a binary phylogenetic tree is maximally balanced when all its internal nodes are balanced. Recurrently, a binary phylogenetic tree is maximally balanced when its root is balanced and both subtrees rooted at the children of the root are maximally balanced. Notice that, for any number of nodes, the topology of a maximally balanced tree with leaves is fixed, and therefore two maximally balanced trees with the same number of leaves differ only in their labeling. Fig. 2 depicts the maximally balanced trees with leaves, up to relabelings.
Let (resp., ) be the set of isomorphism classes of phylogenetic trees (resp, binary phylogenetic trees) with leaves. It is well known (fel:04, , Ch. 3) that and, for every ,
No closed formula is known for the cardinal , only recurrences or generating functions (see again (fel:04, , Ch. 3) and the references therein).
An ordered -forest on a set is an ordered sequence of phylogenetic trees , each on a set of taxa, such that these sets are pairwise disjoint and their union is . An ordered forest is binary when it consists of binary trees. Let (resp., ) be the set of isomorphism classes of ordered -forests (resp., binary ordered -forests) on a set with . It is known (see, for instance, (MirR10, , Lem. 1)) that for every ,
Again, no closed formula is known for .
2.2 Balance indices
Several balance indices have been proposed so far in the literature (fel:04, , p. 563). The two most popular ones are the Sackin index Sackin:72 () and the Colless index Colless:82 (). The Sackin index of a phylogenetic tree is defined as the sum of the depths of its leaves:
Alternatively BF:05 (),
On the other hand, the Colless index of a binary phylogenetic tree is defined as
This Colless index has been extended to non-binary trees by defining for every non-bifurcating internal node Shao:90 ().
It is straightforward to notice that these two indices depend only on the topology of the tree, and they are invariant under isomorphisms and relabelings of leaves. This is desirable in a balance index, because the degree of symmetry of a tree depends only on its shape.
Both Sackin’s and Colless’s indices reach their maximum value exactly at caterpillars, which are clearly the more imbalanced trees, and they reach their minimum on at the maximally balanced trees Heard92 (); Shao:90 (). In both cases, the maximum value is in . But they may also reach their minimum on at other trees. For instance, for , both indices take their minimum value at the two trees depicted in Fig. 3. is maximally balanced, but is not so. Actually, it is easy to check that Sackin’s index is invariant under interchanges of cousins, which may produce trees with different degrees of symmetry but the same Sackin index.
The main drawback with Colless’ index is its difficult meaningful generalization to non-binary trees. Moreover, as Fig. 3 shows, although not every interchange of cousins yields trees with the same Colless index, there are still interchanges of cousins that modify the symmetry of the trees but preserve this index.
The expected values of these indices on have been studied under the Yule and the uniform models. Recall that, under the Yule model, different trees in may have different probabilities: namely, a tree with leaves has probability Brown (); SM01 ()
Under the uniform model, all trees in are equiprobable, and thus they have probability
Let and be the random variables defined by choosing a tree and computing or , respectively. The following facts are known about the expected values of these random variables:
3 The total cophenetic index
For every pair of leaves in a phylogenetic tree , their cophenetic value Sokal:62 () is the depth of their least common ancestor:
The total cophenetic index of a phylogenetic tree is the sum of the cophenetic values of its pairs of different leaves:
This index can be seen as an extension of Sackin’s: instead of adding up the depths of the leaves (that is, the depths of the LCA of every leaf and itself), adds up the depths of the LCA of every pair of leaves in . Notice also that, as Sackin’s and Colless’ indices, only depends on the topology of , and in particular it is invariant under permutations of its labels.
Fig. 4 shows all possible topologies of phylogenetic trees with 5 leaves, and their total cophenetic indices. Although we shall return on it later for trees with an arbitrary number of leaves, notice that the rooted star has the smallest total cophenetic value, 0; the binary tree with the smallest total cophenetic value is the maximally balanced; and the tree with the largest total cophenetic value is the caterpillar.
The following alternative expression for will be useful in many proofs.
Let be a phylogenetic tree with root . Then,
For every and for every , let
Then, and thus
For every , can be computed in time .
The vector can be computed in linear time by traversing in post order the tree (Val:02, , §3.2), and then, by the last lemma, is computed in linear time from this vector. ∎
Let be a phylogenetic tree with root , and let , , be the subtrees rooted at the children of ; cf. Fig 5. Then,
Let be the root of , , and the root of . Then, by Lemma 2,
This shows that the total cophenetic index is a recursive tree shape statistic in the sense of Matsen ().
Next lemma shows that the total cophenetic index is local, in the sense that if two trees differ only on a rooted subtree, then the difference between their total cophenetic values is equal to that of these subtrees. Sackin’s and Colless’ indices also satisfy this property.
Let and be two phylogenetic trees with , let be such that its subtree rooted at some node is , and let be the tree obtained from by replacing by as its subtree rooted at . Then
Without any loss of generality, assume that , with . Let . Then, for every ,
On the other hand, if or . Therefore
The nodal distance between a pair of leaves is the length of the unique undirected path connecting them; equivalently, it is the sum of the lengths of the paths from to and . The total area MirR10 () of a tree is defined as
There is an easy relation between , and , which will be used several times in this paper.
For every ,
It is straightforward to check that, for every ,
4 Trees with maximum and minimum
In this section we determine which trees in and have the largest and smallest total cophenetic indices. We begin by establishing two lemmas that will allow us to find the trees with the maximum on .
Let , with , be an ordered forest on . Consider the trees described in Fig. 6. Then, .
With the notations of Fig. 6, notice that
For every non-binary phylogenetic tree , there always exists a binary phylogenetic tree such that .
Let be a non-binary phylogenetic tree. Then it contains an internal node whose rooted subtree looks like the tree in the previous lemma, for some . By Lemma 5 and the last lemma, if is the tree obtained from by replacing by as its subtree rooted at , then . ∎
Therefore, the maximum total cophenetic index is reached at a binary tree.
Let , let , let be any binary tree on , and let and be the phylogenetic trees in depicted in Fig. 7. Then, .
By Lemma 2, and recalling that , we have that
and hence, since ,
The trees in with maximum total cophenetic index are exactly the rooted caterpillars , and this maximum is .
By Corollary 8, any tree in with maximum total cophenetic index will be binary. Let now and assume that it is not a caterpillar. Therefore, it has an internal node of largest depth without any leaf child; in particular, all internal descendant nodes of have some leaf child. Thus, and up to a relabeling of its leaves, the subtree of rooted at has the form of the tree in Fig. 8, for some and some . But then, by Lemma 9 (taking as the caterpillar subtree rooted at the parent of the leaf ), the tree also depicted in Fig. 8 has a strictly larger total cophenetic index. Then, by Lemma 5, if we replace in the subtree rooted at by this tree , we obtain a new tree with . This implies that no tree other than a caterpillar can have the largest total cophenetic index.
It is obvious that minimum total cophenetic index is 0, and it is attained only at the rooted star trees, depicted in Fig. 1.(b). Therefore, the range of on goes from 0 to . This is one order of magnitude larger than the range of Sackin’s and Colless’ indices, whose maximum value, reached also at the rooted caterpillars, has order Heard92 (); Rogers:96 (); Shao:90 ().
Let us characterize now those binary phylogenetic trees with smallest total cophenetic index.
Let be an ordered binary forest on , let , for , and assume that , and . Let the phylogenetic tree depicted in Fig 9.(a), and let () be a binary phylogenetic tree having as a subtree rooted at some node. If is minimum in , then .
Assume that . We shall show that, in this case, a suitable interchange of cousins in produces a tree with smaller total cophenetic index, which in particular will imply that cannot be the minimum in .
From the proof of the last lemma we deduce that if, in the tree in Fig. 9.(a), and , and if we interchange and , then the resulting tree has always a different total cophenetic index.
Let be an ordered binary forest on , let , for , and assume that . Let the phylogenetic tree depicted in Fig 10.(a), and let be a binary phylogenetic tree having as a subtree rooted at some node. If is minimum in , then .
Assume that . We shall show that, again in this case, a suitable interchange of cousins in produces a tree with smaller total cophenetic index.
Assume that the tree in the statement has the subtree rooted at a node . Let by the tree obtained from by replacing by the subtree described in Fig. 10.(b). Then:
which shows that . ∎
The last two lemmas show that, unlike what happens with Sackin’s and Colless’ indices, any interchange of cousins that changes the balance of their grandparent always changes the total cophenetic index of a tree.
For every , is minimum on if, and only if, is maximally balanced.
Assume that is not maximally balanced, and let be a non-balanced internal node in with largest depth. Assume that and are its children, with .
If is a leaf, then, by Lemma 12, and therefore . Therefore, and are internal, and hence balanced. Let be the subtree of rooted at , represented in Fig. 9.(a), and let , for ; without any loss of generality, we shall assume that and and thus, since and are balanced, or and or . Then, implies that , and hence that .
Therefore, by Lemma 11, if is minimum in , it must happen that . Since it forbids the equality , it implies that and therefore . But then , against the assumption that is not balanced. ∎
So, the only binary trees with minimum are the maximally balanced. Let us compute now this minimum value of on .
For every , let be the minimum of on . Then, and
This recurrence for is a direct consequence of Lemma 4 and the fact that the root of a maximally balanced tree in is balanced and the subtrees rooted at their children are maximally balanced. ∎
For every , let is the highest power of 2 that divides . Then, for every ,
The sequence is sequence A011371 in Sloane’s On-Line Encyclopedia of Integer Sequences Sloane (), where we learn that it satisfies the recurrence
Let now denote the sequence of partial sums of , which is sequence A174605 in Sloane’s Encyclopedia. Then, the sequence starts with and it satisfies the recurrence
We want to prove that , for every . Since , it remains to check the equality
We prove this equality with the help of Lemma 4 and by distinguishing four cases, depending on the residue of mod 4.
If , then