Roadblocked monotonic paths and the enumeration of coalescent histories for non-matching caterpillar gene trees and species trees

Roadblocked monotonic paths and the enumeration of coalescent histories for non-matching caterpillar gene trees and species trees

Zoe M. Himwich
Noah A. Rosenberg
Department of Mathematics, Stanford University, Stanford, CA 94305 USADepartment of Biology, Stanford University, Stanford, CA 94305 USA. Email: noahr@stanford.edu.
September 8, 2019
Key words: Catalan numbers, coalescent histories, Dyck paths, monotonic paths, nearest-neighbor-interchange, subtree-prune-and-regraft
Running title: Coalescent histories for non-matching caterpillars
Mathematics subject classification: 05A15, 05A19, 05B35, 92B10, 92D15

Abstract. Given a gene tree topology and a species tree topology, a coalescent history represents a possible mapping of the list of gene tree coalescences to associated branches of a species tree on which those coalescences take place. Enumerative properties of coalescent histories have been of interest in the analysis of relationships between gene trees and species trees. The simplest enumerative result identifies a bijection between coalescent histories for a matching caterpillar gene tree and species tree with monotonic paths that do not cross the diagonal of a square lattice, establishing that the associated number of coalescent histories for -taxon matching caterpillar trees () is the Catalan number . Here, we show that a similar bijection applies for non-matching caterpillars, connecting coalescent histories for a non-matching caterpillar gene tree and species tree to a class of roadblocked monotonic paths. The result provides a simplified algorithm for enumerating coalescent histories in the non-matching caterpillar case. It enables a rapid proof of a known result that given a caterpillar species tree, no non-matching caterpillar gene tree has a number of coalescent histories exceeding that of the matching gene tree. Additional results on coalescent histories can be obtained by a bijection between permissible roadblocked monotonic paths and Dyck paths. We study the number of coalescent histories for non-matching caterpillar gene trees that differ from the species tree by nearest-neighbor-interchange and subtree-prune-and-regraft moves, characterizing the non-matching caterpillar with the largest number of coalescent histories. We discuss the implications of the results for the study of the combinatorics of gene trees and species trees.

1 Introduction

In the mathematical study of evolutionary trees, genetic lineages can be treated as evolving along the branches of a species phylogeny, a tree that represents the evolutionary relationships among a set of species (Pamilo & Nei, 1988; Maddison, 1997; Degnan & Rosenberg, 2009). A tree describing a set of genetic lineages that descend from a common ancestor is a gene tree, and a tree relating the species themselves is a species tree. Looking backward in time, in a gene tree of genetic lineages sampled from representative individuals of a given set of species, a pair of genetic lineages can coalesce, or find a common ancestor, only after the common ancestor of their species is reached. More generally, a set of two or more genetic lineages has a most recent common ancestor only after the most recent common ancestor of their associated species is reached.

The study of the relationship between gene trees and species trees—usually treated as binary, rooted, and leaf-labeled—has generated a number of novel combinatorial structures (Maddison, 1997; Degnan & Salter, 2005; Rosenberg & Tao, 2008; Than & Nakhleh, 2009; Degnan et al., 2012; Stadler & Degnan, 2012; Wu, 2012, 2016; Degnan & Rhodes, 2015). Among these are coalescent histories, structures that describe the possible locations on a species tree where the coalescences of a gene tree can take place (Degnan & Salter, 2005; Rosenberg, 2007). More precisely, for a (binary, rooted, leaf-labeled) gene tree topology and a (binary, rooted, leaf-labeled) species tree topology on the same set of taxa, a coalescent history associates with each coalescence in an edge of , such that two properties are satisfied: (i) the species tree edge associated with a gene tree coalescence is ancestral to all lineages that descend from ; (ii) for any pair of gene tree coalescences for which lies on a path from to a leaf of the gene tree, lies on a path from to a leaf of the species tree. From a biological perspective, this pair of constraints encodes the rules that (i) gene lineages can coalesce only in a branch of the species tree in which it is possible for their ancestors to coexist, and that (ii) ancestors can coalesce no more recently than their descendants.

Rosenberg (2007) provided a recursion that enumerates coalescent histories for arbitrary gene tree and species tree topologies. For gene tree topology and species tree topology , so that the taxon set of is a superset of that of but not necessarily the same set, let denote the minimal displayed subtree of that contains all the taxa of , that is, the subtree of rooted at the node that corresponds to the most recent common ancestor of the taxa with the same labels as the taxa in . Let denote the number of edges that separate the root of from the root of . Let and denote the left and right subtrees of . We define an integer parameter , and write a recursion for a function :

(1)

The base case is obtained by setting to 1 for all in the case that has only one taxon. With these definitions, the number of coalescent histories for gene tree topology and species tree topology is .

Caterpillar species trees, in which an internal node exists that is descended from all other internal nodes, represent a special case in which enumeration of the coalescent histories is simpler than in the general case of arbitrary species trees. Thus, although exact and asymptotic results are known for certain additional shapes (Rosenberg, 2007, 2019; Disanto & Rosenberg, 2015), enumerative properties have been explored most extensively for caterpillar species trees and shapes that closely resemble them (Degnan, 2005; Degnan & Salter, 2005; Rosenberg, 2007, 2013; Rosenberg & Degnan, 2010; Disanto & Rosenberg, 2016). First, for a matching caterpillar gene tree and species tree—a caterpillar gene tree and species tree with the same labeled topology—Degnan (2005) found a bijection between coalescent histories and monotonic paths on a square lattice that do not cross above the diagonal, a quantity well-known to be described by the Catalan number sequence (Stanley, 2015, item 24). Eq. 1 recovers the Catalan numbers in this case (Rosenberg, 2007, Corollary 3.5), and can be used to show that the number of coalescent histories for matching gene trees and species trees in small “caterpillar-like families” is asymptotic to a constant multiple of the Catalan numbers (Rosenberg, 2007, 2013). This asymptotic behavior has been demonstrated for caterpillar-like families of arbitrary size using techniques of analytic combinatorics (Disanto & Rosenberg, 2016).

Enumerative results have been comparatively little studied, however, in the case that labeled gene trees and species trees disagree in topology. Than et al. (2007) performed a numerical investigation, finding that the number of coalescent histories for non-matching gene tree and species tree topologies generally decreases with increasing subtree-prune-and-regraft (SPR) distance between the trees. Rosenberg & Degnan (2010) demonstrated that for the caterpillar species tree topology with taxa, there exists a non-matching gene tree topology with more coalescent histories than the matching caterpillar gene tree topology. Nevertheless, for caterpillar species tree topologies, Degnan & Rhodes (2015) showed that no non-matching caterpillar gene tree topology can exceed the matching caterpillar gene tree topology in number of coalescent histories; indeed, the constructive example of Rosenberg & Degnan (2010) of a non-matching gene tree topology with more coalescent histories than the matching caterpillar was not itself a caterpillar.

Here, we extend the monotonic path approach of Degnan (2005) to non-matching caterpillar gene tree and species tree topologies. We show that coalescent histories for non-matching caterpillar gene tree and species tree topologies can be bijectively associated with a set of roadblocked monotonic paths that do not cross above the diagonal of a square lattice. The approach immediately recovers the result of Degnan & Rhodes (2015) that non-matching caterpillar gene tree topologies do not exceed the matching caterpillar gene tree topology in number of coalescent histories. It enables calculations of the number of coalescent histories for caterpillar gene tree topologies that differ from the species tree by common transformations—nearest-neighbor-interchange and subtree-prune-and-regraft. We characterize non-matching caterpillar gene trees with the largest numbers of coalescent histories, finding that the number of coalescent histories in such cases is asymptotically equivalent to that in the matching case.

2 Preliminaries

2.1 Caterpillar trees

We consider binary, rooted, leaf-labeled trees with leaf labels bijectively drawn from a label set containing distinct labels. For convenience, a “tree” refers to a binary, rooted, leaf-labeled tree. Trees contain two types of nodes, leaf nodes and non-leaf, or internal, nodes. Because trees are rooted, we say that a node of a tree is descended from another node if the shortest path from to the root node contains . We also say that is ancestral to . Ancestor–descendant relationships also apply to pairs of edges and to pairs containing a vertex and an edge. A node or edge is trivially descended from itself, and it is also trivially ancestral to itself. The root node is an internal node.

We focus on caterpillar trees, trees in which there exists an internal node descended from all other internal nodes (Figure 1A). A caterpillar tree has exactly one cherry node, a node with exactly two descendant leaves. Among leaves, the longest path length to the root of a caterpillar tree with leaves is .

The number of distinct caterpillar trees possible for a label set with distinct labels is : the leaf separated from the root by only edge has possible labels, the leaf two edges from the root then has possible labels, and so on. In this assignment of labels, the leaves descended from the cherry node are exchangeable. Hence, only one labeling is possible for these leaves, giving a total of labelings. These labelings represent the caterpillar labeled topologies for label set .

For convenience, we organize the labels in an -leaf caterpillar tree canonically in a vector of length . For , entry in the vector is the label of the leaf separated from the root by edges. Entries 1 and 2 are the labels for the leaves in the cherry. Two vectors of labels and are considered to be equivalent if and only if one of the following two conditions holds: (1) for all , or (2) , , and for each .

Two leaves in a caterpillar tree are considered to be adjacent if they are separated by exactly two or three edges (Figure 1A). Equivalently, leaves are adjacent if and only if their indices in the sequence of labels for the tree differ by 1, or if one is entry 1 and the other is entry 3.

A component of a caterpillar tree is a subset of adjacent leaves, excluding from the definition the subset consisting solely of the pair of leaves in the cherry. Formally, a subset of labels is a component of if and for any pair of labels , there exists a sequence of distinct elements in which each consecutive pair of elements labels adjacent leaves in .

It is convenient to number the internal nodes of an -leaf caterpillar tree from 1 to in increasing order from the cherry node toward the root. These nodes are ordered by ancestor-descendant relationships, so that the node of smallest value in any nonempty subset of internal nodes descends from all other elements of the subset. We call this node the minimal node of the subset. It is also useful to consider that a tree possesses an internal edge ancestral to its root node; thus, identifying each internal node with its immediate ancestral edge, a nonempty subset of internal edges has a minimal edge.

2.2 Relationships between pairs of caterpillar trees

The labelings of distinct caterpillar trees with the same label set differ by a permutation of the vector of leaf labels. We will have occasion to examine pairs of caterpillar trees whose labelings differ by specific types of permutation: nearest-neighbor-interchange and subtree-prune-and-regraft (Steel, 2016).

Consider two distinct caterpillar trees and , bijectively labeled from the same set of distinct labels.

Definition 1.

Caterpillar trees and differ by a nearest-neighbor-interchange, or NNI move, if can be obtained from by exchanging the labels of a pair of adjacent leaves in that are separated by exactly three edges (Figure 1B).

Note that our definition of adjacent leaves includes the leaves corresponding to labels and in the canonical ordering. This pair is the only pair of adjacent leaves that are not separated by an NNI move.

Definition 2.

Caterpillar trees and differ by a subtree-prune-and-regraft, or SPR move, if there exists an ordered pair of edges in with the property that if edge is cut, edge is subdivided in two by placement of a new vertex of degree two, and the subtree descended from is connected to vertex such that now has degree three and is ancestral to the subtree, then tree is obtained (Figure 1C, 1D).

In an SPR move, note that it is possible for the edge to be the edge ancestral to the root of .

Definition 3.

Caterpillar trees and differ by a cyclic permutation if there exists a component of and a component of such that the labels of represent a cyclic permutation of the labels of .

By definition of a component, this definition excludes permutations that simultaneously involve leaves separated from the root by the fewest edges and leaves separated from the root by the most edges, unless all leaves are involved.

Definition 4.

Caterpillar trees and differ by an incrementation if they differ by a cyclic permutation and at most one label has positions in the canonical label vectors of and that differ by more than one.

can differ from by a forward or a reverse cycle or incrementation (Figure 1C, 1D). If differs from by a forward incrementation or cycle, then differs from by a reverse incrementation or cycle, and vice versa. Note that each cyclic permutation that exchanges two leaves is concurrently a forward incrementation, a reverse incrementation, and an NNI move.

We can immediately observe that a pair of caterpillar trees and differ by an SPR move if and only if they also differ by an incrementation of the leaf labels. SPR moves that convert caterpillars to caterpillars necessarily prune and regraft a single leaf. If a leaf is pruned from and regrafted to , then depending on which leaf is pruned and where it is regrafted, can differ from by either a forward or a reverse incrementation. Therefore, enumeration of coalescent histories in the case that caterpillar trees differ by an SPR move is performed by enumeration in the associated case of a forward or a reverse incrementation.

Figure 1: Transformations of caterpillar trees. (A) A caterpillar tree . The vector of labels for , in canonical order, is . The adjacent pairs of leaves are , , , , , , , , , and . (B) A tree that differs from by nearest-neighbor-interchange. Leaves and are exchanged. (C) A tree obtained from by forward incrementation of leaves , , and . (D) A tree obtained from by reverse incrementation of leaves , , and . The tree in (C) can also be viewed as the result of a subtree-prune-and-regraft operation, with the branch leading to leaf pruned and regrafted; the tree in (D) can be viewed as the result of an SPR operation involving the leaf leading to . In each panel, the red line indicates which leaves are permuted.

2.3 Coalescent histories

We study coalescent histories for a caterpillar gene tree and a caterpillar species tree , treated as binary, rooted, leaf-labeled caterpillar trees, each with leaves labeled by labels bijectively drawn from the same set . This setting corresponds to considering to represent the tree formed by sampling a single gene lineage in each of the species present in species tree . Gene tree and species tree are said to be matching if and have the same labeled topology, and they are said to be non-matching otherwise.

Formally, a coalescent history can be defined as follows (Rosenberg & Degnan, 2010).

Definition 5.

Consider an ordered pair of binary, rooted, leaf-labeled trees whose labels are bijectively drawn from the same label set . A coalescent history is a function from the set of internal nodes of to the set of internal edges of that satisfies two conditions:

  1. For each internal node of , all leaf labels for leaves descended from in label leaves descended from edge in .

  2. For all pairs of internal nodes in , if node is descended from node in , then edge is descended from edge in .

An illustration appears in Figure 2. Recall that we consider that contains an edge ancestral to its root; this edge can be the image of an internal node of under a coalescent history mapping. Note that because an edge is trivially descended from itself, in part 2 of Definition 5, it is permissible for to equal .

We will have occasion to use the concept of a partial coalescent history.

Definition 6.

Consider an ordered pair of binary, rooted, leaf-labeled trees whose labels are drawn from the same label set , not necessarily bijectively. A partial coalescent history is a function from the set of internal nodes of to the set of internal edges of , satisfying the two conditions in Definition 5.

We say that if is empty, then has one partial coalescent history. For nonempty , because the labels in are not necessarily the same as those of , it is possible that for some nodes in , has no edge that can serve as the image of a node in . In this case, the pair has no partial coalescent histories. When connecting the purely graphical definition of coalescent histories in Definition 5 to the biological context in which they arise, we say that an internal node of is a gene tree coalescence; the coalescence is said to occur on edge of .

Figure 2: Coalescent histories. (A) A gene tree and species tree with the same label set. The gene tree appears in blue, and the species tree appears in black. (B) The coalescent history depicted in (A) for . The arrows connect internal nodes of to their associated edges in .

2.4 Catalan numbers and monotonic paths

We recall a number of results concerning Catalan numbers and their use in counting paths along the edges of square lattices. The Catalan sequence satisfies

beginning from , with values 1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, …

Catalan numbers can be placed in the combinatorial construction known as Catalan’s triangle (Reuveni, 2014), of which we display the first several columns:

In this triangle, the initial 1 in the lower left corner is denoted . Other entries are denoted , with as the horizontal distance from the lower left corner and as the vertical distance from this entry.

For with , the entries satisfy the recursion relation

(2)

with initial condition . The general formula for is

(3)

In particular, for , we have .

The entry counts the number of monotonic paths on the lattce in the first quadrant of the plane (including the coordinate axes) that do not cross the line , where a monotonic path is a path from to that proceeds by steps upward and to the right on the lattice.

We will also make use of extensions of Catalan’s triangle known as Catalan’s trapezoids of order , which contain an initial column of entries equal to 1, rather than a single entry (Reuveni, 2014). Entries in Catalan’s trapezoids satisfy a version of eq. 2:

(4)

We have . The first few columns of Catalan’s trapezoid of order 3 appear below:

An entry in the trapezoid can be calculated in closed form as

(5)

The entry in Catalan’s trapezoid of order counts the number of monotonic paths on the lattice in the first quadrant of the plane (including the coordinate axes) that do not cross the line .

3 Bijection of coalescent histories and roadblocked monotonic paths

3.1 Matching gene trees and species trees

Degnan (2005) proved that the number of coalescent histories for a matching caterpillar gene tree and species tree with labels is the Catalan number , demonstrating a bijection between coalescent histories and monotonic paths that do not cross the diagonal of a square lattice. We will discuss this well-known correspondence, as the bijective approach is useful for the non-matching case.

Lemma 7.

The coalescent histories for a matching -leaf caterpillar gene tree and species tree can be bijectively associated with monotonic paths that do not cross the diagonal of an lattice.

Proof.

Label the internal nodes of sequentially from 1 to , using 1 for the internal node nearest the cherry and for the root. For each internal node of , identify the label for the node with the edge immediately ancestral to it. Similarly, sequentially label the internal nodes of from 1 to , proceeding from the cherry toward the root and identifying the label for each node with its immediate ancestral edge.

For each with , denote by the subtree of the gene tree rooted at node , and for each with , denote by the subtree of the species tree rooted at node . We also define and to be empty subtrees of the gene tree and species tree, respectively. Denote by the set of partial coalescent histories for . For matching and , for each with , . Hence, by definition of a coalescent history, for each internal node of , the image in a coalescent history of must be ancestral in to all leaves of labeled by labels in . The edges of with this property are edges . For , we have , and for all with .

Each partial coalescent history in is formed in one of two ways. Gene tree node is mapped either to species tree internal edge , or to one of the edges . The former case produces partial coalescent histories, each obtained by appending the coalescence of gene tree node to a partial coalescent history for . The latter case produces partial coalescent histories; because no gene tree coalescences in such a partial coalescent history occur on species tree edge , each such partial coalescent history for is a partial coalescent history for . Hence, we have

(6)

with the constraint for and . For and , we have by the convention that has one partial coalescent history for empty . We set for all that do not satisfy .

Recursion 6 and its base cases, with in the role of and in the role of , is precisely eq. 2. Setting , eq. 2 gives the recursion for enumerating the set of monotonic paths that do not cross the diagonal of an square lattice, a set with elements. In the bijection between coalescent histories and monotonic paths, each step to the right in the lattice, incrementing , corresponds to incorporating an additional edge of the species tree as a possible location for gene tree coalescences, and each step up, incrementing , corresponds to occurrence of a gene tree coalescence. ∎

We can read a coalescent history of from its associated monotonic path (Figure 3). For example, in a 10-leaf tree, the monotonic path that proceeds through (0,0), (3,0), (3,2), (6,2), (6,3), (7,3), (7,7), (9,7), and (9,9) has no gene tree coalescences on edge 1 of the species tree above or on edge 2 above . Gene tree coalescences and occur on edge 3 above species tree node . No gene tree coalescences occur on edges 4 or 5. Gene tree coalescence occurs on edge 6. Four gene tree coalescences occur on edge 7 above species tree node . The two remaining gene tree coalescences occur on edge 9 above the species tree root.

The bijection between coalescent histories and monotonic paths generates a set of values of that considers each and with and . These values can be depicted in a lattice so that the value is associated with the coordinate of lattice point (Figure 3). Indeed, they correspond exactly to the entries of Catalan’s triangle (eq. 3), with in the role of and in the role of .

Figure 3: The correspondence between monotonic paths that do not cross above the diagonal of an square lattice and coalescent histories for a matching caterpillar gene tree and species tree with leaves. The lower left corner represents the origin . Monotonic paths from to represent the partial coalescent histories for . Values are taken from eq. 2, using in place of . Species tree internal edges are read from left to right: labels the species tree internal edge from which and descend, and each successive label indicates the internal edge ancestral both to the leaf corresponding to the associated label and to the caterpillar subtree containing all prior labels. Gene tree internal nodes are read in the same manner from bottom to top. The monotonic path shown in red indicates the locations on the species tree of the gene tree coalescences of a specific coalescent history.

The construction takes advantage of the caterpillar shape of both gene tree and species tree. Because internal nodes of a caterpillar tree can be placed in order with each entry descended from the next until the root is reached, simply stating the next leaf label suffices to specify the leaves descended from the next internal node. Movement from left to right in Figure 3 indicates movement from the cherry of the species tree toward the root, and movement from bottom to top indicates coalescence in the gene tree.

3.2 Non-matching gene trees and species trees

Our key insight is that a version of the construction of Degnan (2005) linking coalescent histories and monotonic paths applies even if the gene tree and species tree are non-matching, provided that both continue to be caterpillars. Coalescent histories for non-matching caterpillars can be associated with roadblocked monotonic paths that do not cross above the diagonal of an square lattice.

Definition 8.

In a lattice, a roadblocked monotonic path is a monotonic path that is not permitted to pass through certain specified lattice points. We term these lattice points roadblocks.

Consider a caterpillar gene tree and a caterpillar species tree , whose leaves are both bijectively associated with the same set of leaves, but that do not necessarily match. As in Section 3.1, we associate points on the x-axis of an lattice with species tree internal edges in , and we associate points on the y-axis with gene tree internal nodes in . We continue to label internal nodes of and in increasing order from 1 to , from the cherry to the root, indexing the gene tree internal nodes by and the species tree internal nodes by .

As is true in the matching case, for each from 1 to , each coalescent history must have , as a gene tree internal node must map to a species tree internal edge ancestral to at least as many leaves as descend from node in . Hence, each coalescent history for corresponds to a monotonic path that has and hence does not cross the diagonal of the lattice. However, an additional constraint is imposed by the fact that and do not necessarily match.

Given and , let denote the permutation of the gene tree leaf labels represented by the species tree leaf labels . The action of sends the vector of leaf labels from one -tuple to another, and we denote the index in of , the th label of , by .

For the leaf labels in , let denote the minimal internal edge of ancestral to leaf , the species tree leaf with label . For a matching gene tree and species tree , is the identity permutation so that ; we then have , and for .

For general that do not necessarily match, by Definition 5, (i) if or , then , and (ii) if , then . This rule encodes the fact that a gene tree coalescence can occur only on a species tree edge ancestral to all species tree leaves labeled by the elements of the set of labels for leaves descended from the gene tree coalescence.

Consider the partial coalescent histories with . As in Section 3.1, for , for all with . For each from 1 to , the minimal internal edge of that is ancestral to all leaves labeled by labels of leaves of that descend from gene tree internal node is . Therefore, for , we have for all with . Note that these are the only roadblocks: for , , as is one less than the maximum of distinct elements of , a quantity greater than or equal to . For , because for all lattice points with , all such points are roadblocks.

We also note that for , . The set of descendant leaves of internal node of contains as a subset the descendant leaves of internal node of . Hence, the minimal internal edge of ancestral to all labels that label leaves descended from internal node of has an index at least as great as the corresponding internal edge of associated with internal node of . Consequently, if is a roadblock, then because and for , we can conclude that is a roadblock for each with .

As in Section 3.1, each partial coalescent history in is formed in one of two ways. For , gene tree node is mapped either to species tree internal edge , or to one of the edges . The former case produces partial coalescent histories, and the latter produces . Hence, the recursion is still satisfied. We still have the constraints for and , for and , and for all that do not satisfy . We also have the new constraint for all that satisfy .

The set of roadblocks for is defined by . We have therefore demonstrated the following proposition.

Proposition 9.

Consider a caterpillar gene tree and a caterpillar species tree , both bijectively associated with the same set of leaf labels, but that do not necessarily match. Then can be associated with a set of roadblocks such that the coalescent histories for bijectively correspond to roadblocked monotonic paths that do not cross the diagonal of an lattice.

By definition of , we immediately see that if is a roadblock for , then is a roadblock as well for each with . We can also see that if is a roadblock for , then is a roadblock as well for each with ; this result follows from the fact that for . We have the following remark.

Remark 10.

Consider a caterpillar gene tree and a caterpillar species tree . The roadblock set consists of a set of points with such that if , then (i) for all with , and (ii) for all with .

Figure 4 illustrates the correspondence between coalescent histories and roadblocked monotonic paths. In Figure 4, we have . Because , is a roadblock, as are , , and for the same reason ( is a roadblock if ). Because , is also a roadblock, as are and . We can also identify , , and by Remark 10 as roadblocks as a consequence of the fact that , , and are roadblocks. Continuing through all , we identify 15 roadblocks in Figure 4.

Figure 4: The correspondence between monotonic paths that do not cross above the diagonal of an square lattice and coalescent histories for a non-matching caterpillar gene tree and species tree with leaves. Roadblocks are indicated by circles on lattice points; no roadblocked monotonic paths traverse the shaded regions. The lower left corner represents the origin . Monotonic paths from to represent the partial coalescent histories for . Values are taken from eq. 2, using in place of . Species tree internal edges are read from left to right: labels the species tree internal edge from which and descend, and each successive label indicates the internal edge ancestral both to the leaf corresponding to the associated label and to the caterpillar subtree containing all prior labels. Gene tree internal nodes are read in the same manner from bottom to top.

From Proposition 9, we immediately obtain that the number of coalescent histories for is given by the number of roadblocked monotonic paths that do not cross above the diagonal of an lattice, where the roadblocks are those in the set . We also obtain a simple proof of the following corollary, which appeared as Remark 15 of Degnan & Rhodes (2015).

Corollary 11.

Consider a caterpillar gene tree topology and a caterpillar species tree topology . The number of coalescent histories for is strictly greater for than for each choice of .

Proof.

By Proposition 9, coalescent histories for correspond to roadblocked monotonic paths that do not cross the diagonal of an lattice.

In the case that , applying Lemma 7, the number of coalescent histories is the number of monotonic paths that do not cross the diagonal of the lattice.

Adding a roadblock to the lattice necessarily reduces the number of monotonic paths from to , as each lattice point has at least one monotonic path that passes through it. Because the number of coalescent histories for is equal to the number of roadblocked monotonic paths on the lattice, it suffices to show that for , at least one lattice point is a roadblock.

Because , there exists some internal node of at least one of whose descendant leaves has a label not contained in the label set of the leaves descended from internal node of . This leaf has . Hence, is a roadblock, and is associated with fewer monotonic paths than is . ∎

3.3 Roadblock sets

Given a caterpillar species tree , Remark 10 suggests a characterization of the possible sets of roadblocks, considering all caterpillar gene trees . Each roadblock set has the property that within a row, all points to the left of a roadblock and on or below the diagonal are also roadblocks. Within a column, all points above a roadblock and on or below the diagonal are roadblocks.

Proposition 12.

Consider a caterpillar species tree topology with leaves. For each caterpillar gene tree topology with leaves, denote its associated roadblock set by . Considering all possible caterpillar gene tree topologies, the distinct roadblock sets are bijectively associated with the monotonic paths on the lattice that do not cross the diagonal.

Proof.

Consider a roadblock set . For each from 1 to , we identify the largest such that is not a roadblock. Call this value . A unique monotonic path connects : by Remark 10, for each and each , is either a roadblock or it lies above the line. Hence, denoting and , for each from 1 to , a monotonic path from to must proceed horizontally by length 1 and then vertically by length .

To show that this construction is injective, note that distinct monotonic paths are associated with distinct roadblock sets: consider a point appearing in one monotonic path but not in another one, . Because is the largest value of that is not a roadblock for path , must be a roadblock for .

For surjectivity, consider a monotonic path from to that does not cross the line. For each in the path, , we assign each point with to be a roadblock. ∎

Figure 5: The correspondence between roadblock sets, monotonic paths that do not cross above the diagonal of an square lattice, and Dyck paths of semi-length . Given a roadblock set, the associated monotonic path is constructed by identifying for each coordinate from 0 to the lattice point of greatest coordinate, and then constructing the unique monotonic path through those points. Similarly, given a monotonic path, its roadblock set is obtained by placing roadblocks at each lattice point above and to the left of the path. (A) Roadblock set symmetric across the line . (B) Roadblock set asymmetric across the line . (C) Roadblock set asymmetric across the line , obtained by reflecting the roadblocks in (B) over this line. (D) Symmetric Dyck path associated with the roadblock set in (A). (E) Asymmetric Dyck path associated with the roadblock set in (B). (F) Asymmetric Dyck path associated with the roadblock set in (C), obtained by reversing the Dyck path in (E). The roadblock sets in (B) and (C) both generate 235 monotonic paths from to .

Figure 5 provides an illustration of Proposition 12, showing how the monotonic path associated with a roadblock set is constructed and vice versa. The monotonic path associated with a roadblock set can be viewed as the monotonic path that comes as close as possible to the roadblocks. The roadblock set for a monotonic path is the set of points above and to the left of the path.

The number of distinct caterpillar trees is , whereas the number of distinct roadblock sets is the smaller . For a given caterpillar species tree, we can place the caterpillar gene trees into equivalence classes, where two gene trees are said to be history-equivalent if and only if they are associated with the same roadblock set. Two history-equivalent caterpillar trees and have the same set of roadblocks and the same set of monotonic paths, and hence, the same set of coalescent histories, up to permutation of the leaf labels. These equivalence classes were termed history classes by Rosenberg & Tao (2008), so that two caterpillars with the same roadblocks are in the same history class.

By Proposition 12, for a fixed species tree, the number of history classes considering all caterpillar trees is ; this result accords with the computation of 5 history classes for (Rosenberg, 2002, Table V) and 14 for (Rosenberg & Tao, 2008, Table 3). We have also seen in Corollary 11 that is the largest possible number of coalescent histories for a pair of caterpillar trees. We now ask how many of the values can be the number of coalescent histories for some caterpillar gene tree and species tree. The simplest upper bound on this quantity is . To improve on this bound, it is convenient to use the bijection between monotonic paths that do not cross the diagonal of the lattice and Dyck paths of semilength  (Stanley, 1999, Corollary 6.3.2). Each monotonic path represents a series of steps by or from to , with at each step. Each Dyck path represents a series of steps by or from to , with at each step. The coalescent histories for can therefore be associated with Dyck paths, where each up-step represents addition of a species in the species tree and each down-step represents a gene tree coalescence.

A Dyck path of semi-length has total up-steps and down-steps. The steps of Dyck paths can be written as a sequence, with denoting up-steps and denoting down-steps. A Dyck path can be reversed in the following manner: we take the sequence of and steps in the path, reverse the order of steps, and exchange the positions of and steps. Thus, a path becomes . Reversing a Dyck path corresponds to traversing the path in reverse order. A reversed Dyck path is itself a Dyck path; if the sequence of and steps in a Dyck path is reversed, then at each step; exchanging the positions of the and steps reflects the path over the axis.

Lemma 13.

Consider a caterpillar species tree topology with leaves. Consider gene tree topologies and such that is in the roadblock set if and only if is in the roadblock set . Then and have the same number of coalescent histories.

Proof.

We show that the coalescent histories for can be bijectively associated with the coalescent histories for . Consider a coalescent history for . Identify its associated monotonic path according to Proposition 9, and identify the Dyck path associated with this monotonic path. Reverse to obtain , and identify the monotonic path associated with .

Because avoids each roadblock in , after steps, cannot have taken up-steps and down-steps. Because is the reverse of , after steps, cannot have taken up-steps and down-steps. The monotonic path therefore avoids the point for each roadblock in . Hence, avoids each roadblock in , and it therefore represents a coalescent history for . Similarly, beginning from the coalescent history for associated with , we find that represents a coalescent history for . ∎

The lemma demonstrates that for two roadblock sets, if the roadblocks of one can be obtained by transforming each roadblock of one into a roadblock of the other, then the associated caterpillar gene trees have the same number of coalescent histories.

Consider a set of points on or below the diagonal of the first quadrant of the lattice (and not on lines or ) with the property that if , then for all with and for all with . By Proposition 12, given a caterpillar species tree topology, is the roadblock set for some caterpillar gene tree. We term such a set a caterpillar-friendly roadblock set.

Definition 14.

Consider a caterpillar-friendly roadblock set for the lattice. We say that is symmetric if for each , is also in . Otherwise, is asymmetric.

In a symmetric caterpillar-friendly roadblock set, when the points in the roadblock set are reflected across the line , the same roadblock set is obtained (Figure 5A). For an asymmetric caterpillar-friendly roadblock set, a different roadblock set is obtained by this reflection (Figure 5B and 5C).

For the lattice, denote by and the numbers of symmetric and asymmetric caterpillar-friendly roadblock sets, respectively. By Lemma 13, the asymmetric caterpillar-friendly roadblock sets can be partitioned into disjoint pairs such that the associated caterpillar gene trees for the two entries in a pair give rise to the same number of coalescent histories. Hence, considering all caterpillar gene trees and species trees, the number of distinct values possible for the number of coalescent histories is bounded above by , or because , by .

We obtain by counting all ways of placing roadblocks with . By symmetry we then assign points to be roadblocks as well. Because of the bijection between roadblock sets and monotonic paths (Proposition 12), each set of roadblocks with is bijectively associated with a monotonic path from to a point for some with .

Lemma 15.

The value of is .

Proof.

Using eq. 3, the number of monotonic paths from to for some with is obtained by the sum

The first sum gives for odd , and for even . The second sum gives for odd , and for even . Combining these cases, the result follows. ∎

This result appeared in Bonin et al. (2003, Theorem 2.5) as the number of number of distinct first halves for Dyck paths, and in Deng et al. (2015, Theorem 4.2) as the number of Dyck paths invariant under reversal.

Proposition 16.

The size of the set of values that can equal the number of coalescent histories for at least one pair consisting of an -leaf caterpillar gene tree and an -leaf caterpillar species tree is bounded above by , or

This quantity, which appeared in a bijectively related context in Bonin et al. (2003, Theorem 4.2), gives the number of distinct Dyck paths up to reversal. Numerical values of the formulas in Lemma 15 and Proposition 16 are shown in Table 1.

Number of leaves Number of distinct roadblock sets Number of roadblock sets associated with symmetric Dyck paths Number of roadblock sets associated with asymmetric Dyck paths Upper bound on the number of distinct values for the number of coalescent histories Exact number of distinct values for the number of coalescent histories
Notation
Formula
OEIS record A000108 A001405 A007123
2 1 1 0 1 1
3 2 2 0 2 2
4 5 3 2 4 4
5 14 6 8 10 10
6 42 10 32 26 21
7 132 20 112 76 56
8 429 35 394 232 154
9 1430 70 1360 750 440
10 4862 126 4736 2494 1373
11 16796 252 16544 8524 4310
12 58786 462 58324 29624 13925
Table 1: The number of distinct values possible for the number of coalescent histories of a caterpillar gene tree and a caterpillar species tree.

4 Non-recursive enumeration of coalescent histories

With the correspondence between coalescent histories for non-matching caterpillars and roadblocked monotonic paths established, we now turn to enumerating the coalescent histories of possibly non-matching caterpillar gene trees and species trees. We can do so recursively by enumerating roadblocked monotonic paths according to Proposition 9; we can also obtain a non-recursive formula by applying eq. 1.

Without loss of generality, considering the two subtrees immediately descended from the root of a tree, we treat the left subtree as having a number of leaves greater than or equal to that of the right subtree. The right subtree of a caterpillar tree then has a single leaf, so that in eq. 1, the right subtree always has exactly one leaf in each successive step of the recursion. Hence, the term , follows the base case of the recursion and is equal to 1. Eq. 1, describing the number of coalescent histories for a caterpillar gene tree and a species tree , then reduces to

(7)

with initial condition for all when has a single leaf.

If is also a caterpillar tree with leaves, then we can iterate the recursion times, at each step reducing the size of the left subtree by one, until has a single leaf, the base case applies, and the summand equals 1. Each iteration introduces a new summation, with its upper limit depending on the associated , the number of edges that separate the root of from the root of . Continuing to label internal nodes of from 1 to in increasing order from the cherry to the root, we associate internal node of with index . Setting the integer parameter equal to 1, we have

(8)

where the constant represents the number of additional edges of that are possible locations for gene tree coalescence but that are not possible for gene tree coalescence .

For , consider gene tree internal node . Let be the set of labels for all leaves descended from . Following the definitions in eq. 1, let denote the smallest subtree of that has the property that each label in labels one of its leaves, and let denote the number of edges separating the root of from the root of . Then gives the number of edges of on which gene tree coalescence can occur (the +1 represents the root edge of ). The quantity , equal to the number of edges of ancestral to at least leaves (or ) but on which gene tree coalescence cannot occur, represents the number of roadblocks with fixed and .

For , the desired quantity , the number of additional edges of available for coalescence but not for coalescence , equals . We have therefore shown the following proposition.

Proposition 17.

Consider a caterpillar gene tree and a caterpillar species tree , both bijectively associated with the same set of leaf labels, but that do not necessarily match. The number of coalescent histories for is obtained by eq. 8, where the vector is obtained as a function that depends only on the topologies of and .

Note that if and match, then for each from 1 to , , and hence , , and no roadblocks occur. We have for each from 1 to , and eq. 8 becomes

equal to the Catalan number (Rosenberg, 2007, Theorem 3.4).

We take as an example the gene tree and species tree in Figure 4. We report the values of the , and in Table 2. The number of coalescent histories is

We can also obtain this result by recursive summation of roadblocked monotonic paths (Figure 4).

Internal node index in gene tree () 9 8 7 6 5 4 3 2 1
Summation index () 1 2 3 4 5 6 7 8 9
Number of roadblocks () 0 1 1 2 1 1 2 3 4
Distance between root of and root of () 0 0 1 1 3 4 4 4 4
Nodes possible for coalescence but not () NA 0 1 0 2 1 0 0 0
Summation term