A matroid associated with a phylogenetic tree

A matroid associated with a phylogenetic tree

Andreas W.M. Dress, Katharina T. Huber, and Mike Steel (A.W.D.): CAS-MPG Partner Institute and Key Lab for Computational Biology, 320 Yue Yang Road, 200031 Shanghai, China;
(K.T.H.): School of Computing Sciences, University of East Anglia, Norwich, UK;
(M.S.): Department of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
Abstract

A (pseudo-)metric on a finite set is said to be a ‘tree metric’ if there is a finite tree with leaf set and non-negative edge weights so that, for all , is the path distance in the tree between and . It is well known that not every metric is a tree metric. However, when some such tree exists, one can always find one whose interior edges have strictly positive edge weights and that has no vertices of degree , any such tree is – up to canonical isomorphism – uniquely determined by , and one does not even need all of the distances in order to fully (re-)construct the tree’s edge weights in this case. Thus, it seems of some interest to investigate which subsets of suffice to determine (‘lasso’) these edge weights. In this paper, we use the results of a previous paper to discuss the structure of a matroid that can be associated with an (unweighted) tree defined by the requirement that its bases are exactly the ‘tight edge-weight lassos’ for , i.e, the minimal subsets of that lasso the edge weights of .

keywords:
phylogenetic tree, tree metric, matroid, lasso (for a tree), cord (of a lasso).

1 Introduction

Given any finite tree without vertices of degree , there is an associated matroid having ground set where is the set of leaves of . In this paper, we describe this matroid and investigate a number of interesting properties it exhibits. The motivation for studying this matroid is its relevance to the problem of uniquely reconstructing an edge-weighted tree from its topology and just some of the leaf-to-leaf distances in that tree. This combinatorial problem arises in phylogenetics (the inference of evolutionary relationships from genetic data) since – due to patchy taxon coverage by available genetic loci patchy () – reliable estimates of evolutionary distances can often be obtained only for some pairs of species.

In dre3 (), we already introduced and explored related mathematical questions. We asked when knowing just some of the leaf-to-leaf distances is sufficient to uniquely determine – or, as we say, ‘lasso’ – the topology of the tree, or its edge weights, or both. In this paper, we turn our attention to a fixed (un-weighted) tree and the set of minimal subsets of for which the leaf-to-leaf distances between all with relative to some edge-weighting of suffice to determine all the other distances relative to and, thus, the edge-weighting . Indeed, these subsets form the bases of the matroid that will be studied here.

We begin by recalling some basic definitions and some relevant terminology from dre3 () on trees, lassos, and associated concepts (readers unfamiliar with basic matroid theory may wish to consult wel () – though even Wikipedia may suffice). We then define and describe some of its basic properties before presenting our main results. Finally, we provide a number of remarks, observations, and questions for possible further study.

2 Some terminology and basic facts

We will assume throughout that is a finite set of cardinality and, for any elements , we will usually write just instead of , and we will refer to any such set as a ‘cord’ whenever holds. Throughout this paper, we will assume that is an tree, i.e., a finite tree with vertex set , leaf set , and edge set that has no vertices of degree . Two trees and are said to be ‘equivalent’ if there exist a bijection with for all and in which case we will also write In case every interior vertex of an tree (that is, every vertex in ) has degree , will also be said to be a ‘binary’ tree.

Further, given any two vertices of , we denote by the set of all vertices on the path in from to and by the set of all edges in on that path so that always holds.

For each , we denote by the map (where is, of course, the Kronecker delta function). And for all and , we put in case and otherwise.

Here, given an tree , we will be mainly concerned with the -linear map

and the associated -labeled family of linear forms

Note that holds for all and all , and for all and where denotes the map

associated to the edge weighting – a map which in case is a non-negative edge weighting is nothing but the associated (pseudo-)metric on induced by the edge weighted tree much studied in phylogenetic analysis.

Recall also that, given an arbitrary metric defined on ,

  • the metric is dubbed a ‘tree metric’ if it is of the form for some tree and some non-negative edge weighting of

  • which, in turn, holds if and only if satisfies the well-known ‘four-point condition’ stating that, for all in , the larger two of the three distance sums coincide,

  • that, in this case, one can actually always find an tree and an edge weighting of with such that is strictly positive on all interior edges in which case is called a ‘proper’ edge weighting of ,

  • any such pair is – up to canonical isomorphism – uniquely determined by ,

  • and one does not even need to know the values of for all cords in in order to determine all the other distances and, thus, the edge-weighting in this case.

In this note, we continue our investigation of those subsets of for which – given the tree – already the restriction of the map to suffices to determine – or ‘lasso’ – the edge weighting of that we began in dre3 (). To this end, we denote, for any subset of ,

– by the -linear subspace of the dual vector space of the space generated by the maps with ,

– by the dimension of , and

– by the graph with vertex set and edge set .

Following the conventions introduced in dre3 (),

– we will refer to a subset of as being ‘connected’, ‘disconnected’ or ‘bipartite’ etc. whenever the graph is connected, disconnected, or bipartite and so on,

– a connected component of will also be called a connected component of ,

– and given any two subsets of , the subset of will be denoted by so that a subset of is bipartite if and only if there exist two disjoint subsets of with .

Further, a subset of will be called

– an ‘edge-weight lasso’ for if the implication ” holds for any two proper edge-weightings of ,

– a ‘topological lasso’ for if the implication ” holds for any tree and any proper edge-weightings of and of , respectively, and

– a ‘strong lasso’ for if it is simultaneously an edge-weight and a topological lasso for .

Next, recall (see e.g. ox (); wel ()) that an ‘abstract’ matroid with a ground set, say, can be defined in terms of its ‘rank function’ (and with denoting the power set of ) as well as by the collection of its ‘independent sets’, the collection of its ’generating sets’, the collection of its ‘bases’, i.e., the maximal sets in or, just as well, the minimal sets in , the collection of its ‘circuits’, i.e., the minimal sets in , as well as the ‘closure operator’ associated to .

Here, given any tree , we want to investigate the matroid with ground set associated to whose rank function is the map defined just above, i.e., the matroid that is ‘represented’ (over , again see e.g. ox (); wel ()) by the map

We will denote by its collection of independent sets, by its collection of generating sets, by its collection of bases, by its collection of ‘circuits’ and, given any subset of , we denote by the ‘(-)closure’ of relative to .

It was noted already in (dre3, , Theorem 1) that a subset of is an edge-weight lasso for an tree if and only if the implication ” does not only hold for any two proper edge weightings of , but for any two maps and, hence, if and only if coincides with or, using the terminology introduced above, if and only if or, just as well, holds. In particular, an edge-weight lasso for is a ‘tight’ edge-weight lasso for , i.e, a minimal subset of that is an edge-weight lasso for , if and only if its cardinality coincides with if and only if it is a basis of , that is, holds.

Particular types of trees that will play an important role in this paper are shown in Figure 1. They comprise (i) the ‘star trees’, i.e., trees that have just one interior vertex and, hence, are equivalent to the tree with leaf set , vertex set and edge set where ‘’ denotes just some arbitrary, but fixed element not in ; (ii) ‘quartet trees’, i.e., binary trees that have four leaves (with denoting the quartet tree with leaf set whose central edge that will also be denoted by separates the leaves from ), and (iii) ‘caterpillar trees’, i.e. binary trees containing two interior vertices with .

Figure 1:

(i) A star tree with leaf set ; (ii) A binary tree – up to equivalence, there are two more binary trees ; (iii) a ‘caterpillar’ tree for .

3 Star trees

For the simplest type of tree, i.e., the star tree with leaf set (cf. Figure 1 (i)), the associated matroid is well known: It is easily seen to exactly coincide with the ‘biased matroid’ of the complete signed graph with vertex set all of whose edges have sign . In consequence (see e.g. (ox, , Section 6.10) and the references therein to Zaslavsky’s papers on signed graphic matroids), the following results are known to hold:

Proposition 3.1

Given a finite set of cardinality , the following holds for the matroid associated to the star tree with leaf set :

  • The collection of all edge-weight lassos for coincides with the collection of all ‘strongly non-bipartite’ subsets of , i.e, all subsets of for which none of the connected components of is bipartite.

  • The collection of all tight edge-weight lassos for coincides with the collection of all minimal strongly non-bipartite subsets of , i.e, all subsets of for which each connected component of contains exactly one circle111In our context, we adopt the convention of calling a graph (and, hence, also every subgraph of a graph) a ‘circle’ if it is connected and every vertex in that graph has degree . and the length of this circle has odd parity.

  • The collection of all independent subsets of coincides with the collection of all subsets of for which each connected component of is either a tree or contains exactly one circle and the length of this circle has odd parity.

  • The collection of all circuits of coincides with the collection of all subsets of that either form a circle of even length or a pair of circles of odd length together with a connecting simple path, such that the two circles are either disjoint (then the connecting path has one end in common with each circle and is otherwise disjoint from both) or share just a single common vertex (in this case the connecting path is that single vertex).

  • The co-rank of a subset of relative to coincides with the number of non-bipartite connected component of .

  • The closure of a subset of relative to coincides with the union of (a) the edge set of the complete graph whose vertex set is the union of the vertex sets of all non-bipartite connected components of and (b) all subsets of the form for which some bipartite connected component of with , , and exists.

4 A recursive approach for computing

Every tree can be reduced by a sequence of edge contractions to a star tree (one may even insist that at each stage, one of the two subtrees incident with the edge being contracted has only one non-leaf vertex, though we do not require this here). Thus, Proposition 3.1 can be used as basis for a recursive description of the matroid associated with any tree, provided that one can describe, for any tree , how to obtain from where is any interior edge of , and is the tree obtained from by collapsing edge . We provide such a description shortly, in Proposition 4.2, using the following lemma.

Lemma 4.1

Given any tree , any subset of the set of interior edges of , any map , and any map with , let denote the tree obtained by collapsing all edges in , and let denote the restriction of to the space relative to the canonical embedding defined by extending each map to the map by putting for all . Then, one has , i.e., one has for all maps with for all .

In particular, given any edge-weight lasso for , is also an edge-weight lasso for the tree . More generally,

(1)

holds for every subset of and any subset of the set of interior edges of .

{@proof}

[Proof]The first part follows directly from the definitions and implies that holds for all . In particular, as the map is surjective, the maps must generate whenever the maps generate while, more generally, they generate a space whose dimension coincides with the difference of and the dimension of the kernel of the map .  

Proposition 4.2

Given an tree , an interior edge of , a pair , and a basis of , let denote the unique map in with . Then, coincides with the set

{@proof}

[Proof]By Lemma 4.1, there exists, for each , some with . Thus, each element of is of the form for some and some . So, must clearly hold. Now suppose that and holds. Then, denoting by the space of all maps with , it follows from (1) that holds while, by construction, we have if and only if there exists some non-zero map in .

However, given any map and any real number , it follows from the fact that, by definition, coincides with for all , one has if and only if vanishes and, hence, if and only if holds for all . Thus, one has if and only if one has and, hence, if and only if holds as claimed.  

Remark:

Similarly, suppose that is an tree and that is a core as defined in (dre3, , Section 5), i.e., a non-empty subset of for which the induced subgraph of with vertex set is connected (and, hence, a tree) and the degree of any vertex in is either or coincides with the degree of in . Then, the rank of a subset of relative to and the rank of the corresponding subset of relative to the tree are easily seen to be related by the inequality

This fact can be used to prove (dre3, , Theorem 5) in the same way Lemma 4.1 has been used above to establish Proposition 4.2.

4.1 An example

To illustrate Proposition 4.2, consider – for – the quartet tree shown in Figure 1 (ii). In this case, there is – up to scaling – only one linear relation between the six maps , viz. the relation

Thus, consists of the four -subsets of that do not contain exactly one of the four cords – or, equivalently, with – and, hence, the four subsets of whose graphs are shown in Figure 2(ii). Clearly, if coincides with the unique interior edge of (i.e., the edge denoted by in Figure 1 (ii)), is equivalent to the star tree , also shown in Figure 1 (i), and the graphs corresponding to the bases in , being minimal strongly non-bipartite graphs with vertex set , must consist of one triangle (for which there are possibilities) to which the remaining element in is appended by a single edge (for which there are possibilities). So, consists of bases that form two orbits relative to the symmetry group of representatives of which are the bases and shown in Figure 2 (i). For the two cords , we have – putting

while

and

holds implying that, to bases of type , we can add cords of type , but not cords of type .

And for the two cords , we have

as well as

and

holds implying that, to bases of type , we can add either one of the two missing cords. Obviously, this fully corroborates our previous assertion about .

Figure 2:

(i): Two graphs representing two of the twelve tight edge-weight lassos of , one from each of the two orbits of such lassos relative to the -element symmetry group of .

(ii): The four graphs associated to the four bases in .

5 Some particular cases

5.1 Pointed -covers of binary trees that are bases of

When is a binary tree, some particular bases in are easily described: Select any element and, for each one of the interior vertices of , consider the three components of the graph obtained from by deleting . Select an element of from each of the two components that do not contain , and denote this pair by . Put

and let denote the collection of subsets of that can be generated in this way (by the various choices of and as varies).

For example, considering again the quartet tree with its two interior vertices and as shown in Figure 1 (ii), we may choose , and obtain the lasso

as an element of .

Clearly, is a subset of for each , since the elements of correspond precisely to the so-called ‘pointed covers’ of of cardinality and, by Theorem 7 of dre3 (), any pointed cover of a binary tree is not only an edge-weight, but a strong lasso for that tree.

We note also that, given two distinct elements in , a subset of cannot simultaneously be a pointed -cover in and a pointed -cover in unless is a caterpillar tree with and at opposite ‘ends’ of the tree: Indeed, if there exists some , we must have implying that the path from to in must pass through every interior vertex of .

Our next results require two definitions that will also be important later in this paper: Recall first that, given an tree and a subset of of cardinality at least , one denotes

  • by the tree obtained from the minimal subtree of that connects the leaves in by suppressing any resulting vertices of degree 2 (see e.g. (dre3, , Section 2.3)),

  • by and its vertex and edge set, respectively,

  • and, given in addition any edge weighting of , one denotes by the ‘induced’ edge weighting of , i.e., the edge weighting that maps any edge onto the sum , yielding a surjective -linear map such that holds for all and all .

It follows that the map coincides, for all , with the map , the composition of the maps and .

So, denoting by the dual – and necessarily injective – map of the map , we have also and for every subset of . In consequence, we must also have

(2)

for every subset of of cardinality at least and every subset of , implying also that every circuit of must also be a circuit of , i.e., we have for every such subset of of cardinality at least .

Further, denoting – for every – by the unique pendant edge of containing , we say that a -subset of forms (or ‘is’) a ‘cherry’ if the two edges share a vertex, and is said to form a ‘proper cherry’ if this vertex has degree 3. Note that a -subset of forms a proper cherry if and only if holds for any two distinct elements in (if any). Note also that, in a binary tree , every cherry is a proper cherry. In addition, such a tree is a caterpillar tree if and only if holds or its leaf set contains exactly two distinct cherries. We claim:

Proposition 5.1

For an tree , a cord is a ‘co-loop’ of , i.e. it is contained in every edge-weight lasso for , if and only if is a proper cherry.

{@proof}

[Proof]If holds, the set is the only basis of while, if holds and is a proper cherry, the cord must be contained in every edge-weight lasso for in view of (dre3, , Corollary 1). Conversely, if does not form a proper cherry, there must exist two distinct elements in such that holds, implying that cannot be a co-loop of .  

6 Main results

6.1 determines up to equivalence

We begin this section by showing that the matroid associated with an tree determines that tree up to equivalence:

Theorem 1

One has “” for any two trees and .

{@proof}

[Proof]We first note that, if is any tree and is a -subset of , then we have if and only if there exists at least one basis of containing the set : Indeed, if holds, the four maps and, hence, also the corresponding four maps are linearly independent. So, by the matroid augmentation property of independent sets, there exists some containing these four cords. Conversely, if and, therefore, also holds, cannot be part of a basis . It follows that implies where, for any tree , is defined by . However, it has been observed already by H. Colonius and H. Schultze in col77 (); col81 () that holds for any two trees if and only if one has (for a more recent account, see (dre11, , Theorem 2.7)) or (sem, , Corollary 6.3.8).  

6.2 The rank of topological lassos

Now assume that holds and recall that the following three assertions are – according to (dre3, , Theorem 8) – equivalent in this case for any tree and any bipartition of into two disjoint non-empty subsets :

  • The subset of is a topological lasso for ,

  • is a ‘cover’ of (i.e., given any interior vertex of and any two edges with , there exists some cord in with , see (dre11, , Section 7)).

  • holds for every cherry .222When stating this theorem in dre3 (), we forgot to mention that one needs to assume that holds. Indeed, it is simply wrong for for obvious trivial reasons as (split-i) holds for all bipartitions of the leaf set of an tree with leaves, but (split-ii) and (split-iii) never holds in this case. Yet, the assumption will always be made here when applying this theorem.

And it was also noted in this context that such bipartitions exist if and only if every cherry is a proper cherry and holds.

Here, we want to complement this result as follows:

Theorem 2

Given any tree , one has for every bipartite subset of . Furthermore, the following assertions are equivalent for every such subset of :

  • The rank of coincides with .

  • There exists some cord such that is an edge-weight lasso for .

  • is connected and is an edge-weight lasso for for some cord if and only if is not bipartite.

  • is connected, the closure of relative to coincides with the edge set of the necessarily unique complete bipartite graph with vertex set whose edge set contains , i.e., coincides with the set in case the two subsets of form the necessarily unique bipartition of with , and this set forms a ‘hyperplane’ in , i.e., a maximal subset of of rank smaller than .

{@proof}

[Proof]Assume that and are two subsets of that form a bipartition of with , and let denote the map in that maps every interior edge of onto , every pendant edge that is incident with some leaf in onto , and every pendant edge that is incident with some leaf in onto . Clearly,