UPGMA and the normalized equidistant minimum evolution problem

# UPGMA and the normalized equidistant minimum evolution problem

## Abstract

UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a widely used clustering method. Here we show that UPGMA is a greedy heuristic for the normalized equidistant minimum evolution (NEME) problem, that is, finding a rooted tree that minimizes the minimum evolution score relative to the dissimilarity matrix among all rooted trees with the same leaf-set in which all leaves have the same distance to the root. We prove that the NEME problem is NP-hard. In addition, we present some heuristic and approximation algorithms for solving the NEME problem, including a polynomial time algorithm that yields a binary, rooted tree whose NEME score is within of the optimum. We expect that these results to eventually provide further insights into the behavior of the UPGMA algorithm.

###### keywords:
UPGMA, minimum evolution, balanced minimum evolution, hierarchical clustering
###### Msc:
68Q17, 05C05, 05C85, 92B05
\cortext

[cor1]Corresponding author

## 1 Introduction

Clustering (i.e. subdividing a dataset into smaller subgroups or clusters) is a fundamental task in data analysis, and has a wide range of applications (see, e.g. (1)). An important family of clustering methods aim to produce a clustering of a dataset in which the clusters form a hierarchy where the clusters nest within one another. Such hierarchies are typically represented by leaf-labeled tree structures known as dendrograms or rooted phylogenetic trees. Introduced in 1958 (2), average linkage analysis, usually referred to as UPGMA (Unweighted Pair Group Method with Arithmetic Mean), is arguably the most popular hierarchical clustering algorithm in use to date, and remains widely cited1 and extremely popular (see, e.g. (3)). This is probably because UPGMA is conceptually easy to understand and fast in practice, an important consideration as big data sets are becoming the norm in many areas. UPGMA is commonly used in phylogenetics and taxonomy to build evolutionary trees ((4), Chapter 11) as well as in related areas such as ecology (5) and metagenomics (6). In addition, it is used as a general hierarchical clustering tool in bioinformatics and other areas including data mining and pattern recognition ((7), Chapter 2).

UPGMA is a text-book algorithm that belongs to the family of agglomerative clustering methods that share the following common bottom-up scheme (cf. e.g. ((4), p.162)). They take as input a dissimilarity on a set , i.e. a real-valued, symmetric map on which vanishes on the diagonal, and build a collection of clusters or subsets of which correspond to a rooted tree with leaf-set . To do this, at each step two clusters with the minimum inter-cluster dissimilarity are combined to create a new cluster, starting with the collection of clusters consisting of singleton subsets of , and finishing when the cluster is obtained. Different formulations of the inter-cluster dissimilarity, which specifies the dissimilarity of sets as a function of the dissimilarities observed on the members of the sets, lead to different heuristic criteria of the agglomerative methods. UPGMA, as the name average linkage analysis suggests, uses the mean dissimilarity across all pairs of elements that are contained within the two clusters. Formally, two clusters are selected for merging at each iteration step of UPGMA if the average

 1|A||B|∑a∈A,b∈BD(a,b)

is minimized over all possible pairs of clusters. Since the arithmetic mean is used, UPGMA is often more stable than linkage methods in which only a subset of the elements within the clusters are used (e.g. the single-linkage method).

UPGMA is commonly thought of as a method that greedily constructs a rooted phylogenetic tree that is closest to the input dissimilarity matrix in the least squares sense (8). However, it is not guaranteed to do so, although it often does quite well in practice ((4), p.162). In (9) it was shown that the related Neighbor-Joining (10) method for constructing unrooted phylogenetic trees from dissimilarity matrices can be thought of as a greedy heuristic that minimizes the so-called balanced minimum evolution score. Here we shall observe that (see Section 3), completely analogously, UPGMA is a greedy heuristic for computing a rooted phylogenetic tree that minimizes the so-called minimum evolution score (11) over all rooted phylogenetic trees on the same fixed leaf-set in which all leaves have the same distance in the tree to the root. We refer to this optimization problem as the normalized equidistant minimum evolution (NEME) problem, and expect that a better understanding of this problem will provide further insights into the behavior of the UPGMA algorithm.

### 1.1 Related work

Theoretical properties of discrete optimization problems arising in the construction of evolutionary trees have been studied for many years (for some earlier work see, e.g. (12); (13); (14)). Among these, the problems falling under the name of minimum evolution alone form a quite diverse family (see, e.g. (15)), in which the so-called balanced minimum evolution problem (16) is a particularly well-studied member. For this problem it was recently shown in (17) that for general -input dissimilarity matrices there exists a constant such that no polynomial time algorithm can achieve an approximation factor of unless P equals NP. We note that this hardness result does not rely on the often imposed restriction (see, e.g. (18); (13)) that the edge lengths of the constructed tree must be integers. Moreover, in contrast to general input dissimilarity matrices, for inputs that are metrics (i.e. matrices that also satisfy the triangle inequality) a polynomial time algorithm with an approximation factor of is presented in (17). Interestingly, the proof of this approximation factor uses the fact that the balanced minimum evolution score of an unrooted tree can be interpreted as being the average length of a spanning cycle compatible with the structure of the tree (19).

Another recent, related direction of work considers the algebraic structure of the space of rooted phylogenetic trees induced by the UPGMA method (see, e.g. (8); (20)). This algebraic structure is tightly linked with the property of consistency of a tree construction method, that is, those conditions under which the method is able to reconstruct a tree that has been used to generate the input dissimilarity matrix (see, e.g. (21)). In the context of our work, we are particularly interested in the consistency of methods that perform a local search of the space of all rooted phylogenetic trees on a fixed set of leaves (see, e.g. (16)). Again, balanced minimum evolution is the variant of minimum evolution for which some consistency results of this type are known (22); (23).

### 1.2 Our results and organization

After presenting some preliminaries in the next section, in Section 3 we begin by giving an explicit formula of the minimum evolution score of a rooted tree as a linear combination of the input dissimilarities. This formula allows us to interpret the minimum evolution score of in terms of the average length of a minimum spanning tree compatible with the set of clusters induced by .

Using this observation, we explain how UPGMA can be regarded as a greedy heuristic for the NEME problem. In addition, we show that there are rooted phylogenetic trees with leaves on which some input dissimilarity matrix has an optimal least squares fit while the NEME score of that tree for the same dissimilarity matrix is worse than the minimum NEME score by a factor in . This highlights the fact that the NEME problem and searching for trees with minimum least squares fit are quite distinct problems.

Next, in Section 4, we explore solving the NEME problem by performing a local search of the space of binary rooted phylogenetic trees using so-called rooted nearest neighbor interchanges as the moves in the local search. We show that this approach is consistent. More specifically, for any input dissimilarity matrix that can be perfectly represented by a unique binary rooted phylogenetic tree with all leaves having the same distance from the root, we prove that the local search will arrive at this tree after a finite number of moves.

In Section 5 we show that the NEME problem is NP-hard even for input distance matrices that satisfy the triangle inequality and only take on different values. In light of this fact, in Section 6 we consider some approximation algorithms for solving the NEME problem. More specifically, we first show that the tree produced by UPGMA can have a score that is worse than the minimum score by a factor in . Then, for dissimilarity matrices that satisfy the triangle inequality, we present a polynomial time algorithm that yields a binary rooted phylogenetic tree whose NEME score is within of the optimum. We conclude in Section 7 by mentioning two possible directions for future work.

## 2 Preliminaries

In this section we give a formal definition of the NEME problem and introduce some of the notation and terminology that will be used throughout this paper.

Let be a finite non-empty set. A dissimilarity on is a symmetric map with for all . In this paper, a rooted phylogenetic tree on is a rooted tree with (i) root of degree 1, (ii) leaf set and (iii) every vertex not in having degree at least 3. Note that even though we require the root to have degree 1, we do not consider it a leaf of the tree. A normalized equidistant edge weighting (NEEW) of a rooted phylogenetic tree on is a map such that the total weight of the edges on the path from to is 0 for all . More generally, for all , we denote the total weight of the edges on the path between vertices and by . The height of any vertex is defined as for any leaf in the subtree of rooted at . The length of under the edge weighting is , that is, the total weight of all edges of .

Note that the length of any rooted phylogenetic tree on with a normalized equidistant edge weighting can also be expressed as follows:

 ℓω(T)=∑v∈(V−(X∪{ρ}))(deg(v)−2)⋅h(T,ω)(v), (1)

where denotes the degree of vertex in . Note that the restriction of to yields a dissimilarity on . Moreover, this dissimilarity is an ultrametric, that is, holds for all , if and only if the edge weighting assigns a non-negative real number to each edge not adjacent to a vertex in . We call any such edge weighting interior positive.

Let be a dissimilarity on and a rooted phylogenetic tree on . For any vertex let denote the set of children of , that is, the set of vertices that are adjacent to and for which lies on the path from to . Moreover, we refer to as the parent of the vertices and we denote by the cluster of elements in induced by , that is, the set of those leaves of for which the path from to contains . In (24) it is shown that, for any dissimilarity and any rooted phylogenetic tree on , there is a unique normalized equidistant edge weighting with

 Δ(D,D(T,ω))=∑{x,y}∈(X2)(D(x,y)−D(T,ω)(x,y))2

minimum, where denotes the set of 2-element subsets of . More precisely, this edge weighting is the unique solution of the system of linear equations

 ℓ(T,ω)(v,y) =12∑{u,u′}∈(ch(v)2)|C(u)|⋅|C(u′)|∑{u,u′}∈(ch(v)2)∑x∈C(u)x′∈C(u′)D(x,x′), (2)

for all , and , for all , . Note that this is the analogue to Vach’s theorem for unrooted trees (25). For later use, we put .

Now, the normalized equidistant minimum evolution score of a rooted phylogenetic tree on with respect to a dissimilarity  on is formally defined as

 σD(T)=ℓω(D,T)(T),

that is, the length of under the edge weighting . The NEME problem is to compute, for an input dissimilarity  on , a rooted phylogenetic tree on with minimum NEME score. Formally, it can be stated as below.

Problem NEME Problem
Instance: A distance matrix on a finite set and a number .
Question: Does there exist a rooted phylogenetic tree on such that holds.

## 3 UPGMA and the NEME problem

We begin this section by explaining how the UPGMA algorithm can be reinterpreted as a greedy approach to solving the NEME problem. First note that it follows directly from Equations (1) and (2) that the NEME score of a rooted phylogenetic tree can be written as the following linear combination of the given dissimilarity values:

 σD(T) =∑{x,y}∈(X2)αT{x,y}⋅D(x,y)with (3) αT{x,y} =deg(lca(x,y))−22∑{u,u′}∈(ch(lca(x,y))2)|C(u)|⋅|C(u′)|,

where is the lowest common ancestor in for any two vertices . In particular, in case is a binary tree, that is, every vertex not in has degree precisely 3, we obtain, for any the coefficient

 αT{x,y}=12∏u∈ch(lca(x,y))|C(u)|. (4)

As an immediate consequence of (3) we obtain that the score is linear in , that is, when can be written as for non-negative real numbers and dissimilarities , , then

 σD(T)=λ1⋅σD1(T)+λ2⋅σD2(T) (5)

To link the Formula (3) with the UPGMA algorithm, recall that this algorithm constructs a rooted phylogenetic tree by generating the list of clusters associated to the vertices of this tree. It starts with the singleton clusters associated to the leaves of the tree. Then, in each iteration of the algorithm we already have a partition of into clusters and a dissimilarity on . UPGMA then selects a pair of two distinct clusters that minimizes

 1|A||B|∑a∈A,b∈BD(a,b)

and returns the partition . Now it is easy to see that UPGMA in each iteration greedily pairs already selected clusters so as to locally minimize the value added to the score for the rooted phylogenetic tree  produced by the method.

Interestingly, the coefficients in (4) suggest the following alternative interpretation of the NEME score of a binary rooted phylogenetic tree : Consider the complete graph with vertex set . Each edge of is weighted with the value . We construct a random subgraph of as follows. For each vertex select a random edge that has precisely one end point in each cluster associated with the two children of . Let denote the resulting subgraph of . It is easy to see that is always a spanning tree of . Thus, in case is binary, can be interpreted as half the average length of a random spanning tree of that is compatible with the clusters of . Based on (3), this interpretation can be extended to the non-binary case where, instead of a spanning tree, a random spanning forest in with edges, some of which selected more than once, arises.

Next we present a technical lemma summarizing some simple observations about the NEME score that will be used later. Let denote the set of all rooted phylogenetic trees on . In addition let denote the subset of those trees in that are binary.

###### Lemma 1

Let be a non-negative dissimilarity on a finite set with . Then, for all , we have:

•  ∑{x,y}∈(X2)αT{x,y}=12(n−1),
•  2n2≤min{αT{x,y}:{x,y}∈(X2)}≤max{αT{x,y}:{x,y}∈(X2)}≤12,
•  σD(T)≤n24min{σD(T′):T′∈RX}.
{@proof}

[Proof.](i): We use induction on . For the equality clearly holds. Next assume and consider any . Let be the single child of . Put and let denote the children of . Then we have, by induction,

 ∑{x,y}∈(X2)αT{x,y} =⎡⎢ ⎢⎣k∑i=1∑{x,y}∈(C(vi)2)αT{x,y}⎤⎥ ⎥⎦+⎡⎢ ⎢ ⎢⎣∑{x,y}∈(X2)u=lca(x,y)αT{x,y}⎤⎥ ⎥ ⎥⎦ =[k∑i=112(|C(vi)|−1)]+k−12=12(n−1),

as required.

(ii): Consider any . The inequality follows immediately from the definition of the coefficients . And the inequality follows from the fact that, for any integer , the function

 f:Rk→R:(z1,z2,…,zk)↦∑1≤i

attains its maximum among all non-negative with at . Hence, for any with and , we have .

(iii): This is an immediate consequence of (ii):

 σD(T) ≤∑{x,y}∈(X2)12D(x,y) =n24⎡⎢⎣∑{x,y}∈(X2)2n2D(x,y)⎤⎥⎦≤n24min{σD(T′):T′∈RX}.

We end this section presenting a family of dissimilarities for which the closest rooted equidistant tree in the least squares sense (i.e. the tree with minimum) has an NEME score that is worse than the minimum NEME score by a quadratic factor. This illustrates, as mentioned in the introduction, that the NEME problem is quite different from the problem of finding a closest rooted tree.

###### Lemma 2

There exist dissimilarities on a set with elements for which there exists a rooted phylogenetic tree on together with a normalized equidistant edge weighting with for all but

 σD(T)≥n24min{σD(T′):T′∈RX}
{@proof}

[Proof.]Assume with for some integer . Define the dissimilarity on by putting for all odd , where is a real number, and for all other . Note that in any rooted phylogenetic tree on for which holds with each pair , odd, must form a cherry (cf. Figure 1(a)). Moreover, for any such tree we have . In contrast, in any tree with minimum NEME score for the vertex must be the single child of the root for all odd (cf. Figure 1(b)). This implies .

## 4 Searching tree space for an optimal NEME tree

In this section we shall first establish that performing a local search on the space for trees with minimum NEME score is a consistent approach, that is, if the input dissimilarity can be represented by a binary rooted phylogenetic tree with an interior positive normalized equidistant edge weighting then this tree has minimum NEME score and, under some mild technical conditions, the local search will arrive at precisely this tree after a finite number of steps.

Note that consistency is an important property and there are general conditions known that imply consistency for approaches that construct unrooted phylogenetic trees (see e.g. (21); (26)). We first show that for any generic ultrametric, that is, a dissimilarity where and is a normalized equidistant edge weighting for with for all edges  not incident to a vertex in , a local search in starting from any using rooted nearest neighbor interchanges (rNNI) will terminate in . For unrooted trees an analogous result is established in (22). Recall that an rNNI modifies a rooted phylogenetic tree locally around a vertex as depicted in Figure 2. In the following, for any vertex of a rooted phylogenetic tree on , the subtree of induced by consists of the parent of together with all the vertices of for which the path from to contains . Note that such a subtree can be viewed as a phylogenetic tree with root on the cluster of elements in induced by .

In the following we will use the well known fact that, for any generic ultrametric on , the binary rooted phylogenetic tree with for the edge weighting is unique ((27), Theorem 7.2.8). We will say that represents , for short.

###### Lemma 3

Let be a generic ultrametric on and the unique binary rooted phylogenetic tree on that represents . Then, for any , there exists an rNNI that changes to with .

{@proof}

[Proof.]We use induction on . The statement in the lemma clearly holds for in view of the fact that . So assume and consider any . The situation is depicted in Figure 3(a). Let and , respectively, denote the set of leaves in the rooted subtrees and of . Similarly, let and , respectively, denote the set of leaves in the rooted subtrees and of . Note that the restriction of to any non-empty subset is again a generic ultrametric.

First consider the case that does not represent . Then, by induction, there exists an rNNI in that results in a rooted phylogenetic tree on with strictly smaller NEME score. Hence, applying the same rNNI to yields a tree with . The case that does not represent is completely analogous.

It remains to consider the case that and represent and , respectively. Then, and immediately implies and and, thus, . Otherwise, there exists at least one with or but and . Thus, swapping the roles of either and or and , we can assume without loss of generality that the sets , and are non-empty. The structure of the tree is depicted in Figure 3(b). Put .

First assume that . Put , , and . Without loss of generality we assume . We perform an rNNI pruning and regrafting the subtree with leaf set to obtain the tree depicted in Figure 3(c). Put . To show it suffices to show . To establish the latter inequality, recall that represents and, therefore, we can assume that has been scaled so that for all , . We put

 δA=∑a′∈(A∩A′)b′∈(A∩B′)D(a′,b′)andδB=∑a′∈(B∩A′)b′∈(B∩B′)D(a′,b′).

Note that all distances that contribute to and are strictly less than , implying and . Using this notation, we obtain

 h(T′,ω′)(u′)+h(T′,ω′)(w′2)−h(T′′,ω′′)(u′)−h(T′′,ω′′)(v) =12+n1n4+n2n32(n1+n2)(n3+n4)−n22(n1+n2)−n1+n32(n1+n2+n3) +12[1(n1+n2)(n3+n4)−1(n1+n2)n3]⋅δA +12[1(n1+n2)(n3+n4)−1(n1+n2+n3)n4]⋅δB =g(δA,δB).

Note that is a linear function in and and that the coefficient of is negative. Moreover, the assumption implies that the coefficient of is negative too. Thus, using the fact that and , we have

 g(δA,δB)>g(n1n3,n2n4)=0,

from which follows, as required.

It remains to consider the case that , that is, . We apply the same rNNI to as in the previous case and, using similar calculations, we obtain

 σD(T′)−σD(T′′)=n42(n3+n4)+12[1n1(n3+n4)−1n1n3]⋅δA>0,

using again .

In the following main result of this section we note that even for the non-generic case a weak form of consistency holds.

###### Theorem 4

Let and an interior-positive normalized equidistant edge weighting for . Put . Then we have

 σD(T)=min{σD(T′):T′∈BRX}.

If is generic, then is the unique tree in minimizing the NEME score for and a local search using rooted nearest neighbor interchanges starting from any tree in will arrive at after a finite number of steps.

{@proof}

[Proof.]For generic , the theorem is an immediate consequence of Lemma 3. So, assume that is not generic and, for a contradiction, that there exists some with . For any real number , define the NEEW of by putting for all edges of not adjacent to a vertex in and for all other edges of . Put and note that, by construction, is a generic ultrametric that is represented by . As a consequence, must hold. But this contradicts in view of the fact that, as tends to , tends to while tends to .

The result in Theorem 4 immediately raises the question whether the NEME score for any input distance matrix is minimized by some binary rooted phylogenetic tree. It is known (28) that for unrooted phylogenetic trees the balanced minimum evolution score is indeed always minimized for some unrooted binary phylogenetic tree. We end this section establishing that, in contrast, the answer to the above question for the NEME score is no.

###### Lemma 5

There exist dissimilarities with

 min{σD(T):T∈RX}
{@proof}

[Proof.]Consider a binary rooted phylogenetic tree whose structure is as depicted in Figure 4(a). It consists of two rooted binary subtrees and , each having leaves. In addition, there is a single leaf adjacent to vertex . Let be an interior positive normalized equidistant edge weighting for such that and for some . Put .

Next, consider the non-binary rooted phylogenetic tree depicted in Figure 4(b). It is constructed from by contracting the edge between and into the vertex . Put . To show that , it suffices, by Equation (1), to show that

 h(T,ω)(u)+h(T,ω)(v) =s+1>2h(T′,ω′)(w)=2m2+4msm2+2m,

which can easily be checked to be the case, in view of , for any . Hence, by Theorem 4, we have .

## 5 The NEME problem is NP-hard

To establish NP-hardness of the NEME problem, we use a reduction from the well-known NP-hard graph coloring problem (see e.g. (29)). More specifically, we consider the following variant of this problem:
Input: A graph with .
Question: Can be partitioned into 4 subsets with such that no edge has both endpoints in the same set for some ? We call any such partition a 4-coloring of .

Note that the additional constraint that the sets in a 4-coloring are all of the same size is merely added to simplify the description of the reduction. It preserves the NP-hardness of the graph coloring problem in view of the fact that adding isolated vertices to any graph does not change the minimum number of colors that suffice to color . We first present a technical lemma that will be used in the construction below.

###### Lemma 6

Let be a set with elements, , , that is partitioned into the sets , and with and . In addition, let be a real number and a dissimilarity on with for all , and for all other . Then any binary rooted phylogenetic tree on with must contain two distinct vertices with (i) , (ii) , (iii) and .

{@proof}

[Proof.]First consider a binary rooted phylogenetic tree on that contains vertices and with properties (i)-(iii). Note that this implies that and have the same parent and that is the single child of the root . Moreover, for all , , we have and, by Lemma 1(ii), this is the smallest possible value for a rooted phylogenetic tree with leaves.

Next consider any binary rooted phylogenetic tree on that does not contain two vertices and satisfying properties (i)-(iii). Let denote the single child of and consider the two children and of . In particular, and must violate at least one of the properties (i)-(iii). By construction we have . Hence, one of the properties (ii) or (iii) must be violated.

First consider the case that (iii) is violated. This implies, without loss of generality, that there exist and with . In view of and Lemma 1(ii) we have

 αT′{a,b}≥2(2(m+k)−1)2=12(m+k)2−2(m+k)+1/2

Next consider the case that property (iii) is satisfied but (ii) is violated for and . Then, for any and any , we have

 αT′{a,b}≥12(m+k+1)(m+k−1)=12(m+k)2−2

Noting that , we calculate a lower bound on the difference between the coefficients in and :

 12(m+k)2−2−12(m+k)2≥12(m+k)4.

This implies, using Lemma 1(i) to obtain the upper bound in the second line below:

 σD(T) =∑{x,y}∈(X2)αT{x,y}D(x,y) ≤⎡⎣∑a∈A,b∈BαT{a,b}D(a,b)⎤⎦+s(m+k)3(m+k)5 <⎡⎣∑a∈A,b∈BαT{a,b}s⎤⎦+s2(m+k)4 ≤∑a∈A,b∈BαT′{a,b}s≤∑{x,y}∈(X2)αT′{x,y}D(x,y)=σD(T′)

Hence, cannot be an optimal tree for .

Next we describe how to construct for a given graph with a suitable dissimilarity . First construct a set that is the disjoint union of , and with and where . Note that

 |X|=2k(n+1)≤2log2(n+1)+4(n+1)=16(n+1)2.

Put , . In addition, put and, for , . Moreover, put . The values will be the possible distances between elements in .

Now, recursively partition the set so as to force a fully balanced binary tree as a backbone structure. More precisely, put and define, for all and all , sets and so that , and hold. Select an element for each .

Next we construct the dissimilarity on :

• For all and all we put . The elements in are just used to fill subtrees so that we get a fully balanced backbone tree.

• For all , , we put where is the largest index in with for some . The distances between the elements in force a fully balanced backbone tree by Lemma 6 (cf. Figure 5).

• For all and all we put . And for all and all we put . This will force that a subset of vertices of is grouped together with each , , in the same subtree.

• For all , , we put if and otherwise. These distances capture the structure of and are so small that they do not interfere with forming the fully balanced backbone tree.

###### Lemma 7

Let be a graph with vertices and let be the dissimilarity on constructed above. Then has a 4-coloring if and only if there exists a binary rooted phylogenetic tree on with not larger than

 [2k−2k∑i=1si