Metric embeddings with outliers
Abstract
We initiate the study of metric embeddings with outliers. Given some metric space we wish to find a small set of outlier points and either an isometric or a lowdistortion embedding of into some target metric space. This is a natural problem that captures scenarios where a small fraction of points in the input corresponds to noise.
For the case of isometric embeddings we derive polynomialtime approximation algorithms for minimizing the number of outliers when the target space is an ultrametric, a tree metric, or some constantdimensional Euclidean space. The approximation factors are , and , respectively. For the case of embedding into an ultrametric or tree metric, we further improve the running time to for an point input metric space, which is optimal. We complement these upper bounds by showing that outlier embedding into ultrametrics, trees, and dimensional Euclidean space for any are all NPhard, as well as NPhard to approximate within a factor better than 2 assuming the Unique Game Conjecture.
For the case of nonisometries we consider embeddings with small distortion. We present polynomialtime bicriteria approximation algorithms. Specifically, given some , let denote the minimum number of outliers required to obtain an embedding with distortion . For the case of embedding into ultrametrics we obtain a polynomialtime algorithm which computes a set of at most outliers and an embedding of the remaining points into an ultrametric with distortion . Finally, for embedding a metric of unit diameter into constantdimensional Euclidean space we present a polynomialtime algorithm which computes a set of at most outliers and an embedding of the remaining points with distortion .
1 Introduction
Metric embeddings provide a framework for addressing in a unified manner a variety of dataanalytic tasks. Let , be metric spaces. At the high level, a metric embedding is a mapping that either is isometric or preserves the pairwise distances up to some small error called the distortion^{1}^{1}1Various definitions of distortion have been extensively considered, including multiplicative, additive, average, and distortion, as well as expected distortion when the map is random [6].. The corresponding computational problem is to decide whether an isometry exists or, more generally, to find a mapping with minimum distortion. The space might either be given or it might be constrained to be a member of a collection of spaces, such as trees, ultrametrics, and so on. The problems that can be geometrically abstracted using this language include phylogenetic reconstruction (e.g. via embeddings into trees [1, 2, 11] or ultrametrics [20, 4]), visualization (e.g. via embeddings into constantdimensional Euclidean space [8, 21, 10, 32, 19, 9, 17, 22]), and many more (for a more detailed exposition we refer the reader to [26, 25]).
Despite extensive research on the above metric embedding paradigm, essentially nothing is known when the input space can contain outliers. This scenario is of interest for example in applications where outliers can arise from measurement errors. Another example is when realworld data does not perfectly fit a model due to mathematical simplifications of physical processes.
We propose a generalization of the above highlevel metric embedding problem which seeks to address such scenarios: Given and we wish to find some small and either an isometric or lowdistortion mapping . We refer to the points in as outliers.
We remark that it is easy to construct examples of spaces , where any embedding has arbitrarily large distortion (for any “reasonable” notion of distortion), yet there exists and an isometry . Thus new ideas are needed to tackle the more general metric embedding problem in the presence of outliers.
1.1 Our contribution
Approximation algorithms.
We focus on embeddings into ultrametrics, trees, and constantdimensional Euclidean space. We first consider the problem of computing a minimum size set of outliers such that the remaining pointset admits an isometry into some target space. We refer to this task as the minimum outlier embedding problem.
Outlier embeddings into ultrametrics. It is wellknown that a metric space is an ultrametric if and only if any point subset is an ultrametric. We may therefore obtain a approximation as follows: For all , if the triple is not an ultrametric then remove , , and from . It is rather easy to see that this gives a approximation for the minimum outlier embedding problem into ultrametrics (as for every triple of points that we remove, at least one of them must be an outlier in any optimal solution), with running time . By exploiting further structural properties of ultrametrics, we obtain a approximation with running time . We remark that this running time is optimal since the input has size and it is straightforward to show that any approximation has to read all the input (e.g. even to determine whether is an ultrametric, which corresponds to the case where the minimum number of outliers is zero).
Outlier embeddings into trees. Similarly to the case of ultrametrics, it is known that a space is a tree metric if and only if any point subset is a tree metric. This similarly leads to a approximation algorithm in time. We further improve the running time to , which is also optimal. However, obtaining this improvement is significantly more complicated than the case of ultrametrics.
Outlier embeddings into . It is known that for any any metric space admits an isometric embedding into dimensional Euclidean space if and only if any subset of size does [33]. This immediately implies a approximation algorithm for outlier embedding into dimensional Euclidean space with running time , for any . Using additional rigidity properties of Euclidean space we obain a approximation with the same running time.
Hardness of approximation.
We show that, assuming the Unique Games Conjecture [28], the problems of computing a minimum outlier embedding into ultrametrics, trees, and dimensional Euclidean space for any , are all NPhard to approximate within a factor of , for any . These inapproximability results are obtained by combining reductions from Vertex Cover to minimum outlier embedding and the known hardness result for the former problem [29]. Note that for the case of embedding into dimensional Euclidean space for any this inapproximability result matches our upper bound.
Bicriteria approximation algorithms.
We also consider nonisometric embeddings. All our results concern distortion. For some outlier set , the distortion of some map is defined to be
In this context there are two different objectives that we wish to minimize: the number of outliers and the distortion. For a compact metric space , denote by the diameter of .
Definition 1.1 (Outlier embedding).
We say that admits a outlier embedding into if there exists with and some with distortion at most . We refer to as the outlier set that witnesses a outlier embedding of .
Note that the multiplication of the distortion by is to make the parameter scalefree. Since an isometry can be trivially achieved by removing all but one points; thus the above notion is welldefined for all . We now state our main results concerning bicriteria approximation:
Bicriteria outlier embeddings into ultrametrics: We obtain a polynomialtime algorithm which given an point metric space and some such that admits a outlier embedding into an ultrametric, outputs a outlier embedding into an ultrametric.
Bicriteria outlier embeddings into : We present an algorithm which given an point metric space and some such that admits a outlier embedding in , outputs a outlier embedding of into . The algorithm runs in time .
Bicriteria outlier embeddings into trees: Finally we mention that one can easily derive a bicriteria approximation for outlier embedding into trees by the work of Gromov on hyperbolicity [23] (see also [14]). Formally, there exists a polynomialtime algorithm which given a metric space and some such that admits a outlier embedding into a tree, outputs a outlier embedding into a tree. Let us briefly outline the proof of this result: hyperbolicity is a fourpoint condition such that any hyperbolic space admits an embedding into a tree with distortion , and such an embedding can be computed in polynomial time. Any metric that admits an embedding into a tree with distortion is hyperbolic. Thus by removing all 4tuples of points that violate the hyperbolicity condition and applying the embedding from [23] we immediately obtain an outlier embedding into a tree. We omit the details.
1.2 Previous work
Over the recent years there has been a lot work on approximation algorithms for minimum distortion embeddings into several host spaces and under various notions of distortion. Perhaps the most wellstudied case is that of multiplicative distortion. For this case, approximation algorithms and inapproximability results have been obtained for embedding into the line [34, 10, 8, 22], constantdimensional Euclidean space [9, 17, 19, 32, 10], trees [11, 15], ultrametrics [4], and other graphinduced metrics [15]. We also mention that similar questions have been considered for the case of bijective embeddings [35, 24, 27, 19, 30]. Analogous questions have also been investigated for average [18], additive [5], [2], and distortion [20, 1].
Similar in spirit with the outlier embeddings introduced in this work is the notion of embeddings with slack [12, 13, 31]. In this scenario we are given a parameter and we wish to find an embedding that preserves fraction of all pairwise distances up to a certain distortion. We remark however that these mappings cannot in general be used to obtain outlier embeddings. This is because typically in an embedding with slack the pairwise distances that are distorted arbitrarily involve a large fraction of all points.
1.3 Discussion
Our work naturally leads to several directions for further research. Let us briefly discuss the most prominent ones.
An obvious direction is closing the gap between the approximation factors and the inapproximability results for embedding into ultrametrics and trees. Similarly, it is important to understand whether the running time of the 2approximation for embedding into Euclidean space can be improved. More generally, an important direction is understanding the approximability of outlier embeddings into other host spaces, such as planar graphs and other graphinduced metrics.
In the context of bicriteria outlier embeddings, another direction is to investigate different notions of distortion. The case of distortion studied here is a natural starting point since it is very sensitive to outliers. It seems promising to try to adapt existing approximation algorithms for , multiplicative, and average distortion to the outlier case.
Finally, it is important to understand whether improved guarantees or matching hardness results for bicriteria approximations are possible.
2 Definitions
A metric space is a pair where is a set and such that (i) for any , , (ii) if and only if , and (iii) for any , . Given two metric spaces and , an embedding of into is simply a map , and is an isometric embedding if for any , .
In this paper our input is an point metric , meaning that is a discrete set of cardinality . Given an point metric space and a value , we denote by the metric space where for any we have .
Definition 2.1 (Ultrametric space).
A metric space is an ultrametric (tree) space if and only if the following threepoint condition holds for any :
(1) 
Definition 2.2 (Tree metric).
A metric space is a tree metric if and only if the following fourpoint condition holds for any :
(2) 
An equivalent formulation of the fourpoint condition is that for all , the largest two quantities of the following three terms are equal:
(3) 
In particular, an point tree metric can be realized by a weighted tree such that there is a map into the set of nodes of , and that for any , the shortest path distance in equals . In other words, is an isometric embedding of into the graphic tree metric . An ultrametric is in fact a special case of tree metric, where there is an isometric embedding to a rooted tree such that are leaves of and all leaves are at equal distance from the root of .
3 Approximation algorithms for outlier embeddings
In this section we present approximation algorithms for the minimum outlier embedding problem for three types of target metric spaces: ultrametrics, tree metrics, and Euclidean metric spaces. We show in Appendix B that finding optimal solutions for each of these problems is NPhard (and roughly speaking hard to approximate within a factor of as well). In the cases of ultrametric and tree metrics, it is easy to approximate the minimum outlier embedding within constant factor in and time, respectively. The key challenge (especially for embedding into tree metric) is to improve the time complexity of the approximation algorithm to , which is optimal.
3.1 Approximating outlier embeddings into ultrametrics
Theorem 3.1.
Given an point metric space , there exists a approximation algorithm for minimum outlier embedding into ultrametrics, with running time .
Proof.
We can obtain a polynomialtime approximation algorithm as follows: For each triple of points , considered in some arbitrary order, check whether it satisfies (1). If not, then remove , , and from and continue with the remaining triples. Let be the set of removed points. For every triple of points removed, at least one must be in any optimal solution; therefore the resulting solution is a approximation. The running time of this method is . We next show how to improve the running time to .
Let . We inductively compute a sequence , where is set to be . Given for some , assuming the invariance that is an ultrametric, we compute as follows. We check whether is an ultrametric. If it is, then we set . Otherwise, there must exist that violates (1). Since is an ultrametric, it follows that every such triple must contain . Therefore it suffices to show how to quickly find such that violates (1), if they exist. To this end, let be a nearest neighbor of in , that is where we brake ties arbitrarily. Instead of checking against all possible from , we claim that (1) holds for all with if and only if for all we have
(4) 
Indeed, assume that (i) and (ii) above hold for all , yet there exist some such that violates (1), say w.l.o.g., . Then by (ii) above, we have and , implying that . Hence also violates (1), contradicting the fact that is an ultrametric. Hence no such can exist, and (10) is sufficient to check whether induces an ultrametric or not.
3.2 Approximating outlier embeddings into trees
We now present a approximation algorithm for embedding a given point metric space into a tree metric with a minimum number of outliers. Using the fourpoint condition (2) in Definition 2.2, it is fairly simple to obtain a approximation algorithm for the problem with running time as follows: Check all 4tuples of points . If the 4tuple violates the fourpoint condition, then remove from . It is immediate that for any such 4tuple, at least one of its points much be an outlier in any optimal solution. It follows that the result is a approximation.
We next show how to implement this approach in time . The main technical difficult is in finding a set of violating 4tuples quickly. The highlevel description of the algorithm is rather simple, and is as follows. Let be the input metric space where . Set . For any , we inductively define . At the beginning of the th iteration, we maintain the invariance that is a tree metric. If is a tree metric, then we set . Otherwise there must exist such that the 4tuple violates the fourpoint condition; we set .
To implement this idea in time, it suffices to show that for any , given , we can compute in time . The algorithm will inductively compute a collection of edgeweighted trees , with simply being the graph with , and maintain the following invariants for each :
 (I1)

and all leaves of are in . embeds isometrically into ; that is, the shortestpath metric of agrees with on : for any , .
 (I2)

At the th iteration either or where the 4tuple violates the fourpoint condition under metric .
Definition 3.2 (Leaf augmentation).
Given , let be a tree with . Given and , the leaf augmentation of at is the tree obtained as follows. Let be the path in between and (which may contain a single vertex if ). Set . Let be a vertex in with ; if no such vertex exists then we introduce a new such vertex by subdividing the appropriate edge in and update the edge lengths accordingly. In the resulting tree we add the vertex and the edge if they do no already exist, and we set the length of to be . We call the stem of (w.r.t. the leaf augmentation). When we say that is the leaf augmentation of at , in which case is obtained from simply by adding as a leaf attached to , and is the stem of .
In what follows, we set to be the nearest neighbor of in , that is
where we break ties arbitrarily. Intuitively, if we can build a new tree from so that can be isometrically embedded in , then is a leaf augmentation of at some pair . Our approach will first compute an auxiliary structure, called orientation on , to help us identify a potential leaf augmentation. We next check for the validity of this leaf augmentation. The key is to produce this candidate leaf augmentation such that if it is not valid, then we will be able to find a 4tuple violating the fourpoint condition from it quickly.
Definition 3.3 (orientation).
Let and let be a tree with . Let and . The orientation of is a partially oriented tree obtained as follows: Let be the leaf augmentation of at , and let be the stem of . We orient every edge in towards , where is the unique path in between and . All other edges in remain unoriented.
If then there exists a unique edge in (which is subdivided in ); this edge remains undirected in . We call this edge the sink edge w.r.t. . If there is no sink edge, then there is a unique vertex in with no outgoing edges in , which we call the sink vertex w.r.t. . Note that the sink is the simplex of smallest dimension that contains the stem of w.r.to the leaf augmentation at .
See the right figure for an example: where is stem of in the leaf augmentation at . The thick path is oriented, other than the sink edge (the one that contains the stem ).
Definition 3.4 (orientation).
An orientation of is any partial orientation of obtained via the following procedure: Consider any ordering of , say . Start with , i.e. all edges in are initialized as undirected and we will iteratively modify their orientation. Process vertices in this order. For each , denote by the path in between and . Traverse starting from until we reach either or an edge which is already visited. For each unoriented edge we visit, we set its orientation to be the one in the orientation of . An edge that is visited in the above process is called masked.
Since the above procedure is performed for all leaves of , an orientation will mask all edges. However, a masked edge may not be oriented, in which case this edge must be the sink edge w.r.t. for some .
Definition 3.5 (Sinks).
Given an orientation of tree , a sink is either an unoriented edge, or a vertex such that all incident edges have an orientation toward . The former is also called a sink edge w.r.t. and the latter a sink vertex w.r.t. .
It can be shown that each sink edge/vertex must be a sink edge/vertex w.r.t. for some , and we call a generating vertex for this sink.
An orientation may have multiple sinks. We further augment the orientation to record a generating vertex for every sink (there may be multiple choices of for a single sink, and we can take an arbitrary one). We also remark that a sink w.r.t. some may not ultimately be a sink for the global orientation: see the right figure for an example, where is a sink vertex w.r.t. , but not a sink vertex for the global orientation.
The proofs of the following two results can be found in Appendix A.
Lemma 3.6.
An orientation of (together with a generating vertex for each sink) can be computed in time.
Lemma 3.7.
Any orientation of has at least one sink.
(a)  (b)  (c) 
Lemma 3.8.
For any , given , we can compute and satisfying invariants (I1) and (I2) in ) time.
Proof.
It suffices to show that in time we can either find a tuple of points in , that violates the fourpoint condition, or we can compute a tree having a shortestpath metric that agrees with on . By Lemma 3.6, we can compute an orientation of in time. Consider any sink of (whose existence is guaranteed by Lemma 3.7), and let be its associated generating vertex; must be in . Let be the leaf augmentation of at , and let denote the shortest path metric on the tree .
Since is the leaf augmentation of , we have for all , (the last quality is because embeds isometrically into ). Thus may only disagree with on pairs of points , for some . We check in time if, for all , we have via a traveral of starting from the stem of in . If the above holds, then obviously embeds isometrically into . We then set and output . Otherwise, let be such that . We now show that we can find a tuple including that violates the fourpoint condition in constant time.
Let be the stem of in . Consider as rooted at and let be the lowest common ancester of and . Note that must be a vertex from too. Let denote the unique path in between any two and . The vertex must be in the path .

Case 1: . In this case, is either in the interior of path or of path . Assume w.o.l.g. that is in the interior of ; the handling of the other case is completely symmetric. See Figure 1 (a) for an illustration. Since is a tree metric, we know that the tuple should satisfy the fourpoint condition under the metric . Using the alternative formulation of fourpoint condition in Definition 2.2, we have that the largest two quantities of the following three terms should be equal:
(5) For this specific configuration of , we further have:
(6) On the other hand, by construction, we know that agrees with on . Furthermore, since is the leaf augmentation of at , we have that and . Hence (6) can be rewritten as
(7) If , then the largest two quantities of
can no longer be equal as . Hence the tuple violates the fourpoint condition under the metric (by using (3)).

Case 2: , in which case must be a sink vertex: see Figure 1 (b) for an illustration. For this configuration of , it is necessary that
(8) Hence if , then the tuple violates the fourpoint condition under the metric because
What remains is to find an violating 4tuple for the case when .
Now imagine performing the leaf augmentation of at . We first argue that the stem of w.r.t. necessarily lies in in . Let . In the augmented tree , . Combing (8) and we have that . On the other hand, following Definition 3.2, the position of is such that , while the position of was that . It then follows that must lie in the interior of path .
Since the stem of w.r.t. is in , it means that before we process in the construction of the orientation , there must exist some other leaf such that the process of assigns the orientation of the edge to be towards ; See Figure 1 (c). This is because if no such exists, then while processing , we would have oriented the edge towards stem , thus towards , as the stem is in . The point can be identified in constant time if during the construction of , we also remember, for each edge, the vertex the processing of which leads to orienting this edge. Such information can be easily computed in time during the construction of .
Now consider . If , then one can show that is necessarily the stem for w.r.t. as well (by simply computing the position of the stem using Definition 3.2). In this case, considering the 4tuple , we are back to Case 1 (but for this new 4tuple), which in turn means that this 4tuple violates the fourpoint condition. Hence we are done.
If , then since we orient the edge towards during the process of leaf , the stem of of the leaf augmentation at is in the path . By an argument similar to the proof that is in the interior of above, we can show that . Now consider the 4tuple : this leads us to an analogous case when for the 4tuple . Hence by a similar argument as at the beginning of Case 2, we can show that violates the fourpoint condition under metric .
Putting everything together, in either case, we can identify a 4tuple , which could be , , or as shown above, that violates the fourpoint condition under metric . We simply remove these four points, adjust the resulting tree to obtain and set . The overall algorithm takes time as claimed. This proves the lemma. ∎
Theorem 3.9.
There exists a approximation algorithm for minimum outlier embedding into trees, with running time .
Proof.
By Lemma 3.8 and induction on , it follows immediately that we can compute in time . By invariant (I1), the output is a tree metric as it can be isometrically embedded into . Furthermore, by invariant (I2), each 4tuple of points we removed forms a violation of the fourpoint condition, and thus must contain at least one point from any optimal outlier set. As such, the total number of points we removed can be at most four times the size of the optimal solution. Hence our algorithm is a approximation as claimed. ∎
3.3 Approximating outlier embeddings into
In this section, we present a approximation algorithm for the minimum outlier embedding problem into the Euclidean space in polynomial time, which matches our hardness result in Appendix B. Given two points , let denote the Euclidean distance between and .
Definition 3.10 (embedding).
Given a discrete metric space , an embedding of is simply an isometric embedding of into ; that is, for any , . We say that is strongly embeddable if it has a embedding, but cannot be isometrically embedded in . In this case, is called the embedding dimension of .
Theorem 3.11.
The metric space is strongly embeddable in if and only if there exist points, say , such that:
(i) is strongly embeddable; and
(ii) for any , is embeddable.
Furthermore, given an point metric space , it is known that one can decide whether is embeddable in some Euclidean space by checking whether a certain matrix derived from the distance matrix is positive semidefinite, and the rank of this matrix gives the embedding dimension of ; see e.g. [36].
Following Theorem 3.11, one can easily come up with a approximation algorithm for minimum outlier embedding into , by simply checking whether each tuple of points is embeddable, and if not, removing all these points. Our main result below is an approximation algorithm within the same running time. In particular, Algorithm 1 satisfies the requirements of Theorem 3.12, and the proof is in Appendix A.
Theorem 3.12.
Given an point metric space , for any , there exists approximation algorithm for minimum outlier embedding into , with running time .
Hardness results.
In Appendix B, we show that the minimum outlier embedding problems into ultrametrics, trees and Euclidean space are all NPhard, by reducing the Vertex Cover problem to them in each case. In fact, assuming the unique game conjecture, it is NPhard to approximate each of them within , for any positive . For the case of minimum outlier embedding into Euclidean space, we note that our approximation algorithm above matches the hardness result.
4 Bicriteria approximation algorithms
4.1 Bicriteria approximation for embedding into ultrametrics
Let be a tree with nonnegative edge weights. The ultrametric induced by T is the ultrametric where for every we have that is equal to the maximum weight of the edges in the unique  path in (it is easy to verify that the metric constructed as such is indeed an ultrametric [20]). Given an metric space , we can view it as a weighted graph and talk about its minimum spanning tree (MST). The following result is from [20].
Lemma 4.1 (Farach, Kannan and Warnow [20]).
Let be a metric space and let be an ultrametric minimizing . Let be ultrametric induced by a MST of . Then there exists , such that .
In particular, let . Then . This further implies that .
Theorem 4.2.
There exists a polynomialtime algorithm which given an point metric space , , and , such that admits a outlier embedding into an ultrametric, outputs a outlier embedding into an ultrametric.
Proof.
For simplicity, assume that the diameter . The algorithm is as follows. We first enumerate all triples . For any such triple, if , then we remove , , and from . Let be the resulting point set. We output the ultrametric induced by an MST of the metric space . This completes the description of the algorithm.
It suffices to prove that the output is indeed an outlier embedding. Let , with , be such that admits a outlier embedding into an ultrametric. Let be such that . It follows by Lemma 4.1 that does not admit a outlier embedding into an ultrametric. Thus, . It follows that . In other words, the algorithm removes at most points.
It remains to bound the distortion between and . Let be the MST of such that the algorithm outputs the ultrametric induced by . We will prove by induction on , that for all , if the  path in contains at most edges, then
For the base case we have that . Since is the minimum spanning tree of , it follows that , proving the base case. For the inductive step, let such that the  contains at most edges, for some . Let be such that is in the  path in , and moreover the  and  paths in have at most edges each. Since , it follows that the triple was not removed by the algorithm, and thus