Approximating Edit Distance in Near-Linear Time1footnote 11footnote 1A preliminary version of this paper appeared in Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC 2009), Bethesda, MD, USA, 2009, pp. 199–204.

Approximating Edit Distance in Near-Linear Time111A preliminary version of this paper appeared in Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC 2009), Bethesda, MD, USA, 2009, pp. 199–204.

Alexandr Andoni
Microsoft Research SVC
This work was done when the author was at Massachusetts Institute of Technology, while supported in part by David and Lucille Packard Fellowship and by MADALGO (Center for Massive Data Algorithmics, funded by the Danish National Research Association) and by NSF grant CCF-0728645.
   Krzysztof Onak
Carnegie Mellon University
Supported in part by a Symantec research fellowship, NSF grant 0728645, and NSF grant 0732334. This work was done when the author was a graduate student at Massachusetts Institute of Technology.
Abstract

We show how to compute the edit distance between two strings of length up to a factor of in time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art approximation. Previously, approximation of was known only for embedding edit distance into , and it is not known if that embedding can be computed in less than quadratic time.

1 Introduction

The edit distance (or Levenshtein distance) between two strings is the number of insertions, deletions, and substitutions needed to transform one string into the other [Lev65]. This distance is of fundamental importance in several fields such as computational biology and text processing/searching, and consequently, problems involving edit distance were studied extensively (see [Nav01], [Gus97], and references therein). In computational biology, for instance, edit distance and its slight variants are the most elementary measures of dissimilarity for genomic data, and thus improvements on edit distance algorithms have the potential of major impact.

The basic problem is to compute the edit distance between two strings of length over some alphabet. The text-book dynamic programming runs in time (see [CLRS01] and references therein). This was only slightly improved by Masek and Paterson [MP80] to time for constant-size alphabets222The result has been only recently extended to arbitrarily large alphabets by Bille and Farach-Colton [BFC08] with a factor loss in time.. Their result from 1980 remains the best algorithm to this date.

Since near-quadratic time is too costly when working on large datasets, practitioners tend to rely on faster heuristics (see [Gus97], [Nav01]). This leads to the question of finding fast algorithms with provable guarantees, specifically: can one approximate the edit distance between two strings in near-linear time [Ind01, BEK03, BJKK04, BES06, CPSV00, Cor03, OR07, KN06, KR06] ?

Prior results on approximate algorithms333We make no attempt at presenting a complete list of results for restricted problems, such as average case edit distance, weakly-repetitive strings, bounded distance regime, or related problems, such as pattern matching/nearest neighbor, sketching. However, for a very thorough survey, if only slightly outdated, see [Nav01]..

A linear-time -approximation algorithm immediately follows from the -time exact algorithm (see Landau, Myers, and Schmidt [LMS98]), where is the edit distance between the input strings. Subsequent research improved the approximation first to , and then to , due to, respectively, Bar-Yossef, Jayram, Krauthgamer, and Kumar [BJKK04], and Batu, Ergün, and Sahinalp [BES06].

A sublinear time algorithm was obtained by Batu, Ergün, Kilian, Magen, Raskhodnikova, Rubinfeld, and Sami [BEK03]. Their algorithm distinguishes the cases when the distance is vs.  in time444We use to denote . for any . Note that their algorithm cannot distinguish distances, say, vs. .

On a related front, in 2005, the breakthrough result of Ostrovsky and Rabani gave an embedding of the edit distance metric into with distortion [OR07] (see preliminaries for definitions). This result vastly improved related applications, namely nearest neighbor search and sketching. However, it did not have implications for computing edit distance between two strings in sub-quadratic time. In particular, to the best of our knowledge it is not known whether it is possible to compute their embedding in less than quadratic time.

The best approximation to this date remains the 2006 result of Batu, Ergün, and Sahinalp [BES06], achieving approximation. Even for time, their approximation is .

Our result.

We obtain approximation in near-linear time. This is the first sub-polynomial approximation algorithm for computing the edit distance between two strings running in strongly subquadratic time.

Theorem 1.1.

The edit distance between two strings can be computed up to a factor of in time.

Our result immediately extends to two more related applications. The first application is to sublinear-time algorithms. In this scenario, the goal is to compute the distance between two strings of the same length in time. For this problem, for any , we can distinguish distance from distance in time.

The second application is to the problem of pattern matching with errors. In this application, one is given a text of length and a pattern of length , and the goal is to report the substring of that minimizes the edit distance to . Our result immediately gives an algorithm for this problem running in time with approximation. We note that the best exact algorithm for this problem runs in time [MP80]. Better algorithms may be obtained if we restrict the minimal distance between the pattern and best substring of or for relatives of the edit distance. In particular, Sahinalp and Vishkin [SV96] and Cole and Hariharan [CH02] showed linear-time algorithms for finding all substrings at distance at most , where is a constant in . Moreover, Cormode and Muthukrishnan gave a near-linear time -approximation algorithm when the distance is the edit distance with moves.

1.1 Preliminaries and Notation

Before describing our general approach and the techniques used, we first introduce a few definitions.

We write to denote the edit distance between strings and . We use the notation . For a string , a substring starting at , of length , is denoted . Whenever we say with high probability (w.h.p.) throughout the paper, we mean “with probability ”, where is a sufficiently large polynomial function of the input size .

Embeddings.

For a metric , and another metric , an embedding is a map such that, for all , we have where is the distortion of the embedding. In particular, all embeddings in this paper are non-contracting.

We say embedding is oblivious if for any subset of size , the distortion guarantee holds for all pairs with high probability. The embedding is non-oblivious if it holds for a specific set (i.e., is allowed to depend on ).

Metrics.

The -dimensional metric is the set of points living in under the distance . We also denote it by .

We define thresholded Earth-Mover Distance, denoted for a fixed threshold , as the following distance on subsets and of size of some metric :

(1)

where ranges over all bijections between sets and . is the simple Earth-Mover Distance (EMD). We will always use and thus drop the subscript ; i.e., .

A graph (tree) metric is a metric induced by a connected weighted graph (tree) , where the distance between two vertices is the length of the shortest path between them. We denote an arbitrary tree metric by .

Semimetric spaces.

We define a semimetric to be a pair that satisfies all the properties of a metric space except the triangle inequality. A -near metric is a semimetric such that there exists some metric (satisfying the triangle inequality) with the property that, for any , we have that

Product spaces.

A sum-product over a metric , denoted , is a derived metric over the set , where the distance between two points and is equal to

For example the space is just the -dimensional .

Analogously, a min-product over , denoted , is a semimetric over , where the distance between two points and is

We also slightly abuse the notation by writing to denote the min-product of tree metrics (that could differ from each other).

1.2 Techniques

Our starting point is the Ostrovsky-Rabani embedding [OR07]. For strings , as well as for all substrings of specific lengths, we compute some vectors living in low-dimensional such that the distance between two such vectors approximates the edit distance between the associated (sub-)strings. In this respect, these vectors can be seen as an embedding of the considered strings into of polylogarithmic dimension. Unlike the Ostrovsky-Rabani embedding, however, our embedding is non-oblivious in the sense that the vectors are computed given all the relevant strings . In contrast, Ostrovsky and Rabani give an oblivious embedding such that approximates . However, the obliviousness comes at a high price: their embedding requires a high dimension, of order , and a high computation time, of order (even when allowing randomized embedding, and a constant probability of a correctness). We further note that reducing the dimension of this embedding seems unlikely as suggested by the results on impossibility of dimensionality reduction within  [CS02, BC03, LN04]. Nevertheless, the general recursive approach of the Ostrovsky-Rabani embedding is the starting point of the algorithm from this paper.

The heart of our algorithm is a near-linear time algorithm that, given a sequence of low-dimensional vectors and an integer , constructs new vectors , where , with the following property. For all , the value approximates the Earth-Mover Distance (EMD)555In fact, our algorithm does this for thresholded EMD, TEMD, but the technique is precisely the same. between the sets and . To accomplish this (non-oblivious) embedding, we proceed in two stages. First, we embed (obliviously) the EMD metric into a min-product of ’s of low dimension. In other words, for a set , we associate a matrix , of polylogarithmic size, such that the EMD distance between sets and is approximated by . Min-products help us simultaneously on two fronts: one is that we can apply a weak dimensionality reduction in , using the Cauchy projections, and the second one enables us to accomplish a low-dimensional EMD embedding itself. Our embedding is not only low-dimensional, but it is also linear, allowing us to compute matrices in near-linear time by performing one pass over the sequence . Linearity is crucial here as even the total size of ’s is , which can be as high as , and so processing each separately is infeasible.

In the second stage, we show how to embed a set of points lying in a low-dimensional min-product of ’s back into a low-dimensional with only small distortion. We note that this is not possible in general, with any bounded distortion, because such a set of points does not even form a metric. We show that this is possible when we assume that the semi-metric induced by the set of points approximates some metric (in our case, the set of points approximates the initial EMD metric). The embedding from this stage starts by embedding a min-product of ’s into a low-dimensional min-product of tree metrics. We further embed the latter into an -point metric supported by the shortest-path metric of a sparse graph. Finally, we observe that we can implement Bourgain’s embedding on a sparse graph metric in near-linear time. These last two steps make our embedding non-oblivious.

1.3 Recent Work

We note that the recent work [AKO10] has shown that one can approximate the edit distance between two strings up to a multiplicative factor of in time, for any desired . Although the new result obtains polylogarithmic approximation, the running time is slightly higher than the algorithm presented here. For a comparable approximation, obtained for , the algorithm of [AKO10] does not improve the running time (up to constants hidden by the big O notation). We further remark that the techniques of [AKO10] are disjoint from the techniques presented here, and are based on asymmetric sampling of one of the strings.

2 Short Overview of the Ostrovsky-Rabani Embedding

We now briefly describe the embedding of Ostrovsky and Rabani [OR07]. Some notions introduced here are used in our algorithm described in the next section.

The embedding of Ostrovsky and Rabani is recursive. For a fixed , they construct the embedding of edit distance over strings of length using the embedding of edit distance over strings of shorter lengths . We denote their embedding of length- strings by , and let be the resulting distance: . For two strings , the embedding is such that approximates an “idealized” distance , which itself approximates the edit distance between and .

Before describing the “idealized” distance , we introduce some notation. Partition into blocks called of length . Next, fix some and . We consider the set of all substrings of of length , embed each one recursively via , and define to be the set of resulting vectors (note that ). Formally,

Taking as given (and thus also the sets for all ), define the new “idealized” distance approximating the edit distance between strings as

(2)

where TEMD is the thresholded Earth-Mover Distance (defined in Equation (1)), and is a sufficiently large normalization constant ( suffices). Using the terminology from the preliminaries, the distance function can be viewed as the distance function of the sum-product of TEMDs, i.e., , and the embedding into this product space is attained by the natural identity map (on sets ).

The key idea is that the distance approximates edit distance well, assuming that approximates edit distance well, for all where . Formally, Ostrovsky and Rabani show that:

Fact 2.1 ([Or07]).

Fix and , and let . Let be an upper bound on distortion of viewed as an embedding of edit distance on strings , for all where . Then,

To obtain a complete embedding, it remains to construct an embedding approximating up to a small factor. In fact, if one manages to approximate up to a poly-logarithmic factor, then the final distortion comes out to be . This follows from the following recurrence on the distortion factor . Suppose is an embedding that approximates up to a factor . Then, if is the distortion of (as an embedding of edit distance), then Fact 2.1 immediately implies that, for ,

This recurrence solves to as proven in [OR07].

Concluding, to complete a step of the recursion, it is sufficient to embed the metric given by into with a polylogarithmic distortion. Recall that is the distance of the metric , and thus, one just needs to embed into . Indeed, Ostrovsky and Rabani show how to embed a relaxed (but sufficient) version of TEMD into with distortion, yielding the desired embedding , which approximates up to a factor at each level of recursion. We note that the required dimension is .

3 Proof of the Main Theorem

We now describe our general approach. Fix . For each substring of , we construct a low-dimensional vector such that, for any two substrings of the same length, the edit distance between and is approximated by the distance between the vectors and . We note that the embedding is non-oblivious: to construct vectors we need to know all the substrings of in advance (akin to Bourgain’s embedding guarantee). We also note that computing such vectors is enough to solve the problem of approximating the edit distance between two strings, and . Specifically, we apply this procedure to the string , the concatenation of and , and then compute the distance between the vectors corresponding to and , substrings of .

More precisely, for each length , for some set specified later, and for each substring , where , we compute a vector in , where . The construction is inductive: to compute vectors , we use vectors for and . The general approach of our construction is based on the analysis of the recursive step of Ostrovsky and Rabani, described in Section 2. In particular, our vectors will also approximate the distance (given in Equation (2)) with sets defined using vectors with .

The main challenge is to process one level (vectors for a fixed ) in near-linear time. Besides the computation time itself, a fundamental difficulty in applying the approach of Ostrovsky and Rabani directly is that their embedding would give a much higher dimension , proportional to . Thus, if we were to use their embedding, even storing all the vectors would take quadratic space.

To overcome this last difficulty, we settle on non-obliviously embedding the set of substrings for under the “ideal” distance with distortion (formally, under the distance from Equation (2), when for ). Existentially, we know that there exist vectors such that approximates for all and — this follows by the standard Bourgain’s embedding [Bou85]. The vectors that we compute approximate the properties of the ideal vectors . Their efficient computability comes at the cost of an additional polylogarithmic loss in approximation.

The main building block is the following theorem. It shows how to approximate the TEMD distance for the desired sets .

Theorem 3.1.

Let and . Let be vectors in , where and . Define sets for .

Let . We can compute (randomized) vectors for such that for any , with high probability, we have

Furthermore, computing all vectors takes time.

To map the statement of this theorem to the above description, we mention that, for each for , we apply the theorem to vectors for each .

We prove Theorem 3.1 in later sections. Once we have Theorem 3.1, it becomes relatively straight-forward (albeit a bit technical) to prove the main theorem, Theorem 1.1. We complete the proof of Theorem 1.1 next, assuming Theorem 3.1.

of Theorem 1.1.

We start by appending to the end of ; we will work with the new version of only. Let and . We construct vectors for , where is a carefully chosen set of size . Namely, is the minimal set such that: , and, for each with , we have that for all integers . It is easy to show by induction that the size of is . We construct the vectors inductively in a bottom-up manner. We use vectors for small to build vectors for large . is exactly the set of lengths that we need in the process.

Fix an such that . We define the vector to be equal to , where is a randomly chosen function. It is readily seen that approximates up to approximation factor, for each .

Now consider such that . Let . First we construct vectors approximating TEMD on sets , where and . In particular, for a fixed equal to a power of 2, we apply Theorem 3.1 to the set of vectors obtaining vectors . Theorem 3.1 guarantees that, for each , the value approximates up to a factor of . We can then use these vectors to obtain the vectors that approximate the “idealized” distance on substrings , for . Specifically, we let the vector be a concatenation of vectors , where , and goes over all powers of 2 less than :

Then, the vectors approximate the distance (given in Equation (2)) up to a  approximation factor, with the sets taken as

for and .

The algorithm finishes by outputting , which is an approximation to the edit distance between and . The total running time is .

It remains to analyze the resulting approximation. Let be the approximation achieved by vectors for substrings of of lengths , where and . Then, using Fact 2.1 and the fact that vectors approximate , we have that

Since the total number of recursion levels is bounded by , we deduce that . ∎

3.1 Proof of Theorem 3.1

The proof proceeds in two stages. In the first stage we show an embedding of the TEMD metric into a low-dimensional space. Specifically, we show an (oblivious) embedding of TEMD into a min-product of . Recall that the min-product of , denoted , is a semi-metric where the distance between two -by- vectors is . Our min-product of ’s has dimensions and . The min-product can be seen as helping us on two fronts: one is the embedding of TEMD into (of initially high-dimension), and another is a weak dimensionality reduction in , using Cauchy projections. Both of these embeddings are of the following form: consider a randomized embedding into (standard) that has no contraction (w.h.p.) but the expansion is bounded only in the expectation (as opposed to w.h.p.). To obtain a “w.h.p.” expansion, one standard approach is to sample many times and concentrate the expectation. This approach, however, will necessitate a high number of samples of , and thus yield a high final dimension. Instead, the min-product allows us to take only independent samples of .

We note that our embedding of TEMD into min-product of , denoted , is linear in the sets : . The linearity allows us to compute the embedding of sets in a streaming fashion: the embedding of is obtained from the embedding of with additional processing. This stage appears in Section 3.1.1.

In the second stage, we show that, given a set of points in min-product of ’s, we can (non-obliviously) embed these points into low-dimensional with distortion. The time required is near-linear in and the dimensions of the min-product of ’s.

To accomplish this step, we start by embedding the min-product of ’s into a min-product of tree metrics. Next, we show that points in the low-dimensional min-product of tree metrics can be embedded into a graph metric supported by a sparse graph. We note that this is in general not possible, with any (even non-constant) distortion. We show that this is possible when we assume that our subset of the min-product of tree metrics approximates some actual metric (in our case, the min-product approximates the TEMD metric). Finally, we observe that we can implement Bourgain’s embedding in near-linear time on a sparse graph metric. This stage appears in Section 3.1.2.

We conclude with the proof of Theorem 3.1 in Section 3.1.3.

3.1.1 Embedding EMD into min-product of

In the next lemma, we show how to embed TEMD into a min-product of ’s of low dimension. Moreover, when the sets are obtained from a sequence of vectors , by taking , we can compute the embedding in near-linear time.

Lemma 3.2.

Fix and . Suppose we have vectors in for some . Consider the sets , for .

Let . We can compute (randomized) vectors for such that, for any we have that

  • and

  • w.h.p.

The computation time is .

Thus, we can embed the metric over sets into , for , such that the distortion is w.h.p. The computation time is .

Proof.

First, we show how to embed TEMD metric over the sets into of dimension . For this purpose, we use a slight modification of the embedding of [AIK08] (it can also be seen as a strengthening of the TEMD embedding of Ostrovsky and Rabani).

The embedding of [AIK08] constructs embeddings , each of dimension , and then the final embedding is just the concatenation . For , we impose a randomly shifted grid of side-length . That is, let be selected uniformly at random from . A specific vector falls into the cell , where for . Then has a coordinate for each cell , where for . These are the only cells that can be non-empty, and there is at most of them. The value of a specific coordinate, for a set , equals the number of vectors from falling into the corresponding cell times . Now, if we scale up by a factor of , Theorem 3.1 from [AIK08]666Note that Theorem 3.1 from [AIK08] is stated for EMD, and here we are concerned with TEMD. Nevertheless, the whole statement still applies, because the side of the largest grid is bounded by . says that the vectors satisfy the condition that, for any , we have:

  • and

  • w.h.p.

Thus, the vectors satisfy the promised properties except they have a high dimension.

To reduce the dimension of ’s, we apply a weak dimensionality reduction via 1-stable (Cauchy) projections. Namely, we pick a random matrix of size by , the dimension of , where each entry is distributed according to the Cauchy distribution, which has probability distribution function . Now define . Standard properties of the dimensionality reduction guarantee that the vectors satisfy the properties promised in the lemma statement, after an appropriate rescaling (see Theorem 5 of [Ind06] with , , and ).

It remains to show that we can compute the vectors in time. To this end, observe that the resulting embedding is linear, namely . Moreover, each can be computed in time, because has exactly one non-zero coordinate, which can be computed in time, and then is simply the corresponding column of multiplied by the non-empty coordinate of . To obtain the first vector , we compute the summation of all corresponding . To compute the remaining vectors iteratively, we use the idea of a sliding window over the sequence . Specifically, we have

which implies that can be computed in time, given the value of . Therefore, the total time required to compute all ’s is .

Finally, we show how we obtain an efficient embedding of TEMD into min-product of ’s. We apply the above procedure times. Let be the resulting vectors, for and . The embedding of a set is the concatenation of the vectors , namely . The Chernoff bound implies that w.h.p., for any , we have that

Also, w.h.p. trivially. Thus the vectors are an embedding of the TEMD metric on ’s into with distortion w.h.p. ∎

3.1.2 Embedding of min-product of into low-dimensional

In this section, we show that points in the semi-metric space can be embedded into of dimension with distortion . The embedding works under the assumption that the semi-metric on is a approximation of some metric. We start by showing that we can embed a min-product of ’s into a min-product of tree metrics.

Lemma 3.3.

Fix such that . Consider vectors in , for some , where each coordinate of each lies in the set . We can embed these vectors into a min-product of tree metrics, i.e., , incurring distortion w.h.p. The computation time is .

Proof.

We consider all thresholds , for . For each threshold , and for each coordinate of the min-product (i.e., ), we create tree metrics. Each tree metric is independently created as follows. We again use randomly shifted grids. Specifically, we define a hash function as

where each is chosen at random from . We create each tree metric so that the nodes corresponding to the points hashed by to the same value are at distance (this creates a set of stars), and each pair of points that are hashed to different values are at distance (we connect the roots of the stars).

For two points , the probability that they are separated by the grid in the -th dimension is at most , which implies by the union bound that

On the other hand, the probability that and are not separated by the grid in the -th dimension is . Since the grid is shifted independently in each dimension,

By the Chernoff bound, if are at distance at most for some , they will be at distance at most in one of the tree metrics with high probability. On the other hand, let and be two input vectors at distance greater than . The probability that they are at distance smaller than in any of the tree metrics, is at most for any , by the union bound.

Therefore, we multiply the weights of all edges in all trees by to achieve a proper (non-contracting) embedding. ∎

We now show that we can embed a subset of the min-product of tree metrics into a graph metric, assuming the subset is close to a metric.

Lemma 3.4.

Consider a semi-metric of size in for some , where each tree metric in the product is of size . Suppose is a -near metric (i.e., it is embeddable into a metric with distortion). Then we can embed into a connected weighted graph with edges with distortion in time.

Proof.

We consider separate trees each on nodes, corresponding to each of dimensions of the min-product. We identify the nodes of trees that correspond to the same point in the min-product, and collapse them into a single node. The graph we obtain has at most edges. Denote the shortest-path metric it spans with , and denote our embedding with . Clearly, for each pair of points in , we have . If the distance between two points shrinks after embedding, then there is a sequence of points , , …, , such that . Because is a -near metric, there exists a metric , such that , for all . Therefore,