Approximating Edit Distance Within Constant Factor in Truly SubQuadratic Time^{1}^{1}1The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/20072013)/ERC Grant Agreement no. 616787.
Abstract
Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor .
In this paper, we provide an algorithm with running time that approximates the edit distance within a constant factor.
1 Introduction
Exact computation of edit distance. The edit distance (aka Levenshtein distance) [16] between strings , denoted by , is the minimum number of character insertions, deletions, and substitutions needed to convert into . It is a widely used distance measure between strings that finds applications in fields such as computational biology, pattern recognition, text processing, and information retrieval. The problems of efficiently computing , and of constructing an optimal alignment (sequence of operations that converts to ), are of significant interest.
Edit distance can be evaluated exactly in quadratic time via dynamic programming (Wagner and Fischer [20]). Landau et al. [15] gave an algorithm that finds an optimal alignment in time , improving on a previous algorithm of Ukkonen [19]. Masek and Paterson[17] obtained the first (slightly) subquadratic time algorithm, and the current asymptotically fastest algorithm (Grabowski [14]) runs in time . Backurs and Indyk [7] showed that a truly subquadratic algorithm ( for some ) would imply a time algorithm for CNFsatisfiabilty, contradicting the Strong Exponential Time Hypothesis (SETH). Abboud et al. [3] showed that even shaving an arbitrarily large polylog factor from would have the plausible, but apparently hardtoprove, consequence that NEXP does not have nonuniform circuits. For further “barrier” results, see [2, 13].
Approximation algorithms. There is a long line of work on approximating edit distance. The exact time algorithm (where is the edit distance of the input) of Landau et al. [15] yields a linear time factor approximation. This approximation factor was improved, first to [8], then to [10] and later to [6], all with slightly superlinear runtime. Batu et al. [9] provided an approximation algorithm with runtime . The strongest result of this type is the factor approximation (for every ) with running time of Andoni et al. [4]. Abboud and Backurs [1] showed that a truly subquadratic deterministic time factor approximation algorithm for edit distance would imply new circuit lower bounds.
Independent of our work, Boroujeni et al. [11] obtained a truly subquadratic quantum algorithm that provides a constant factor approximation. Their latest results [12] are a factor with runtime and a faster time with a larger constant factor approximation.
Andoni and Nguyen [5] found a randomized algorithm that approximates Ulam distance of two permutations of (edit distance with only insertions and deletions) within a (large) constant factor in time , where is the Ulam distance of the input; this was improved by Naumovitz et al. [18] to a factor approximation (for any ) with similar runtime.
Our results. We present the first truly subquadratic time classical algorithm that approximates edit distance within a constant factor.
Theorem 1.1.
There is a randomized algorithm EDUB that on input strings of length over any alphabet outputs an upper bound on in time that, with probability at least , is at most a fixed constant multiple of .
If the output is , then the algorithm has implicitly found an alignment of cost at most . The algorithm can be modified to explicitly output such an alignment.
The approximation factor proved in this preliminary version is 1680, can be greatly improved by tweaking parameters. We believe, but have not proved, that with sufficient care the algorithm can be modified (with no significant increase in runtime) to get approximation.
Theorem 1.1 follows from:
Theorem 1.2.
For every , there is a randomized algorithm that on input strings of length outputs such that: (1) and (2) on any input with , with probability at least . The runtime of is .
The name reflects that this is a "gap algorithm", which distinguishes inputs with (where the output is at most ), and those with (where the output is greater than ).
Theorem 1.1 follows via a routine construction of EDUB from , presented in Section 5. The rest of the paper is devoted to proving Theorem 1.2.
The framework of the algorithm. We use a standard twodimensional representation of edit distance. Visualize as lying on a horizontal axis and as lying on a vertical axis, with horizontal coordinate corresponding to and vertical component corresponding to . The width of interval is . Also, denotes the substring of indexed by . (Note: is not part of , e.g., . This convention is motivated by Proposition 1.3.) We refer to as an interval to indicate that it indexes a substring of , and as a interval to indicate that it indexes a substring of . A box is a set where is a interval and is a interval; corresponds to the substring pair . is a box if . We often abbreviate by . A decomposition of an interval is a sequence of subintervals with , and for , .
Associated to is a directed graph with edge costs called a grid graph with vertex set and all edges of the form (steps), (steps) and (steps). Every Hstep or Vstep costs 1, and Dsteps cost 1 if and 0 otherwise. There is a 11 correspondence that maps a path from to to an alignment from to , i.e. a set of character deletions, insertions and substitutions that changes to , where an Hstep means "delete ", a Vstep means "insert between and " and a Dstep means replace by , unless they are already equal. We have:
Proposition 1.3.
The cost of an alignment, , is the sum of edge costs of its associated path , and is equal to , the min cost of an alignment path from to .
For , is the grid graph induced on , and . The natural highlevel idea of appears (explicitly or implicitly) in previous work. The algorithm has two phases. First, the covering phase identifies a set of certified boxes which are pairs , where is an upper bound on the normalized edit distance . ( is more convenient than for the covering phase.) Second, the mincost path phase, takes input and uses a straightforward customized variant of dynamic programming to find an upper bound on in time quasilinear in . The central issue is to ensure that the covering phase outputs that is sufficiently informative so that for constant , while running in subquadratic time.
Simplifying assumptions. The input strings have equal length . (It is easy to reduce to this case: pad the shorter string to the length of the longer using a new symbol. The edit distance of the new pair is between the original edit distance and twice the original edit distance. This factor 2 increase in approximation factor can be avoided by generalizing our algorithm to the case , but we won’t do this here.) We assume is a power of 2 (by padding both strings with a new symbol, which leaves edit distance unchanged). We assume that is a (negative) integral power of 2. The algorithm involves integer parameters , all of which are chosen to be powers of 2.
Organization of the paper. Section 2 is a detailed overview of the covering phase algorithm and its analysis. Section 3 presents the pseudocode and analysis for the covering phase. Section 4 presents the mincost path phase algorithm. Section 5 summarizes the full algorithm and discusses improvements in runtime via recursion.
2 Covering algorithm: Detailed overview
We give a detailed overview of the covering phase and its time analysis and proof of correctness, ignoring minor technical details. The pseudocode in Section 3 corresponds to the overview, with technical differences mainly to improve runtime. We will illustrate the subquadratic time analysis with the sample input parameter and algorithm parameters , and .
The covering phase outputs a set of certified boxes. The goal is that includes an adequate approximating sequence for some mincost path in , which is a sequence of certified boxes that satisfies:

is a decomposition of .

is an adequate cover of , where denotes the minimal subpath of whose projection to the axis is , and adequate cover means that the (vertical) distance from the start vertex (resp. final vertex) of and the lower left (resp. upper right) corner of , is at most a constant multiple of .

The sequence is adequately bounded, i.e., , for a constant .
This is a slight oversimplification of Definition 3 of approximation of by a sequence of certified boxes.
The intuition for the second condition is that is "almost" a path between the lower left and upper right corners of . Now might have a vertical extent that is much larger than its horizontal extent , in which case it is impossible to place a square with corners close to both endpoints of . But in that case, has a very high cost (at least . The closeness required is adjusted based on , with relaxed requirements if is large.
The output of the mincost path phase should satisfy the requirements of . Lemma 4.1 shows that if the mincost path phase receives that contains a approximating sequence to some mincost path , then it will output an upper bound to that is at most for some . So that on input with , the output is at most , satisfying the requirements of . This formalizes the intuition that an adequate approximating sequence captures enough information to deduce a good bound on .
Once and for all, we fix a mincost path . Our task for the covering phase is that, with high probability, includes an adequate approximating sequence for .
A match for an interval is a interval such that is an adequate cover of . It is easy to show (Proposition 3.3) that this implies . A box is said to be compatible if is a match for and a box sequence is compatible if every box is compatible. A compatible certified box sequence whose distance upper bounds are (on average) within a constant factor of the actual cost, satisfies the requirements for an adequate approximating sequence. Our cover algorithm will ensure that contains such a sequence.
A natural decomposition is , with all parts of width (think of as a power of 2 that is roughly ) so and . The naïve approach to building is to include certified boxes for enough choices of to guarantee a match for each . An interval of width is aligned if its upper and lower endpoints are both multiples of (which we require to be an integral power of 2). We restrict attention to intervals in , called candidates and aligned intervals of width called candidates. It can be shown (see Proposition 3.4) that an interval always has a match that is aligned. (In this overview we will fix to ; the actual algorithm has iterations during which the value of varies, giving improvements in runtime that are unimportant in this overview.) For each candidate , designate one such match as the canonical match, for , and is the canonical compatible box for .
In the exhaustive approach, for each (candidate, candidate)pair , its edit distance is computed in time , and the certified box is included. There are boxes, so the time for all edit distance computations is , which is worse than quadratic. (The factor can be avoided by standard techniques, but this is not significant to the quest for a subquadratic algorithm, so we defer this until the next section.) Note that is (which is for our sample parameters) so at least the mincost path phase (which runs in time quasilinear in ) is truly subquadratic.
Two natural goals that will improve the runtime are: (1) Reduce the amortized time per box needed to certify boxes significantly below and (2) Reduce the total number of certified boxes created significantly below . Neither goal is always achievable, and our covering algorithm combines them. In independent work [11, 12], versions of these two goals are combined, where the second goal is accomplished via Grover search, thus yielding a constant factor subquadratic time quantum approximation algorithm.
Reducing amortized time for certifying boxes: the dense case algorithm. We aim to reduce the amortized time per certified box to be much smaller than . We divide our search for certified boxes into iterations . For iteration , with , our goal is that for all candidate pairs with , we include the certified box for a fixed constant . If we succeed, then for each and its canonical match , and for the largest index for which , iteration will certify with , as needed.
For a string of size , let be the set of candidates with and be the set of candidates with . In iteration , for each candidate , we will specify a set of candidates that includes and is contained in . The set of certified boxes for all candidates and satisfies the goal of iteration .
Iteration proceeds in rounds. In each round we select an candidate , called the pivot, for which has not yet been specified. Compute for all candidates and for all candidates ; these determine and for any . For all , set . By the triangle inequality, for each , includes and is contained in so we can certify all the boxes with upper bound . Mark intervals in as fulfilled and proceed to the next round, choosing a new pivot from among the unfulfilled candidates.
The number of certified boxes produced in a round is . If this is much larger than , the number of edit distance computations, then we have significantly reduced amortized time per certified box. (For example, in the trivial case , every candidate box will be certified in a single round.) But in worst case, there are rounds each requiring time, for an unacceptable total time .
Here is a situation where the number of rounds is much less than . Since any two pivots are necessarily greater than apart, the sets for distinct pivots are disjoint. Now for some parameter (think of ) an candidate is dense for if , i.e., is close in edit distance to at least candidates; it is sparse otherwise. If we manage to select a dense pivot in each round, then the number of rounds is and the overall time will be . For the sample parameters this is . But there’s no reason to expect that we’ll only choose dense pivots; indeed there need not be any dense pivot.
Let’s modify the process a bit. When choosing potential pivot , first test whether or not it is (approximately) dense. This can be done with high probability, by randomly sampling candidates and finding the fraction of the sample that are within of . If this fraction is less than then is declared sparse and abandoned as a pivot; otherwise is declared dense, and used as a pivot. With high probability, all dense intervals that are tested are declared dense, and all tested intervals that are not dense are declared sparse, so we assume this is the case. Then all pivots are processed (as above) in time (under sample parameters: ). We pay to test each potential pivot (at most of them) so the overall time to test potential pivots is (with sample parameters: ).
Each iteration (with different ) splits candidates into two sets, of intervals that are declared sparse, and all of the rest for which we have found the desired set . With high probability every interval in is indeed sparse, but a sparse interval need not belong to , since it may belong to for some selected pivot .
For every candidate we have met the goal for the iteration. If is very small for all iterations, then the set of certified boxes will suffice for the mincost path algorithm to output a good approximation. But if is not small, another approach is needed.
Reducing the number of candidates explored: the diagonal extension algorithm. For each candidate , although it suffices to certify the single box with a good upper bound, since is unknown, the exhaustive and dense case approaches both include certified boxes for all candidates . The potential savings in the dense case approach comes from certifying many boxes simultaneously using a relatively small number of edit distance computations.
Here’s another approach: for each candidate try to quickly identify a relatively small subset of candidates that is guaranteed to include . If we succeed, then the number of boxes we certify is significantly reduced, and even paying quadratic time per certified box, we will have a subquadratic algorithm.
We need the notion of diagonal extension of a box. The main diagonal of box , is the segment joining the lower left and upper right corners. The square box is a diagonal extension of a square subbox if the main diagonal of is a subsegment of the main diagonal of . (see Definition 2.) Given square box and the diagonal extension of with respect to is the unique diagonal extension of having interval . The key observation (Proposition 3.5) is: if is an adequate cover of then any diagonal extension is an adequate cover of .
Now let be two numbers with and . (Think of and .) We use the decomposition of into intervals of width . The set of candidates consists aligned vertical intervals of width and has size . To identify a small set of potential matches for , we will identify a set (of size much smaller than ) of boxes having interval in (the decomposition of into width intervals). For each box in we determine the diagonal extension with respect to , compute and certify . Our hope is that includes a compatible box , then the observation above implies that its diagonal extension provides an adequate cover for .
Here’s how to build : Randomly select a polylog size set of intervals from . For each compute for each candidate , and let consist of the candidates with smallest edit distance to . Here is a parameter; think of as before. consists of all where and .
To bound runtime: Each requires width computations, taking time . Diagonal extension step requires width computations, for time . Summing over choices for gives time (with sample parameters: ).
Why should include a box that is an adequate approximation to ? The intuition behind the choice of is that an adequate cover for should typically be among the cheapest boxes of the form , and if is cheap then for a randomly chosen subinterval , we should also have is among the cheapest boxes for .
Clearly this intuition is faulty: may have many inexpensive matches such that is far from , which may all be much cheaper than the match we are looking for. In this bad situation, there are many intervals such that is smaller than the match we are looking, and this is reminiscent of the good situation for the dense case algorithm, where we hope that has lots of close matches. This suggests combining the two approaches, and leads to our full covering algorithm.
The full covering algorithm. Given the dense case and diagonal extension algorithms, the full covering algorithm is easy to describe. The parameters are as above. We iterate over with . In iteration , we first run the dense case algorithm, and let be the set of intervals declared sparse. Then run the diagonal extension algorithm described earlier (with small modifications): For each interval , select to consist of independent random selections from . For each , find the set of vertical candidates for which . Since is (almost certainly) sparse, the number of such is at most . Proceeding as in the diagonal extension algorithm, we produce a set of certified boxes with interval . Let (resp. ) be the set of all certified boxes produced by the dense case iterations, resp. diagonal extension iterations. The output is . (See Figure 1 for an illustration of the output .)
The runtime is the sum of the runtimes of the dense case and diagonal extension algorithms, as analyzed above. Later, we will give a more precise runtime analysis for the pseudocode.
To finish this extended overview, we sketch the argument that satisfies the covering phase requirements.
Claim 2.1.
Let be an interval in the decomposition. Either (1) the output of the dense case algorithm includes a sequence of certified boxes that adequately approximates the subpath , or (2) with high probability the output of the sparse case algorithm includes a single box that adequately approximates .
(This claim is formalized in Claim 3.12.) Stitching together the subpaths for all implies that will contain a sequence of certified boxes that adequately approximates .
To prove the claim, we establish a sufficient condition for each of the two conclusion and show that if the sufficient condition for the second conclusion fails, then the sufficient condition for the first holds.
Let denote the decomposition of . Every interval has a aligned match . It will be shown (see Proposition 3.4), that . Let denote this upper bound. Consider the first alternative in the claim. During the dense case iteration , every interval is declared dense, and is in for all . To get an adequate approximation, we try to show that later iterations provide much better upper bounds on these boxes, i.e., for a small enough value of . By definition of adequate approximation, it is enough that , for some . Let be the last (largest) iteration for which and (which is well defined since ). Let . Since , the box is certified. The collection is a sequence of certified boxes that satisfies the first two conditions for an adequate approximation of . The third condition will follow if:
(1) 
so this is sufficient to imply the first condition of the claim.
Next consider what we need for the second alternative to hold. Let be the set of intervals declared sparse in iteration . An interval is a winner (for iteration ) if , and is the set of winners. In iteration of the diagonal extension algorithm, we sample elements of . If for at least one iteration our sample includes a winner then the second condition of the claim will hold: is extended diagonally to a box, and by the diagonal extension property, the extension is an adequate cover of , which we will certify with its exact edit distance.
Thus for the second alternative to fail with nonnegligible probability:
(2) 
We argue that if the failure condition (2) holds, then the success condition (1) holds. Multiply (2) by and sum on to get:
(3) 
For a given interval , consider the iterations for which and those for which . First of all if and then since we conclude . So implies that , so the inner sum of the right side of (3) is at most (by summing a geometric series).
Furthermore, for with , by the choice of . Either or . The latter implies , and then is upper bounded by the inner sum on the left of (3). Therefore:
as required for (1).
This completes the overview of the covering algorithm.
3 Covering Algorithm:pseudocode and analysis
The pseudocode consists of CoveringAlgorithm which calls procedures DenseStripRemoval (the dense case algorithm) and SparseStripExtensionSampling (the diagonal extension algorithm). These are abbreviated, respectively by CA, DSR and SSES. The technical differences between the pseudocode and the informal description, are mainly to improve runtime analysis.
3.1 Pseudocode
The parameters of CA are as described in the overview: are input strings of length , comes from , and are integral powers of 2, as are the auxiliary input parameters. The output is a set of certified boxes. The algorithm uses global constants and , where the former one is needed for Proposition 3.8.
We use a subroutine SMALLED which takes strings of length and parameter and outputs if and otherwise outputs . The algorithm of [19] implements SMALLED in time .
One technical difference from the overview, is that the pseudocode saves time by restricting the search for certified boxes to a portion of the grid close to the main diagonal. Recall that has two requirements, that the output upper bounds (which will be guaranteed by the requirement that contains no falsely certified boxes), and that if , the output is at most for some constant . We therefore design our algorithm assuming , in which case every mincost path consists entirely of points within steps from the main diagonal, i.e. . So we restrict our search for certified boxes as follows: set , and consider the overlapping equally spaced boxes of width lying along the main diagonal. Together these boxes cover all points within of the main diagonal.
The algorithm of the overview is executed separately on each of these boxes. Within each of these executions, we iterate over (rather than as in the overview). In each iteration we apply the dense case algorithm and the diagonal extension algorithm as in the overview. The output is the union over all boxes and all iterations, of the boxes produced.
In the procedures DSR and SSES, the input is an induced grid graph corresponding to a box , as described in the "framework" part of Section 1. The procedure DSR on input , sets to be the decomposition of (the candidates) and to be the set of aligned candidates. As in the overview, the dense case algorithm produces a set of certified boxes (called in the pseudocode) and a set of intervals declared sparse. SSES is invoked if and iterates over all intervals in the decomposition . The algorithm skips if contains no subset of , and otherwise selects a sample of subintervals of from . For each sample interval it finds the vertical candidates for which , does a diagonal extension to and certifies each box with an exact edit distance computation.
There are a few parameter changes from the overview that provide some improvement in the time analysis: During each iteration , rather than take our vertical candidates to be from a aligned grid, we can afford a coarser grid that is aligned. Also, the local parameter in DSR and SSES is set to during iteration .
There is one counterintuitive quirk in SSES: each certified box is replicated times with higher distance bounds. This is permissible (increasing the distance bound cannot decertify a box), but seems silly (why add the same box with a higher distance bound?). This is just a convenient technical device to ensure that the second phase mincost path algorithm gives a good approximation.