Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time1footnote 11footnote 1The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013)/ERC Grant Agreement no. 616787.

Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time111The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013)/ERC Grant Agreement no. 616787.

Diptarka Chakraborty diptarka@iuuk.mff.cuni.cz Debarati Das debaratix710@gmail.com Elazar Goldenberg elazargo@mta.ac.il The Academic College Of Tel Aviv-Yaffo, School of Computer Science, Tel Aviv-Yaffo, Israel Michal Koucký koucky@iuuk.mff.cuni.cz Michael Saks msaks30@gmail.com Department of Mathematics, Rutgers University, Piscataway, NJ, USA
Abstract

Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor .

In this paper, we provide an algorithm with running time that approximates the edit distance within a constant factor.

1 Introduction

Exact computation of edit distance. The edit distance (aka Levenshtein distance[16] between strings , denoted by , is the minimum number of character insertions, deletions, and substitutions needed to convert into . It is a widely used distance measure between strings that finds applications in fields such as computational biology, pattern recognition, text processing, and information retrieval. The problems of efficiently computing , and of constructing an optimal alignment (sequence of operations that converts to ), are of significant interest.

Edit distance can be evaluated exactly in quadratic time via dynamic programming (Wagner and Fischer [20]). Landau et al. [15] gave an algorithm that finds an optimal alignment in time , improving on a previous algorithm of Ukkonen [19]. Masek and Paterson[17] obtained the first (slightly) sub-quadratic time algorithm, and the current asymptotically fastest algorithm (Grabowski [14]) runs in time . Backurs and Indyk [7] showed that a truly sub-quadratic algorithm ( for some ) would imply a time algorithm for CNF-satisfiabilty, contradicting the Strong Exponential Time Hypothesis (SETH). Abboud et al. [3] showed that even shaving an arbitrarily large polylog factor from would have the plausible, but apparently hard-to-prove, consequence that NEXP does not have non-uniform circuits. For further “barrier” results, see  [2, 13].

Approximation algorithms. There is a long line of work on approximating edit distance. The exact time algorithm (where is the edit distance of the input) of Landau et al. [15] yields a linear time -factor approximation. This approximation factor was improved, first to  [8], then to  [10] and later to  [6], all with slightly superlinear runtime. Batu et al. [9] provided an -approximation algorithm with runtime . The strongest result of this type is the factor approximation (for every ) with running time of Andoni et al. [4]. Abboud and Backurs [1] showed that a truly sub-quadratic deterministic time -factor approximation algorithm for edit distance would imply new circuit lower bounds.

Independent of our work, Boroujeni et al. [11] obtained a truly sub-quadratic quantum algorithm that provides a constant factor approximation. Their latest results  [12] are a factor with runtime and a faster -time with a larger constant factor approximation.

Andoni and Nguyen  [5] found a randomized algorithm that approximates Ulam distance of two permutations of (edit distance with only insertions and deletions) within a (large) constant factor in time , where is the Ulam distance of the input; this was improved by Naumovitz et al. [18] to a -factor approximation (for any ) with similar runtime.

Our results. We present the first truly sub-quadratic time classical algorithm that approximates edit distance within a constant factor.

Theorem 1.1.

There is a randomized algorithm ED-UB that on input strings of length over any alphabet outputs an upper bound on in time that, with probability at least , is at most a fixed constant multiple of .

If the output is , then the algorithm has implicitly found an alignment of cost at most . The algorithm can be modified to explicitly output such an alignment.

The approximation factor proved in this preliminary version is 1680, can be greatly improved by tweaking parameters. We believe, but have not proved, that with sufficient care the algorithm can be modified (with no significant increase in runtime) to get approximation.

Theorem 1.1 follows from:

Theorem 1.2.

For every , there is a randomized algorithm that on input strings of length outputs such that: (1) and (2) on any input with , with probability at least . The runtime of is .

The name reflects that this is a "gap algorithm", which distinguishes inputs with (where the output is at most ), and those with (where the output is greater than ).

Theorem 1.1 follows via a routine construction of ED-UB from , presented in Section 5. The rest of the paper is devoted to proving Theorem 1.2.

The framework of the algorithm. We use a standard two-dimensional representation of edit distance. Visualize as lying on a horizontal axis and as lying on a vertical axis, with horizontal coordinate corresponding to and vertical component corresponding to . The width of interval is . Also, denotes the substring of indexed by . (Note: is not part of , e.g., . This convention is motivated by Proposition 1.3.) We refer to as an -interval to indicate that it indexes a substring of , and as a -interval to indicate that it indexes a substring of . A box is a set where is a -interval and is a -interval; corresponds to the substring pair . is a -box if . We often abbreviate by . A decomposition of an -interval is a sequence of subintervals with , and for , .

Associated to is a directed graph with edge costs called a grid graph with vertex set and all edges of the form (-steps), (-steps) and (-steps). Every H-step or V-step costs 1, and D-steps cost 1 if and 0 otherwise. There is a 1-1 correspondence that maps a path from to to an alignment from to , i.e. a set of character deletions, insertions and substitutions that changes to , where an H-step means "delete ", a V-step means "insert between and " and a D-step means replace by , unless they are already equal. We have:

Proposition 1.3.

The cost of an alignment, , is the sum of edge costs of its associated path , and is equal to , the min cost of an alignment path from to .

For , is the grid graph induced on , and . The natural high-level idea of appears (explicitly or implicitly) in previous work. The algorithm has two phases. First, the covering phase identifies a set of certified boxes which are pairs , where is an upper bound on the normalized edit distance . ( is more convenient than for the covering phase.) Second, the min-cost path phase, takes input and uses a straightforward customized variant of dynamic programming to find an upper bound on in time quasilinear in . The central issue is to ensure that the covering phase outputs that is sufficiently informative so that for constant , while running in sub-quadratic time.

Simplifying assumptions. The input strings have equal length . (It is easy to reduce to this case: pad the shorter string to the length of the longer using a new symbol. The edit distance of the new pair is between the original edit distance and twice the original edit distance. This factor 2 increase in approximation factor can be avoided by generalizing our algorithm to the case , but we won’t do this here.) We assume is a power of 2 (by padding both strings with a new symbol, which leaves edit distance unchanged). We assume that is a (negative) integral power of 2. The algorithm involves integer parameters , all of which are chosen to be powers of 2.

Organization of the paper. Section 2 is a detailed overview of the covering phase algorithm and its analysis. Section 3 presents the pseudo-code and analysis for the covering phase. Section 4 presents the min-cost path phase algorithm. Section 5 summarizes the full algorithm and discusses improvements in runtime via recursion.

2 Covering algorithm: Detailed overview

We give a detailed overview of the covering phase and its time analysis and proof of correctness, ignoring minor technical details. The pseudo-code in Section 3 corresponds to the overview, with technical differences mainly to improve runtime. We will illustrate the sub-quadratic time analysis with the sample input parameter and algorithm parameters , and .

The covering phase outputs a set of certified boxes. The goal is that includes an adequate approximating sequence for some min-cost path in , which is a sequence of certified boxes that satisfies:

  1. is a decomposition of .

  2. is an adequate cover of , where denotes the minimal subpath of whose projection to the -axis is , and adequate cover means that the (vertical) distance from the start vertex (resp. final vertex) of and the lower left (resp. upper right) corner of , is at most a constant multiple of .

  3. The sequence is adequately bounded, i.e., , for a constant .

This is a slight oversimplification of Definition 3 of -approximation of by a sequence of certified boxes.

The intuition for the second condition is that is "almost" a path between the lower left and upper right corners of . Now might have a vertical extent that is much larger than its horizontal extent , in which case it is impossible to place a square with corners close to both endpoints of . But in that case, has a very high cost (at least . The closeness required is adjusted based on , with relaxed requirements if is large.

The output of the min-cost path phase should satisfy the requirements of . Lemma 4.1 shows that if the min-cost path phase receives that contains a -approximating sequence to some min-cost path , then it will output an upper bound to that is at most for some . So that on input with , the output is at most , satisfying the requirements of . This formalizes the intuition that an adequate approximating sequence captures enough information to deduce a good bound on .

Once and for all, we fix a min-cost path . Our task for the covering phase is that, with high probability, includes an adequate approximating sequence for .

A -match for an -interval is a -interval such that is an adequate cover of . It is easy to show (Proposition 3.3) that this implies . A box is said to be -compatible if is a -match for and a box sequence is -compatible if every box is -compatible. A -compatible certified box sequence whose distance upper bounds are (on average) within a constant factor of the actual cost, satisfies the requirements for an adequate approximating sequence. Our cover algorithm will ensure that contains such a sequence.

A natural decomposition is , with all parts of width (think of as a power of 2 that is roughly ) so and . The naïve approach to building is to include certified boxes for enough choices of to guarantee a -match for each . An interval of width is -aligned if its upper and lower endpoints are both multiples of (which we require to be an integral power of 2). We restrict attention to -intervals in , called -candidates and -aligned -intervals of width called -candidates. It can be shown (see Proposition 3.4) that an -interval always has a -match that is -aligned. (In this overview we will fix to ; the actual algorithm has iterations during which the value of varies, giving improvements in runtime that are unimportant in this overview.) For each -candidate , designate one such -match as the canonical -match, for , and is the canonical -compatible box for .

In the exhaustive approach, for each (-candidate, -candidate)-pair , its edit distance is computed in time , and the certified box is included. There are boxes, so the time for all edit distance computations is , which is worse than quadratic. (The factor can be avoided by standard techniques, but this is not significant to the quest for a sub-quadratic algorithm, so we defer this until the next section.) Note that is (which is for our sample parameters) so at least the min-cost path phase (which runs in time quasi-linear in ) is truly sub-quadratic.

Two natural goals that will improve the runtime are: (1) Reduce the amortized time per box needed to certify boxes significantly below and (2) Reduce the total number of certified boxes created significantly below . Neither goal is always achievable, and our covering algorithm combines them. In independent work  [11, 12], versions of these two goals are combined, where the second goal is accomplished via Grover search, thus yielding a constant factor sub-quadratic time quantum approximation algorithm.

Reducing amortized time for certifying boxes: the dense case algorithm. We aim to reduce the amortized time per certified box to be much smaller than . We divide our search for certified boxes into iterations . For iteration , with , our goal is that for all candidate pairs with , we include the certified box for a fixed constant . If we succeed, then for each and its canonical -match , and for the largest index for which , iteration will certify with , as needed.

For a string of size , let be the set of -candidates with and be the set of -candidates with . In iteration , for each -candidate , we will specify a set of -candidates that includes and is contained in . The set of certified boxes for all -candidates and satisfies the goal of iteration .

Iteration proceeds in rounds. In each round we select an -candidate , called the pivot, for which has not yet been specified. Compute for all -candidates and for all -candidates ; these determine and for any . For all , set . By the triangle inequality, for each , includes and is contained in so we can certify all the boxes with upper bound . Mark intervals in as fulfilled and proceed to the next round, choosing a new pivot from among the unfulfilled -candidates.

The number of certified boxes produced in a round is . If this is much larger than , the number of edit distance computations, then we have significantly reduced amortized time per certified box. (For example, in the trivial case , every candidate box will be certified in a single round.) But in worst case, there are rounds each requiring time, for an unacceptable total time .

Here is a situation where the number of rounds is much less than . Since any two pivots are necessarily greater than apart, the sets for distinct pivots are disjoint. Now for some parameter (think of ) an -candidate is -dense for if , i.e., is -close in edit distance to at least -candidates; it is -sparse otherwise. If we manage to select a -dense pivot in each round, then the number of rounds is and the overall time will be . For the sample parameters this is . But there’s no reason to expect that we’ll only choose dense pivots; indeed there need not be any dense pivot.

Let’s modify the process a bit. When choosing potential pivot , first test whether or not it is (approximately) -dense. This can be done with high probability, by randomly sampling -candidates and finding the fraction of the sample that are within of . If this fraction is less than then is declared sparse and abandoned as a pivot; otherwise is declared dense, and used as a pivot. With high probability, all -dense intervals that are tested are declared dense, and all tested intervals that are not -dense are declared sparse, so we assume this is the case. Then all pivots are processed (as above) in time (under sample parameters: ). We pay to test each potential pivot (at most of them) so the overall time to test potential pivots is (with sample parameters: ).

Each iteration (with different ) splits -candidates into two sets, of intervals that are declared sparse, and all of the rest for which we have found the desired set . With high probability every interval in is indeed -sparse, but a sparse interval need not belong to , since it may belong to for some selected pivot .

For every -candidate we have met the goal for the iteration. If is very small for all iterations, then the set of certified boxes will suffice for the min-cost path algorithm to output a good approximation. But if is not small, another approach is needed.

Reducing the number of candidates explored: the diagonal extension algorithm. For each -candidate , although it suffices to certify the single box with a good upper bound, since is unknown, the exhaustive and dense case approaches both include certified boxes for all -candidates . The potential savings in the dense case approach comes from certifying many boxes simultaneously using a relatively small number of edit distance computations.

Here’s another approach: for each -candidate try to quickly identify a relatively small subset of -candidates that is guaranteed to include . If we succeed, then the number of boxes we certify is significantly reduced, and even paying quadratic time per certified box, we will have a sub-quadratic algorithm.

We need the notion of diagonal extension of a box. The main diagonal of box , is the segment joining the lower left and upper right corners. The square box is a diagonal extension of a square subbox if the main diagonal of is a subsegment of the main diagonal of . (see Definition 2.) Given square box and the diagonal extension of with respect to is the unique diagonal extension of having -interval . The key observation (Proposition  3.5) is: if is an adequate cover of then any diagonal extension is an adequate cover of .

Now let be two numbers with and . (Think of and .) We use the decomposition of into intervals of width . The set of -candidates consists -aligned vertical intervals of width and has size . To identify a small set of potential matches for , we will identify a set (of size much smaller than ) of -boxes having -interval in (the decomposition of into width intervals). For each box in we determine the diagonal extension with respect to , compute and certify . Our hope is that includes a -compatible -box , then the observation above implies that its diagonal extension provides an adequate cover for .

Here’s how to build : Randomly select a polylog size set of -intervals from . For each compute for each -candidate , and let consist of the candidates with smallest edit distance to . Here is a parameter; think of as before. consists of all where and .

To bound runtime: Each requires width- computations, taking time . Diagonal extension step requires width- computations, for time . Summing over choices for gives time (with sample parameters: ).

Why should include a box that is an adequate approximation to ? The intuition behind the choice of is that an adequate cover for should typically be among the cheapest boxes of the form , and if is cheap then for a randomly chosen -subinterval , we should also have is among the cheapest boxes for .

Clearly this intuition is faulty: may have many inexpensive matches such that is far from , which may all be much cheaper than the match we are looking for. In this bad situation, there are many -intervals such that is smaller than the match we are looking, and this is reminiscent of the good situation for the dense case algorithm, where we hope that has lots of close matches. This suggests combining the two approaches, and leads to our full covering algorithm.

The full covering algorithm. Given the dense case and diagonal extension algorithms, the full covering algorithm is easy to describe. The parameters are as above. We iterate over with . In iteration , we first run the dense case algorithm, and let be the set of intervals declared sparse. Then run the diagonal extension algorithm described earlier (with small modifications): For each -interval , select to consist of independent random selections from . For each , find the set of vertical candidates for which . Since is (almost certainly) -sparse, the number of such is at most . Proceeding as in the diagonal extension algorithm, we produce a set of certified -boxes with -interval . Let (resp. ) be the set of all certified boxes produced by the dense case iterations, resp. diagonal extension iterations. The output is . (See Figure  1 for an illustration of the output .)

The runtime is the sum of the runtimes of the dense case and diagonal extension algorithms, as analyzed above. Later, we will give a more precise runtime analysis for the pseudo-code.

To finish this extended overview, we sketch the argument that satisfies the covering phase requirements.

Claim 2.1.

Let be an interval in the -decomposition. Either (1) the output of the dense case algorithm includes a sequence of certified -boxes that adequately approximates the subpath , or (2) with high probability the output of the sparse case algorithm includes a single -box that adequately approximates .

(This claim is formalized in Claim 3.12.) Stitching together the subpaths for all implies that will contain a sequence of certified boxes that adequately approximates .

To prove the claim, we establish a sufficient condition for each of the two conclusion and show that if the sufficient condition for the second conclusion fails, then the sufficient condition for the first holds.

Figure 1: Illustration of the Covering Algorithm: Blue boxes are low cost boxes in dense -strips, while the yellow ones are in sparse -strips. The red line corresponds to the path that we are trying to cover. In each -strip, is covered by either a collection of many -boxes or it is covered by a diagonal extension of a low cost -box. The various boxes might overlap vertically which is not shown in the picture.

Let denote the -decomposition of . Every interval has a -aligned -match . It will be shown (see Proposition 3.4), that . Let denote this upper bound. Consider the first alternative in the claim. During the dense case iteration , every interval is declared dense, and is in for all . To get an adequate approximation, we try to show that later iterations provide much better upper bounds on these boxes, i.e., for a small enough value of . By definition of adequate approximation, it is enough that , for some . Let be the last (largest) iteration for which and (which is well defined since ). Let . Since , the box is certified. The collection is a sequence of certified boxes that satisfies the first two conditions for an adequate approximation of . The third condition will follow if:

(1)

so this is sufficient to imply the first condition of the claim.

Next consider what we need for the second alternative to hold. Let be the set of intervals declared sparse in iteration . An interval is a winner (for iteration ) if , and is the set of winners. In iteration of the diagonal extension algorithm, we sample elements of . If for at least one iteration our sample includes a winner then the second condition of the claim will hold: is extended diagonally to a -box, and by the diagonal extension property, the extension is an adequate cover of , which we will certify with its exact edit distance.

Thus for the second alternative to fail with nonnegligible probability:

(2)

We argue that if the failure condition (2) holds, then the success condition (1) holds. Multiply (2) by and sum on to get:

(3)

For a given interval , consider the iterations for which and those for which . First of all if and then since we conclude . So implies that , so the inner sum of the right side of (3) is at most (by summing a geometric series).

Furthermore, for with , by the choice of . Either or . The latter implies , and then is upper bounded by the inner sum on the left of (3). Therefore:

as required for (1).

This completes the overview of the covering algorithm.

3 Covering Algorithm:pseudo-code and analysis

The pseudo-code consists of CoveringAlgorithm which calls procedures DenseStripRemoval (the dense case algorithm) and SparseStripExtensionSampling (the diagonal extension algorithm). These are abbreviated, respectively by CA, DSR and SSES. The technical differences between the pseudo-code and the informal description, are mainly to improve runtime analysis.

3.1 Pseudo-code

The parameters of CA are as described in the overview: are input strings of length , comes from , and are integral powers of 2, as are the auxiliary input parameters. The output is a set of certified boxes. The algorithm uses global constants and , where the former one is needed for Proposition 3.8.

We use a subroutine SMALL-ED which takes strings of length and parameter and outputs if and otherwise outputs . The algorithm of [19] implements SMALL-ED in time .

One technical difference from the overview, is that the pseudo-code saves time by restricting the search for certified boxes to a portion of the grid close to the main diagonal. Recall that has two requirements, that the output upper bounds (which will be guaranteed by the requirement that contains no falsely certified boxes), and that if , the output is at most for some constant . We therefore design our algorithm assuming , in which case every min-cost -path consists entirely of points within steps from the main diagonal, i.e. . So we restrict our search for certified boxes as follows: set , and consider the overlapping equally spaced boxes of width lying along the main diagonal. Together these boxes cover all points within of the main diagonal.

The algorithm of the overview is executed separately on each of these boxes. Within each of these executions, we iterate over (rather than as in the overview). In each iteration we apply the dense case algorithm and the diagonal extension algorithm as in the overview. The output is the union over all boxes and all iterations, of the boxes produced.

In the procedures DSR and SSES, the input is an induced grid graph corresponding to a box , as described in the "framework" part of Section  1. The procedure DSR on input , sets to be the -decomposition of (the -candidates) and to be the set of -aligned -candidates. As in the overview, the dense case algorithm produces a set of certified boxes (called in the pseudo-code) and a set of intervals declared sparse. SSES is invoked if and iterates over all -intervals in the decomposition . The algorithm skips if contains no subset of , and otherwise selects a sample of subintervals of from . For each sample interval it finds the vertical candidates for which , does a diagonal extension to and certifies each box with an exact edit distance computation.

There are a few parameter changes from the overview that provide some improvement in the time analysis: During each iteration , rather than take our vertical candidates to be from a -aligned grid, we can afford a coarser grid that is -aligned. Also, the local parameter in DSR and SSES is set to during iteration .

There is one counterintuitive quirk in SSES: each certified box is replicated times with higher distance bounds. This is permissible (increasing the distance bound cannot decertify a box), but seems silly (why add the same box with a higher distance bound?). This is just a convenient technical device to ensure that the second phase min-cost path algorithm gives a good approximation.

0:  Strings of length , , , and . are powers of 2.
0:  A set of certified boxes in .  
1:  Initialization: , .
2:  Let
3:  for  do
4:     Let .
5:     for  do
6:        Set .
7:        Invoke DSR, to get and .
8:        if  then
9:           Invoke SSES , to get .
10:        else
11:           .
12:        end if
13:        Add items from to and from to .
14:     end for
15:  end for
16:  Output .
Algorithm 1 CA CoveringAlgorithm
0:   for some , , the endpoints of and are multiples of and .
0:  Set which is a subset of the -decomposition of and a set of -aligned certified -boxes all with distance bound .    
1:  Initialization: . .
2:  , the set of -candidates, is the set of width -aligned subintervals of (having endpoints a multiple of .)
3:  while  is non-empty do
4:     Pick
5:     Sample intervals uniformly at random and for each test if .
6:     if for at most sampled ’s,