Sketching, Streaming, and Fine-Grained Complexity of (Weighted) LCS

Sketching, Streaming, and Fine-Grained Complexity of (Weighted) LCS

Karl Bringmann    Bhaskar Ray Chaudhury
Abstract

We study sketching and streaming algorithms for the Longest Common Subsequence problem (LCS) on strings of small alphabet size . For the problem of deciding whether the LCS of strings has length at least , we obtain a sketch size and streaming space usage of . We also prove matching unconditional lower bounds.

As an application, we study a variant of LCS where each alphabet symbol is equipped with a weight that is given as input, and the task is to compute a common subsequence of maximum total weight. Using our sketching algorithm, we obtain an -time algorithm for this problem, on strings of length , with . We prove optimality of this running time up to lower order factors, assuming the Strong Exponential Time Hypothesis.

algorithms, SETH, communication complexity, run-length encoding

Max Planck Institute for Informatics, Saarland Informatics Campus,
Saarbrücken, Germany kbringma@mpi-inf.mpg.deMax Planck Institute for Informatics, Saarland Informatics Campus,
Graduate School of Computer Science, Saarbrücken, Germanybraycha@mpi-inf.mpg.de \CopyrightKarl Bringmann and Bhaskar Ray Chaudhury\subjclassF.2.2 Nonnumerical Algorithms and Problems\EventEditorsSumit Ganguly and Paritosh Pandya \EventNoEds2 \EventLongTitle38th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2018) \EventShortTitleFSTTCS 2018 \EventAcronymFSTTCS \EventYear2018 \EventDateDecember 11–13, 2018 \EventLocationAhmedabad, India \EventLogo \SeriesVolume122 \ArticleNo40

1 Introduction

1.1 Sketching and Streaming LCS

In the Longest Common Subsequence problem (LCS) we are given strings and and the task is to compute a longest string that is a subsequence of both and . This problem has been studied extensively, since it has numerous applications in bioinformatics (e.g. comparison of DNA sequences [5]), natural language processing (e.g. spelling correction [40, 49]), file comparison (e.g. the UNIX diff utility [23, 38]), etc. Motivated by big data applications, in the first part of this paper we consider space-restricted settings as follows:

  • LCS Sketching: Alice is given and Bob is given . Both also are given a number . Alice and Bob compute sketches and and send them to a third person, the referee, who decides whether the LCS of and is at least . The task is to minimize the size of the sketch (i.e., its number of bits) as well as the running time of Alice and Bob (encoding) and of the referee (decoding).

  • LCS Streaming: We are given , and we scan the string from left to right once, and then the string from left to right once. After that, we need to decide whether the LCS of and is at least . We want to minimize the space usage as well as running time.

Analogous problem settings for the related edit distance have found surprisingly good solutions after a long line of work [11, 29, 46, 16]. For LCS, however, strong unconditional lower bounds are known for sketching and streaming: Even for the sketch size and streaming memory must be bits, since the randomized communication complexity of this problem is  [47]. Similarly strong results hold even for approximating the LCS length [47], see also [35]. However, these impossibility results construct strings over alphabet size .

In contrast, in this paper we focus on strings defined over a fixed alphabet (of constant size). This is well motivated, e.g., for binary files (), DNA sequences (), or English text ( plus punctuation marks). We therefore suppress factors depending only on in -notation throughout the whole paper. Surprisingly, this setting was ignored in the sketching and streaming literature so far; the only known upper bounds also work in the case of large alphabet and are thus .

Before stating our first main result we define a run in a string as the non extendable repetition of a character. For example the string has a run of character of length . Our first main result is the following deterministic sketch.

{theorem}

Given a string of length over alphabet and an integer , we can compute a subsequence of such that (1) , (2) consists of runs of length at most , and (3) any string of length at most is a subsequence of if and only if it is a subsequence of . Moreover, is computed by a one-pass streaming algorithm with memory and running time per symbol of . Note that we can store using bits, since each run can be encoded using bits. This directly yields a solution for LCS sketching, where Alice and Bob compute the sketches and and the referee computes an LCS of and . If this has length at least then also have LCS length at least . Similarly, if have an LCS of length at least , then is also a subsequence of and , and thus their LCS length is at least , showing correctness. The sketch size is bits, the encoding time is , and the decoding time is , as LCS can be computed in quadratic time in the string length .

We similarly obtain an algorithm for LCS streaming by computing and then and finally computing an LCS of and . The space usage of this streaming algorithm is , and the running time is per symbol of and , plus for the last step.

These size, space, and time bounds are surprisingly good for , but quickly deteriorate with larger alphabet size. For very large alphabet size, this deterioration was to be expected due to the lower bound for from [47]. We further show that this deterioration is necessary by proving optimality of our sketch in several senses:

  • We show that for any there exists a string (of length ) such that no string  of length has the same set of subsequences of length at most . Similarly, this string cannot be replaced by any string consisting of runs without affecting the set of subsequences of length at most . This shows optimality of Theorem 1.1 among sketches that replace by another string (not necessarily a subsequence of ) and then compute an LCS of and . See Theorem 3.

  • More generally, we study the Subsequence Sketching problem: Alice is given a string and number and computes . Bob is then given and a string of length and decides whether is a subsequence of . Observe that any solution for LCS sketching or streaming with size/memory yields a solution for subsequence sketching with sketch size .111For LCS sketching this argument only uses that we can check whether is a subsequence of by testing whether the LCS length of and is . For LCS streaming we use the memory state right after reading as the sketch and then use the same argument. Hence, any lower bound for subsequence sketching yields a lower bound for LCS sketching and streaming. We show that any deterministic subsequence sketch has size in the worst case over all strings . This matches the run-length encoding of even up to the -factor. If we restrict to strings of length , we still obtain a sketch size lower bound of . See Theorem 3.

  • Finally, randomization does not help either: We show that any randomized subsequence sketch, where Bob may err in deciding whether is a subsequence of with small constant probability, has size , even restricted to strings of length . See Theorem 3.

We remark that Theorem 1.1 only makes sense if . Although this is not the best motivated regime of LCS in practice, it corresponds to testing whether and are “very different” or “not very different”. This setting naturally occurs, e.g., if one string is much longer than the other, since then . We therefore think that studying this regime is justified for the fundamental problem LCS.

1.2 WLCS: In between min-quadratic and rectangular time

As an application of our sketch, we determine the (classic, offline) time complexity of a weighted variant of LCS, which we discuss in the following.

A textbook dynamic programming algorithm computes the LCS of given strings of length in time . A major result in fine-grained complexity shows that further improvements by polynomial factors would refute the Strong Exponential Time Hypothesis (SETH) [1, 13] (see Section 5 for a definition). In case and have different lengths and , with , Hirschberg’s algorithm computes their LCS in time  [22], and this is again near-optimal under SETH. This running time could be described as “min-quadratic”, as it is quadratic in the minimum of the two string lengths. In contrast, many other dynamic programming type problems have “rectangular” running time222By -notation we ignore factors of the form . , with a matching lower bound of under SETH, e.g., Fréchet distance [4, 12], dynamic time warping [1, 13], and regular expression pattern matching [43, 10].

Part of this paper is motivated by the intriguing question whether there are problems with intermediate running time, between “min-quadratic” and “rectangular”. Natural candidates are generalizations of LCS, such as the weighted variant WLCS as defined in [1]: Here we have an additional weight function , and the task is to compute a common subsequence of and with maximum total weight. This problem is a natural variant of LCS that, e.g., came up in a SETH-hardness proof of LCS [1]. It is not to be confused with other weighted variants of LCS that have been studied in the literature, such as a statistical distance measure where given the probability of every symbol’s occurrence at every text location the task is to find a long and likely subsequence [6, 18], a variant of LCS that favors consecutive matches [36], or edit distance with given operation costs [13].

Clearly, WLCS inherits the hardness of LCS and thus requires time . However, the matching upper bound given by Hirschberg’s algorithm only works as long as the function is fixed (then the hidden constant depends on the largest weight). Here, we focus on the variant where the weight function is part of the input. In this case, the basic -time dynamic programming algorithm is the best known.

Our second main result is to settle the time complexity of WLCS in terms of and for any fixed constant alphabet , up to lower order factors and assuming SETH. {theorem} WLCS can be solved in time . Assuming SETH, WLCS requires time , even restricted to and for any constants and .

In particular, for the time complexity of WLCS is indeed “intermediate”, in between “min-quadratic” and “rectangular”! To the best of our knowledge, this is the first result of fine-grained complexity establishing such an intermediate running time.

To prove Theorem 1.2 we first observe that the usual dynamic programming algorithm also works for WLCS. For the other term , we compress by running the sketching algorithm from Theorem 1.1 with . This yields a string of length such that WLCS has the same value on and , since every subsequence of length at most of is also a subsequence of , and vice versa. Running the -time algorithm on would yield total time , which is too slow by a factor . To obtain an improved running time, we use the fact that consists of runs. We design an algorithm for WLCS on a run-length encoded string consisting of runs and an uncompressed string of length running time . This generalizes algorithms for LCS with one run-length encoded string [7, 20, 37]. Together, we obtain time . We then show a matching SETH-based lower bound by combining our construction of incompressible strings from our sketching lower bounds (Theorem 3) with the by-now classic SETH-hardness proof of LCS [1, 13].

1.3 Further Related Work

Analyzing the running time in terms of multiple parameters like has a long history for LCS [8, 9, 19, 22, 24, 26, 42, 44, 51]. Recently tight SETH-based lower bounds have been shown for all these algorithms [14]. In the second part of this paper, we perform a similar complexity analysis on a weighted variant of LCS. This follows the majority of recent work on LCS, which focused on transferring the early successes and techniques to more complicated problems, such as longest common increasing subsequence [39, 33, 52, 17], tree LCS [41], and many more generalizations and variants of LCS, see, e.g., [32, 15, 48, 28, 3, 34, 30, 21, 45, 25]. For brevity, here we ignore the equally vast literature on the closely related edit distance.

1.4 Notation

For a string of length over alphabet , we write for its -th symbol, for the substring from the -th to -th symbol, and for its length. For we write . For strings we write for their concatenation, and for we write for the -fold repetition . A subsequence of is any string of the form with ; in this case we write . A run in is a maximal substring , consisting of a single alphabet letter . Recall that we suppress factors depending only on in -notation.

2 Sketching LCS

In this section design a sketch for LCS, proving Theorem 1.1. Consider any string defined over alphabet . We call a -permutation string if we can partition such that each contains each symbol in at least once. Observe that a permutation string contains any string of length at most over the alphabet as a subsequence.

Claim \thetheorem.

Consider any string , where are strings over alphabet and . Let . If some suffix of is an -permutation string and , then for all strings of length at most we have if and only if .

Proof.

The “if”-direction is immediate. To prove the “only if”, consider any subsequence of of length and let . Let and be the length of and , respectively. If for all , then clearly . Thus, assume that for some . Let be minimal such that only contains symbols in . By assumption, is an -permutation string, and . Let be the minimum index such that only contains symbols in . Since is minimal, and thus for all . Therefore . Since is an -permutation string and , it follows that is a subsequence of . Hence, and , and thus . ∎

The above claim immediately gives rise to the following one-pass streaming algorithm.

1:initialize as the empty string
2:for all  from to  do
3:   if for all with , no suffix of is an -permutation string then
4:      set    
5:return
Algorithm 1 Outline for computing given a string and an integer

By Claim 2, the string returned by this algorithm satisfies the subsequence property (3) of Theorem 1.1. Note that any run in has length at most , since otherwise for we would obtain an -permutation string followed by another symbol , so that Claim 2 would apply. We now show the upper bounds on the length and the number of runs. Consider a substring of , containing symbols only from . We claim that consists of at most runs. We prove our claim by induction on . For , the claim holds trivially. For and any , let be the minimal index such that is a -permutation string, or if no such prefix of exists. Note that , since otherwise a proper prefix of would be an -permutation string, in which case we would have deleted the last symbol of . The string contains symbols only from and thus by induction hypothesis consists of at most runs. Since , we conclude that the number of runs in is at most . Thus the number of runs of is at most , and since each run has length at most we obtain .

Algorithm 2 shows how to efficiently implement Algorithm 1 in time per symbol of . We maintain a counter (initialized to 0) and a set (initialized to ) for every with the following meaning. After reading , let be minimal such that consists of symbols in . Then is the maximum number such that is a -permutation string. Moreover, let be minimal such that still is a -permutation string. Then is the set of symbols that appear in . In other words, in the future we only need to read the symbols in to complete a -permutation string. In particular, when reading the next symbol , in order to check whether Claim 2 applies we only need to test whether for any with we have . Updating and is straightforward, and shown in Algorithm 2.

1:set , for all
2:set to the empty string
3:for all  from to  do
4:   if  for all with  then
5:      set
6:      for all  such that  do
7:         set
8:         if  then
9:            set
10:            set                
11:      for all  such that  do
12:         set
13:         set          
Algorithm 2 Computing in time per symbol of

Since we assume to be constant, each iteration of the loop runs in time , and thus the algorithm determines in time . This finishes the proof of Theorem 1.1.

3 Optimality of the Sketch

In this section we show that the sketch is optimal in many ways. First, we show that the length and the number of runs are optimal for any sketch that replaces by any other string with the same set of subsequences of length at most . {theorem} For any and there exists a string such that for any string with we have and consists of runs.

Let and . We construct a family of strings recursively as follows, where :

Theorem 3 now follows from the following inductive claim, for .

Claim \thetheorem.

For any string with we have and the number of runs in is at least .

Proof.

We use induction on . For , since we have with and the number of runs in is exactly . For any , if then but , and similarly if then but (note that since , and thus can be ). This implies and thus we have , where each is a string on alphabet . Hence, for any and string of length at most , we have if and only if . Similarly, holds if and only if . Since is equivalent to by assumption, we obtain that is equivalent to . By induction hypothesis, has length at least and consists of at least runs. Summing over all , string has length at least and consists of at least runs. ∎

Note that the run-length encoding of has bit length , since consists of runs, each of which can be encoded using bits. We now show that this sketch has optimal size, even in the setting of Subsequence Sketching: Alice is given a string of length over alphabet and a number and computes . Bob is then given and a string of length at most333In the introduction, we used a slightly different definition where Bob is given a string of length exactly . This might seem slightly weaker, but in fact the two formulations are equivalent (up to increasing by 1), as can be seen by replacing by and by . Then if and only if , and has fixed length . and decides whether is a subsequence of .

We construct the following hard strings for this setting, similarly to the previous construction. Let and . Consider any vector , where . We define the string recursively as follows; see Figure 1 for an illustration:

A straightforward induction shows that . Moreover, for any with base- representation , where , we define the following string; see Figure 2 for an illustration:

Figure 1: Illustration of constructing from . Let . Consider a string of length . The figure shows the construction of from

Figure 2: Illustration of the construction of . Let . Consider . Therefore .

The following claim shows that testing whether is a subsequence of allows to infer the entries of .

Claim \thetheorem.

We have if and only if .

Proof.

Figure 3: Illustration of Claim 3. Let and . Then . Now observe that if and only if .

See Figure 3 for illustration. Given and , let for all , and . Note that . Set , so in particular we have . Observe that if and only if , which follows immediately after matching all ’s in and . Therefore, holds if and only if for any . Substituting we obtain that holds if and only if . ∎

{theorem}

Any deterministic subsequence sketch has size in the worst case. Restricted to strings of length , the sketch size is .

Proof.

Let . Let with and let as above. Alice is given as input. Notice that there are distinct inputs for Alice. Assume for contradiction that the sketch size is less that for every . Then the total number of distinct possible sketches is strictly less than . Therefore, at least two strings, say and , have the same encoding, for some with . Let be such that , and without loss of generality . Now set Bob’s input to , which is a valid subsequence of , but not of . However, since the encoding for both and is the same, Bob’s output will be incorrect for at least one of the strings. Finally, note that . Hence, we obtain a sketch size lower bound of .

If we instead choose from , then the constructed string has length , and the same argument as above yields a sketch lower bound of . ∎

We now discuss the complexity of randomized subsequence sketching where Bob is allowed to err with probability . To this end, we will reduce from the Index problem.

{definition}

In the Index problem, Alice is given an -bit string and sends a message to Bob. Bob is given Alices’s message and an integer and outputs .

Intuitively, since the communication is one-sided, Alice cannot infer and therefore has to send the whole string . This intuition also holds for randomized protocols, as follows.

Fact \thetheorem ([31]).

The randomized one-way communication complexity of Index is .

Claim 3 shows that subsequence sketching allows us to infer the bits of an arbitrary string , and thus the hardness of Index carries over to subsequence sketching.

{theorem}

In a randomized subsequence sketch, Bob is allowed to err with probability . Any randomized subsequence sketch has size in the worst case. This holds even restricted to strings of length .

Proof.

We reduce the Index problem to subsequence sketching. Let be the input to Alice in the Index problem, where . As above, we construct the corresponding input to Alice in subsequence sketching. Observe that . For any input to Bob in the Index problem, we construct the corresponding input for Bob in subsequence sketching. We have if and only if (by Claim 3). This yields a lower bound of on the sketch size (by Fact 3). ∎

4 Weighted LCS

{definition}

In the WLCS problem we are given strings of lengths over alphabet and given a function . A weighted longest common subsequence (WLCS) of and is any string with and maximizing . The task is to compute this maximum weight, which we abbreviate as . In the remainder of this section we will design an algorithm for computing in time . This yields the upper bound of Theorem 1.2. Note that here we focus on computing the maximum weight ; standard methods can be applied to reconstruct a subsequence attaining this value. We prove a matching conditional lower bound of in the next section.

Let be given. The standard dynamic programming algorithm for determining in time trivially generalizes to as well. Alternatively, we can first compress to in time and then compute the , which is equal to since all subsequences of length at most of are also subsequences of . We show below in Theorem 4 how to compute WLCS of a run-length encoded string with runs and a string of length in time . Since consists of runs and the length of is , we can compute in time . In total, we obtain time .

It remains to solve WLCS on a run-length encoded string with runs and a string of length in time . For (unweighted) LCS a dynamic programming algorithm with this running time was presented by Liu et al. [37]. We first give a brief intuitive explanation as to why their algorithm does not generalize to WLCS. Let be the run-length encoded string, where , and let . Let . Liu et al.’s algorithm relies on a recurrence for in terms of . Consider an input like and with . Note that , but . Thus . Therefore, in the weighted setting and can differ by complicated terms that seem hard to figure out locally. Our algorithm that we develop below instead relies on a recurrence for in terms of .

{theorem}

Given a run-length encoded string consisting of runs, a string of length , and a weight function we can determine in time .

Proof.

We write the run-length encoded string as with and . Let . We will build a dynamic programming table where stores the value . In particular, for all . We will show how to compute this table in (amortized) time per entry in the following. Since we can split , we obtain the recurrence . Since is monotonically non-decreasing in and , we may rewrite the same recurrence as

Let be the minimum value of such that . Note that is well-defined, since for we always have , and note that is monotonically non-decreasing in . We define the active -window as the interval . Note that is non-empty and both its left and right boundary are monotonic in . Let be the height of . We define as . With this notation, we can rewrite the above recurrence as

We can precompute all values in time. Hence, in order to determine in amortized time it remains to compute in amortized time . To this end, we maintain the right to left maximum sequence