Optimal trade-offs for pattern matchingwith k mismatches

Optimal trade-offs for pattern matching with mismatches

Abstract

Given a pattern of length and a text of length , the goal in -mismatch pattern matching is to compute, for every -substring of the text, the exact Hamming distance to the pattern or report that it exceeds . This can be solved in either time as shown by Amir et al. [J. Algorithms 2004] or time due to a result of Clifford et al. [SODA 2016]. We provide a smooth time trade-off between these two bounds by designing an algorithm working in time . We complement this with a matching conditional lower bound, showing that a significantly faster combinatorial algorithm is not possible, unless the combinatorial matrix multiplication conjecture fails.

1 Introduction

The basic question in algorithms on strings is pattern matching, which asks for reporting (or detecting) occurrences of the given pattern in the text. This fundamental question comes in multiple shapes and colors, starting from the exact version considered already in the 70s [6]. Here we are particularly interested in the approximate version, where the goal is to detect fragments of the text that are similar to the text. Two commonly considered variants of this question is pattern matching with errors and pattern matching with mismatches. In the former, we are looking for a fragment with edit distance at most to the pattern, while in the latter we are interested in a fragment that differs from the pattern on up to positions (and has the same length). The classical solution by Landau and Vishkin [7] solves pattern matching with mismatches in time for a text of length . For larger values of , Abrahamson [1] showed how to compute the number of mismatches between every fragments of the text and the pattern of length in total time with convolution. Later, Amir et al. [2] combined both approaches to achieve time.

An obvious and intriguing question is what are the best possible time bounds for pattern matching with mismatches. An unpublished result attributed to Indyk [3] is that, if we are interested in counting mismatches for every position in the text, then this is at least as difficult as multiplying boolean matrices. In particular, it implies that one should not hope to significantly improve on the time complexity of an combinatorial algorithm. However, this is not sensitive to the bound on the number of mismatches. In a recent breakthrough, Clifford et al. [4] introduced a new repertoire of tools and showed an time algorithm. In particular, this is near linear-time for and improves on the previous algorithm of Amir et al. [2] that runs in time.

Results.

We provide a smooth transition between the time algorithm of Amir et al. [2] and the solution given by Clifford et al. [4]. The running time of our algorithm is . This matches the previous solution at the extreme points and , but provides a better trade-off in-between. Furthermore, we prove that such transition is essentially the best possible. More precisely, we complement the algorithm with a matching conditional lower bound, showing that a significantly faster combinatorial algorithm is not possible, unless the popular combinatorial matrix multiplication conjecture fails.

Figure 1: Running time on instances with and . Previous algorithms are represented by dashed lines and our algorithm is represented by solid line. For example, for we improve the complexity from to .

Related work.

Landau and Vishkin [7] solve pattern matching with mismatches by checking every possible alignment with constant-time longest common extension queries (also known as “kangaroo jumps”). The main idea in all the subsequent improvements is to use convolution, which essentially counts matches generated by a particular letter with a single FFT in time close to linear. Both Abrahamson [1] and Amir et al. [2] use convolution for letters often occurring in the pattern. Convolution is also used (together with random projections that can be derandomized with an extra factor) by Karloff [5] for approximate mismatches counting.

At a very high level, Clifford et al. [4] obtain the improved time complexity by partitioning both the pattern and the text into subpatterns and subtexts, such that the total number of blocks in their RLE is small. Resulting instances of RLE pattern matching with mismatches are then solved in total time, leading to an time algorithm for the original problem.

Overview of the techniques.

We observe that the reduction from [4] can be done so that, instead of many small instances, we end up with a single new instance of -mismatch pattern matching. The resulting new pattern and text have RLE consisting of blocks and the problem is reduced to RLE pattern matching with mismatches. Since for RLE pattern matching with mismatches there is a matching quadratic conditional lower bound (by reducing from the 3SUM problem), it might seem that no improvement here is possible without making a significant breakthrough.

We show that this is not necessarily the case, by leveraging that the RLE strings are compressed version of strings of length. Thus, letters that appear in only a few blocks of the compressed pattern can be treated in a fashion similar to [2] by producing a representation of all matches generated by a block in the compressed pattern against a block in the compressed text, in constant time per a pair of blocks. For letters that appear in many blocks, we can essentially “uncompress” the corresponding fragment of the pattern, and run the classical convolution, taking advantage of the fact that uncompressed versions are of length . Setting threshold appropriately, we solve the obtained of RLE pattern matching in time time. All in all, we obtain an time solution to the original problem.

2 Upper bound

The goal of this section is to prove the following theorem:

Theorem 2.1.

-mismatch pattern matching can be solved in time .

We begin with the standard trick of reducing the problem to instances of matching a pattern of length to a text of length and work with such formulation from now on. Therefore, the goal now is to achieve complexity.

We start by highlighting the kernelization technique of Clifford et al. [4]. An integer is an -period of a string if (cf. Definition 1 in [4]). Note that compared to the original formulation, we drop the condition that is minimal from the definition.

Lemma 2.2 (Fact 3.1 in [4]).

If the minimal -period of the pattern is , then the starting positions of any two occurrences with mismatches of the pattern are at distance at least .

The first step of algorithm is to determine the minimal -period of the pattern. More specifically, we run the -approximate algorithm of Karloff [5] with matching the pattern against itself. This takes time and, by looking at the approximate outputs for offsets not larger than , allows us to distinguish between two cases:

  • every -period of the pattern is at least , or

  • there is a -period of the pattern.

Then we run the appropriate algorithm as described below.

No small -period.

We again run Karloff’s algorithm with , but now we match the pattern with the text. We look for positions where the approximate algorithm reports at most mismatches, meaning that . By Lemma 2.2, there are such positions, and we can safely discard all other positions. Then, we test every such position using the “kangaroo jumps” technique of Landau and Vishkin [7], using constant-time operations per position, in total time.

Small -period.

Let be any -period of the pattern. For a string and , let up until end of . We denote by an -encoding of , that is the string . Let be the number of runs in . Denote , and observe that it upperbounds the number of runs in .

Lemma 2.3 (Lemma 6.1 in [4]).

If has a -period not exceeding , then .

We proceed with the kernelization argument. Let be the longest suffix of such that . Similarly, let be the longest prefix of such that . Let . Obviously, .

Lemma 2.4 (Lemma 6.2 in [4]).

Every that is an occurrence of with mismatches is fully contained in .

Thus we see that -mismatch pattern matching is reduced to a kernel where the -encoding of both the text and the pattern have few runs, that is, compress well with RLE.

From now on assume that both and are of lengths divisible by . If it is not the case, we can pad them separately with at most characters each, not changing the complexity of our solution. Let and be integers such that and , .

Figure 2: Example of rearranging of text and pattern, with parameter .

We rearrange both and to take advantage of their regular structure. That is, we define , where . Observe that is a word of length , composed first of blocks of the form for , and then of blocks of the form .

Similarly, we define . Again we observe that is the word of length , composed of blocks of the form for . Example of this reduction is presented on Figure 2.

Next we show that and maintain the Hamming distance between any possible alignment of and .

Lemma 2.5.

For any integer , let and . Let . Then

Proof.

Observe that

(1)

where is indicator of character inequality. Observe that , for ther is , and for there is . Additionally, for , , which always generates a mismatch with any character in . Thus

\qed

We see that it is enough to find all occurrences of in with mismatches, where , and . Additionally, and .

Now we describe how to solve the kernelized problem exactly (where we count matches/mismatches for all possible alignments, not just detect occurrences with up to mismatches), using the stated properties of and .

Consider a letter . For a string , we denote by the number of runs in consisting of occurrences of . Fix a parameter . Call a letter such that a heavy letter, and otherwise call it light. Now we describe how to count the number of mismatches for each type of letters. This is reminiscent to a trick originally used by Abrahmson [1] and later refined by Amir et al. [2].

Heavy letters.

For every heavy letter separately we use a convolution scheme. Since both and are of size , this takes time per every such letter. Since , there are heavy letters, making the total time .

Light letters.

First, we preprocess , and for every light letter we compute a list of runs consisting of occurrences of . Our goal is to compute the array , where counts the number of matching occurrences of light letters in and .

Figure 3: On the left - a run in the pattern and a run in the text (both represented by black boxes) consisting of the same character and a histogram of the matches they generate. On the right - first derivates of the indicator arrays and second derivate of the match array, without padding zeroes.

We scan , and for every run of a particular light letter, we iterate through the precomputed list of runs of this letter in . Observe that, given a run of the same letter in and in , matches generated between and account for a piecewise linear function. More precisely, for all integer and , we need to increase by one. To see that we can process pair of runs in constant time, we work with discrete derivates, instead of original arrays.

Given sequence , we define its discrete derivate as follow: . Correspondingly, if we consider generating function , then (for convenience, we assume that arrays are indexed from to ).

Now consider indicator sequences and . To perform the update, we set for all , or simpler using generating functions:

(2)

where and . However, we observe that and have particularly simple forms: and . Thus it is easier to maintain second derivate of , and (2) becomes:

All in all, we can maintain in constant time per pair of runs, or in total time, since every list of runs is of length at most , and there are at most runs in . Additionally, in time we can compute and , allowing us to recover all other s from the formula .

Setting gives the total running time in both cases as claimed.

3 Lower bound

Below we present a conditional lower bound, which expands upon an idea attributed to Indyk [3]. Main idea here is to use rectangular matrices instead of square, and use the padding accordingly. However, we pad using the same character in both text and pattern, increasing the number of mismatches only by a factor of 2.

Recall the combinatorial matrix multiplication conjecture stating that, for any , there is no combinatorial algorithm for multiplying two boolean matrices working in time . The following formulation is equivalent to this conjecture:

Conjecture 3.1 (Combinatorial matrix multiplication).

For any , there is no combinatorial algorithm for multiplying an matrix with an matrix in time .

The equivalence can be seen by simply cutting the matrices into square block (in one direction) or in rectangular blocks (in the other direction).

Now, consider two boolean matrices, of dimension and of dimension , for . We encode as text , by encoding elements row by row and adding some padding. Namely:

where and when and when . Similarly, we encode as column by column, using padding shorter by one character:

where and when and when .

Observe that, since we encode s from and using different symbols, and encoding of s is position-dependent, and will generate a match only if they are perfectly aligned and there is such that , or equivalently . Since each block (encoded row plus following padding) is either of length for rows or for columns, there will be at most one pair row-column aligned for each pattern-text alignment.

The total number of mismatches, for each alignment, is at most (since there are at most non- text characters that are aligned with pattern, and at most non- pattern characters). We can recover whether any given entry of is a , since if so the number of mismatches for the corresponding alignment is decreased by 1.

We have and . By setting , and we have the following:

Corollary 3.2.

For any positive , such that there is no combinatorial algorithm solving pattern matching with mismatches in time for a text of length and a pattern of length , unless Conjecture 3.1 fails.

If we denote by the exponent of fastest algorithm to multiply a matrix of dimension with a matrix of dimension , we have:

Corollary 3.3.

For any positive , such that there is no algorithm solving pattern matching with mismatches in time for a text of length and a pattern of length .

References

  1. Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039–1051, 1987.
  2. Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with mismatches. J. Algorithms, 50(2):257–275, 2004.
  3. Raphaël Clifford. Matrix multiplication and pattern matching under Hamming norm. http://www.cs.bris.ac.uk/Research/Algorithms/events/BAD09/BAD09/Talks/BAD09-Hammingnotes.pdf. Retrieved March 2017.
  4. Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana A. Starikovskaya. The k-mismatch problem revisited. In SODA, pages 2039–2052. SIAM, 2016.
  5. Howard J. Karloff. Fast algorithms for approximately counting mismatches. Inf. Process. Lett., 48(2):53–60, 1993.
  6. Donald E. Knuth, Jr. James H. Morris, and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977.
  7. Gad M. Landau and Uzi Vishkin. Efficient string matching with mismatches. Theor. Comput. Sci., 43:239–249, 1986.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
105310
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description