A Simple Streaming Bit-parallel Algorithmfor Swap Pattern Matching

A Simple Streaming Bit-parallel Algorithm
for Swap Pattern Matching

Václav Blažej Faculty of Information Technology, Czech Technical University in Prague,
Prague, Czech Republic
   Ondřej Suchý Faculty of Information Technology, Czech Technical University in Prague,
Prague, Czech Republic
   Tomáš Valla Faculty of Information Technology, Czech Technical University in Prague,
Prague, Czech Republic
Abstract

The pattern matching problem with swaps is to find all occurrences of a pattern in a text while allowing the pattern to swap adjacent symbols. The goal is to design fast matching algorithm that takes advantage of the bit parallelism of bitwise machine instructions and has only streaming access to the input. We introduce a new approach to solve this problem based on the graph theoretic model and compare its performance to previously known algorithms. We also show that an approach using deterministic finite automata cannot achieve similarly efficient algorithms. Furthermore, we describe a fatal flaw in some of the previously published algorithms based on the same model. Finally, we provide experimental evaluation of our algorithm on real-world data.

1 Introduction

In the Pattern Matching problem with Swaps (Swap Matching, for short), the goal is to find all occurrences of any swap version of a pattern  in a text , where  and  are strings over an alphabet  of length  and , respectively. By the swap version of a pattern  we mean a string of symbols created from  by swapping adjacent symbols while ensuring that each symbol is swapped at most once (see Section 2 for formal definitions). The solution of Swap Matching is a set of indices which represent where occurrences of  in  begin. Swap Matching is an intensively studied problem due to its use in practical applications such as text and music retrieval, data mining, network security and biological computing [5].

Swap Matching was introduced in 1995 as an open problem in non-standard string matching [14]. The first result was reported by Amir et al. [2] in 1997, who provided an -time solution for alphabets of size , while also showing that alphabets of size exceeding  can be reduced to size  with a little overhead. Amir et al. [4] also came up with solution with time complexity for some very restrictive cases. Several years later Amir et al. [3] showed that Swap Matching can be solved by an algorithm for the overlap matching achieving the running time of . This algorithm as well as all the previous ones is based on fast Fourier transformation (FFT).

In 2008 Iliopoulos and Rahman [13] introduced a new graph theoretic approach to model the Swap Matching problem and came up with the first efficient solution to Swap Matching without using FFT (we show it to be incorrect). Their algorithm based on bit parallelism runs in time if the pattern length is similar to the word-size of the target machine. One year later Cantone and Faro [7] presented the Cross Sampling algorithm solving Swap Matching in  time and  space, assuming that the pattern length is similar to the word-size in the target machine. In the same year Campanelli et al. [6] enhanced the Cross Sampling algorithm using notions from Backward directed acyclic word graph matching algorithm and named the new algorithm Backward Cross Sampling. This algorithm also assumes short pattern length. Although Backward Cross Sampling has space and time complexity, which is worse than that of Cross Sampling, it improves the real-world performance.

In 2013 Faro [10] presented a new model to solve Swap Matching using reactive automata and also presented a new algorithm with time complexity assuming short patterns. The same year Chedid [9] improved the dynamic programming solution by Cantone and Faro [7] which results in more intuitive algorithms. In 2014 a minor improvement by Fredriksson and Giaquinta [11] appeared, yielding slightly (at most factor ) better asymptotic time complexity (and also slightly worse space complexity) for special cases of patterns. The same year Ahmed et al. [1] took ideas of the algorithm by Iliopoulos and Rahman [13] and devised two algorithms named Smalgo-I and Smalgo-II which both run in for short patterns, but bear the same error as the original algorithm.

Our Contribution.

We design a simple algorithm which solves the Swap Matching problem. The goal is to design a streaming algorithm, which is given one symbol per each execution step until the end-of-input arrives, and thus does not need access to the whole input. This algorithm has time and space complexity where  is the word-size of the machine. We would like to stress that our solution, as based on the graph theoretic approach, does not use FFT. Therefore, it yields a much simpler non-recursive algorithm allowing bit parallelism and is not suffering from the disadvantages of the convolution-based methods. While our algorithm matches the best asymptotic complexity bounds of the previous results [7, 11] (up to a factor), we believe that its strength lies in the applications where the alphabet is small and the pattern length is at most the word-size, as it can be implemented using only CPU registers and few machine instructions. This makes it practical for tasks like DNA sequences scanning. Also, as far as we know, our algorithm is currently the only known streaming algorithm for the swap matching problem.

We continue by proving that any deterministic finite automaton that solves Swap Matching has number of states exponential in the length of the pattern.

We also describe the Smalgo (swap matching algorithm) by Iliopoulos and Rahman [13] in detail. Unfortunately, we have discovered that Smalgo and derived algorithms contain a flaw which cause false positives to appear. We have prepared implementations of Smalgo-I, Cross Sampling, Backward Cross Sampling and our own algorithm, measured the running times and the rate of false positives for the Smalgo-I algorithm. All of the sources are available for download.111http://users.fit.cvut.cz/blazeva1/gsm.html

This paper is organized as follows. First we introduce all the basic definitions, and also recall the graph theoretic model introduced in [13] and its use for matching in Section 2. In Section 3 we show our algorithm for Swap Matching problem and follow it in Section 4 with the proof that Swap Matching cannot be solved efficiently by deterministic finite automata. Then we describe the Smalgo in detail in Section 5 and finish with the experimental evaluation of the algorithms in Section 6.

2 Basic Definitions and the Graph Theoretic Model

In this section we state the basic definitions, present the graph theoretic model and show a basic algorithm that solves Swap Matching using the model.

2.1 Notations and Basic Definitions

We use the word-RAM as our computational model. That means we have access to memory cells of fixed capacity  (e.g., 64 bits). A standard set of arithmetic and bitwise instructions include And (), Or (), Left bitwise-shift (LShift) and Right bitwise-shift (RShift). Each of the standard operations on words takes single unit of time. In order to compare to other existing algorithms, which are not streaming, we define the access to the input in a less restrictive way – the input is read from a read-only part of memory and the output is written to a write-only part of memory. However, it will be easy to observe that our algorithm accesses the input sequentially. We do not include the input and the output into the space complexity analysis.

A string  over an alphabet  is a finite sequence of symbols from  and  is its length. By  we mean the -th symbol of  and we define a substring for , and prefix for . String  prefix matches string   symbols on position  if .

Next we formally introduce a swapped version of a string.

Definition 1 (Campanelli et al. [6])

A swap permutation for  is a permutation such that:

  1. if then (symbols at positions  and  are swapped),

  2. for all (only adjacent symbols are swapped),

  3. if then (identical symbols are not swapped).

For a string  a swapped version  is a string where  is a swap permutation for .

Now we formalize the version of matching we are interested in.

Definition 2

Given a text and a pattern , the pattern  is said to swap match  at location  if there exists a swapped version  of  that matches  at location , i.e., .

2.2 A Graph Theoretic Model

The algorithms in this paper are based on a model introduced by Iliopoulos and Rahman [13]. In this section we briefly describe this model.

Figure 1: -graph for the pattern

For a pattern  of length  we construct a labeled graph with vertices , edges , and a vertex labeling function (see Fig. 1 for an example). Let where . For we set . Each vertex  is identified with an element of a grid. We set , where , and let . We call the -graph. Note that is directed acyclic graph, , and .

The idea behind the construction of is as follows. We create vertices  and edges  which represent every swap pattern without unnecessary restrictions (equal symbols can be swapped). We remove vertices  and  which represent symbols from invalid indices  and .

The -graph now represents all possible swap permutations of the pattern  in the following sense. Vertices  represent ends of prefixes of swapped version of the pattern which end by a non-swapped symbol. Possible swap of symbols  and is represented by vertices and . Edges represent symbols which can be consecutive. Each path from column  to column  represents a swap pattern and each swap pattern is represented this way.

Definition 3

For a given -labeled directed acyclic graph vertices and a directed path from to , we call a path string of .

2.3 Using Graph Theoretic Model for Matching

In this section we describe an algorithm called Basic Matching Algorithm (BMA) which can determine whether there is a match of pattern  in text  on a position  using any graph model which satisfies the following conditions.

  • It is a directed acyclic graph,

  • (we can divide vertices to columns),

  • (edges lead to next column).

Let be the starting vertices and be the accepting vertices. BMA is designed to run on any graph which satisfies these conditions. Since -graph satisfies these assumptions we can use BMA for .

1:Input: Labeled directed acyclic graph , set of starting vertices, set of accepting vertices, text , and position .
2:Let .
3:for  do
4:     Let .
5:     if  then finish.      
6:     if  then we have found a match and finish.      
7:     Define the next iteration set as vertices which are successors of , i.e.,
8:         .
Algorithm 1 The basic matching algorithm (BMA)

The algorithm runs as follows (see also Algorithm 1). We initialize the algorithm by setting (Step 2). now holds information about vertices which are the end of some path  starting in  for which  possibly prefix matches  symbol of . To make sure that the path  represents a prefix match we need to check whether the label of the last vertex of the path  matches the symbol  (Step 4). If no prefix match is left we did not find a match (Step 5). If some prefix match is left we need to check whether we already have a complete match (Step 6). If the algorithm did not stop it means that we have some prefix match but it is not a complete match yet. Therefore we can try to extend this prefix match by one symbol (Step 8) and check whether it is a valid prefix match (Step 4). Since we extend the matched prefix in each step, we repeat these steps until the prefix match is as long as the pattern (Step 3).

Having vertices in sets is not handy for computing so we present another way to describe this algorithm. We use their characteristic vectors instead.

Definition 4

A Boolean labeling function of vertices of  is called a prefix match signal.

The algorithm can be easily divided into iterations according to the value of  in Step 3. We denote the value of the prefix match signal in  iteration as  and we define the following operations:

  • propagate signal along the edges, is an operation which sets if and only if there exists an edge with ,

  • filter signal by a symbol , is an operation which sets for each  where ,

  • match check, is an operation which checks whether there exists such that and if so reports a match.

With these definitions in hand we can describe BMA in terms of prefix match signals as Algorithm 2. See Fig. 2 for an example of use of BMA to figure out whether swap matches at a position .

1:Let for each and for each .
2:for  do
3:     Filter signals by a symbol .
4:     if  for every  then finish.      
5:     if  for any  then we have found a match and finish.      
6:     Propagate signals along the edges.
Algorithm 2 BMA in terms of prefix match signals

Figure 2: BMA of on a -graph of the pattern . The prefix match signal propagates along the dashed edges. Index  above a vertex  represent that , otherwise .

2.4 Shift-And Algorithm

The following description is based on [8, Chapter 5] describing the Shift-Or algorithm.

For a pattern  and a text  of length  and , respectively, let  be a bit array of size  and  its value after text symbol  has been processed. It contains information about all matches of prefixes of  that end at the position  in the text. For , if and 0 otherwise. The vector can be computed from  as follows. For each positive  we have if and , and otherwise. Furthermore, if and 0 otherwise. If then a complete match can be reported.

The transition from  to  can be computed very fast as follows. For each let  be a bit array of size  such that for if and only if . The array  denotes the positions of the symbol  in the pattern . Each  can be preprocessed before the search. The computation of  is then reduced to three bitwise operations, namely . When , the algorithm reports a match on a position .

3 Our Algorithm

In this section we will show an algorithm which solves Swap Matching. We call the algorithm GSM (Graph Swap Matching). GSM uses the graph theoretic model presented in Section 2.2 and is based on the Shift-And algorithm from Section 2.4.

The basic idea of the GSM algorithm is to represent prefix match signals (see Definition 4) from the basic matching algorithm (Section 2.3) over in bit vectors. The GSM algorithm represents all signals  in the bitmaps  formed by three vectors, one for each row. Each time GSM processes a symbol of , it first propagates the signal along the edges, then filters the signal and finally checks for matches. All these operations can be done very quickly thanks to bitwise parallelism.

First, we make the concept of GSM more familiar by presenting a way to interpret the Shift-And algorithm by means of the basic matching algorithm (BMA) from Section 2.3 to solve the (ordinary) Pattern Matching problem. Then we expand this idea to Swap Matching by using the graph theoretic model.

3.1 Graph Theoretic View of the Shift-And Algorithm

Let  and  be a text and a pattern of lengths  and , respectively. We create the -graph of the pattern .

Definition 5

Let  be a string. The -graph of  is a graph where , and such that .

We know that the -graph is directed acyclic graph which can be divided into columns (each of them containing one vertex ) such that the edges lead from  to . This means that the -graph satisfies all assumptions of BMA. We apply BMA to  to figure out whether  matches  at a position . We get a correct result because for each we check whether .

To find every occurrence of  in  we would have to run BMA for each position separately. This is basically the naive approach to solve the pattern matching. We can improve the algorithm significantly when we parallelize the computations of  runs of BMA in the following way.

The algorithm processes one symbol at a time starting from . We say that the algorithm is in the  step when a symbol  has been processed. BMA represents a prefix match as a prefix match signal . Its value in the  step is denoted . Since one run of the BMA uses only one column of the -graph at any time we can use other vertices to represent different runs of the BMA. We represent all prefix match indicators in one vector so that we can manipulate them easily. To do that we prepare a bit vector . Its value in  step is denoted  and defined as .

First operation which is used in BMA (propagate signal along the edges) can be done easily by setting the signal of  to value of the signal of its predecessor  in the previous step. I.e., for we set if and otherwise. In terms of this means just , where LSO is defined as .

We also need a way to set for each  for which which is another basic BMA operation (filter signal by a symbol). We can do this using the bit vector  from Section 2.4 and taking . I.e., the algorithm computes as .

The last BMA operation we have to define is the match detection. We do this by checking whether and if this is the case then a match starting at position occurred.

3.2 Our Algorithm for Swap Matching Using the Graph Theoretic Model

Now we are ready to describe the GSM algorithm.

We again let be the -graph of the pattern , apply BMA to  to figure out whether  matches  at a position , and parallelize runs of BMA on .

Again, the algorithm processes one symbol at a time and it is in the  step when a symbol  is being processed. We again denote the value of the prefix match signal  of BMA in the  step by . I.e., the semantic meaning of  is that if there exists a swap permutation  such that and . Otherwise is .

We want to represent all prefix match indicators in vectors so that we can manipulate them easily. We can do this by mapping the values of for rows of the -graph to vectors , and , respectively. We denote value of the vector in  step as . We define values of the vectors as , , and , where the value of for every .

We define BMA propagate signal along the edges operation as setting the signal of  to  if at least one of its predecessors have signal set to . I.e., we set , , , , and . We can perform the above operation using the operation. We obtain the propagate signal along the edges operation in the form , , and .

The operation filter signal by a symbol can be done by first constructing a bit vector  for each as if and otherwise. Then we use these vectors to filter signal by a symbol  by taking , , and .

The last operation we define is the match detection. We do this by checking whether or and if this is the case, then a match starting at a position occurred.

1:Input: Pattern of length and text of length over alphabet .
2:Output: Positions of all swap matches.
3:Let .
4:Let , for all .
5:for  do
6:for  do
7:     ,
8:     .
9:     , , .
10:     if  or  then report a match on position .      
Algorithm 3 The graph swap matching (GSM)

The final GSM algorithm (Algorithm 3) first prepares the D-masks  for every and initializes (Steps 35). Then the algorithm computes the value of vectors , , and  for by first using the above formula for signal propagation (Steps 7 and 8) and then the formula for signal filtering (Step 9) and checks whether or and if this is the case the algorithm reports a match (Step 10).

Observe that Algorithm 3 accesses the input sequentially and thus it is a streaming algorithm. We now prove correctness of our algorithm. To ease the notation let us define to be if , if , and if . We define analogously. Similarly, we define as if , if , and if . By the way the masks are computed on lines 4 and 5 of Algorithm 3, we get the following observation.

Observation 3.1

For every and every we have if and only if .

The following lemma constitutes the crucial part of the correctness proof.

Lemma 1

For every and every we have if and only if there exists a swap permutation such that and .

Proof

Let us start with the “if” part. We prove the claim by induction on . If and there is a swap permutation such that and , then the algorithm sets to on line 7 or 8 (recall the definition of LSO). As , we have by Observation 3.1 and, therefore, by line 9, also .

Now assume that and that the claim is true for every smaller . Assume that there exists a swap permutation such that and . By induction hypothesis we have that , where . Since equals if and only if equals by Definition 1, we have . Therefore the algorithm sets to on line 7 or 8. Moreover, since , we have by Observation 3.1 and the algorithm sets to on line 9.

Now we prove the “only if” part again by induction on . If and , then we must have and, by Observation 3.1, also . We obtain by setting , and for every . It is easy to verify that this is a swap permutation for and has the desired properties.

Now assume that and that the claim is true for every smaller . Assume that . Then, due to line 9 we must have and, hence, by Observation 3.1, also . Moreover, we must have and, hence, by lines 7 and 8 of the algorithm also for some with . By induction hypothesis there exists a swap permutation for such that and . If , then setting finishes the proof. Otherwise we have either or and . In the former case we let for every and in the later case we let , and for every . In both cases we let for every . It is again easy to verify that is a swap permutation for with the desired properties. ∎

Our GSM algorithm reports a match on position if and only if or . However, by Lemma 1, this happens if and only if there is a swap match of on position in . Hence, the algorithm is correct.

Theorem 3.2

The GSM algorithm runs in time and uses memory cells (not counting the input and output cells), where  is the length of the input text,  length of the input pattern,  is the word-size of the machine, and  size of the alphabet.222To simplify the analysis, we assume that , i.e., the iteration counter fits into one memory cell.

Due to space constraints, we defer the proof to the appendix.

Corollary 1

If for some constant , then the GSM algorithm runs in time and has space complexity. Moreover, if , then the GSM algorithm can be implemented using only memory cells.

Proof

The first part follows directly from Theorem 3.2. Let us show the second part. We need  cells for all D-masks,  cells for  vectors (reusing the space also for vectors), one pointer to the text, one iteration counter, one constant for the match check and one temporary variable for the computation of the more complex parts of the algorithm. Alltogether, we need only memory cells to run the GSM algorithm. ∎

From the space complexity analysis we see that for some sufficiently small alphabets (e.g. DNA sequences) the GSM algorithm can be implemented in practice using solely CPU registers with the exception of text which has to be loaded from the RAM.

4 Limitations of the Finite Deterministic Automata Approach

Many of the string matching problems can be solved by finite automata. The construction of a non-deterministic finite automaton that solves Swap Matching can be done by a simple modification of the -graph. An alternative approach to solve the Swap Matching would thus be to determinize and execute this automaton. The drawback is that the determinization process may lead to an exponential number of states. We show that in some cases it actually does, contradicting the conjecture of Holub [12], stating that the number of states of this determinized automaton is .

Theorem 4.1

There is an infinite family  of patterns such that any deterministic finite automaton  accepting the language  for has states.

Proof

For any integer  we define the pattern . Note that the length of  is . Suppose that the automaton  recognizing language  has  states such that . We define a set of strings where  is defined as follows. Let be the binary representation of the number . Let if and let if . Then, let . See Table 1 for an example. Since , there exist such that both  and  are accepted by the same accepting state  of the automaton . Let  be the minimum number such that . Note that and . Now we define and . Let and be the suffices of the strings and both of length . Note that  begins with and  begins with and that block  or  repeats for  times in both. Therefore pattern  swap matches  and does not swap match . Since for the last symbol of both  and  the automaton is in the same state , the computation for  and  must end in the same state . However as  should not be accepted and  should be accepted we obtain contradiction with the correctness of the automaton . Hence, we may define the family  as , concluding the proof. ∎

Table 1: An example of the construction from proof of Theorem 4.1 for .

This proof shows the necessity for specially designed algorithms which solve the Swap Matching. We presented one in the previous section and now we reiterate on the existing algorithms.

5 Smalgo Algorithm

In this section we discuss how Smalgo by Iliopoulos and Rahman [13] and Smalgo-I by Ahmed et al. [1] work. Since they are bitwise inverse of each other, we will introduce them both in terms of operations used in Smalgo-I.

Before we show how these algorithms work, we need one more definition.

Definition 6

A degenerate symbol  over an alphabet  is a nonempty set of symbols from alphabet . A degenerate string  is a string built over an alphabet of degenerate symbols. We say that a degenerate string  matches a text  at a position  if for every .

5.1 Smalgo-I

The Smalgo-I [1] is a modification of the Shift-And algorithm 2.4 for Swap Matching. The algorithm uses the graph theoretic model introduced in Section 2.2.

First let be a a degenerate version of pattern . The symbol in  on position  represents the set of symbols of  which can swap to that position. To accommodate the Shift-And algorithm to match degenerate patterns we need to change the way the  masks are defined. For each let  be the bit array of size  such that for if and only if .

While a match of the degenerate pattern  is a necessary condition for a swap match of , it is clearly not sufficient. The way the Smalgo algorithms try to fix this is by introducing P-mask which is defined as if or if there exist vertices , and  and edges in  for which for some and for , and otherwise. One -mask called is used to represent the -masks for triples which only contain 1 in the first column.

Now, whenever checking whether  prefix swap matches   symbols at position  we check for a match of  in  and we also check whether . This ensures that the symbols are able to swap to respective positions and that those three symbols of the text  are present in some .

With the P-masks completed we initialize . Then for every  to  we repeat the following. We compute  as . To check whether or not a swap match occurred we check whether . This is claimed to be sufficient because during the processing we are in fact considering not only the next symbol  but also the symbol .

5.2 The Flaw in the Smalgo, Smalgo-I and Smalgo-Ii

We shall see that for a pattern and a text all Smalgo versions give false positives.

The concept of Smalgo is based on the assumption that we can find a path in  by searching for consecutive paths of length  (triplets), where each two consecutive share two columns and can partially overlap. However, this only works if the consecutive triplets actually share the two vertices in the common columns. If the assumption is not true then the found substring of the text might not match any swap version of .

The above input gives such a configuration and therefore the assumption is false. The Smalgo-I algorithm actually reports match of pattern on a position  of text . This is obviously a false positive, as the pattern has two  symbols while the text has only one.

The reason behind the false positive match is as follows. The algorithm checks whether the first triplet of symbols matches. It can match the swap pattern . Next it checks the second triplet of symbols , which can match . We know that  is not possible since it did not appear in the previous check, but the algorithm cannot distinguish them since it only checks for triplets existence. Since each step gave us a positive match the algorithm reports a swap match of the pattern in the text.

In the Fig. 3 we see the two triplets which Smalgo assumes have two vertices in common. The Smalgo-II algorithm saves space by maintaining less information, however it simulates how Smalgo-I works and so it contains the same flaw. The appendix provides more details on the execution of Smalgo-I algorithm on pattern and text and also a detailed analysis of the Smalgo-II algorithm.

Figure 3: Smalgo flaw represented in the -graph for

6 Experiments

We implemented our Algorithm 3 (GSM), described in Section 3.2, the Bitwise Parallel Cross Sampling (BPCS) algorithm by Cantone and Faro [7], the Bitwise Parallel Backward Cross Sampling (BPBCS) algorithm by Campanelli et al. [6], and the faulty SMALGO algorithm by Iliopoulos and Rahman [13]. All these implementations are available online.333http://users.fit.cvut.cz/blazeva1/gsm.html

We tested the implementations on three real-world datasets. The first dataset (CH) is the 7th chromosome of the human genome444ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/ which consists of 159 M characters from the standard ACTG nucleobases and N as for non-determined. Second dataset (HS) is a partial genome of Homo sapiens from the Protein Corpus555http://www.data-compression.info/Corpora/ProteinCorpus/ with 3.3 M characters representing proteins encoded in 19 different symbols. The last dataset (BIB) is the Bible text of the Cantenbury Corpus 666http://corpus.canterbury.ac.nz/descriptions/large/bible.html with 4.0 M characters containing 62 different symbols. For each length from , and we randomly selected 10,000 patterns from each text and processed each of them with each implemented algorithm.

All measured algorithms were implemented in C++ and compiled with -O3 in gcc 6.3.0. Measurements were carried on an Intel Core i7-4700HQ processor with 2.4 GHz base frequency and 3.4 GHz turbo with 8 GiB of DDR3 memory at 1.6 GHz. Time was measured using std::chrono::high_resolution_clock::now() from the C++ chrono library. The resulting running times, shown in Table 2, were averaged over the 10,000 patterns of the given length.

Data () Algor. Pattern Length
3 4 5 6 8 9 10 12 16 32
CH (5) SMALGO 426 376 355 350 347 347 344 347 345 345
BPCS 398 353 335 332 329 329 326 328 329 327
BPBCS 824 675 555 472 366 328 297 257 199 112
GSM 394 354 338 333 332 331 329 333 331 333
HS (19) SMALGO 4.80 4.73 4.72 4.74 4.70 4.71 4.71 4.71 4.72 4.70
BPCS 4.43 4.36 4.36 4.36 4.34 4.33 4.34 4.34 4.35 4.34
BPBCS 7.16 5.80 4.79 4.05 3.03 2.70 2.44 2.06 1.62 0.95
GSM 4.42 4.38 4.41 4.46 4.45 4.45 4.45 4.44 4.53 4.48
BIB (62) SMALGO 8.60 8.38 8.29 8.34 8.32 8.33 8.30 8.35 8.35 8.33
BPCS 7.53 7.36 7.28 7.29 7.26 7.27 7.26 7.28 7.29 7.25
BPBCS 12.43 10.03 8.26 7.03 5.44 4.93 4.52 3.93 3.19 1.88
GSM 7.52 7.37 7.31 7.35 7.38 7.40 7.38 7.42 7.44 7.40
Table 2: Comparison of the running times. Each value is the average over 10,000 patterns randomly selected from the text in milliseconds.

The results show, that the GSM algorithm runs approximately faster than Smalgo (ignoring the fact that Smalgo is faulty by design). Also, the performance of GSM and BPCS is almost indistinguishable and according to our experiments, it varies in the span of units of percents depending on the exact CPU, cache, RAM and compiler setting. The seemingly superior average performance of BPBCS is caused by the heuristics BPBCS uses; however, while the worst-case performance of GSM is guaranteed, the performance of BPBCS for certain patterns is worse than that of GSM. Also note that GSM is a streaming algorithm while the others are not.

Algorithm Dataset
CH HS BIB
SMALGO 86243500784 51136419 315612770
rest 84411799892 51034766 315606151
Table 3: Found occurrences across datasets: The value is simply the sum of occurrences over all the patterns.

Table 3 visualizes the accurateness of Smalgo-I with respect to its flaw by comparing the number of occurrences found by the respective algorithms. The ratio of false positives to true positives for the Smalgo-I was: CH , HS and BIB .

References

  • [1] Ahmed, P., Iliopoulos, C.S., Islam, A.S., Rahman, M.S.: The swap matching problem revisited. Theoretical Computer Science 557, 34–49 (2014)
  • [2] Amir, A., Aumann, Y., Landau, G.M., Lewenstein, M., Lewenstein, N.: Pattern matching with swaps. Journal of Algorithms 37(2), 247–266 (2000)
  • [3] Amir, A., Cole, R., Hariharan, R., Lewenstein, M., Porat, E.: Overlap matching. Information and Computation 181(1), 57–74 (2003)
  • [4] Amir, A., Landau, G.M., Lewenstein, M., Lewenstein, N.: Efficient special cases of pattern matching with swaps. Inform. Process. Lett. 68(3), 125–132 (1998)
  • [5] Antoniou, P., Iliopoulos, C.S., Jayasekera, I., Rahman, M.S.: Implementation of a swap matching algorithm using a graph theoretic model. In: Bioinformatics Research and Development, BIRD 2008, CCIS, vol. 13, pp. 446–455. Springer (2008)
  • [6] Campanelli, M., Cantone, D., Faro, S.: A new algorithm for efficient pattern matching with swaps. In: IWOCA 2009, LNCS, vol. 5874, pp. 230–241. Springer (2009)
  • [7] Cantone, D., Faro, S.: Pattern matching with swaps for short patterns in linear time. In: SOFSEM 2009, LNCS, vol. 5404, pp. 255–266. Springer (2009)
  • [8] Charras, C., Lecroq, T.: Handbook of Exact String Matching Algorithms. King’s College Publications (2004)
  • [9] Chedid, F.: On pattern matching with swaps. In: AICCSA 2013. pp. 1–5. IEEE (2013)
  • [10] Faro, S.: Swap matching in strings by simulating reactive automata. In: Proceedings of the Prague Stringology Conference 2013, pp. 7–20. CTU in Prague (2013)
  • [11] Fredriksson, K., Giaquinta, E.: On a compact encoding of the swap automaton. Information Processing Letters 114(7), 392–396 (2014)
  • [12] Holub, J.: Personal communication (2015)
  • [13] Iliopoulos, C.S., Rahman, M.S.: A new model to solve the swap matching problem and efficient algorithms for short patterns. In: SOFSEM 2008: Theory and Practice of Computer Science, LNCS, vol. 4910, pp. 316–327. Springer (2008)
  • [14] Muthukrishnan, S.: New results and open problems related to non-standard stringology. In: CPM 95, LNCS, vol. 937, pp. 298–317. Springer (1995)

Appendix A Proof of Theorem 3.2

Proof

The initialization of and masks (lines 3 and 4) takes time. The bits in masks are set according to the pattern in time (line 5). The main cycle of the algorithm (lines 610) makes iterations. Each iteration consists of computing values of in bitwise operations, i.e., in machine operations, and checking for the result in time. This gives time in total. The algorithm saves 3 masks (using the same space for all and also for masks), masks, and constant number of variables for other uses (iteration counters, temporary variable, etc.). Thus, in total the GSM algorithm needs memory cells. ∎

Appendix B The Run of Smalgo-I Resulting in the False Positive

In Tables 4 and 5 we can see the step by step execution of Smalgo-I algorithm on pattern and text . In Table 5 we see that  has  in the  row which means that the algorithm reports a pattern match on a position . This is a false positive, because it is not possible to swap match the pattern with two  symbols in the text with only one