On Combinatorial Generation of Prefix Normal Words

On Combinatorial Generation of Prefix Normal Words

Péter Burcsi Dept. of Computer Algebra, Eötvös Loránd Univ., Budapest, Hungary, bupe@compalg.inf.elte.hu    Gabriele Fici Dip. di Matematica e Informatica, University of Palermo, Italy, gabriele.fici@math.unipa.it    Zsuzsanna Lipták Dip. di Informatica, University of Verona, Italy, zsuzsanna.liptak@univr.it    Frank Ruskey Dept. of Computer Science, University of Victoria, Canada, ruskey@cs.uvic.ca    Joe Sawada School of Computer Science, University of Guelph, Canada, jsawada@uoguelph.ca
Abstract

A prefix normal word is a binary word with the property that no substring has more s than the prefix of the same length. This class of words is important in the context of binary jumbled string matching. In this paper we present an efficient algorithm for exhaustively listing the prefix normal words with a fixed length. The algorithm is based on the fact that the language of prefix normal words is a bubble language, a class of binary languages with the property that, for any word in the language, exchanging the first occurrence of by in results in another word in the language. We prove that each prefix normal word is produced in amortized time, and conjecture, based on experimental evidence, that the true amortized running time is .

1 Introduction

A binary word of length is prefix normal if for all , no substring of length has more s than the prefix of length . For example, is not prefix normal because the substring has more s than the prefix . These words were introduced in [13], where it was shown that each binary word has a canonical prefix normal form of the same length: and are equivalent in a certain sense.

The study of prefix normal words, and prefix normal forms, is motivated by the string problem known as binary jumbled pattern matching. In that problem, we are given a text of length over a binary alphabet, and two numbers and , and ask whether the text has a substring with exactly s and s. While the online version can be solved with a simple sliding window algorithm in time, the offline version, where many queries are expected, has recently attracted much interest: here an index of size can be generated which then allows answering queries in constant time [9]. However, the best construction algorithms up to date have running time O [6, 19]. Several recent papers have yielded better results under specific assumptions, such as word level parallelism or highly compressible strings [20, 2, 15, 10], or for constructing an approximate index [11]; but the general case has not been improved. It was demonstrated in [13, 2] that prefix normal forms of the text can be used to construct this index. See the Appendix for a brief explanation of this connection.

Jumbled Pattern Matching (JPM), over an arbitrary alphabet, is a variant of approximate pattern matching: We are given a text and a pattern, and want to answer the question whether the text has a substring which is a permutation of the pattern (existence), or find one or all occurrences of such substrings (occurrence, listing).111Formally: the Parikh vector of a string over a finite ordered alphabet is the vector s.t. for all , is the number of occurrences of in . Given the text and pattern , we want to find (occurrences of) substrings of s.t. . This problem has also been studied under the terms Abelian pattern matching, Parikh vector matching, and permutation matching. A closely related problem is that of Parikh fingerprints [1]. Applications in computational biology include SNP discovery, alignment, gene clusters, pattern discovery, and mass spectrometry data interpretation [4, 3, 5, 12, 21].

For one query, the JPM problem can be solved in optimal linear time with a classical sliding window approach [8], while recently, interest has turned towards the indexing problem [9, 17]. Moreover, several variants of the original problem have recently been introduced: approximate JPM [7], JPM in the streaming model [18], JPM on trees and graphs [14, 10].

Bubble languages are an interesting new class of binary languages defined by the following property: is a bubble language if, for every word , replacing the first occurrence of (if any) by results in another word in  [22, 23, 24]. A generic generating algorithm for bubble languages was given in [24], leading to Gray codes for each of these languages, while the algorithm’s efficiency depends only on a language-dependent subroutine. In the best case, this leads to CAT (constant amortized time) generating algorithms. Many important languages are bubble languages, among them necklaces, binary Lyndon words, and -ary Dyck words.

In this paper, we show that prefix normal words form a bubble language and present an efficient generating algorithm which runs in amortized time per word, and which yields a Gray code for prefix normal words. The best previous generating algorithm for prefix normal words ran in time, and consisted simply in testing each binary word for the prefix normal property (unpublished). Based on experimental evidence, we conjecture that the running time of our algorithm is in fact amortized. We also give a new characterization of bubble languages in terms of a closure property in the computation tree of a certain generating algorithm for binary words. We prove new properties of prefix normal words and present a linear time testing algorithm for words which have been obtained from prefix normal words via a simple operation. This could lead to a better understanding of prefix normal words and in the long run, to faster computation of prefix normal forms, and thus contribute to the goal of strongly subquadratic algorithms for binary jumbled pattern matching.

2 Basics

A binary word (or string) over is a finite sequence of elements from . Its length is denoted by . For any , the -th symbol of a word is denoted by . We denote by the words over of length , and by the set of finite words over . The empty word is denoted by . Let . If for some , we say that is a prefix of and is a suffix of . A substring of is a prefix of a suffix of . A binary language is any subset of .

In the following, we will often write binary words in a canonical form , where and . In other words, is the length of the first, possibly empty, -run of , the length of the first -run, and the remaining, possibly empty, suffix. Note that this representation is unique. We call the critical prefix of and the critical prefix length of . We denote by the number of occurrences in of character , and by the set of all binary strings of length such that (the density of is ).

Given a string , we can obtain another string , the string obtained from by exchanging the characters in positions and .

2.1 Prefix Normal Words

Let . For , we set

  • , the number of s in the -length prefix of

  • , the maximum number of s over all substrings of length .

Definition 1

A binary word is prefix normal if, for all , . In other words, a word is prefix normal if no substring contains more s than the prefix of the same length.

For example, is not prefix normal because the substring has more s than the prefix . We denote by the language of prefix normal words. In [13] it was shown that for every word there exists a unique word , called its prefix normal form, or , such that for all , , and is prefix normal. Therefore, a prefix normal word is a word coinciding with its prefix normal form.

Note: In [13], the property ‘prefix normal’ was defined both with respect to and with respect to . Here we restrict ourselves to prefix normal words w.r.t. .

words with this PNF words with this PNF
{} {}
{, }
{, }
{, , }
{}
{}
{}
Table 1: All prefix normal words of length 5 and their equivalence classes.

In Table 1 we list all prefix normal words of length , and, for each , the set of binary words such that (i.e., its equivalence class). Several methods were presented in [13] for testing whether a word is prefix normal; however, all ran in quadratic time. One open problem given there was that of enumerating prefix normal words (counting). The number of prefix normal words of length can be computed by checking for each binary word whether it is prefix normal, i.e. altogether in time. In this paper, we present an algorithm that is far superior in that it generates only prefix normal words, rather than testing every binary word; it runs in time per word; and it generates prefix normal words in cool-lex order, constituting a Gray code (subsequent words differ by a constant number of swaps or flips).

2.2 Bubble Languages and Combinatorial Generation

Here we give a brief introduction to bubble languages, mostly summarising results from [22, 24]. We also give a new characterization of bubble languages in terms of the computation tree of a generating algorithm (Obs. 1).

Definition 2

A language is called a bubble language if, for every word , exchanging the first occurrence of (if any) by results in another word in .

For example, the languages of Lyndon words, necklaces and pre-necklaces are bubble languages. A language is a bubble language if and only if each of its fixed-density subsets is a bubble language [22]. This implies that for generating a bubble language, it suffices to generate its fixed-density subsets.

Next we consider combinatorial generation of binary strings.

Let be a binary string of length . Let be the number of s in , and let denote the positions of the s in . Clearly, we can obtain from the word with the following algorithm: first swap the last with the in position , then swap the st with the in position etc. Note that every is moved at most once, and in particular, once the ’th is moved into the position , the suffix remains fixed for the rest of the algorithm.

These observations lead us to the following generating algorithm (Fig. 1), which we will refer to as Recursive Swap Generation Algorithm (like Alg. 1 from [24], which in addition includes a language-specific subroutine). It generates recursively all binary strings from with fixed suffix , where , starting from the string . The call Generate() generates all binary strings of length with density .

  • if  and

  • then for 

  • do 

  • Generate()

  • Visit()

Figure 1: The Recursive Swap Generation Algorithm

The algorithm swaps the last of the first -run with each of the s of the first -run, thus generating a new string each, for which it makes a recursive call. During the execution of the algorithm, the current string resides in a global array . In the subroutine Visit() we can, but do not have to, print the contents of this array; we may just want to increment a counter (for enumeration), or check some property of the current string. The main point of Visit() is that it touches every object once.

Let denote the recursive computation tree generated by a call to Generate(). As an example, Fig. 2 illustrates the computation tree .

Figure 2: The computation tree for .

The depth of the tree equals , the number of s; while the maximum degree is , the number of s. In general, for the subtree rooted at , we have depth and maximum degree ; in particular, the number of children of is exactly . In fact, ’s th child has the form . Moreover, the suffix remains unchanged in the entire subtree, and the computation tree is isomorphic to the computation tree of . This is called fixed suffix [22]. Note also that the critical prefix length strictly decreases along any downward path in the tree.

The algorithm performs a post-order traversal of the tree, yielding an enumeration of the strings of in what is referred to as cool-lex order [26, 24, 22]. A pre-order traversal of the same tree, which implies moving line 4 of the algorithm before line 1, would yield an enumeration in co-lex order. A crucial property of cool-lex order is that any two subsequent strings differ by at most two swaps (transpositions), thus yielding a Gray code [22]. This can be seen in the computation tree as follows. Note that in a post-order traversal of , we have:

Let both be children of . This means that for some and , we have , , and . Let be a descendant of along the leftmost path, i.e.  for some . Then

(1)

We now state a crucial property of bubble languages with respect to the Recursive Swap Generating Algorithm which follows immediately from the definition of bubble languages:

Observation 1

A language is a bubble language if and only if, for every , its fixed-density subset is closed w.r.t. parents and left siblings in the computation tree of the Recursive Swap Generating Algorithm. In particular, if , then it forms a subtree rooted in .

Using this property, the Recursive Swap Generating Algorithm can be used to generate any fixed-density bubble language , as long as we have a way of deciding, for a node , already known to be in , which is its rightmost child (if any) that is still in . If such a child exists, and it is the th child , then the bubble property ensures that all children to its left are also in . Thus, line in the algorithm can simply be replaced by “for ”. Moreover, the Recursive Swap Generating Algorithm, which visits the words in the language in cool-lex order, will yield a Gray code, since because of this closure property, will again either be the parent, or a node on the leftmost path of the right sibling, both of which are reachable within two swaps, see (1).

In [24], a generic generating algorithm was given which moves the job of finding this rightmost child into a subroutine Oracle(). If Oracle() runs in time , then we have a CAT algorithm. In general, this will not be possible, and a generic Oracle tests for each child from left to right (or from right to left) whether it is in the language. Because of the bubble property, after the first negative (positive) test, it is guaranteed that no more children will be in the language, and the running time of the algorithm is amortized that of the membership tester. The crucial trick is that it is not necessary to have a general membership tester, since all we want to know is which of the children of a node already known to be in are in ; moreover, the membership tester is allowed to use other information, which it can build up iteratively while examining earlier nodes.

3 Combinatorial Generation of Prefix Normal Words

In this section we prove that the set of prefix normal words is a bubble language. Then, by providing some properties regarding membership testing, we can apply the cool-lex framework to generate all prefix normal words of a given length and density in -amortized time. By concatenating the lists together for all densities in increasing order, we obtain an -amortized time algorithm to list all prefix normal words of length .

Lemma 1

The language is a bubble language.

Proof

Let be a prefix normal word containing an occurrence of . Let be the word obtained from by replacing the first occurrence of with . Then , for some . Let be a substring of . We have to show that .

Note that for any , . In fact, , and for every , . Now if is contained in or in , then is a substring of , and thus . If , with suffix of and prefix of , then . If , with prefix of , then , and is a substring of , thus . Else , with suffix of . We can assume that is a proper suffix of . Let be the substring of of the same length as and starting one position before (in other words, is obtained by shifting to the left by one position). Since does not contain as a substring, we have for some . If is a power of ’s, then and the claim holds. Else, , and is a substring of . Thus . ∎

In Fig. 3, we give the computation tree and highlight the subtree corresponding to . Since is a bubble language, by Obs. 1 it is closed w.r.t. left siblings and parents. However, we still have to find a way of deciding which is the rightmost child of a node that is still in .

Figure 3: The computation tree for . Prefix normal words in bold.

The following lemma states that, given a prefix normal word , in order to decide whether one of its children in the computation tree is prefix normal, it suffices to check the PN-property for one particular length only: the critical prefix length of the child node. Moreover, this check can be done w.r.t.  only. This fact will be crucial in the generating algorithm.

Lemma 2

Let , with , with . Let , i.e.  padded with s to length . Let . Then , unless one of the following holds:

  1. has a substring of length with at least s, or

  2. the string has at least s.

Moreover, the latter is the case if and only if (where by convention, we regard a prefix of negative length as the empty word).

Proof

Let’s assume that . Then there is a substring of s.t. . Let be the length of .

Case 1. is a substring of . Since , therefore . Since also , this implies , because for all other arguments, and coincide. Note that must have an occurrence in which contains neither of the swapped bits, else it would not be a substring of . Thus starts at some position to the right of . Therefore we can write , with a substring of ; in particular, if is a substring of , then and ; otherwise, is a prefix of . Now set , with the substring of of length following . Then has length , and since , it contains at least many s.

Case 2. is not a substring of . Therefore it contains at least one of the two swapped bits. It cannot contain the swapped (in position ) because then it would be preceded only by s, in which case the prefix of of length could not have fewer s than . Thus, contains the swapped only (in position ). If , then the prefix of of length overlaps with , i.e. we can write and for some non-empty containing the swapped . Since , this implies that also has more s than the prefix of the same length. Since is a substring of , we are back in Case 1.

So we have . We can write , with prefix of . Now remove the starting s from and extend it to the right to get , with be the prefix of of length . Then and . Moreover, . ∎

Corollary 1

Given . If we know and , then it can be decided in constant time whether is prefix normal.

Lemma 3

Let , with . Then for all ,

Proof

A substring of length either uses the new in the first position, or it does not. If it does, then it is a prefix of and its number of s is given by . Else it is a substring of , and its number of s is given by for up to the length of , or by the number of s in , , if spans all of . ∎

Corollary 2

The -function of for node can be computed in linear time based on the -function of ’s parent node.

By applying these results, the algorithm GeneratePN() can be used to generate in cool-lex Gray code order. Starting from the left child and proceeding right (with respect to the computation tree ), the algorithm will make a recursive call until a child is no longer prefix normal. The membership test is done in the subroutine isPN, which uses the conditions of Lemma 2. The algorithm maintains an array which contains the maximum number of s in -length substrings of (the -function of ), and a variable . Before testing the first child, in update(), it computes the current ’s -function based on the parent’s (Corollary 2). Note that it is not necessary to compute all of the -function, since all nodes in the subtree have critical prefix length smaller than , thus this update is done only up to length . After the recursive calls to the children, the array is restored to its previous state in restore(). The variable contains the number of s in the prefix of which is spanned by the substring of case 2. of Lemma 2, for the first child. It is updated in constant time after each successful call to isPN, to include the number of s in the two following positions in .

  • if  and

  • then update

  • while  and isPN()

  • do 

  • GeneratePN()

  • update

  • restore()

  • Visit()

Figure 4: Algorithm generating all prefix normal words in the subtree rooted in .

By concatenating the lists of prefix normal words with densities , we obtain an exhaustive listing of .

  • for 

  • do initialize of length with all s

  • GeneratePN()

Figure 5: Algorithm generating all prefix normal words of length .

As an example, a call to GeneratePN() produces the following list of prefix normal words of length 5:

These strings are also given in Sec. 2.1. Since the fixed-density listings are a cyclic Gray code (Theorem 3.1 from [22]), it follows that this complete listing is also a Gray code. In fact, if the fixed-density listings are listed by the odd densities (increasing), followed by the even densities (decreasing), the resulting listing would be a cyclic Gray code.

Theorem 3.1

Algorithm GeneratePN() generates all prefix normal words of length in amortized time per word.

Proof

Since is prefix normal for every , we only need to show that the correct subtrees of are generated by the algorithm. By Lemma 2, only those children will be generated that are prefix normal; on the other hand, by the bubble property (Obs. 1), as soon as a child tests negative, no further children will be prefix normal. The running time of the recursive call on consists of (a) updating and restoring (lines 2 and 9): the number of steps equals the critical prefix length of , which is ; (b) computing (line 3): again , the critical prefix length of , many steps, so ; and (c) work within the while-loop (lines 5 to 8), which, for a word with prefix normal children, consists of positive and negative membership tests, of updates of , and the recursive calls on the positive children. The membership tests take constant time by Corollary 1, so does the update of . Since has prefix normal children, we charge the positive membership tests and the -updates to the children, and the negative test to the current word. So for one word , we get work. ∎

4 Experimental results

In this section we present some theoretical and numerical results about the number of prefix normal words and their structure. These have become available thanks to the algorithm presented, which allowed us to generate up to length 50 on a home computer. Let . The following lemma follows from the observation that is a prefix normal word of length for all words of length .

Lemma 4

The number of prefix normal words grows exponentially in . We have that .

The first members of the sequence are listed in [25], and these values suggest that the lower bound above is not sharp. We turn our attention to the growth rate of as increases. Note that . The lower bound follows form the fact that all prefix normal words can be extended by adding a to the end, and the upper bound is implied by the prefix-closed property of . Fig. 6 (left) shows the growth ratio for small values of . The figure shows two interesting phenomena: the values seem to approach 2 slowly, i.e., the number of prefix normal words almost doubles as we increase the length by 1. Second, the values show on oscillation pattern between even and odd values. We have so far been unable to establish these observations theoretically.

Figure 6: The value of (left), and of for prefix normal words for (right).

The structure of prefix normal words is also relevant for the generation algorithm, since the amortized running time of the algorithm is bounded above by the average value of the critical prefix length taken over all prefix normal words. This differs from the expected critical prefix length of the prefix normal form of a uniformly random word. For the latter we have the following result.

Lemma 5

Given a random word , let . Let denote the critical prefix length of . Then for the the expected value of we have .

Proof

Write in the usual form, i.e. with , and consider the random variables and . It is known that the expected maximum length of a run of 1s in a random word is [16]. Clearly, equals the length of the longest run of s of , thus . To determine , consider a -run of of maximum length . If has at least another occurrence of , then there is a substring of consisting of the maximal -run and one more ; the number of ’s in this substring is an upper bound on . Since these s form a single -run, their number is again in expectation. If on the other hand, all occurrences of in are in the maximal run, then so . The number of words with at most one -run is . So we have:

The expected value of the critical prefix length for prefix normal words is shown in Fig. 6 (right) for , on a loglinear scale. We conjecture that is polylogarithmic for prefix normal words. The linear alignment of the data points together with lemma 5 seems to support that.

5 Conclusion and Open Problems

We presented a new generating algorithm for prefix normal words, which produces all prefix normal words of length in amortized linear time per word. Notice that the number of words that are not prefix normal also grows exponentially and greatly dominates prefix normal words (e.g., ), so the gain of any algorithm that runs in amortized time per word, over brute-force testing of all binary words, is considerable. We believe, moreover, that our algorithm actually runs in time per word. This could be proved by showing that the expected critical prefix length of a prefix normal word is polylogarithmic in .

In Sec. 3 we gave a linear time testing algorithm for words which are derived from a word already known to be prefix normal, via a particular operation (swapping the last of the first 1-run with a in the first 0-run). This testing algorithm relies both on the knowledge that is prefix normal, and on the presence of a data structure for (the -function). We pose as an open problem to find a strongly subquadratic time testing algorithm for arbitrary words. Another open problem is the computation of prefix normal forms. Solving this problem would lead immediately to an improvement for indexed binary jumbled pattern matching.

The observation that our language is a bubble language has opened up completely new roads. An efficient implementation of the generating algorithm led to new experimental results which were not available with our previous approach. The obtained data led to new conjectures and results. We are confident that the connection to bubble languages will also help in establishing theoretical results about the number and structure of prefix normal words, and could hopefully lead to a strongly subquadratic testing algorithm.

References

  • [1] A. Amir, A. Apostolico, G. M. Landau, and G. Satta. Efficient text fingerprinting via Parikh mapping. J. Discrete Algorithms, 1(5-6):409–421, 2003.
  • [2] G. Badkobeh, G. Fici, S. Kroon, and Zs. Lipták. Binary jumbled string matching for highly run-length compressible texts. Inf. Process. Lett., 113(17):604–608, 2013.
  • [3] G. Benson. Composition alignment. In Proc. of the 3rd International Workshop on Algorithms in Bioinformatics (WABI’03), pages 447–461, 2003.
  • [4] S. Böcker. Simulating multiplexed SNP discovery rates using base-specific cleavage and mass spectrometry. Bioinformatics, 23(2):5–12, 2007.
  • [5] S. Böcker, K. Jahn, J. Mixtacki, and J. Stoye. Computation of median gene clusters. In Proc. of the Twelfth Annual International Conference on Computational Molecular Biology (RECOMB 2008), pages 331–345, 2008. LNBI 4955.
  • [6] P. Burcsi, F. Cicalese, G. Fici, and Zs. Lipták. On Table Arrangements, Scrabble Freaks, and Jumbled Pattern Matching. In Proc. of the 5th International Conference on Fun with Algorithms (FUN 2010), volume 6099 of LNCS, pages 89–101, 2010.
  • [7] P. Burcsi, F. Cicalese, G. Fici, and Zs. Lipták. On approximate jumbled pattern matching in strings. Theory Comput. Syst., 50(1):35–51, 2012.
  • [8] A. Butman, R. Eres, and G. M. Landau. Scaled and permuted string matching. Inf. Process. Lett., 92(6):293–297, 2004.
  • [9] F. Cicalese, G. Fici, and Zs. Lipták. Searching for jumbled patterns in strings. In Proc. of the Prague Stringology Conference 2009 (PSC 2009), pages 105–117. Czech Technical University in Prague, 2009.
  • [10] F. Cicalese, T. Gagie, E. Giaquinta, E. S. Laber, Zs. Lipták, R. Rizzi, and A. I. Tomescu. Indexes for jumbled pattern matching in strings, trees and graphs. In Proc. of the 20th String Processing and Information Retrieval Symposium (SPIRE 2013), volume 8214 of LNCS, pages 56–63, 2013.
  • [11] F. Cicalese, E. S. Laber, O. Weimann, and R. Yuster. Near linear time construction of an approximate index for all maximum consecutive sub-sums of a sequence. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM 2012), volume 7354 of LNCS, pages 149–158, 2012.
  • [12] K. Dührkop, M. Ludwig, M. Meusel, and S. Böcker. Faster mass decomposition. In WABI, pages 45–58, 2013.
  • [13] G. Fici and Zs. Lipták. On prefix normal words. In Proc. of the 15th Intern. Conf. on Developments in Language Theory (DLT 2011), volume 6795 of LNCS, pages 228–238. Springer, 2011.
  • [14] T. Gagie, D. Hermelin, G. M. Landau, and O. Weimann. Binary jumbled pattern matching on trees and tree-like structures. In Proc. of the 21st Annual European Symposium on Algorithm (ESA 2013), pages 517–528, 2013.
  • [15] E. Giaquinta and Sz. Grabowski. New algorithms for binary jumbled pattern matching. Inf. Process. Lett., 113(14-16):538–542, 2013.
  • [16] L. J. Guibas and A. Odlyzko. Long repetitive patterns in random sequences. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebeite, 53:241–262, 1980.
  • [17] T. Kociumaka, J. Radoszewski, and W. Rytter. Efficient indexes for jumbled pattern matching with constant-sized alphabet. In Proc. of the 21st Annual European Symposium on Algorithm (ESA 2013), pages 625–636, 2013.
  • [18] L.-K. Lee, M. Lewenstein, and Q. Zhang. Parikh matching in the streaming model. In Proc. of 19th International Symposium on String Processing and Information Retrieval, SPIRE 2012, volume 7608 of Lecture Notes in Computer Science, pages 336–341. Springer, 2012.
  • [19] T. M. Moosa and M. S. Rahman. Indexing permutations for binary strings. Inf. Process. Lett., 110:795–798, 2010.
  • [20] T. M. Moosa and M. S. Rahman. Sub-quadratic time and linear space data structures for permutation matching in binary strings. J. Discrete Algorithms, 10:5–9, 2012.
  • [21] L. Parida. Gapped permutation patterns for comparative genomics. In Proc. of the 6th International Workshop on Algorithms in Bioinformatics, (WABI 2006), pages 376–387, 2006.
  • [22] F. Ruskey, J. Sawada, and A. Williams. Binary bubble languages and cool-lex order. J. Comb. Theory, Ser. A, 119(1):155–169, 2012.
  • [23] F. Ruskey, J. Sawada, and A. Williams. De Bruijn sequences for fixed-weight binary strings. SIAM Journal of Discrete Mathematics, 26(2):605–517, 2012.
  • [24] J. Sawada and A. Williams. Efficient oracles for generating binary bubble languages. Electr. J. Comb., 19(1):P42, 2012.
  • [25] N. J. A. Sloane. The On-Line Encyclopedia of Integer Sequences. Available electronically at http://oeis.org. Sequence A194850.
  • [26] A. M. Williams. Shift Gray Codes. PhD thesis, University of Victoria, Canada, 2009.

Appendix: Connection between prefix normal forms and binary jumbled pattern matching

The linear space solutions for binary pattern matching all rely on a simple property of binary strings, which we refer to as Interval Lemma (folklore): For a binary string and any fixed length , if has two substrings of length , with one containing s, and the other s, where , then, for any , also contains a substring of length with exactly s. In other words, all Parikh vectors of substrings of the same length build an interval. The lemma implies that in order to be able to answer existence jumbled pattern matching queries, it suffices to store, for every length , the maximum and minimum number of s in any substring of length : When querying whether has a substring with Parikh vector , we can simply ask whether lies between the maximum and minimum number of s for length . This list of minima and maxima for every length is the linear size index used. The big open question is how to compute it faster than the current time.

Now, prefix normal forms of a word can be used to compute this index. We know that two words have the same Parikh set (Parikh vectors of substrings) if and only if they have the same prefix normal forms both w.r.t.  and to (see [13], Thm. 2).

In Fig. 7, we present the word and its prefix normal forms in a standard representation for binary words: Draw in the Euclidean plane the word by representing each letter by an upper unit diagonal and each letter by a lower unit diagonal, starting from the origin . The region between and forms exactly the Parikh set of . For example, all substrings of length have one of the Parikh vectors .

Figure 7: The word (dashed line) and its prefix normal forms (grey lines). The area between the two PNFs is the Parikh set of . The vertical line shows all Parikh vectors of substrings of length , namely .
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
174192
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description