Normal, Abby Normal, Prefix Normal
A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present results about the number of prefix normal words of length , showing that for some and . We introduce efficient algorithms for testing the prefix normal property and a “mechanical algorithm” for computing prefix normal forms. We also include games which can be played with prefix normal words. In these games Alice wishes to stay normal but Bob wants to drive her “abnormal” – we discuss which parameter settings allow Alice to succeed.
Keywords: prefix normal words, binary jumbled pattern matching, normal forms, enumeration, membership testing, binary languages
Consider the binary word . Does it have a substring of length containing exactly ones? In Fig. 1 the word is represented by the black line (go up and right for a , down and right for a ), while the grid points within the area between the two lighter lines form the Parikh set of : the set of vectors s.t. some substring of contains exactly ones and zeros. Since the point lies within the area bounded by the two lighter lines, we see that the answer to our question is ‘yes’. (Don’t worry, more detailed explanation will follow soon.) Now, this paper is about the lighter lines, called prefix normal words.
Prefix normal words: A binary word is called prefix normal (with respect to ) if no substring of has more s then the prefix of the same length111When not specified, we mean prefix normal w.r.t. 1.. For example, is not prefix normal because it has a substring of length with ones, while the prefix of length has only ones. In  it was shown that to every word , one can assign two prefix normal words, the prefix normal forms (PNF) of (w.r.t. and w.r.t. ), and that these are precisely the lines bounding ’s Parikh set from above (w.r.t. ) resp. from below (w.r.t. ), interpreted as binary words.
Prefix normal games: Before we further elaborate on the connection between the initial problem and prefix normal words, let’s see how well you have understood the definition. To this end, we define a two-player game. At the start of the game Alice and Bob have free positions. Alice moves first: she picks a position and sets it to or . Then in alternating moves, they pick an empty position and set it. The game ends after moves. Alice wins if and only if the resulting binary word is prefix normal.
Here is an example run. We have . Alice sets the first bit to , then Bob sets the second bit to . Now Alice sets the th bit to , and she has won, since whichever position Bob chooses, she will set the remaining position to , thus ensuring that the word is prefix normal.
The solution to the following exercise can be found in Section 6.
Find the maximum such that Alice has a winning strategy.
Binary Jumbled Pattern Matching: The problem of deciding whether a particular pair lies within the Parikh set of a word is known as binary jumbled pattern matching. There has been much interest recently in the indexed version, where an index for the Parikh set is created in a preprocessing step, which can then be used to answer queries fast. The Parikh set can be represented in linear space due to the interval property of binary strings: If has -length substrings with resp. ones, where , then it also has a -length substring with ones, for every (folklore). Thus the Parikh set can be represented by storing, for every , the minimum and maximum number of s in a substring of length . Much recent research has focused on how to compute these numbers efficiently [10, 20, 21, 12, 2, 16, 15]. The problem has also been extended to graphs and trees [15, 11], to the streaming model , and to approximate indexes . There is also interest in the non-binary variant [9, 10, 18]. A closely related problem is that of Parikh fingerprints . Applications in computational biology include SNP discovery, alignment, gene clusters, pattern discovery, and mass spectrometry data interpretation [4, 3, 5, 13, 23].
The current best construction algorithms for the linear size index for binary jumbled pattern matching run in time [7, 20], for a word of length , with some improvements for special cases (compressible strings [15, 2], bit-parallel operations [21, 16])222Very recently, an algorithm with running time was presented .. As we will see later, computing the prefix normal forms is equivalent to creating an index for the Parikh set of . Currently, we know no faster computation algorithms for the prefix normal forms than already exist for the linear-size index. However, should better algorithms be discovered, these would immediately carry over to the problem of indexed binary jumbled pattern matching.
Testing: It turns out that even testing whether a given word is prefix normal is a nontrivial task. We can of course compute ’s prefix normal form, in time using one of the above algorithms: obviously is prefix normal if and only if . In , we gave a generating algorithm for prefix normal words, which exhaustively lists all prefix normal words of a fixed length. The algorithm was based on the fact that prefix normal words are a bubble language, a recently introduced class of binary languages [24, 26]. As a subroutine of our algorithm, we gave a linear time test for words which are obtained from a prefix normal word via a certain operation. In Section 7, we present an algorithm to test whether an arbitrary word is prefix normal, based on similar ideas. Our algorithm is quadratic in the worst case but we believe it performs much better than other algorithms once some simple cases have been removed.
We further demonstrate how using several simple linear time tests can be used as a filtering step, and conjecture, based on experimental evidence, that these lead to expected time algorithms. But first the reader is kindly invited to try for herself.
Decide whether the word is prefix normal.
Enumerating: Another very interesting and challenging problem is the enumeration of prefix normal words. It turns out that even though the number of prefix normal words grows exponentially, the fraction of these words within all binary words goes to as goes to infinity. In Sections 3 to 5, we present both asymptotic and exact results for prefix normal words, including generating functions for special classes and counting extensions for particular words. Some of the proofs in this part of the paper are rather technical: they will be available in the full version.
Mechanical algorithm design: We contribute to the area of mechanical algorithm design by presenting an algorithm for computing the Parikh set which uses the new sandbeach technique, a technique we believe will be useful in many other applications (Sec. 7).
We would like to point out that prefix normal words, albeit similar in name, are not to be confused with so-called Abby Normal (a.k.a. abnormal or AB normal), words, or rather, brains, introduced in .— And now it is time to wish you, the reader, as much fun in reading our paper as we had in writing it!
2 Prefix normal words
A binary word (or string) over is a finite sequence of elements from . Its length is denoted by . For any , the -th symbol of a word is denoted by . We denote by the words over of length , and by the set of finite words over . The empty word is denoted by . Let . If for some , we say that is a prefix of and is a suffix of . A substring of is a prefix of a suffix of . A binary language is any subset of . We denote by the number of occurrences in of character ; is called the density of .
Let . For , we set , the number of s in the -length prefix of , and , the maximum number of s over all substrings of length .
Prefix normal words, prefix normal equivalence and prefix normal form were introduced in . A word is prefix normal (w.r.t. ) if, for all , . In other words, a word is prefix normal if no substring contains more s than the prefix of the same length.
We give all prefix normal words of length :
000000, 100000, 100001, 100010, 100100, 101000, 101001, 101010, 110000, 110001, 110010, 110011, 110100, 110101, 110110, 111000, 111001, 111010, 111011, 111100, 111101, 111110, 111111.
Two words are prefix normal equivalent (w.r.t. ) if and only if for all . Given , the prefix normal form (w.r.t. ) of , , is the unique prefix normal word which is prefix normal equivalent (w.r.t. ) to . Prefix normality w.r.t. , prefix normal equivalence w.r.t. , and are defined analogously. When not stated explicitly, we are referring to the functions w.r.t. . For example, the words and are prefix normal equivalent both w.r.t. and . See [14, 8] for more examples.
In Fig. 1, we see an example string and its prefix normal forms. The interval property (see Introduction) can be graphically interpreted as vertical lines. The vertical line through point represents length- substrings: the grid points within the enclosed area are and , so all length- substrings have between and ones. We can interpret, for each length , the intersection of the th vertical line with the top grey line as the maximum number of s, and with the bottom grey line as the minimum number of s. Now it is easy to see that, passing from to , this maximum, , can either remain the same or increase by one. This means that the top grey line allows an interpretation as a binary word. A similar interpretation applies to the bottom line and prefix normal words w.r.t 0.
It should now be clear, also graphically, that the maximum number of s for a substring of length , , is precisely the number of s in the -length prefix of (the upper grey line); and similarly for the maximal number of s (equivalently, the minimal number of s) and (the lower grey line). Moreover, these values can be obtained in constant time with constant-time rank-operations [22, 15].
We list a few properties of prefix normal words that will be useful later.
Lemma 1 (Properties of prefix normal words )
Every prefix of a prefix normal word is also prefix normal.
If is prefix normal, then is also prefix normal.
Given of length , it can be decided in time whether is prefix normal.
We denote the language of prefix normal words by , the number of prefix normal words of length by , and the number of prefix normal words of length and density by . The first few values of the sequence are listed in .
3 Asymptotic bounds on the number of prefix normal words
We give lower and upper bounds on the number of prefix normal words of length . Our lower bound on is proved in Section 6.
There exists such that
If we consider the length of the first 1-run, we obtain an upper bound.
For , we have .
Let be a number to be specified later. Partition into two classes according to the length of the first 1-run.
Case 1: If is prefix normal and the first 1-run’s length is less than , then there are no consecutive s in . Write as the concatenation of blocks of length and a final, possibly shorter block: For each block we have at most possibilities, so there can be at most words in this class. Case 2: The length of the first -run in is at least . Since the first symbols of are already fixed as s, there can only be words in this class.
If we balance the two cases by letting be the largest integer such that , then we have and
as stated. ∎
4 Exact formulas for special classes of prefix normal words
4.0.1 Words with fixed density.
We formulate an equivalent definition of the prefix normal property that will be useful in the enumeration of prefix normal words. Let be a prefix normal word of density . Denote by the distances between consecutive occurrences of in , and set so that holds. We can thus write . For , we have , , , and . The prefix normal property is equivalent to requiring that for all , one of the shortest substrings containing exactly ones is a prefix. This gives us the following lemma.
The binary word is prefix normal if and only if the following inequalities hold:
For , we have the generating functions :
Similar formulas can be derived for for small values of . Unfortunately, no clear pattern is visible for that we could use for calculating .
4.0.2 Words with a fixed prefix.
We now fix a prefix and give enumeration results on prefix normal words with prefix . Our first result indicates that we have to consider each separately.
If is a binary word, let , and . Let , and .
Let be both prefix normal. If then .
We were unable to prove that the growth of these two extension languages also differ.
Let be both prefix normal. If then the infinite sequences and are different.
The values seem hard to analyze. We give exact formulas for a few special cases of interest. Using Lemma 2, it is possible to give formulas similar to those in Lemma 3 for for fixed and . We only mention one such result.
For we have .
Let be an arbitrary prefix normal word of length and density with as its first symbol. Insert a before each subsequent occurrence of . It is easy to see that this operation creates a bijection between the two sets that we want to enumerate. ∎
The following lemma lists exact values for for some infinite families of words .
Let denote the th Fibonacci number: and . Then for all values of where the exponents are nonnegative, we have the following formulas:
For , , and , it is easy to count those extensions that fail to give prefix normal words. Similarly, for , and , counting the extensions that give prefix normal words gives the results in a straightforward way.
Let be even. For , note that is prefix normal if and only if avoids . The number of such words is known to equal . For odd, the argument is similar. ∎
5 Experimental results about prefix normal words
We consider extensions of prefix normal words by a single symbol to the right. It turns out that this question has implications for the enumeration of prefix normal words.
We call a prefix normal word extension-critical if is not prefix normal. Let denote the number of extension-critical words in .
For we have
From this it follows that
From Theorem 3.1 we have:
For going to infinity, .
We conjecture that in fact the ratio of extension-critical words converges to . We study the behavior of for . The left plot in Fig. 2 shows the ratio of extension-critical words for . These data support the conjecture that the ratio tends to . Interestingly, the values decrease monotonically for both odd and even values, but we have for even . We were unable to find an explanation for this.
The right plot in Fig. 2 shows the ratio of extension-critical words multiplied by . Apart from a few initial data points, the values for even increase monotonically and the values for odd decrease monotonically, and the values for odd stay above those for even .
Based on empirical evidence, we conjecture the following:
Note that the second estimate follows from the first one by (3).
6 Prefix Normal Games
Variant 1: Prefix normal game starting from empty positions. See Introduction.
For Bob has a winning strategy in the game starting from empty positions.
Variant 2: Prefix normal game with blocks. The game is played as follows. Now a block length of is also specified, and we require that divides . The first symbols are set to before the game starts (in order to give Alice a fair chance). Divide the remaining empty positions into blocks of length . Then Bob starts by picking a block with empty positions, and setting half of the positions of the block arbitrarily. Alice moves next and she sets the remaining positions in the same block as she wants. Now this block is completely filled. Then Bob picks another block, fills in half of it, etc. Iterate this process until every position is filled in.
Alice has a winning strategy in the game with blocks, for any .
Alice can always achieve that the current block contains exactly and s. Now consider a substring of length of the word that is obtained in the end. We have to show that the prefix of the same length has at least as many s. Clearly, only has to be considered, and we can also assume that starts after position . The substring contains some -blocks in full, and some others partially. Let , then , while the number of s in the prefix of length is at least , as claimed. ∎
As a corollary, we can prove the lower bound in Theorem 3.1.
(of Theorem 3.1). There are at least as many prefix normal words of length as there are distinct words resulting after a game with blocks that Alice has won using the above strategy. Note that with this strategy, each block has exactly many s and Bob is free to choose their positions within the block. Moreover, for different choices of -positions by Bob, the resulting words will be different. So overall, Bob can achieve at least different outcomes. If we set , and note that for not dividing , we can use , then we obtain: and the statement follows. ∎
7 Construction and testing algorithms
In this section, for strings , we use the notation , with and . Note that this notation is unique. We call the critical prefix of .
7.1 A mechanical algorithm for computing the prefix normal forms
We now present a mechanical algorithm for computing the prefix normal form of a word . It uses a new algorithm technique we refer to as sandy beach technique, a technique that we think will be useful for many other similar problems.
First observe that if you draw your word as in Fig. 1, then the Parikh set of will be the region spanned by drawing all the suffixes of starting from the origin. As we know, the prefix normal forms of will be the upper and the lower contour of the Parikh set, respectively. This leads to the following algorithm, that we can implement in any sand beach—for example, Lipari’s Canneto (Fig. 3).
Take a folding ruler (see Fig. 3) and fold it in the form of your word. Now designate an origin in the sand. Put the folding ruler in the sand so that its beginning coincides with the origin. Next, move it backwards in the sand such that the position at the beginning of the -length suffix coincides with the origin; then with the next shorter suffix and so on, until the right end of the folding ruler reaches the origin. The traced area to the right of the origin is the Parikh set of , and its top and bottom boundaries, the prefix normal forms of (that you can save by taking a photo).
Analysis: The algorithm requires a quadratic amount of sand, but can outperform existing ones in running time if implemented by a very fast person.
7.2 Testing algorithm
It can be tested easily in time if a word is prefix normal, by computing its -function and comparing it to its prefixes; several other quadratic time tests were presented in . Currently, the fastest algorithms for computing run in worst-case time (references in the Introduction). Here we present another algorithm, which, although in the worst-case, we believe could well outperform other algorithms when iterated on prefixes of increasing length.
Given a word of length and density , . Since the cases are trivial, we assume . Notice that, then, in order for to be prefix normal, must hold. Now build a sequence of words , where and , in the following way: for every , is obtained from by swapping the positions and , where is the rightmost mismatch between and . So for example, if , we have the following sequence of words: , , , .
The following lemma follows straightforwardly from the results of :
Given with , and the sequence , we have that is prefix normal if and only if every is.
Moreover, as was shown there, it can be checked efficiently whether these strings are prefix normal. We summarize in the following lemma, and give a proof sketch and an example.
Lemma 12 (from )
Given a prefix normal word . Let , then it can be decided in linear time whether is prefix normal.
We will give an intuition via a picture, see Fig. 4. If is not prefix normal, then there must be a and a substring of length s.t. has more s than the prefix of length . It can be shown that it suffices to check this for one value of only, namely for , the length of the critical prefix length of . The number of s in this prefix is . Now if such a exists, then it is either a substring of , in which case ; or it is a substring which contains the position of the newly swapped (both in grey in the third line). This latter case can be checked by computing the number of s in the prefix of the appropriate length of (in slightly darker grey) and checking whether it is greater than .
Thus, for , we test if is prefix normal. If at some point, we receive a negative answer, then the test returns NO, otherwise it returns YES. Additional data structures for the algorithm are the -function, which is updated to the current suffix following the critical prefix, up to the length of the next critical prefix (in linear time); and a variable containing the number of s in the appropriate length prefix of .
Example: We test whether the word is prefix normal.
At this point we have and therefore, we stop. Indeed, we can see that the next word to be generated, is not be prefix normal, since it has a substring of length with ones, but the prefix of length has only ones.
Analysis: The running time of the algorithm is in the worst case, where the are the positions of the s in , so in the worst case quadratic.
Iterating version. The algorithm tests a condition on the suffixes starting at the s, in increasing order of length, and compares them to a prefix where the remaining s but one are in a block at the beginning. This implies that for some which are not prefix normal, e.g. , the algorithm will stop very late, even though it is easy to see that the word is not prefix normal. This problem can be eliminated by running some linear time checks on the word first; the power of this approach will be demonstrated in the next section.
Since we know that a word is prefix normal iff every prefix of is, we have that a word which is not prefix normal has a shortest non-prefix-normal prefix. We therefore adapt the algorithm in order to test the prefix normality on the prefixes of of length powers of , in increasing order. In the worst case, we apply the algorithm times. Since the test on the prefix of length takes time, we have an overall worst case running time, so no worse than the original algorithm.
We believe that our algorithm will perform well on strings which are “close to prefix normal” in the sense that they have long prefix normal prefixes, or they have passed the filters, i.e. that it will be expected strongly subquadratic, or even linear, time even on these strings.
7.3 Membership testing with linear time filters
In this section, we provide a two-phase membership tester for prefix normal words. Experimental evidence indicates that on average its running time is .
Suppose there is an test that can be used to reject of the binary strings outright (Phase I). For the remaining strings, apply the worst case algorithm (Phase II). This gives an -amortized time algorithm when taken over all strings. For such a two-phase approach, let denote the strings not rejected by the first phase. We are interested in the ratio As grows, if it appears as though this ratio is bounded by a constant, then we would conjecture that such a membership tester runs in average case time.
First we try a trivial test: a string will not be prefix-normal if the longest substring of 1s is not at the prefix. Applying this test as the first phase, the resulting ratios for some increasing values of are given in Table 1(a). Since the ratios are increasing as increases, we require a more advanced rejection test.
The next attempt uses a more compact run-length representation for . Let be represented by a series of blocks, which are maximal substrings of the form . Each block is composed of two integers representing the number of 1s and 0s respectively. For example, the string 11100101011100110 can be represented by . Such a representation can easily be found in time. A word will not be prefix normal word if it contains a substring of the form such that and (the substring is no longer, yet has more 1s than the critical prefix). Thus, a word will not be prefix normal, if for some :
By applying this additional test in our first phase, we obtain algorithm MemberPN(), consisting of the two rejection tests, followed by any simple quadratic time algorithm.
The ratios that result from this algorithm are given in Table 1(b). Since the ratios are decreasing as increases, we make the following conjecture.
The membership tester MemberPN() for prefix normal words funs in average case -time.
We note that there are several other trivial rejection tests that run in time, however these two were sufficient to obtain our desired experimental results.
Acknowledgements. We thank Ferdinando Cicalese who pointed us to  and thus contributed to the fun part of our paper.
-  A. Amir, A. Apostolico, G. M. Landau, and G. Satta. Efficient text fingerprinting via Parikh mapping. J. Discrete Algorithms, 1(5-6):409–421, 2003.
-  G. Badkobeh, G. Fici, S. Kroon, and Zs. Lipták. Binary jumbled string matching for highly run-length compressible texts. Inf. Process. Lett., 113(17):604–608, 2013.
-  G. Benson. Composition alignment. In Proc. of the 3rd International Workshop on Algorithms in Bioinformatics (WABI’03), pages 447–461, 2003.
-  S. Böcker. Simulating multiplexed SNP discovery rates using base-specific cleavage and mass spectrometry. Bioinformatics, 23(2):5–12, 2007.
-  S. Böcker, K. Jahn, J. Mixtacki, and J. Stoye. Computation of median gene clusters. In Proc. of the Twelfth Annual International Conference on Computational Molecular Biology (RECOMB 2008), pages 331–345, 2008. LNBI 4955.
-  M. Brooks and G. Wilder. Young Frankenstein. http://www.imdb.com/title/tt0072431/quotes, http://www.youtube.com/watch?v=yH97lImrr0Q, 1974.
-  P. Burcsi, F. Cicalese, G. Fici, and Zs. Lipták. On Table Arrangements, Scrabble Freaks, and Jumbled Pattern Matching. In Proc. of the 5th International Conference on Fun with Algorithms (FUN 2010), volume 6099 of LNCS, pages 89–101, 2010.
-  P. Burcsi, G. Fici, Zs. Lipták, F. Ruskey, and J. Sawada. On combinatorial generation of prefix normal words. In Proc. 25th Ann. Symp. on Comb. Pattern Matching (CPM 2014), volume 8486 of LNCS, pages 60–69, 2014.
-  A. Butman, R. Eres, and G. M. Landau. Scaled and permuted string matching. Inf. Process. Lett., 92(6):293–297, 2004.
-  F. Cicalese, G. Fici, and Zs. Lipták. Searching for jumbled patterns in strings. In Proc. of the Prague Stringology Conference 2009 (PSC 2009), pages 105–117. Czech Technical University in Prague, 2009.
-  F. Cicalese, T. Gagie, E. Giaquinta, E. S. Laber, Zs. Lipták, R. Rizzi, and A. I. Tomescu. Indexes for jumbled pattern matching in strings, trees and graphs. In Proc. of the 20th String Processing and Information Retrieval Symposium (SPIRE 2013), volume 8214 of LNCS, pages 56–63, 2013.
-  F. Cicalese, E. S. Laber, O. Weimann, and R. Yuster. Near linear time construction of an approximate index for all maximum consecutive sub-sums of a sequence. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM 2012), volume 7354 of LNCS, pages 149–158, 2012.
-  K. Dührkop, M. Ludwig, M. Meusel, and S. Böcker. Faster mass decomposition. In WABI, pages 45–58, 2013.
-  G. Fici and Zs. Lipták. On prefix normal words. In Proc. of the 15th Intern. Conf. on Developments in Language Theory (DLT 2011), volume 6795 of LNCS, pages 228–238. Springer, 2011.
-  T. Gagie, D. Hermelin, G. M. Landau, and O. Weimann. Binary jumbled pattern matching on trees and tree-like structures. In Proc. of the 21st Annual European Symposium on Algorithm (ESA 2013), pages 517–528, 2013.
-  E. Giaquinta and Sz. Grabowski. New algorithms for binary jumbled pattern matching. Inf. Process. Lett., 113(14-16):538–542, 2013.
-  D. Hermelin, G. M. Landau, Y. Rabinovich, and O. Weimann. Binary jumbled pattern matching via all-pairs shortest paths. Arxiv: 1401.2065v3, 2014.
-  T. Kociumaka, J. Radoszewski, and W. Rytter. Efficient indexes for jumbled pattern matching with constant-sized alphabet. In Proc. of the 21st Annual European Symposium on Algorithm (ESA 2013), pages 625–636, 2013.
-  L.-K. Lee, M. Lewenstein, and Q. Zhang. Parikh matching in the streaming model. In Proc. of 19th International Symposium on String Processing and Information Retrieval, SPIRE 2012, volume 7608 of Lecture Notes in Computer Science, pages 336–341. Springer, 2012.
-  T. M. Moosa and M. S. Rahman. Indexing permutations for binary strings. Inf. Process. Lett., 110:795–798, 2010.
-  T. M. Moosa and M. S. Rahman. Sub-quadratic time and linear space data structures for permutation matching in binary strings. J. Discrete Algorithms, 10:5–9, 2012.
-  J. I. Munro. Tables. In Proc. of Foundations of Software Technology and Theoretical Computer Science (FSTTCS’96), pages 37–42, 1996.
-  L. Parida. Gapped permutation patterns for comparative genomics. In Proc. of the 6th International Workshop on Algorithms in Bioinformatics, (WABI 2006), pages 376–387, 2006.
-  F. Ruskey, J. Sawada, and A. Williams. Binary bubble languages and cool-lex order. J. Comb. Theory, Ser. A, 119(1):155–169, 2012.
-  N. J. A. Sloane. The On-Line Encyclopedia of Integer Sequences. Available electronically at http://oeis.org. Sequence A194850.
-  A. M. Williams. Shift Gray Codes. PhD thesis, University of Victoria, Canada, 2009.