Data Structure Lower Bounds on Random Access to Grammar-Compressed Strings
In this paper we investigate the problem of building a static data structure that represents a string using space close to its compressed size, and allows fast access to individual characters of . This type of structures was investigated by the recent paper of Bille et al. . Let be the size of a context-free grammar that derives a unique string of length . (Note that might be exponential in .) Bille et al. showed a data structure that uses space and allows to query for the -th character of using running time . Their data structure works on a word RAM with a word size of bits.
Here we prove that for such data structures, if the space is , then the query time must be at least where is the space used, for any constant . As a function of , our lower bound is . Our proof holds in the cell-probe model with a word size of bits, so in particular it holds in the word RAM model. We show that no lower bound significantly better than can be achieved in the cell-probe model, since there is a data structure in the cell-probe model that uses space and achieves query time. The “bad” setting of parameters occurs roughly when . We also prove a lower bound for the case of not-as-compressible strings, where, say, . For this case, we prove that if the space is , then the query time must be at least .
The proof works by reduction to communication complexity, namely to the LSD (Lopsided Set Disjointness) problem, recently employed by Pǎtraşcu and others. We prove lower bounds also for the case of LZ-compression and Burrows-Wheeler (BWT) compression. All of our lower bounds hold even when the strings are over an alphabet of size and hold even for randomized data structures with 2-sided error.
In many modern databases, strings are stored in compressed form. Many compression schemes are grammar-based, in particular Lempel-Ziv [8, 14, 15] and its variants, as well as Run-Length Encoding. Another big family of textual compressors are BWT (Burrow-Wheeler Transformation ) based compressors, like the one used by the software bzip2.
A natural desire is to store a text using space close to its compressed size, but to still allow fast access to individual characters: can we do something faster than simply extracting the whole text each time we need to access a character? This question was recently answered in the affirmative by Bille et al.  and by Claude and Navarro . These two works investigate the problem of storing a string that can be represented by a small CFG (context-free grammar) of size , while allowing some basic stringology operations, in particular random access to a character in the text. The data structure of Bille et al. [3, Theorem 1] stores the text in space linear in , while allowing access to an individual character in time , where is the text’s uncompressed size. (The result of Bille et. al. also allows other query operations such as pattern matching; we do not discuss those in this paper.) But is that the best upper bound possible?
In this paper we show a lower bound on the query time whenever the space used by the data structure is , showing that the result of Bille et al. is close to optimal. Our lower bounds are proved in the cell-probe model of Yao , with word size , therefore they in particular hold for the model studied by Bille et al. , since the cell-probe model is strictly stronger than the RAM model. Our lower bound is proved by a reduction to Lopsided Set Disjointness (LSD), a problem for which Pǎtraşcu has recently proved an essentially-tight randomized lower bound . The idea is to prove that grammars are rich enough to effectively “simulate” a disjointness query: our class of grammars, presented in Section 3.1, might be of independent interest as a class of “hard” grammars for other purposes as well.
In terms of , our lower bound is . The results of Bille et al. imply an upper bound of on the query time, since , therefore in terms of there is a curious quadratic gap between our lower bound and Bille et al.’s upper bound. We show that this gap can be closed by giving a better data structure: we show a data structure which takes space and has query time , showing that no significantly better lower bound is possible. This data structure, however, comes with a big caveat – it runs in the highly-unrealistic cell-probe model, thus serving more as an impossibility proof for lower bounds than as a reasonable upper bound. The question remains open of whether such a data structure exists in the more realistic word RAM model.
Our lower bound holds for a particular, “worst-case”, dependence of on . Namely, is roughly . It might also be interesting to explicitly limit the range of allowed parameters to other regimes, for example to non-highly-compressible text; in such a regime it might be that . The above result does not imply any lower bound for this case. Thus, we show in another result that for any data structure in that regime, if the space is , then the query time must be . This lower bound holds, again, in the cell probe model with words of size bits, and is proved by a reduction to two-dimensional range counting (which, once again, was lower bounded by a reduction to LSD ). To this end, we prove a new lower bound for two-dimensional range counting on grid, which was not previous known, and is of independent interest.
In this paper we denote . All logarithms are in base unless explicitly stated otherwise.
Our lower bounds are proved in Yao’s cell-probe model . In the cell-probe model, the memory is seen as an array of cells, where each cell consists of bits each. The query time is measured as the number of cells read from memory, and all computations are free. This model is strictly stronger than the word RAM, since in the word RAM the operations allowed on words are restricted, while in the cell-probe model we only measure the number of cells accessed. The cell-probe model is widely used in proving data structure lower bounds, especially by reduction to communication complexity problems . In this paper we prove our result by a reduction to the Blocked-LSD problem introduced by Pǎtraşcu .
An SLP (straight line program) is a collection of derivation rules, defining the symbols . Each rule is either of the form , i.e. is a terminal, which takes the value of a character from the underlying alphabet, or of the form , where and , i.e. and were already defined, and we define the nonterminal symbol to be their concatenation. The symbol is the start symbol. To derive the string we start from and follow the derivation rules until we get a sequence of characters from the alphabet. The length of the derived string is at most . W.l.o.g. we assume it is at least . As the same in Bille et al. , we also assume w.l.o.g. that the grammars are in fact SLPs and so on the righthand side of each grammar rule there are either exactly two variables or one terminal symbol. In this paper SLP, CFG and grammar all mean the same thing.
The grammar random access problem is the following problem.
Definition 1 (Grammar Random Access Problem)
For a CFG of size representing a binary string of length , the problem is to build a data structure to support the following query: given , return the -th character (bit) in the string.
We study two other data-structured problems, which are closely related to their communication-complexity counterparts.
Definition 2 (Set Disjointness, )
For a set , the problem is to build a data structure to support the following query: given a set , answer whether .
Given a universe , a set is called blocked with cardinality if when we divide the universe into equal-sized consecutive blocks, contains exactly one element from each of the blocks while could be arbitrary.
Definition 3 (Blocked Lopsided Set Disjointness, )
For a set , the problem is to build a data structure to support the following query: given a blocked set with cardinality , answer whether .
For proving lower bound for near-linear space data structures, we also need a variant of the range counting problem.
Definition 4 (Range Counting on Grid)
The range counting problem is a static data structure problem. We need to preprocess a set of points on a grid. A query asks to count the number of points in a dominance rectangle , modulo .
When , the above problem has been investigated under the name “range counting” in Pǎtraşcu . Note that the above problem is “easier” than the classical 2D range-counting problem, since it is a dominance problem, it is on the grid, and it is modulo 2. However, the (tight) lower bound that is known for the general problem in Pǎtraşcu , is proven for the problem we define. In this paper, we extend this result a little bit to give a lower bound on the universe of for any constant .
3 Lower Bound for Grammar Random Access
In this section we prove the main lower bound for grammar random access. In Section 3.1 we show the main reduction from SD and BLSD. In Section 3.2 we prove lower bounds for SD and BLSD, based on reductions to communication complexity (these are implicit in the work of Pǎtraşcu ). Finally, in Section 3.3 we tie these together to get our lower bounds.
3.1 Reduction from SD and LSD
In this section we show how to reduce the grammar access problem to SD or BLSD, by considering a particular type of grammar. The reductions tie the parameters and to the parameters and of BLSD (or just to the parameter of SD). In Section 3.3 we show how to choose the relation between the various parameters in order to get our lower bounds. We remark that the particular multiplicative constants in the lemmas below will not matter, but we give them nonetheless, for concreteness.
These reductions might be confusing for the reader, but they are in fact almost entirely tautological. They just follow from the fact that the communication matrix of SD is a tensor product of the 2 by 2 communication matrices for the coordinates, i.e., it is just a -fold tensor product of the matrix . For BLSD, the communication matrix is the -fold tensor product of the communication matrix for each block (for example, for this matrix is ). We do not formulate our arguments in the language of communication matrices and tensor products, since this would hide what is really going on. To aid the reader, we give an example after each of the two constructions.
Lemma 1 (Reduction from )
For any set , there is a grammar of size deriving a binary string of length such that for any set , it holds that iff .
Note that in this lemma we have indexed the string by sets: there are possible sets , and the length of the string is also – each set serves as an index of a unique character. The indexing is done in lexicographic order: the set is identified with its characteristic vector, i.e., the vector in whose -th coordinate is ‘1’ if , and ‘0’ otherwise, and the sets are ordered according to lexicographic order of their characteristic vectors. For example, here is the ordering for the case : .
We now show how to build the grammar . The grammar has symbols for the strings , i.e., all strings consisting solely of the character ‘0’, of lengths which are all powers of 2 up to . Then, the grammar has additional symbols . The terminal is equal to the character . For any , we set to be equal to if , and to be equal to if . The start symbol of the grammar is .
We claim that the string derived by this grammar has the property that iff . This is easy to prove by induction on , where the induction claim is that for any , is the string that corresponds to the set over the universe .
Consider the universe . Let . The string is . The locations of the 1‘s correspond exactly to the sets that don’t intersect , namely to the sets , , and , respectively.
We now show the reduction from blocked LSD. It follows along the same general idea, but the grammar is slightly more complicated.
Lemma 2 (Reduction from )
For any set , there is a grammar of size deriving a binary string of length such that for any blocked set of cardinality , it holds that iff .
Recall that by a “blocked set of cardinality ” we mean a set such that the universe is divided into equal-sized blocks, and contains exactly one element from each of these blocks.
Note that in this lemma we have again indexed the string by sets: there are possible sets and the length of the string is . The indexing is done in lexicographic order, this time identifying a set with a length- vector whose -th coordinate is chosen according to which element it contains in block , and the sets are ordered according to lexicographic order of their characteristic vectors. For example, here is the ordering for the case : .
The construction in this reduction is similar to that in the case of , but instead of working element by element, we work block by block.
We now show how to build the grammar . The grammar has symbols for the strings , i.e., all strings consisting solely of the character ‘0’, of lengths which are all powers of up to . We cannot simply obtain the symbols directly from each other: e.g., to obtain from , we need to concatenate with itself times. Thus we use rules to derive all of these symbols. (In fact, rules can suffice but this does not matter).
Then, beyond these, the grammar has additional symbols , one for each block. The terminal is equal to the character . For any , is constructed from according to which elements of the -th block are in : we set to be a concatenation of symbols, each of which is either or . In particular, is the concatenation of , where is equal to if the -th element of the -th block is not in , and it is equal to if the -th element of the -th block is in . To construct these symbols we need at most rules, because we need concatenation operations to derive from . (Note that here we cannot get down to rules – seem to be necessary.) The start symbol of the grammar is .
We claim that the string produced by this grammar has the property that iff . This is easy to prove by induction on , where the induction claim is that for any , is the string that corresponds to the set over the universe .
Consider the values and . Let . The string is . The locations of the 1’s correspond exactly to the blocked sets that don’t intersect , namely to the sets , , and , respectively. A brief illustration for this example is in Figure 1.
3.2 Lower bounds for SD and BLSD
In this subsection we show lower bounds for SD and BLSD that are implicit in the work of Pǎtraşcu . Recall the notations from Section 2: in particular, in all of the bounds, , , and denote the word size (measured in bits), the size of the data structure (measured in words) and the query time (measured in number of accesses to words), respectively.
For any 2-sided-error data structure for , .
Note that this theorem does not give strong bounds when , but it is meaningful for bit-probe () bound and a warm-up for the reader.
Let be any small constant. For any 2-sided-error data structure for ,
The proofs follow by standard reductions from data structure to communication complexity, using known lower bounds for SD and BLSD (the latter is one of the main results in ).
We now cite the corresponding communication complexity lower bounds:
Consider the communication problem where Alice and Bob each receive a subset of , and they want to decide whether the sets are disjoint. Any randomized 2-sided-error protocol for this problem uses communication .
Lemma 4 (See , Lemma 3.1)
Let be any small constant. Consider the communication problem where Bob gets a subset of and Alice gets a blocked subset of of cardinality , and they want to decide whether the sets are disjoint. In any randomized 2-sided-error protocol for this problem, either Alice sends bits or Bob sends bits. (The -notation hides a multiplicative constant that depends on ).
The way to prove the data structure lower bounds from the communication lower bounds is by reductions to communication complexity: Alice and Bob execute a data structure query; Alice simulates the querier, and Bob simulates the data structure. Alice notifies Bob which cell she would like to access; Bob returns that cell, and they continue for rounds, which correspond to the probes. At the end of this process, Alice knows the answer to the query. Overall, Alice sends bits and Bob sends bits. The rest is calculations, which we include here for completeness:
We know that the players must send a total of bits, but the data structure implies a protocol where bits are communicated. Therefore so .
3.3 Putting it Together
We now put the results of Section 3.1 and 3.2 together to get our lower bounds. Note that in all lower bounds below we freely set the relation of and in any way that gives the best lower bounds. Therefore, if one is interested in only a specific relation of and (say ) the lower bounds below are not guaranteed to hold. The typical “worst” dependence in our lower bounds (at least for the case where and ) is roughly .
For any 2-sided-error data structure for the grammar random access problem, . And in terms of , .
When setting and (polynomial space in the bit-probe model), we get that . And in terms of , .
Trivial, since and .
Assume . Let be any arbitrarily small constant. For any 2-sided-error data structure for the grammar random access problem, . And in terms of , .
When setting and (polynomial space in the cell-probe model with cells of size ), there is another constant such that we get that . And in terms of , .
The condition is a technical condition, which ensures that the value of we choose in the proof is at least . For one gets the best results just by reducing from SD, as in Theorem 3.3.
For the first part of the theorem, substitute , , into (1). For the second part of the theorem, substitute , and . And for the result, set .
4 Lower Bound for Less-Compressible Strings
4.1 The Range Counting Lower Bound
In the above reduction, the worst case came from strings that can be compressed superpolynomially. However, for many of the kinds of strings we expect to encounter in practice, superpolynomial compression is unrealistic. A more realistic range is polynomial compression or less. In this section we discuss the special case of strings of length . We show that for this class of strings, the Bille et al.  result is also (almost) tight by proving an lower bound on the query time, when the space used is . This is done by reduction from the range counting problem on a 2D (two-dimensional) grid. Leaving the proof to Appendix A, we have the following lower bound for the range counting problem (see Definition 4 for details).
Any data structure for the 2D range counting problem using space requires query time in the cell probe model with cell size .
Recall that the version of range counting we consider is actually dominance counting modulo 2 on the grid.
The main idea behind our reduction is to consider the length- binary string consisting of the answers to all possible range queries (in the natural order, i.e. row-by-row, and in each row from left to right); call this the answer string of the corresponding range counting instance. We prove that this string can be represented using a grammar of size . The reduction obviously follows, since a range query can then just be answered by asking for one bit of the compressed string.
For any range counting problem in 2D, the answer string can be represented by a CFG of size .
The idea behind the proof of the lemma is to simulate a sweep of the point-set from top to bottom by a dynamic one-dimensional range tree. The symbols of the CFG will correspond to the nodes of the tree. With each new point encountered, only new symbols have to be introduced.
Assume for simplicity that is a power of . It is easy to see that the answer string could be built by concatenating the answers in a row-wise order, just as illustrated in Figure 2.
We are going to build the string row by row. Think of a binary tree representing the CFG built for the first row of the input. The root of the tree derives the first row of the answer string, whose two children respectively represent the answer string for the left and the right half of the row. In this way the tree is built recursively. The leaves of the tree are terminal symbols in . Thus there are symbols in total for the whole tree. At the same time we also maintain the negations of the symbols in the tree, i.e., making a new symbol for each in the tree, where if is a terminal symbol, or if .
The next row in the answer string will be built by changing at most symbols in the old tree, where is the number of new points in the next row. The symbols for the new row are built by re-using most of the symbols in the old row, and introducing new symbols where needed. We process the new points one by one, and for each one, the modifications needed all lie in a path from a leaf to the root of the tree. Assuming the update is the path , the new tree will contain an update of . Also, all the right children of these nodes will be switched with their negations (this switching step does not actually require introducing any new symbols). An intuitive picture of the process is given in Figure 3.
It is easy to see for each new point, additional symbols are created. of them are the new symbols (), and another of them are their negations (). After all, we use symbols to derive the whole answer string.
By using the above lemma, we have the lower bound of the grammar random access problem.
Any data structure using space for the grammar random access problem requires query time.
For inputs of the range counting problem, we compress the answer string to a CFG of size according to Lemma 6. After that we build a data structure for the random access problem on this CFG using Lemma 7. For any query of the range counting problem, we simply pass the query result on the index on the answer string as an answer. By Bille et al.  this makes a data structure using space with query time for the range counting problem. According to Lemma 5 the lower bound for range counting is for space, thus .
Note that natural attempt is to replace the 1D range tree that we used above by a 2D range tree and perform a similar sweep procedure, but this does not seem to work for building higher dimension answer strings.
In this section, we show that the upper bound in Bille et al.  is nearly optimal, for two reasons. First, it is clear that by Theorem 4.1, the upper bound in Lemma 7 is optimal, when the space used is .
Second, in the cell-probe model with words of size we also have the following lemma by Bille et al. .
There is a data structure for the grammar random access problem with space and time. This data structure works in the word RAM with words of size .
There is a data structure for the grammar random access problem with space and time.
This is a trivial bound. The number of bits to encode the grammar is since each rule needs bits. The cell size is , so in time the querier can just read all of the grammar. Since computation is free in the cell-probe model, the querier can get the answer immediately.
Assuming , there is a data structure in the cell-probe model with space and time .
6 Extensions and Variants
In this section we discuss a bit about what the lowed bound means for LZ-based compression, and ways to extend the lower bound to BWT-based compressions.
6.1 LZ-based Compression
Lemma 9 (Lemma 9 of )
The length of the LZ77 sequence for a string is a lower bound on the size of the smallest grammar for that string.
The basic idea of this lemma is to show that each rule in the grammar only contribute one entry for LZ77. Since LZ77 could compress any string with small grammar size into a smaller size, it can also compress the string in Lemma 1 into a smaller size. Thus the lower bound for grammar random access problem also holds for LZ77.
The reader might also be curious about what will happen for the LZ78  case. Unfortunately the lowerbound does not hold for LZ78. This is because LZ78 is a “bad” compression scheme that even the input is of all ’s, LZ78 can only compress the string to length of . But a random access on an all string is trivially constant with constant space. So we are not able to have any lower bounds for this case.
There are also lots of other variants of LZ-based compressions. As long as the compression is efficient like LZ77, we have the lower bound. Otherwise if it is like LZ78, we do not.
6.2 Lower Bound for BWT-based Compression Access
In last sections we talked about strings that could be compressed efficiently used by grammar-based compression scheme. But it might be an interesting question to ask if we take another compression approach, say, BWT-based compression, does the lower bound holds as well? We answer this question positively here. We claim that with a little modification used our “hard instance” used in last section could be efficiently compressed by BWT, so that our lower bound holds for BWT as well.
The BWT of a binary string could be obtained by the following process. We use to denote the string which is the concatenation of the substring ‘$’ where $ is the end-of-string symbol lexicographically smaller than and , and is the substring obtained by the first bits of . The BWT of the string is the string formed by the last characters of list for after sorting. This string is of length but it will have long “runs”, which means maximal consecutive ’s or ’s when omitting ‘$’. For example in Figure 4 there are runs ‘0’,‘111’ and ‘00’. We call the function defined in this process . And this function is invertible according to , that is, given , one can always recover in linear time.
However, BWT itself is not a compression algorithm. But there are several approaches to compress the text efficiently after BWT, e.g., [7, 9], first use MTF (move-to-front) encoding, and then arithmetic encoding. The compressed length is . For binary strings, there is a much easier way to bound the number of bits for storing the compressed string. We can just save the length and an extra bit indicating or for each run, and bits for storing the position of ‘$’. Thus the compressed length is .
For a binary string , if , then the BWT-based compressed representation of is less than bits.
In the appendix, we will prove the following lemma for a string which is quite similar to the string used in Lemma 2.
Lemma 11 (Sketched Version of Lemma 15)
There is a string constructed for blocked set and a mapping such that for any set , . And the length of is , while .
By using this lemma we prove the following main theorem for BWT, with details in appendix.
For random access a bit in a binary string compressed by BWT-based methods with bits. Assume . Let be any arbitrarily small constant. For any 2-sided-error data structure, . And in terms of , . When setting and (polynomial in ), there is a constant such that we get . And in terms of , .
We thank Travis Gagie and Pawel Gawrychowski for helpful discussions.
-  L. Babai, P. Frankl, and J. Simon. Complexity classes in communication complexity theory. In FOCS, pages 337–347, 1985.
-  Z. Bar-Yossef, TS Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702–732, 2004.
-  P. Bille, G.M. Landau, R. Raman, S. Rao, K. Sadakane, and O. Weimann. Random access to grammar compressed strings. In SODA, 2011.
-  M. Burrows and D.J. Wheeler. A block-sorting lossless data compression algorithm. Digital, Systems Research Center, 1994.
-  M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005.
-  F. Claude and G. Navarro. Self-indexed text compression using straight-line programs. Mathematical Foundations of Computer Science 2009, pages 235–246, 2009.
-  S. Deorowicz. Second step algorithms in the Burrows-Wheeler compression algorithm. Software: Practice and Experience, 32(2):99–111, 2002.
-  A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on information theory, 22(1):75–81, 1976.
-  G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):430, 2001.
-  P.B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication complexity. In STOC, page 111. ACM, 1995.
-  M. Patrascu. Unifying the Landscape of Cell-Probe Lower Bounds. SIAM Journal on Computing, 40(3), 2011.
-  A.A. Razborov. On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385–390, 1992.
-  A.C.C. Yao. Should tables be sorted? Journal of the ACM, 28(3):615–628, 1981.
-  J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3):337–343, 1977.
-  J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.
Appendix A Range Counting Lower Bound
In this section we prove the lower bound for range counting (Lemma 5). The way of proving the lower bound for the range counting problem in Pǎtraşcu  is by reduction from the reachability problem on butterfly graphs. However, the version used in Pǎtraşcu’s paper can only be reduced to range counting on grid. In order to make the reduction work for grid, we need the following unbalanced butterfly graph for the reachability problem.
Definition 5 ((Unbalanced) Butterfly Graph)
A graph is called butterfly graph iff there exists some integers such that , , , and the vertices of could be labeled uniquely as where . And an directed edge iff labeled and labeled .
If we merge the vertices having difference labels in the layer of the graph together, i.e., make the vertices labeled with and the same vertex, we call it an unbalanced butterfly graph. All the vertices on other layers can only have one unique label. So there are vertices in layer and vertices in other layers. For an example, the reader may refer to Figure 5.
Note here the condition is enforced to make sure that the number of different labels .
If , the unbalanced butterfly graph is just the butterfly graph. In Pǎtraşcu’s paper a relationship between reachability in the butterfly graph and range counting problem is given [11, Section 2.1+Appendix A]. Actually the proof could be extended to the unbalanced butter fly graphs with little modifications. The basic idea of the proof is to map an edge to a rectangle on the grid. This is because all the vertices in layer that leads to this edge is are the vertices , and all the vertices in the last layer this edge leads to are the vertices , where means arbitrary values. And a reachability query from layer to the last layer could be translated to a stabbing query, which will further be translated to a range counting query. We state the following lemma with the proof leaving to the reader.
A range counting data structure on a grid could be used to solve the reachability problem on unbalanced butterfly graphs with vertices in layer and vertices in the last layer with same space and query time. The reachability problem is to answer queries if there is a directed path from vertex to vertex where is a vertex in layer and is a vertex in the last layer.
By choosing , and , we can see that the unbalanced butterfly graph has vertices in layer and vertices in the last layer. Thus a lower bound for reachability on this graph will implies lower bound for range counting on the grid.
Lemma 13 ()
For the reachability problem on the butterfly graphs with , there exists some constant such that when , the query time .
And it is easy to observe that a reachability data structure for unbalanced butterfly graphs is also a reachability data structure for butterfly graphs.
A reachability data structure for unbalanced butterfly graphs could be used to solve reachability problem for butterfly graph using the same space and query time.
This is almost trivial to prove. For a query on vertices and in the butterfly graph, it will just map to the vertices with the same label on the corresponding unbalanced butterfly graph.
Finally, we have the proof for Lemma 5.
Proof (Lemma 5)
We know that and for some constant . And since , we can choose to make and . It implies that . We can choose . Thus in the best case we have .
Appendix B Remaining Proofs for BWT-based Compressions
Here we bound the number of “runs” in (see Lemma 2 for definition). We are going to show that for the string derived by a variant of the string , BWT-based compression schemes could compress it well. As a result, by using a similar argument of Lemma 2, randomly accessing a bit in BWT compressed strings is also hard.
An important observation for string is that it could also be derived by applying the following replacement rules to the string “1” sequentially for .
is a binary string. The reader could simply check that it defines the same string as in Lemma 2.
Let and to be the binary string obtained by applying the replacement rules for to . Here replacement rules means that we simply replace every in with and every with . And we note that . And we have the following lemma for a variant of .
For every set , let and , for , we apply the following replacement rule on to get ,
where the length of is and is with replaced with and replaced by , e.g., if then .
We let , then the following properties of is true.
There is a function independent of such that for every blocked set , ;
First assuming this lemma is true, we have the proof for Theorem 6.1.
Proof (Theorem 6.1)
Now we prove Lemma 15.
Proof (Lemma 15)
First of all, it is easy to make the following observations.
Each starts with and ends with .
The number of consecutive ’s in is .
The number of consecutive ’s in is at most .
The length of is .
If starts with ’s and is not a multiple of or 1, then ends with .
If we represent a blocked set in integer as as an base integer, then we know that for the integer in base , we have . So this is the we want.
Second, we are going to show that . If this is true, then , which is the third item we need to prove.
The way of upper bounding by is to see the process of computing BWT of as first computing the BWT of then inserting the rest bits. Precisely speaking, if we group the bits in into segments of length in from start, we know each segment is derived by a bit in . A simple observation is that by looking at the start of each segment (), the last bit of the sorted list is the same as the BWT of . So there are runs in the string formed by last bits of the sorted list.
So all the runs come from the other parts of , say for . We discuss about them in two cases.
For starts with ’s. We know that for any , ends in the same bit as . If we look at another string where , then by the following lemma we know that the first of them differs.
The first bits in and () are always different for different if they do not start with more than ’s.
And by the following lemma, we know the last character of only depends on the first bits.
If the first characters from and are the same, and they are not start with more than 0’s, then they end with the same character.
So we know that if we group these into groups according to the first bits, then there will be groups and each group will end in the same bit. So the number of runs increased by inserting these is .
For starts with ’s. Since , ends with . And we know that after sorting all these starting with ’s and ’s, they will be inserted into and for some and where one of them starts with ’s and another of them starts with ’s. By the following lemma and the fact that all the strings inserted here will be before the strings in case I, we know that at least one of them ends with so the number of runs increased is .
For all possible choices of , if there are ’s in the prefix of and ’s in the prefix of (), then at least one of them ends in .
At last we prove the lemmas left.
Proof (Lemma 16)
We prove by contradiction. Say if there exists , , , such that and have common prefix longer than and .
According to the assumption the first bits of them are all the same. However, we know that there are ’s at the start of the two strings. So segment 3 and segment 4 must be corresponds to two ’s in . Since we know that starts and ends with , so segment 5 must be as well. By the same argument we know that segment to segment are all . However there are at most 4 consecutive ’s in , so it is not possible.
For all possible choices of and , there are only different kinds of prefixes of length in . This is because these bits are derived by at most bits in . The number of all the possible combinations of these bits is . And has different choices, so the number of possible prefixes is .
Proof (Lemma 17)
If the first characters from and are the same, and they are not start with more than 4b 0’s.
Then by Lemma 16, we know .
And we can also easily know that and are generate by a same character. That is and , which are the (p+1)’s character and (q+1)’s character of , must be the same.
So the character and are also the same. and is the last character of , is the last character of
Proof (Lemma 18)
We know that in , there are ’s in the prefix of and ’s in the prefix of . When , we know that they must be derived by a in . However, we know that and can not hold simultaneously, so at least one of them must end with .
If , we know that starts with ’s. It must be from a in , however, when we know that ends with .