A Lower Bound for Succinct Rank Queries
The rank problem in succinct data structures asks to preprocess an array of bits into a data structure using as close to bits as possible, and answer queries of the form . The problem has been intensely studied, and features as a subroutine in a majority of succinct data structures.
We show that in the cell probe model with -bit cells, if rank takes time, the space of the data structure must be at least bits. This redundancy/query trade-off is essentially optimal, matching our upper bound from [FOCS’08].
1.1 The Complexity of Rank
Consider an array of bits. Can we preprocess this array into a data structure of size bits, for small redundancy , which supports rank queries efficiently? The problem of supporting rank (and the related select queries) is the bread-and-butter of succinct data structures. It finds use in most other data structures (for representing trees, graphs, suffix trees / suffix arrays etc), and its redundancy / query trade-off has come under quite a bit of attention.
Rank already had a central position in the seminal papers on succinct data structures. Jacobson [Jac89], in FOCS’89, and Clark and Munro [CM96], in SODA’96, gave the first data structures using space and constant query time. These results were slightly improved in [Mun96, MRR01, RRR02].
In several applications, the set of ones is not dense in the array. Thus, the problem was generalized to storing an array , containing ones and zeros. The optimal space is . Pagh [Pag01] achieved space for this sparse problem. Recently, Golynski et al. [GGG07] achieved . Subsequently, Golynski et al. [GRR08] have achieved space .
In my paper from FOCS’08 [Pǎt08], I gave a qualitative improvement to these bounds, showing an exponential dependence between the query time and the redundancy. Specifically, with query time , the achievable redundancy is . This improved the redundancy for many succinct data structures where rank/select queries were the bottleneck.
Given the surprising nature of this improvement, a natural question is whether we can do much better. In this paper, we show that we cannot, at least for the basic rank queries:
In the cell-probe model with words of bits, a data structure that supports rank queries in cell probes requires at least bits of space.
All succinct data structure papers assume . The lower bound matches my upper bound, except for the difference between and . This difference is inconsequential for small . If we want a polynomially small redundancy (say, less than , for some constant ), the upper bound says that is sufficient. The lower bound says that is necessary. It is unclear which bound is the optimal one in this regime.
1.2 Lower Bounds for Succinct Data Structures
Much work in lower bounds for succinct data structures has been in the so-called systematic model. In this model, the array must be represented as is, i.e. the data structure only has oracle access to it (it can read any consecutive bits at cost). In addition, the data structure may store an index of sublinear size, which the query algorithm can examine at no cost. See [GM03, Mil05, GRR08, Gol07] for increasingly tight lower bounds in this model. Note, however, that in the systematic model, the best achievable redundancy with query time is , i.e. there is a linear trade-off between redundancy and query time. This is significantly improved by my (non-systematic) upper bounds [Pǎt08], and these lower bounds qualitatively miss the nature of this improvement.
In the unrestricted cell-probe model, the first lower bounds were shown by Gál and Miltersen [GM03] in 2003. These lower bounds were strong, showing a linear dependence between the query and redundancy . However, the problem being analyzed is somewhat unnatural: the bound applies to polynomial evaluation, for which nontrivial succinct upper bounds appear unlikely. Their technique, which is based on the strong error correction implicit in their problem, remains powerless for “easier” problems. (Thus, succinct data structures are unusual for lower bounds, in that the difficult goal seems to be proving lower lower bounds for natural problems.)
A significant break-through occured in SODA’09, when Golynski [Gol09] showed a lower bound of for the problem of storing a permutation and querying and . This quadratic trade-off is tight for storing a permutation and its inverse. Golynski’s technique is based on the inherent difficulty of storing a function and its inverse without doubling the space. However, due to the particular attention it pays to inverses, it is unclear how it could generalize to problems like rank.
In this paper, we make further progress on getting lower bounds for natural problems, and analyze one of the central problems in succinct data structures. It is reasonable to hope that our lower bound technique will generalize to many other problems, given the many applications of rank queries.
2 The Proof
2.1 An Entropy Bound
The structure of the rank problem is not particularly important in the lower bound proof. All that is needed is an inequality on the entropy of rank queries that we describe here. Essentially, the lower bound applies to any problem which satisfies a similar entropy condition.
The possible queries come from the universe . Imagine that this universe is divided into blocks of equal size (the remainder is ignored if doesn’t divide ). Let be the set containing the -th query (counting from zero) in each block. For a set of queries, let be the vector of answers to the queries in . We treat as a random variable, depending on the random choice of the input .
Let is chosen uniformly at random in , and let and any be arbitrary. Then, for any event with for a small enough constant , we have:
Let us ignore the conditioning on for now. The lemma says that representing the answers to the queries and (a subset of) separately loses bits of entropy per block compared to the optimal joint encoding.
Let be entropy of the binomial distribution on unbiased trials. The entropy is exactly equal to : the answer of a query minus the answer of the previous is exactly a binomial on random bits. In all blocks that do not contain an element of , the contribution of the block in is cancelled by its contribution in .
Blocks that contain an element from (except the first block) contribute:
at least to . The contribution is more if the previous block did not contain an element from ;
exactly to .
Thus, the block contributes to the sum. Using the known estimation , this quantity is minimized when , and is always at least .
The fact that conditioning on does not change the result comes from a standard independence trick in lower bounds. We decomposed as the sum over independent variables (essentially111The careful reader has probably noticed that we actually decomposed it into two sums, each of which has terms independent among themselves; however, the sums are dependent. We are subtracting the entropy of sub-blocks of size from the entropy of blocks of size in the first sum; and the entropy of sub-blocks of size from the entropy of blocks of size in the second sum. The analysis proceeds by union bound over the two sums. ). Each component was with constant probability. By a Chernoff bound, the sum is with probability . Thus, even if we condition on an event of probability , the sum must remain with overwhelming probability. ∎
2.2 Cell-Probe Elimination
To support the induction in our proof, we augment the cell-probe model with published bits. These bits represent a memory of bounded size which the query algorithm can examine at no cost. Like the regular memory (which must be examined through cell probes), the published bits are initialized at construction time, as a function of the input . Observe that if we have published bits, the problem can be solved trivially.
Our proof will try to publish a small number of cells from the regular memory which are accessed frequently. Thus, the complexity of many queries will decrease by at least one. The argument is then applied iteratively: the cell-probe complexity decreases, as more and more bits are published. If we arrive at zero cell probes and less than published bits, we have a contradiction.
Let be the set of cells probed by query ; this is a random variable, since the query can be adaptive. Also let .
The main technical result in our proof is captured in the following lemma, the proof of which appears in the next section:
Assume a data structure uses published bits, and at most memory bits. Break the queries into blocks, for a large enough constant . Then:
The lemma shows that are a good set of cells to publish, since a constant fraction of the queries probe at least one cell from this set.
Completing the proof is now easy. If the data structure has redundancy , begin by publishing some arbitrary bits, to satisfy the condition that there are at most bits in regular memory.
In step , we let , and publish the cells in , together with their address. The number of published bits increases to . The cell-probe complexity of an average query decreases by .
Since the average case complexity cannot go below zero, the number of iterations that we are able to make must be . The only reason we may fail to make another iteration is a violation to the lemma’s condition . Thus, , that is . This is the desired trade-off.
2.3 An Encoding Argument
In this section, we prove Lemma 3. Our proof is an encoding argument: we show that, if the conclusion of the lemma failed, we could encode a uniformly random using strictly less than bits.
Let and be as in our lemma’s statement, and assume , for a small enough constant . We thus know that a random query is very likely to probe cells not in .
By averaging, there exists a such that We are only going to concentrate on the queries in .
More specifically, we are going to concentrate on the queries that probe no cell from : . Note that .
Intuitively speaking, our contradiction is found as follows. The answers to queries must be encoded in the cells . The answers to queries must be encoded in the cells , which, by definition, is disjoint from . But the answers and are highly correlated (by Lemma 2). Thus, if the two answers are written in disjoint sets of cells, a lot of entropy is being wasted, which is impossible for a succinct data structure.
We first formalize the intuitive notion of “the contents of cells .” Define the footprint of a query set by the following algorithm. We assume the published bits are known in the course of the definition. Enumerate queries in increasing order. For each query, simmulate its execution one cell probe at a time. If a cell has already been included in the footprint, ignore it. Otherwise, append the contents (but not the address) of the new cell in the footprint. Observe that is a string of exactly bits.
We observe that is a function of and the published bits. Indeed, we can simmulate the queries in order. At each step, we know how the query algorithm acts based on the published bits and the previously read cells. Thus, we know the address of the next cell to be read. We can check whether the cell was already in the footprint (since we also know the address of previous cells). If not, we read the next bits of the footprint, which are precisely the contents of this cell, and continue the simulation.
Our encoding for the array will consist of the following:
the published bits ( bits). Denote these bits by the random variable .
the identity of the set as a subset of . This uses bits. By submodularity, the average length of this component is on the order of:
the answers , encoded jointly. Using Huffman coding, this requires bits on average.
the footprint , encoded optimally given the knowledge of and the published bits. This takes bits on average.
the footprint , encoded optimally given the knowledge of and the published bits. This takes bits on average.
all cells outside , included verbatim with bits per cell. As noted above, the cell addresses and can be decoded from , respectively , and the published bits. Thus, we know exactly which cells to include in this component. This part takes bits on average.
Observe that this encoding includes the published bits and all cells in the memory (though the cells in and are included in a compressed format). Thus, all queries can be simmulated. If all answers are known, the array can be decoded. Thus, this is a valid encoding of .
It remains to analyze the average size of the encoding. To bound item 4., we can write:
But , since the answers can be decoded from the footprint and the published bits. Now note that , since this is the size in bits of the footprint and the published bits. Finally, note that . Thus:
Similarly, item 5. is bounded by:
Summing up all components, our encoding has expected size:
We can now rewrite:
We can now apply Lemma 2 for any fixed and the event . Note that the density is which constant probability over the choice of . Thus, the lemma applies for small enough . We conclude that with constant probability over . Thus, the expectation is also .
Plugging our result into (1), the size of the encoding becomes . Setting for a large constant , and a small enough constant, the negative term is double the positive terms. Thus, the encoding size is , a contradiction.
- [CM96] David R. Clark and J. Ian Munro. Efficient suffix trees on secondary storage. In Proc. 7th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages 383–391, 1996.
- [GGG07] Alexander Golynski, Roberto Grossi, Ankur Gupta, Rajeev Raman, and S. Srinivasa Rao. On the size of succinct indices. In Proc. 15th European Symposium on Algorithms (ESA), pages 371–382, 2007.
- [GM03] Anna Gál and Peter Bro Miltersen. The cell probe complexity of succinct data structures. In Proc. 30th International Colloquium on Automata, Languages and Programming (ICALP), pages 332–344, 2003.
- [Gol07] Alexander Golynski. Optimal lower bounds for rank and select indexes. Theoretical Computer Science, 387(3):348–359, 2007. See also ICALP’06.
- [Gol09] Alexander Golynski. Cell probe lower bounds for succinct data structures. In Proc. 20th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages 625–634, 2009.
- [GRR08] Alexander Golynski, Rajeev Raman, and S. Srinivasa Rao. On the redundancy of succinct data structures. In Proc. 11th Scandinavian Workshop on Algorithm Theory (SWAT), 2008.
- [Jac89] Guy Jacobson. Space-efficient static trees and graphs. In Proc. 30th IEEE Symposium on Foundations of Computer Science (FOCS), pages 549–554, 1989.
- [Mil05] Peter Bro Miltersen. Lower bounds on the size of selection and rank indexes. In Proc. 16th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages 11–12, 2005.
- [MRR01] J. Ian Munro, Venkatesh Raman, and S. Srinivasa Rao. Space efficient suffix trees. Journal of Algorithms, 39(2):205–222, 2001. See also FSTTCS’98.
- [Mun96] J. Ian Munro. Tables. In Proc. 16th Conference on the Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pages 37–40, 1996.
- [Pag01] Rasmus Pagh. Low redundancy in static dictionaries with constant query time. SIAM Journal on Computing, 31(2):353–363, 2001. See also ICALP’99.
- [Pǎt08] Mihai Pǎtraşcu. Succincter. In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS), pages 305–313, 2008.
- [RRR02] Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding -ary trees and multisets. In Proc. 13th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages 233–242, 2002.