Compressed Dynamic Range Majority Data Structures

Compressed Dynamic Range Majority Data Structures

Travis Gagie, Meng He, and Gonzalo Navarro
Diego Portales University CeBiB Dalhousie University University of Chile travis.gagie@mail.udp.cl mhe@cs.dal.ca gnavarro@dcc.uchile.cl
Abstract

In the range -majority query problem, we preprocess a given sequence for a fixed threshold , such that given a query range , the symbols that occur more than times in can be reported efficiently. We design the first compressed solution to this problem in dynamic settings. Our data structure represents using bits for any , where is the alphabet size and is the -th order empirical entropy of . It answers range -majority queries in time, and supports insertions and deletions in amortized time. The best previous solution [1] has the same query and update times, but uses words.

footnotetext: Funded with basal funds FB0001, Conicyt, Chile and by NSERC of Canada.

Compressed Dynamic Range Majority Data Structures

Travis Gagie, Meng He, and Gonzalo Navarro
Diego Portales University CeBiB Dalhousie University University of Chile travis.gagie@mail.udp.cl mhe@cs.dal.ca gnavarro@dcc.uchile.cl


1 Introduction

Given a threshold , a symbol is an -majority in a sequence if occurs more than times in . Thus -majorities are often used to represent frequent symbols and, naturally, the problem of finding -majorities is important in data mining [2, 3]. Misra and Gries [4] proposed an optimal solution that computes all -majorities using comparisons, and when implemented in word RAM for a sequence over an alphabet of size , the running time becomes  [3].

In the range -majority query problem, we further preprocess such that given a query range , the -majorities of , i.e., the symbols that occur more than times in , can be reported efficiently. Karpinski and Nekrich [5] first considered this problem and proposed a solution that uses words to support queries in time. Durocher et al. [6] presented the first solution that achieves optimal query time, and their structure occupies words. Subsequently, much work has been done to make the space cost independent of  [7, 8, 9], and even to achieve compression [7, 9] when is drawn from a fixed alphabet. For example, Gagie et al. [9] showed how to represent using bits for any constant to answer range -majority queries in time, where is the -th order empirical entropy of . We refer to [9] for a more thorough survey.

In dynamic settings, we wish to maintain to support range -majority queries under the following update operations: i) , which inserts symbol between and , shifting the symbols in positions to to positions and , respectively; ii) , which deletes , shifting the symbols in positions to to positions and , respectively. Elmasry et al. [1] considered this problem, and they designed an -word structure that can answer range -majority queries in time, and supports insertions and deletions in amortized time.111Karpinski and Nekrich [5] also considered the dynamic case, though they defined the data set as a set of colored points in 1D. With a reduction developed in [1], the solutions by Karpinski and Nekrich can also be used to encode dynamic sequences, though the results are inferior to those of Elmasry et al. [1].

Previously, no succinct data structures have been designed for dynamic range -majorities. We thus design the first compressed data structure for this problem. Our data structure represents using bits for any , where is the alphabet size and is the -th order empirical entropy of . It answers range -majority queries in time, and supports insertions and deletions in amortized time. Hence, its query and update times match the best previous solution by Elmasry et al. [1], while using compressed space.

2 Preliminaries

In this section, we summarize some existing data structures that will be used in our solution. One such data structure is designed for the problem of maintaining a string under and operations to support the following operations: , which returns ; , which returns the number of occurrences of character in ; and , which returns the position of the -th occurrence of in . The following lemma summarizes the currently best compressed solution to this problem, which also supports the extraction of an arbitrary substring in optimal time:

Lemma 1 ([10]).

A string of length over an alphabet of size can be represented using bits for any to support , , , and in time. It also supports the extraction of a substring of length in time.

Raman et al. [11] considered the problem of representing a dynamic integer sequence to support the following operations: , which computes ; , which returns the smallest with ; and , which sets to . One building component of their solution is a data structure for small sequences, which will also be used in our data structures:

Lemma 2 ([11]).

A sequence, , of nonnegative integers of bits each, where , can be represented using bits to support , , and where , in time. This data structure can be constructed in time, and requires a precomputed universal table occupying bits for any fixed .

3 Compressed Dynamic Range Majority Data Structures

In this section we design compressed dynamic data structures for range -majority queries. We define three different types of queries as follows. Given an -majority query with range , we compute the size, , of the query range as . If , where , then we say that this query is a large-sized query. The query is called a medium-sized query if , where . If , then it is a small-sized query.

We represent the input sequence using Lemma 1. This supports small-sized queries immediately: By Lemma 1, we can compute the content of the subsequence , where is the query range, in time. We can then compute the -majorities in in time using the algorithm of Misra and Gries [4]. Thus it suffices to construct additional data structures for large-sized and medium-sized queries.

3.1 Supporting Large-Sized Range -Majority Queries

To support large-sized queries, we construct a weight-balanced B-tree [12] with branching parameter and leaf parameter . We augment by adding, for each node, a pointer to the node immediately to its left at the same level, and another pointer to the node immediately to its right. These pointers can be maintained easily under updates, and will not affect the space cost of asymptotically. Each leaf of represents a contiguous subsequence, or block, of , and the entire sequence can be obtained by concatenating all the blocks represented by the leaves of from left to right. Each internal node of then represents a block that is the concatenation of all the blocks represented by its leaf descendants. We number the levels of by from the leaf level to the root level. Thus level is higher than level if . Let be a node at the -th level of , and let denote the block it represents. Then, by the properties of weight-balanced B-trees, if is a leaf, the length of its block, denoted by , is at least and at most . If is an internal node, then . We also have that each internal node has at least and at most children.

We do not store the actual content of a block in the corresponding node of . Instead, for each , we store the size of the block that it represents, and in addition, compute and store information in a structure called candidate list about symbols that can possibly be the -majorities of subsequences that meet certain conditions. More precisely, let be the level of , be the parent of , and be the concatenation of the blocks represented by the node immediately to the left of at level , the node , and the node immediately to the right of at level . Then contains each symbol that appears more than times in , where is the minimum size of a block at level . Since the maximum length of each block at level is , we have , and thus . To show the idea behind the candidate lists, we say that two subsequences touch each other if their corresponding sets of indices in are not disjoint. We then observe that, since the size of any block at level is greater than , any subsequence touching is completely contained in if is within . Since each -majority in appears at least times, it is also contained in . Therefore, to find the -majority in , it suffices to verify whether each element in is indeed an answer; more details are to be given in our query algorithm later.

Even though it only requires time to construct  [4], it would be costly to reconstruct it every time an update operation is performed on . To make the cost of maintaining acceptable, we only rebuild it periodically by adopting a strategy by Karpinski and Nekrich [5]. More precisely, when we construct , we store symbols that occur more than times in . We also keep a counter that we increment whenever we perform or in . Only when reaches do we reconstruct , and then we reset to . Since at most updates can be performed to between two consecutive reconstructions, any symbol that becomes an -majority in any time during these updates must have at least occurrences in before these updates are performed. Thus we can guarantee that any symbol that appears more than times in is always contained in during updates. The size of is still , and, as to be shown later, it only requires amortized time per update to to maintain all the candidate lists.

We also construct data structures to speed up a top-down traversal in . These data structures are defined for the marked levels of , where the -th marked level is level of for . Given a node at the -th marked level, the number of its descendants at the -st marked level is at most . Thus, the sizes of the blocks represented by these descendants, when listed from left to right, form an integer sequence, , of at most entries. We represent using Lemma 2, and store a sequence of pointers , in which points to the -th leftmost descendant at the -st marked level.

We next prove the following key lemma regarding an arbitrary subsequence of length greater than , which will be used in our query algorithm:

Lemma 3.

If , then each -majority element in is contained in for any node at level whose block touches .

Proof.

Let be ’s parent. Then also touches , and is at level . Let and be the nodes immediately to the left and right of at level , respectively.

Let and denote the minimum block size represented by nodes at level and of , respectively. Then, by the properties of weight-balanced B-trees, if , . When , . Thus, we always have . Therefore, any -majority of occurs more than times in .

On the other hand, . Since touches , this inequality means that is entirely contained in either the concatenation of and , or the concatenation of and . In either case, is contained in . Since any -majority of occurs more than times in , it also occurs more than times in . As includes any symbol that appears more than times in , any -majority of is contained in . ∎

We now describe our query and update algorithms, and analyze space cost.

Lemma 4.

Large-sized range -majority queries can be supported in time.

Proof.

Let be the query range, and . We first look for a node at level whose block touches . The obvious approach is to perform a top-down traversal of to look for a node at level whose block contains position . During the traversal, we make use of the information about the lengths of the blocks represented by the nodes of to decide which node at the next level to descend to, and to keep track of the starting position in of the block represented by the node that is currently being visited. More precisely, suppose that we visit node at the current level as we have determined previously that contains . We also know that the first element in is . Let denote the children of , where . To decide which child of represents a block that contains , we retrieve the lengths of all ’s, and look for the smallest such that . Node is then the node at the level below whose block contains , and the starting position of its block in is . As and we store the length of the block that each node represents, these steps use constant time.

However, if we follow the approach described in the previous paragraph, we would use time in total, as has levels. Thus we make use of the additional data structures stored at marked levels to speed up this process. If there is no marked level between the root level and , then the top down traversal only descends levels, requiring time only. Otherwise, we perform the top-down traversal until we reach the highest marked level. Let be the node that we visit at the highest marked level. As stores the lengths of the blocks at the next marked level, we can perform a operation in and then follow an appropriate pointer in to look for the node at the second highest level that contains , and perform a operation in to determine the starting position of in . These operations require constant time. We repeat this process until we reach the lowest marked level above level , and then we descend level by level until we find node . As there are marked levels, the entire process requires time.

By Lemma 3, we know that the -majorities of are contained in . We then verify, for each symbol, , in , whether it is indeed an -majority by computing its number, , of occurrences in and comparing to . As , can be computed in time by Lemma 1. As , it requires time in total to find out which of these symbols should be included in the answer to the query. Therefore, the total query time is . ∎

Lemma 5.

The data structures described in Section 3.1 can be maintained in amortized time under update operations.

Proof.

We only show how to support ; the support for is similar.

To perform , we first perform a top down traversal to look for the node at level whose block contains . During this traversal, we descend level by level as in Lemma 4, but we do not use the marked levels to speed up the process. For each node that we visit, we increment the recorded length of . In addition, we update the counters stored in the children of and in the children of the two nodes that surround . There are a constant number of these nodes, and they can all be located in constant time by following either the edges of , or the pointers between two nodes that are next to each other at the same level where we augment .

When incrementing the counter of each node, we find out whether the candidate list of this node has to be rebuilt. To reconstruct the candidate list of a node at level , we first compute the starting and ending positions of in . This can be computed in constant time because, during the top down traversal, we have already computed the starting and ending positions of in , and the three nodes whose blocks form , as well as the sizes of these three blocks, can be retrieved by following a constant number of pointers starting from . We then extract the content of . As (see discussions earlier in this section) and , by Lemma 1, can be extracted from in time. We next compute all the symbols that appear in more than times in time [4], and these are the elements in the reconstructed . Since the counter has to reach before has to be rebuilt, the amortized cost per update is .

If is at a marked level, we perform a operation in time to locate the entry of that corresponds to the node at the next lower marked level whose block contains , and perform an , again in time, to increment the value stored in this entry. So far we have used amortized time for each node we visit during the top-down traversal. Since has levels, the overall cost we have calculated up to this point is amortized time.

When a node, , at level of splits, we reconstruct in time. If is a marked level, but it is not the lowest marked level, we also rebuild and in time. By the properties of a weight-balanced B-tree, after a node at level has been split, it requires at least insertions before it can be split again. Therefore, we can amortize the cost of reconstructing these data structures over the insertions between reconstructions, and each is thus charged with amortized cost. As each may cause one node at each level of to split, the overall cost charged to an operation is thus .

Finally, update operations may cause the value of to change. To make it happen, the value of must change, and this requires updates. It is clear that our data structures can be constructed in time, incurring amortized time for each update. To summarize, can be supported in amortized time. ∎

Lemma 6.

The data structures described in Section 3.1 occupy bits.

Proof.

As has nodes, the structure of , pointers between nodes at the same level, as well as counters and block lengths stored with the nodes, occupy bits in total. Each candidate list can be stored in bits, so the candidate lists stored in all the nodes use bits in total. The size of the structures and can be charged to the pointed nodes, so there are entries to store. As each entry of uses bits, all the ’s occupy bits. The same analysis applies to . Therefore, the data structures described in this section use bits. ∎

3.2 Supporting Medium-Sized Range -Majority Queries

We could use the same structures designed in Section 3.1 to support medium-sized queries if we simply set the leaf parameter of to be instead of , but then the resulting data structures would not be succinct. To save space, we build a data structure for each leaf node of . Our idea for supporting medium-sized queries is similar to that for large-sized queries, but since the block represented by a leaf node of is small, we are able to simplify the idea and the data structures in Section 3.1. Such simplifications allow us to maintain a multi-level decomposition of in a hierarchy of lists instead of in a tree, which are further laid out in one contiguous chunk of memory for each leaf node of , to avoid using too much space for pointers.

We now describe this multi-level decomposition of , which will be used to define the data structure components of . As we define one set of data structure components in for each level of this decomposition, we use to refer to both the data structure that we build for and the decomposition of . To distinguish a level of from a level of , we number each level of using a non-positive integer. At level , for , is partitioned into mini-blocks of length between and . Note that the level decomposition contains simply one mini-block, which is itself, as the length of any leaf block in is between and already. We define , which is the minimum length of a mini-block at level . As , the minimum length of a mini-block at the lowest level, i.e., level , is between and .

For each mini-block at level of , we define its predecessor, , as follows: If is not the leftmost mini-block at level of , then is the mini-block immediately to its left at the same level. Otherwise, if is not the leftmost leaf ( is null otherwise), let be the leaf immediately to the left of in , and is defined to be the rightmost mini-block at level of . Similarly, we define the successor, , of as the mini-block immediately to the right of at level of if such a mini-block exists. Otherwise, is the leftmost mini-block at level of where is the leaf immediately to the right of in if exists, or null otherwise. Then, the candidate list, , of contains each symbol that occurs more than times in the concatenation of , and . To maintain during updates, we use the same strategy in Section 3.1 that is used to maintain . More specifically, we store a counter so that we can rebuild after exactly update operations have been performed to , and . Whenever we perform the reconstruction, we include in each symbol that occurs more than times in the concatenation of , and . Since , the number of symbols included in is at most .

The precomputed information for each mini-block includes , , and . These data for mini-blocks at the same level, , of are chained together in a doubly linked list . then contains these lists. We however cannot afford storing each list in the standard way using pointers of bits each, as this would use too much space. Instead, we lay them out in a contiguous chunk of memory as follows. We first observe that the number of mini-blocks at level of is less than . Thus, the total number of mini-blocks across all levels is less than . We then use an array of fix-sized slots to store , and each slot stores the precomputed information of a mini-block.

To determine the size of a slot, we compute the maximum number of bits needed to encode the precomputed information for each mini-block . can be stored in bits. As has less than elements, its length can be encoded in bits. The counter can be encoded in bits. The two pointers to the neighbours of in the linked list can be encoded as the indices of these mini-blocks in the memory chunk. Since there are slots, each pointer can be encoded in bits. Therefore, we set the size of each slot to be bits.

We prepend this memory chunk with a header. This header encodes the indices of the slots that store the head of each . As there are levels and each index can be encoded in bits, the header uses bits. Clearly our memory management scheme allows us to traverse each doubly linked list easily. When mini-blocks merge or split during updates, we need to perform insertions and deletions in the doubly linked lists. To facilitate these updates, we always store the precomputed information for all mini-blocks in in a prefix of , and keep track of the number of used slots of . When we perform an insertion into a list , we use the first unused slot of to store the new information, and update the header if the newly inserted list element becomes the head. When we perform a deletion, we copy the content of the last used slot (let be the mini-block that corresponds to it) into the slot corresponding to the deleted element of . We also follow the pointers encoded in the slot for to locate the neighbours of in its doubly linked list, and update pointers in these neighbours that point to . If is the head of a doubly linked list (we can determine which list it is using ), we update the header as well. The following lemma shows that our memory management strategy indeed saves space:

Lemma 7.

The data structures described in Section 3.2 occupy bits.

Proof.

We first analyze the size of the memory chunk storing for each leaf of . By our analysis in previous paragraphs, we observe that the header of this chunk uses bits. Each slot of uses bits, and has entries. Therefore, occupies bits. Hence the total size of the memory chunk of each leaf of is bits. As there are leaves in , the data structures described in this section uses bits. ∎

We now show how to support query and update operations.

Lemma 8.

Medium-sized range -majority queries can be supported in time.

Proof.

Let be the query range and let . We first perform a top down traversal in to locate the leaf, , that represents a block containing in time using the approach described in the proof of Lemma 4. In this process, we can also find the starting position of in .

We next make use of to answer the query as follows. Let . As , we have . We then scan the list to look for a mini-block, , that contains at level . This can be done by first locating the head of from the header of the memory chunk that stores , and then perform a linear scan, computing the starting position of each mini-block in along the way. As has at most entries, we can locate in time. Since , is either entirely contained in the concatenation of and , or the concatenation of and . Thus each -majority of must occur more than times in the concatenation of , and . Therefore, each -majority of is contained in . We can then perform operations in to verify whether each symbol in is indeed an -majority of . As has symbols, this process requires time. ∎

Lemma 9.

The data structures described in Section 3.2 can be maintained in amortized time under update operations.

Proof.

We only show how to support ; the support for is similar.

To perform , we first perform a top down traversal in to locate the leaf, , that represents a block containing in time. We then increment the recorded lengths of all the mini-blocks that contain . We also increment the counters of these mini-blocks, as well as the counters of their predecessors and successors. All the mini-blocks whose counters should be incremented are located in , and , where and are the leaves immediately to the left and right of in . We scan each doubly linked list , and to locate these mini-blocks. Since , and have mini-blocks in total over all levels, it requires to find these mini-blocks and update them.

The above process can find all these mini-blocks, as well as their starting and ending positions in . It may be necessary to reconstruct the candidate list of these mini-blocks. Similarly to the analysis in the proof of Lemma 5, the candidate list of each of these mini-blocks can be maintained in amortized time. Since there are levels in , and and only a constant number of mini-blocks at each level may need to be rebuilt, it requires amortized time to reconstruct all of them.

An insertion may also cause a mini-block to split. As in the proof of Lemma 5, we compute the candidate list and other required information for the mini-block created as a result of the merge, and amortize the cost to the insertions that lead to the merge. The amortized cost is again . As there can possibly be a merge at each level of , it requires amortized time to handle them. Finally, when the value of changes, we rebuild all the data structures designed in this section, incurring amortized time. Therefore, the total time required to support is . ∎

Combining Lemma 1 and Lemmas 4-9, we have our main result:

Theorem 10.

A sequence of length over an alphabet of size can be represented using bits for any to answer range -majority queries in time, and to support and in amortized time.

4 Concluding Remarks

In this paper, we have designed the first compressed data structure for dynamic range -majority. To achieve this result, our key strategy is to perform a multi-level decomposition of the sequence , and, for each block of , precompute a candidate set which includes all the -majorities of any query range of the right size that touches this block. Thus, when answering a query, we need not find a set of blocks whose union forms the query range as is required in the solution of Elmasry et al. [1]. Instead, we only look for a single block that touches the query range. This simpler strategy allows us to achieve compressed space. Furthermore, it is possible to generalize our solution to design the first dynamic data structure that can maintain in the same space and update time, to support the computation of the -majorities in a given query range for any in time. Note that here is given in a query and only is fixed and given beforehand. This type of query is more general than range -majority queries and was only studied in the static case before [7, 9]. The details are deferred to the full version of this paper.

References

References

  • [1] A. Elmasry, M. He, J. I. Munro, and P. K. Nicholson, “Dynamic range majority data structures,” Theoretical Comp. Sci., vol. 647, pp. 59–73, 2016.
  • [2] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, “Computing iceberg queries efficiently,” in Proc. VLDB, 1998, pp. 299–310.
  • [3] E. D. Demaine, A. López-Ortiz, and J. I. Munro, “Frequency estimation of internet packet streams with limited space,” in Proc. ESA, 2002, pp. 348–360.
  • [4] J. Misra and D. Gries, “Finding repeated elements,” Sci. Comp. Prog., vol. 2, pp. 143–152, 1982.
  • [5] M. Karpinski and Y. Nekrich, “Searching for frequent colors in rectangles,” in Proc. CCCG, 2008, pp. 11–14.
  • [6] S. Durocher, M. He, J. I. Munro, P. K. Nicholson, and M. Skala, “Range majority in constant time and linear space,” Inf. Comp., vol. 222, pp. 169–179, 2013.
  • [7] T. Gagie, M. He, J. I. Munro, and P. K. Nicholson, “Finding frequent elements in compressed 2d arrays and strings,” in Proc. SPIRE, 2011, pp. 295–300.
  • [8] T. M. Chan, S. Durocher, M. Skala, and B. T. Wilkinson, “Linear-space data structures for range minority query in arrays,” Algorithmica, vol. 72, pp. 901–913, 2015.
  • [9] D. Belazzougui, T. Gagie, J. Ian Munro, G. Navarro, and Y. Nekrich, “Range majorities and minorities in arrays,” CoRR, vol. abs/1606.04495, 2016.
  • [10] J. I. Munro and Y. Nekrich, “Compressed data structures for dynamic sequences,” in Proc. ESA, 2015, pp. 891–902.
  • [11] R. Raman, V. Raman, and S. S. Rao, “Succinct dynamic data structures,” in Proc. WADS, 2001, pp. 426–437.
  • [12] L. Arge and J. S. Vitter, “Optimal external memory interval management,” SIAM J. Comp., vol. 32, pp. 1488–1508, 2003.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
84286
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description