Biased Range Trees
A data structure, called a biased range tree, is presented that preprocesses a set of points in and a query distribution for 2-sided orthogonal range counting queries. The expected query time for this data structure, when queries are drawn according to , matches, to within a constant factor, that of the optimal decision tree for and . The memory and preprocessing requirements of the data structure are .
BIASED RANGE TREES
|Vida Dujmović and John Howat and Pat Morin|
Let be a set of points in and let be a probability measure over . A 2-sided orthogonal range counting query over asks, for a query point , to report the number of points such that and . A 2-sided range counting query has distribution if the query point is chosen from the probability measure . If is a data structure for answering 2-sided range counting queries over then we denote by the expected time, using , to answer a range query with distribution . The current paper is concerned with preprocessing the pair to build a data structure that minimizes .
1.1 Previous Work
The general topic of geometric range queries is a field that has seen an enormous amount of activity in the last century. Results in this field depend heavily on the types of objects the data structure stores and on the shape of the query ranges. In this section we only mention a few data structures for orthogonal range counting and semigroup queries in 2 dimensions. The interested reader is directed to the excellent, and easily accessible, survey by Agarwal and Erickson .
Orthogonal range counting is a classic problem in computational geometry. The 2- (and 3- and 4-) sided range counting problem can be solved by Bentley’s range trees . Range trees use space and can be constructed in time. Originally, range trees answered queries in time. However, with the application of fractional cascading [6, 11] the query time can be reduced to without increasing the space requirement by more than a constant factor. Range trees can also answer more general semigroup queries in which each point of is assigned a weight from a commutative semigroup and the goal is to report the weight of all points in the query range [10, 15].
For 2-sided orthogonal range counting queries, Chazelle [4, 5] proposes a data structure of size , that can be constructed in time, and that can answer range couting queries in time. Unfortunately, this data structure is not capable of answering semigroup queries in the same time bound. For semigroup queries, Chazelle provides data structures with the following requirements: (1) space and query time, (2) space and query time, and (3) space and query time.
Practical linear space data structures for range counting include -d trees , quad-trees , and their variants. These structures are practical in the sense that they are easy to implement and use only space. Unfortunately, neither of these structures has a worst-case query time of . Thus, in terms of query time, -d trees and quad-trees are nowhere near competitive with range trees.
Despite the long history of data structures for orthogonal range queries, range trees with fractional cascading are still the most effective data structure for 2-sided orthogonal range queries in the semigroup model. In particular, no data structure is currently known that uses space and can answer 2-sided orthogonal range queries in time.
1.2 New Results
In the current paper we present a data structure, the biased range tree, for 2-sided orthogonal range counting. Biased range trees fit into the comparison tree model of computation, in which all decisions made during a query are based on the result of comparing either the - or -coordinate of the query point to some precomputed values. Most data structures for orthogonal range searching, including range trees, -d trees and quadtrees, fit into the comparison tree model. This model makes no assumptions about the - or -coordinates of points other than that they each come from some (possibly different) total order. This is particularly useful in practice since it avoid the precision problems usually associated with algebraic decisions and allows the mixing of different data types (one for -coordinates and one for -coordinates) in one data structure.
A biased range tree has size , can be constructed in time, and can answer range counting (or semigroup) queries in expected time, where is any comparison tree that answers range counting queries over . In particular, could be a comparison tree that minimizes implying that the expected query time of our data structure is as fast as the fastest comparison-based data structure for answering range counting queries over . Moreover, the worst-case search time of biased range trees is , matching the worst-case performance of range trees.
Note that we do not place any restrictions on the comparison tree . Biased range trees, while requiring only space, are competitive with any comparison-based data structure. Thus, the memory requirement of biased range trees is the same as that of range trees but their expected query time can never be any worse.
The remainder of the paper is organized as follows. In Section 2 we present background material that is used in subsequent sections. In Section 3 we define biased range trees. In Section 4 we prove that biased range trees are optimal. In Section 5 we recap, summarize, and describe directions for future work.
In this section we give definitions, notations, and background that are prerequisites for subsequent sections.
For the purposes of the current paper, a rectangle is defined as
We also allow unbounded rectangles by setting and/or . Therefore, under this definition, rectangles can have 0, 1, 2, 3, or 4 sides. For a query point we denote by the query range . A horizontal strip is rectangle of the form and a vertical strip is a rectangle of the form .
Classification Problems and Classification Trees.
A classification problem over a domain is a function . The special case in which is called a decision problem. A -ary classification tree is a full -ary tree111A full -ary tree is a rooted ordered tree in which each non-leaf node has exactly children. in which each internal node is labelled with a function and for which each leaf is labelled with a value in . The search path of an input in a classification tree starts at the root of and, at each internal node , evaluates and proceeds to the th child of . We denote by the label of the final (leaf) node in the search path for . We say that the classification tree solves the classification problem over the domain if, for every , .
The particular type of classification trees we are concerned with are comparison trees. These are binary classification trees in which the function at each node compares either or to a fixed value (that may depend on the point set and the distribution ). For the problem of 2-sided range counting over , the leaves of are labelled with values in and for all .
For a probability measure and an event , we denote by the distribution conditioned on . That is, the distribution where the probability of an event is . The probability measures used in this paper are usually defined over . We make no assumptions about how these measures are represented, but we assume that an algorithm can, in constant time, given a rectangle , determine .
For a classification tree that solves a problem and a probability measure over , the expected search time of , denoted by , is the expected length of the search path for when is drawn at random from according to . Note that, for each leaf of there is a maximal subset such that the search path for any ends at . Thus, the expected search time of (under distribution ) can be written as
where denotes the leaves of and denotes the length of the path from the root of to . When the tree is obvious based on context we will sometimes use the notation to denote . Note that, for comparison trees, the closure of is always a rectangle. For a node in a tree, we will use the phrases depth of and level of interchangeably and they both refer to .
The following theorem is a restatement of (half of) Shannon’s Fundamental Theorem for a Noiseless Channel [14, Theorem 9].
Let be a classification problem and let be selected from a distibution such that , for . Then, any -ary classification tree that solves has
In terms of range counting, Theorem 1 immediately implies that, if is the probability that the query range contains points of , then any binary decision tree that does range counting has . Unfortunately for us, this lower bound is too weak and, in general, there is no decision tree whose performance matches this obvious entropy lower bound.
A stronger lower bound on the cost of range searching can be obtained by considering the arrangement of rays obtained by drawing two rays originating at each point of , one to the left and one downwards (see Figure 1.a). This arrangement partitions the plane into a set of faces . If is a comparison tree for range counting in , then there is no leaf of such that the interior of intersects any edge of since otherwise there are query points in the neighbourhood of this intersection for which . Therefore, by relabelling the leaves of with the faces of , we obtain a data structure for determining which face of contains the query point . By Theorem 1, this implies that
Unfortunately, this bound is still not strong enough and, in general, there is no decision tree that matches this lower bound. To see this, consider Figure 1.b, when the query point is uniformly distributed among the shaded circles. In this case, is always in the same face of so the lower bound given above is 0. Nevertheless, it is not hard to see that the leaves of any decision tree for range searching in can be relabelled to determine which of the circles contains , so .
Biased Search Trees.
Biased search trees are a classic data structure for solving the following 1-dimensional problem: Given an increasing sequence of real numbers and a probability distribution over , construct a binary search tree so that, for any query value drawn from , one can quickly find the unique interval containing . If is the probability that then the expected number of comparisons performed while searching for is given by
and the tree can be constructed in time . Clearly, by Theorem 1, the query time of this binary search tree is optimal up to an additive constant term. Note that, by having each node of store the size of its subtree, a biased search tree can count the number of elements of in the interval without increasing the search time by more than a constant factor. Thus, biased search trees are an optimal data structure for 1-dimensional range counting.
3 Biased Range Trees
In this section we describe the biased range tree data structure, which has three main parts: the backup tree, the primary tree, and a set of catalogues that adorn the nodes of the primary tree.
3.1 The Backup Tree
In trying to achieve optimal query time, biased range trees will try to quickly answer queries that are, in some sense, easy. In some cases, a query is difficult and it cannot be answered in time. For these queries, a backup range tree that stores the points of and can answer any 2-sided range query in worst-case time is used. The preprocessing time and space requirements of this backup tree are .
3.2 The Primary Tree
Like a range tree, a biased range tree is an augmented data structure consisting of a primary tree whose nodes store secondary structures. However, in a range tree the primary tree is a binary search tree that discriminates based only on the -coordinate of the query point . In order to achieve optimal expected query time, this turns out to be insufficient, so instead biased range trees use a variation of a -d tree as the primary tree.
The primary tree is constructed in a top-down fashion. Each node of is associated with a region whose closure is a rectangle. The region associated with the root of is all of . We say that a node is bad if its depth is at least and . A node is split if its depth is less than , and . The two children of a split node are associated with the two regions obtained by removing a horizontal or vertical strip from depending on whether the depth of is even or odd, respectively. We call a node at even distance from the root a vertical node, otherwise we call a horizontal node.
Refer to Figure 2. For a vertical node , we denote its children by and and call them the left child and right child of , depending on which side of the vertical strip (left or right) they are. For uniformity, we will also call the children of a node that is split with a horizontal strip and . The child below the strip is denote by and the child above the strip is denoted by . Similarly, the left and right boundaries of a strip at a horizontal node refer to the bottom and top sides of . Note that, with these conventions, if the query point is in then intersects . However, if then does not intersect . Similarly, for a query point , the query range intersects but not
All that remains is to define the strip for each node . If is a leaf then we use the convention that . If is not a leaf then is selected as a maximal strip containing no point of in its interior, that is closed on its right side and open on its left side and such that each of the at most two components of has probability at most . Suppose is a vertical node. Then let , be a partitioning of into strips, in left-to-right order, obtained by drawing a vertical line through each of the points in . We use the convention that each strip is closed on its right side and open on its left side. Then there is a unique strip such that and . For a horizontal node , the definition of is analagous except we use horizontal lines through each point of .
Note that for a node that is not a leaf, we use the convention that contains its right side but not its left side and that and are the two components of . This implies that and/or may be empty, in which case , respectively, is a leaf of . With these definitions, for any point there is exactly one vertex of such that .
The following two properties are easily derived from the definition of and are necessary to prove the optimality of biased range trees:
Any node at depth in has .
For any node of , if , then the closure of contains at least one point of .
Point 1 above follows immediately from the definition of . Next we explain the logic leading to Point 2. If contains a point of then so does the closure of . If , then . Otherwise, and has no point of in its interior. Then consider the parent of . Since does not contain there must be a point of on the boundary of that is also on the boundary of . Therefore contains this point in its closure.
3.3 The Catalogues
The nodes of the tree are augmented with additional data structures called catalogues that hold subsets of . Each node has two catalogues, and that store subsets of sorted by their -, respectively, -, coordinate. Intuitively, stores points that are “above” and stores points that are “to the right of” . (Refer to Figure 3.) More precisely, if is a horizontal node, then and . If is a vertical node, then and . For any node that is the root of or a right child of its parent, .
Consider any node that is not a bad leaf and any point . If has a left child then let , otherwise, let . Let denote the path from to the root of (see Figure 4). Then the catalogues of have the following properties:
The points in the catalogues of are above or to the right of . That is, for each , all points in , respectively, have their -, respectively, -, coordinate greater than or equal to , respectively, .
All catalogues at nodes in are disjoint. That, is, for each , , , , and .
The catalogues at nodes contain all points in the query range . That is,
Note that, points 1, 2 and 3 above imply that determining can be done by solving a sequence of 1-sided range queries in the - and -catalogues of . However, performing these queries individually would take too long.
To speed up the process of navigating the catalogues of , fractional cascading  is used. Starting at the root of and as long as is not a leaf, a fraction of the data in is cascaded into and . As well, a fraction of the data in is cascaded into both and . Note that this cascading is done only to speed up navigation between the catalogues of . Although fractional cascading introduces extra data into the catalogues of we will continue to use the notations and to denote the set of points contained in the catalogues of before fractional cascading takes place.
Finally, each catalogue and is indexed by a biased binary search tree , respectively, . If is the left child of its parent, then the weight of an interval in , respectively, is given by the probability that , respectively, , is in the interval when is drawn according to the distribution . Otherwise ( is not a left child), the weight of an interval is determined by the distribution .
3.4 Construction Time and Space Requirements
The biased range tree data structure is now completely defined. The structure consists of a backup tree, a primary tree, and the catalogues of the primary tree. We now analyze the construction time and space requirements of biased range trees.
The backup tree has size and can be constructed in time [8, Theorem 5.11]. To construct the primary tree quickly we presort the points of by their and coordinates. Since the primary tree has height , it is then easily constructed in time. Ignoring any copies of points created by fractional cascading, each point in occurs in at most 2 catalogues at each level of the primary tree. Thus, the sizes of all catalogues (before fractional cascading) is and these catalogues can be constructed in time (because of elements of are presorted; see de Berg et al [8, Section 5.3] for details). The fractional cascading between catalogues does not increase the size of catalogues by more than a constant factor since each catalogue is cascaded into only a constant number of other catalogues .
In summary, given the point set and access to the distribution , a biased range tree for can be constructed in time and requires space.
3.5 The Query Algorithm
The algorithm to answer a 2-sided range query proceeds in three steps:
The algorithm navigates the tree from top to bottom to locate the unique node such that . This step takes time, where is the depth of the node . If is a bad leaf (so ) then the algorithm performs a range query in time using the backup range tree and the query algorithm does not execute the next two steps.
If has a left child then let , otherwise let . The algorithm uses and to locate and , respectively, in the catalogues and , respectively.
The algorithm walks back from to the root of , locating in the catalogues of all nodes on this path and computing the results of the range counting query as it goes. Thanks to fractional cascading, each step of this walk can be done in constant time, so the overall time for this step is also .
Observe that Steps 1 and 3 of the query algorithm each take time. The time needed to accomplish Step 2 of the algorithm depends on exactly what is in the catalogues and , and will be the first quantity we study in the next section.
4 Optimality of Biased Range Trees
In this section we show that the expected query time of biased range trees is as good as the expected query time of any comparison tree. The expected query time has two components. The first component is the expected depth, , of the node such that contains . The second component is the expected cost of locating in the catalogues of (recall that or if has no left child). We will show that each of these two components is a lower bound on the expected cost of any decision tree for two-sided range searching on where queries come from distribution . In order to simplify notation in this section we will use the convention is the probability that a search terminates at node of .
4.1 The Catalogue Location Step
First we show that the expected cost of locating in the two catalogues, and is a lower bound on the expected cost of any decision tree for answering 2-sided range queries in . The intuition behind this proof is that, in order to correctly answer range counting queries, any decision tree for range counting must locate the -coordinate of with respect to the -coordinates of all points above . Similarly, it must locate the -coordinate of with respect to the -coordinates of all points to the right of . The structure of the catalogues ensures that biased range trees do this in the most efficient manner possible.
Let be a set of points and let be a probability measure over . Let be any decision tree for 2-sided range counting in and let denote the expected cost of locating in Step 2 of the biased range tree query algorithm on the biased range tree . Then
We first observe that, by definition,
Consider some node of . For a point , all of the points in are points that may or may not be in the query range depending on where exactly is located within . This implies that, if correctly answers range queries for every point then it must determine the location of the -coordinate of with respect to all points in . More precisely, the leaves of could be relabelled to obtain a comparison tree that determines, for any , which interval of contains . Since is a biased search tree for the probability measure , this implies that
Similarly, the same argument applied to yields
We can now complete the proof with
4.2 The Tree Searching Step
Next we bound the expected depth of the node of such that . We do this by showing that any decision tree for range counting in must solve a set of point location problems and that the expected depth of is a lower bound on the complexity of solving these problems.
We say that a set of rectangles is HV-independent if no horizontal or vertical line intersects more than one rectangle in the set. We say that a set of nodes in is HV-independent if the set is HV-independent.
Let be a set of points and let be a probability measure over . Let be the biased range tree for and label each node of white or black, such that all white nodes are at distance at most from the root of . Then, if contains more than white nodes then contains an HV-independent set of white nodes of size .
Define a graph whose vertices are the white nodes of and for which if and only if there is a horizontal or vertical line that intersects both and . Note that an independent set of vertices in is an HV-independent set of which nodes in . Thus, it suffices to find a sufficiently large independent set in
A well-know result on -d trees states that, for a -d tree of height , any horizontal or vertical line intersects at most rectangles of the -d tree [8, Lemma 5.4]. Therefore, since is a -d tree,222Although is not exactly a -d tree as described in Reference , the proof found there still holds. the number of edges in is at most . This implies that has a vertex of degree at most and this is also true of any vertex-induced subgraph of .
We can therefore obtain an independent set in by repeatedly selecting a vertex of degree , adding to the independent set and deleting and its neighbours from . Since, at each step we add one vertex to the independent set and delete at most vertices from , this produces an independent of size , as required. ∎
We can now provide the second piece of the lower bound.
Let be a set of points and let be a probability measure over . Let be any comparison tree that does range counting over . Let denote the expected depth of the node of the biased range tree such that . Then
Partition the nodes of into groups where contains all nodes such that . Observe that the nodes in group occur in the first levels of . Select a constants and with and define . By repeatedly applying Lemma 2, each group can be partitioned into groups where, for each , is an HV-independent set with . Furthermore, . (Note that is not necessarily HV-independent.)
Consider some group for . Let be a leaf of and observe that, because the nodes in are independent and each one contains at least one point of in its closure, there are at most 4 nodes in such that intersects the closure of . (Otherwise contains a point of in its interior and therefore does not solve the range counting problem for .) Thus, by performing 2 additional comparisons, can be used to determine which node of (if any) contains the query point in . However, contains nodes and the search path for terminates at each of these with probability between and . Therefore, if we denote by the distribution conditioned on the search path for terminating in one of the nodes in then we have, by applying Theorem 1,
Putting this all together, we obtain
where the last inequality follows from the fact that . ∎
To get some idea of the constants involved in the proof of Lemma 3, we can select , so that and and the term is approximately 20. Thus, for this choice of parameters, the depth in is competitive with to within a factor of and an additive constant of 20. Alternatively, selecting gives a constant factor less than 3 and an additive term of approximately 90.
And now the main event:
Let be a set of points and let be a probability measure over . Let be the biased range tree for and and let be any decision tree that answers range counting queries for . Then
5 Summary, Discussion, and Conclusions
We have presented biased range trees, an optimal data structure for 2-sided orthogonal range counting queries when the point set and query distribution is known in advance. The expected time required to answer queries with a biased range tree, when the queries are distributed according to , is within a constant factor of any decision tree for answering range queries over . Like standard range trees, biased range trees use space and can also answer semigroup queries [10, 15].333That biased range trees can answer semigroup queries follows from Properties 1–3 of the catalogues in Section 3.3. Although the analysis of biased range trees is complicated, their implementation is not much more complicated than that of standard range trees.
As a small optimization, the backup range tree data structure can be eliminated from biased range trees. Instead, once the probability of a node drops below the node can be split by ignoring the distribution and simply splitting the points of into two sets of roughly equal size. This results in a tree of depth at most .
This work is just one of many possible results on distribution-sensitive range searching. Several open problems immediately arise.
Open Problem 1.
Are there efficient distribution-sensitive data structures for 3-sided and 4-sided orthogonal range counting queries?
Note that a 4-sided orthogonal range counting query can be reduced to 4 2-sided orthogonal range counting queries using the principle of inclusion-exclusion. Unfortunately, this reduction does not produce an optimal distribution-sensitive data structure. To see this, consider 4-sided queries consisting of unit squares whose bottom left corner is uniformly distributed in the shaded region of Figure 5. All such queries contain no points in the query region and all such queries can be answered in time by simply checking that all four corners of the square are to the left of the point set. However, when we decompose these queries into a four 2-sided queries we obtain 2-sided queries that require time to be answered.
Open Problem 2.
Biased range trees require that the point set and the distribution be known in advance. Is there a self-adapting version of biased range trees that, without knowing in advance, can answer queries, each drawn independently from in expected time?
Open Problem 3.
Determine the worst-case or the average case constants associated with 2-dimensional orthogonal range searching for comparison-based data structures. By applying the result of Adamy and Seidel  on point location to the arrangement described in Section 2 one immediately obtains an space data structure that answers queries using at most comparisons. Is there an space structure with the same performance?
Open Problem 4.
A point is maximal with respect to if no point of has every coordinate larger than the corresponding coordinate of . For , is there a distribution-sensitive data structure for testing if a query point is maximal? For point sets in 2 dimensions, an orthogonal variant of the point-location techniques of Collette et al  seems to apply.
Open Problem 5.
Are there distribution-sensitive data structures for -sided range search in point sets in ? The current fastest structures for range search in point sets in that use near-linear space have query time. Is there a structure that uses near-linear space and is optimal when the point set and the distribution are known in advance?
-  U. Adamy and R. Seidel. On the exact worst case query complexity of planar point location. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 609–618, 1998.
-  J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509–517, 1975.
-  J. L. Bentley. Multidimensional divide-and-conquer. Communications of the ACM, 23:214–229, 1980.
-  B. Chazelle. Filtering search: A new approach to query-answering. SIAM Journal on Computing, 15:703–724, 1986.
-  B. Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing, 17:427–462, 1988.
-  B. Chazelle and L. J. Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1:133–162, 1986.
-  S. Collette, V. Dujmović, J. Iacono, S. Langerman, and P. Morin. Distribution-sensitive point location in convex subdivisions. In Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms (SODA 2008), 2008. Submitted to SIAM Journal on Computing, August 2007.
-  M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer-Verlag, Heidelberg, 1997.
-  J. Erickson and P. K. Agarwal. Geometric range searching and its relatives. In B. Chazelle, J. E. Goodman, and R. Pollack, editors, Advances in Discrete and Computational Geometry, volume 223 of Contemporary Mathematics, pages 1–56. American Mathematical Society Press, 1999.
-  M. L. Fredman. A lower bound on the complexity of orthogonal range queries. Journal of the ACM, 28:696–705, 1981.
-  G. S. Luecker. A data structure for orthogonal range queries. In Proceedings of the 19th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 28–34, 1978.
-  K. Mehlhorn. Nearly optimal binary search trees. Acta Informatica, 5:287–295, 1975.
-  H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990.
-  C. E. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, pages 379–423 and 623–656, 1948.
-  A. C. Yao. On the complexity of maintaining partial sums. SIAM Journal on Computing, 14:277–288, 1985.