I/O-Efficient Planar Range Skyline and Attrition Priority QueuesThis is the full version of our PODS 2013 paper with the same title.

I/O-Efficient Planar Range Skyline
and Attrition Priority Queuesthanks: This is the full version of our PODS 2013 paper with the same title.

Casper Kejlberg-Rasmussen       Yufei Tao       Konstantinos Tsakalidis Kostas Tsichlas       Jeonghun Yoon MADALGO, Aarhus University Chinese University of Hong Kong Korea Advanced Institute of Science and Technology Hong Kong University of Science and Technology Aristotle University of Thessaloniki MADALGO (Center for Massive Data Algorithmics – a Center of the Danish National Research Foundation)
Abstract

In the planar range skyline reporting problem, the goal is to store a set of 2D points in a structure such that, given a query rectangle , the maxima (a.k.a. skyline) of can be reported efficiently. The query is 3-sided if an edge of is grounded, giving rise to two variants: top-open () and left-open () (symmetrically bottom-open and right-open) queries.

This paper presents comprehensive results in external memory under the space budget ( is the block size), covering both the static and dynamic settings:

  • For static , we give structures that answer top-open queries in , , and I/Os when the universe is , a grid, and a rank space grid , respectively (where is the number of reported points). The query complexity is optimal in all cases.

  • We show that the left-open case is harder, such that any linear-size structure must incur I/Os to answer a query. In fact, this case turns out to be just as difficult as the general 4-sided queries, for which we provide a static structure with the optimal query cost .

  • We present a dynamic structure that supports top-open queries in I/Os, and updates in I/Os, for any satisfying . This result also leads to a dynamic structure for 4-sided queries with optimal query cost , and amortized update cost .

As a contribution of independent interest, we propose an I/O-efficient version of the fundamental structure priority queue with attrition (PQA). Our PQA supports FindMin, DeleteMin, and InsertAndAttrite all in worst case I/Os, and amortized I/Os per operation. Furthermore, it allows the additional CatenateAndAttrite operation that merges two PQAs in worst case and amortized I/Os. The last operation is a non-trivial extension to the classic PQA of Sundar, even in internal memory.

Skyline, range reporting, priority queues, external memory, data structures
\conferenceinfo

PODS’13, June 22–27, 2013, New York, New York, USA. \CopyrightYear2013 \crdata978-1-4503-2066-5/13/06

\numberofauthors

1

\category

F.2.2Analysis of algorithms and problem complexityNonnumerical Algorithms and Problems[computations on discrete structures] \categoryH.3.1Information storage and retrievalContent analysis and indexing[indexing methods]

1 Introduction

Given two different points and in , where denotes the real domain, we say that dominates  if and . Let be a set of points in . A point is maximal if it is not dominated by any other point in . The skyline of consists of all maximal points of . Notice that the skyline naturally forms an orthogonal staircase where increasing -coordinates imply decreasing -coordinates. Figure 1a shows an example where the maximal points are in black.

   
(a) Skyline     (b) Range skyline
Figure 1: Range skyline queries.
(a) Top-open (b) Right-open (c) Bottom-open (d) Left-open (e) Dominance (f) Anti-dominance (g) Contour
Figure 2: Variations of range skyline queries (black points represent the query results).

Given an axis-parallel rectangle , a range skyline query (also known as a range maxima query) reports the skyline of . In Figure 1b, for instance,  is the shaded rectangle, and the two black points constitute the query result. When  is a 3-sided rectangle, a range skyline query becomes a top-open, right-open, bottom-open or left-open query, as shown in Figures 2a-2d respectively. A dominance (resp. anti-dominance) query is a 2-sided rectangle with both the top and right (resp. the bottom and left) edges grounded, as shown in Figure 2e (resp. 2f). Another well-studied variation is the contour query, where is a 1-sided rectangle that is the half-plane to the left of a vertical line (Figure 2g).

This paper studies linear-size data structures that can answer range skyline queries efficiently, in both the static and dynamic settings. Our analysis focuses on the external memory (EM) model [1], which has become the dominant computation model for studying I/O-efficient algorithms. In this model, a machine has words of memory, and a disk of an unbounded size. The disk is divided into disjoint blocks, each of which is formed by consecutive words. An I/O loads a block of data from the disk to memory, or conversely, writes words from memory to a disk block. The space of a structure equals the number of blocks it occupies, while the cost of an algorithm equals the number of I/Os it performs. CPU time is for free.

By default, the data universe is . Given an integer , represents the set . All the above queries remain well defined in the universe . Particularly, when , the universe is called rank space. In general, for a smaller universe, it may be possible to achieve better query cost under the same space budget. We consider that is in general position, i.e., no two points in have the same - or -coordinate (datasets not in general position can be supported by standard tie breaking). When the universe is , we make the standard assumption that a machine word has at least bits.

1.1 Motivation of 2D Range Skyline

Skylines have drawn very significant attention (see [9, 14, 15, 18, 23, 24, 26, 30, 5, 7, 13, 27, 29, 31, 33, 34, 35] and the references therein) from the research community due to their crucial importance to multi-criteria optimization, which in turn is vital to numerous applications. In particular, the rectangle of a range skyline query represents range predicates specified by a user. An effective index is essential for maximizing the efficiency of these queries in database systems [27, 31].

This paper concentrates on 2D data for several reasons. First, planar range skyline reporting (i.e., our problem) is a classic topic that has been extensively studied in theory [9, 14, 15, 18, 23, 24, 26, 30]. However, nearly all the existing results apply to internal memory (as reviewed in the next subsection), while currently there is little understanding about the characteristics of the problem in I/O environments.

space query insertion deletion remark
top-open in - - optimal
top-open in - - optimal
top-open in - - optimal
anti-dominance in - - lower bound (indexability)
4-sided in - - optimal (indexability)
top-open in for any constant
4-sided in update cost is amortized
Table 1: Summary of our range skyline results (all complexities are in the worst case by default).

The second, more practical, reason is that many skyline applications are inherently 2D. In fact, the special importance of 2D arises from the fact that one often faces the situation of having to strike a balance between a pair of naturally contradicting factors. A prominent example is price vs. quality in product selection. A range skyline query can be used to find the products that are not dominated by others in both aspects, when the price and quality need to fall in specific ranges. Other pairs of naturally contradicting factors include space vs. query time (in choosing data structures), privacy protection vs. disclosed information (the perpetual dilemma in privacy preservation [12]), and so on.

The last reason, and maybe the most important, is that clearly range skyline reporting cannot become easier as the dimensionality increases, whereas even for two dimensions, we will prove a hardness result showing that the problem (unfortunately) is already difficult enough to forbid sub-polynomial query cost under the linear space budget! In other words, the “easiest” dimensionality of 2 is not so easy after all, which also points to the absence of query-efficient structures in any higher dimension when only linear space is permitted.

1.2 Previous Results

Range Skyline in Internal Memory. We first review the existing results when the dataset fits in main memory. Early research focused on dominance and contour queries, both of which can be solved in time using a structure of size, where is the number of points reported [14, 18, 23, 26, 30]. Brodal and Tsakalidis [9] were the first to discover an optimal dynamic structure for top-open queries, which capture both dominance and contour queries as special cases. Their structure occupies space, answers queries in time, and supports updates in time. The above structures belong to the pointer machine model. Utilizing features of the RAM model, Brodal and Tsakalidis [9] also presented an alternative structure in universe , which uses space, answers queries in time, and can be updated in time. In RAM, the static top-open problem can be easily settled using an RMQ (range minimum queries) structure (see, e.g., [40]), which occupies space and answers queries in time.

For general range skyline queries (i.e., 4-sided), all the known structures demand super-linear space. Specifically, Brodal and Tsakalidis [9] gave a pointer-machine structure of size, query time, and update time. Kalavagattu et al. [24] designed a static RAM-structure that occupies space and achieves query time . In rank space, Das et al. [15] proposed a static RAM-structure with space and query time.

The above results also hold directly in external memory, but they are far from being satisfactory. In particular, all of them incur I/Os to report points. An I/O-efficient structure ought to achieve I/Os for this purpose.

Range Skyline in External Memory. In contrast to internal memory where there exist a large number of results, range skyline queries have not been well studied in external memory. As a naive solution, we can first scan the entire point set to eliminate the points falling outside the query rectangle , and then find the skyline of the remaining points by the fastest skyline algorithm [35] on non-preprocessed input sets. This expensive solution can incur I/Os.

Papadias et al. [31] described a branch-and-bound algorithm when the dataset is indexed by an R-tree [20]. The algorithm is heuristic and cannot guarantee better worst case query I/Os than the naive solution mentioned earlier. Different approaches have been proposed for skyline maintenance in external memory under various assumptions on the updates [37, 39, 31, 22]. The performance of those methods, however, was again evaluated only experimentally on certain “representative” datasets. No I/O-efficient structure exists for answering range skyline queries even in sublinear I/Os under arbitrary updates.

Priority Queues with Attrition (PQAs). Let be a set of elements drawn from an ordered domain, and let be the smallest element in . A PQA on is a data structure that supports the following operations:

  • FindMin: Return .

  • DeleteMin: Remove and return .

  • InsertAndAttrite: Add a new element to and remove from all the elements at least . After the operation, the new content is . The elements are attrited.

In internal memory, Sundar [36] described how to implement a PQA that supports all operations in worst case time, and occupies space after InsertAndAttrite and DeleteMin operations.

1.3 Our Results

This paper presents external memory structures for solving the planar range skyline reporting problem using only linear space. At the core of one of these structures is a new PQA that supports the extra functionality of catenation. This PQA is a non-trivial extension of Sundar’s version [36]. It can be implemented I/O-efficiently, and is of independent interest due to its fundamental nature. Next, we provide an overview of our results.

Static Range Skyline. When is static, we describe several linear-size structures with the optimal query cost. Our structures also separate the hard variants of the problem from the easy ones.

For top-open queries, we present a structure that answers queries in optimal I/Os (Theorem 1) when the universe is . To obtain the result, we give an elegant reduction of the problem to segment intersection, which can be settled by a partially persistent B-tree (PPB-tree) [6]. Furthermore, we show that this PPB-tree is (what we call) sort-aware build-efficient (SABE), namely, it can be constructed in linear I/Os, provided that is already sorted by -coordinate (Theorem 1). The construction algorithm exploits several intrinsic properties of top-open queries, whereas none of the known approaches [2, 17, 38] for bulkloading a PPB-tree is SABE.

The above structure is indivisible, namely, it treats each coordinate as an atom by always storing it using an entire word. As the second step, we improve the top-open query overhead beyond the logarithmic bound when the data universe is small. Specifically, when the universe is where is an integer, we give a divisible structure with optimal query I/Os (Corollary 1). In the rank space, we further reduce the query cost again optimally to (Theorem 2).

Clearly, top-open queries are equivalent to right-open queries by symmetry, and capture dominance and contour queries as special cases, so the results aforementioned are applicable to those variants immediately.

Unfortunately, fast query cost with linear space is impossible for the remaining variants under the well-known indexability model of [21] (all the structures in this paper belong to this model). Specifically, for anti-dominance queries, we establish a lower bound showing that every linear-size structure must incur I/Os in the worst case (Theorem 5), where can be an arbitrarily small constant. Furthermore, we prove that this is tight, by giving a structure to answer a 4-sided query in I/Os (Theorem 6). Since 4-sided is more general than anti-dominance, these matching lower and upper bounds imply that they, as well as left- and bottom-open queries, have exactly the same difficulty.

The above 4-sided results also reveal a somewhat unexpected fact: planar range skyline reporting has precisely the same hardness as planar range reporting (where, given an axis-parallel rectangle , we want to find all the points in , instead of just the maxima; see [3, 21] for the matching lower and upper bounds on planar range reporting). In other words, the extra skyline requirement does not alter the difficulty at all.

Dynamic Range Skyline. The aforementioned static structures cannot be updated efficiently when insertions and deletions occur in . For top-open queries, we provide an alternative structure with fast worst case update overhead, at a minor expense of query efficiency. Specifically, our structure occupies linear space, is SABE, answers queries in I/Os, and supports updates in I/Os, where can be any parameter satisfying (Theorem 4). Note that setting gives a structure with query cost and update cost .

The combination of this structure and our (static) 4-sided structure leads to a dynamic 4-sided structure that uses linear space, answers queries optimally in I/Os, and supports updates in I/Os amortized (Theorem 6). Table 1 summarizes our structures.

Catenable Priority Queues with Attrition. A central ingredient of our dynamic structures is a new PQA that is more powerful than the traditional version of Sundar [36]. Specifically, besides FindMin, DeleteMin and InsertAndAttrite (already reviewed in Section 1.2), it also supports:

  • CatenateAndAttrite: Given two PQAs on sets and respectively, the operation returns a single PQA on . In other words, the elements in are attrited.

We are not aware of any previous work that addressed the above operation, which turns out to be rather challenging even in internal memory.

Our structure, named I/O-efficient catenable priority queue with attrition (I/O-CPQA), supports all operations in worst case and amortized I/Os (the amortized bound requires that a constant number of blocks be pinned in main memory, which is a standard and compulsory assumption to achieve amortized update cost of most, if not all, known structures, e.g., the linked list). The space cost is after InsertAndAttrite and CatenateAndAttrite operations, and after DeleteMin operations.

2 SABE Top-Open Structure

In this section, we describe a structure of linear size to answer a top-open query in I/Os. The structure is SABE, namely, it can be constructed in linear I/Os provided that the input set is sorted by -coordinate.

2.1 Reduction to Segment Intersection

We first describe a simple structure by converting top-open range skyline reporting to the segment intersection problem: the input is a set of horizontal segments in ; given a vertical segment , a query reports all the segments of intersecting .

Given a point in , denote by the leftmost point among all the points in dominating . If such a point does not exist, nil. We convert to a horizontal segment as follows. Let . If nil, then ; otherwise, . Define , i.e., the set of segments converted from the points of . See Figure 3a for an example.

   
(a) Data conversion     (b) Converted query
Figure 3: Reduction.

Now, consider a top-open query with rectangle . We answer it by performing segment intersection on . First, obtain as the highest -coordinate of the points in . Then, report all segments in that intersect the vertical segment . An example is shown in Figure 3b.

Lemma 1

The query algorithm is correct.

{proof}

Consider any point and a top-open query with . We show that our algorithm reports if and only if satisfies the query.

If direction: As satisfies the query, we know that , , and . The last fact suggests that (if nil, define ). Hence, intersects the vertical segment , and thus, will be reported by our algorithm.

Only-if direction: Let be a point found by our algorithm, i.e., intersects , where (if does not exist, ). It follows that and .

Next, we prove . Recall that is the -coordinate of the highest point among all the points in . If , then clearly holds. Otherwise, we know , which implies that . This is because if , then dominates , which (because ) contradicts the definition of . Now, follows from .

So far we have shown that is covered by . It remains to prove that is not dominated by any point in . This is true because suggests that the leftmost point in dominating must be outside .

We can find in I/Os with a range-max query on a B-tree indexing the -coordinates in . For retrieving the segments intersecting , we store in a partially persistent B-tree (PPB-tree) [6]. As has segments, the PPB-tree occupies space and answers a segment intersection query in I/Os. We thus have obtained a linear-size top-open structure with query I/Os.

More effort, however, is needed to make the structure SABE. In particular, two challenges are to be overcome. First, we must generate in linear I/Os. Second, the PPB-tree on must be built with asymptotically the same cost (note that the range-max B-tree is already SABE). We will tackle these challenges in the rest of this section.

2.2 Computing

is not an arbitrary set of segments. We observe:

Lemma 2

has the following properties:

  • (Nesting) for any two segments and in , their -intervals are either disjoint, or such that one -interval contains the other.

  • (Monotonic) let be any vertical line, and the set of segments in intersected by . If we sort the segments of in ascending order of their -coordinates, the lengths of their -intervals are non-decreasing.

{proof}

Nesting: Let and be the points such that and . Assume without loss of generality that . Consider first the case . In this scenario, the -interval of must terminate before because dominates . In other words, and have disjoint -intervals.

We now discuss the case . If does not exist, the -interval of is , which clearly encloses that of . Consider, instead, that exists. If has -coordinate smaller than , then and have disjoint -intervals. Otherwise, also dominates , implying that the -interval of contains that of .

Monotonic: Let intersect the -axis at . Consider the contour query with rectangle , which is a special top-open query. By Lemma 1, the left endpoints of the segments in constitute the skyline of . Therefore, if we enumerate the segments of in ascending order of -coordinates, their left endpoints’ -coordinates decrease continuously. It thus follows from the nesting property that their -intervals have increasing lengths.

We are ready to present our algorithm for computing , after has been sorted by -coordinates. Conceptually, we sweep a vertical line from to . At any time, the algorithm (essentially) stores the set of segments in a stack, which are en-stacked in descending order of -coordinates (i.e., the segment that tops the stack has the lowest y-coordinate). Whenever a segment is popped out of the stack, its right endpoint is decided, and the segment is output. In general, the segments of are output in non-descending order of their right endpoints’ -coordinates.

Specifically, the algorithm starts by pushing the leftmost point of onto the stack. Iteratively, let be the next point fetched from , and the point currently at the top of the stack. If , we know that . Hence, the algorithm pops off the stack, and outputs segment . Then, letting be the point that tops the stack currently, the algorithm checks again whether , and if so, repeats the above steps. This continues until either the stack is empty or . In either case, the iteration finishes by pushing onto the stack. It is clear that the algorithm generates in I/Os.

2.3 Constructing the PPB-tree

Remember that we need a PPB-tree on . The known algorithms for PPB-tree construction require super-linear I/Os even after sorting [2, 6, 17, 38]. Next, we show that the two properties of in Lemma 2 allow building  in linear I/Os. Let us number the leaf level as level 0. In general, the parent of a level- () node is at level . We will build in a bottom-up manner, i.e., starting from the leaf level, then level , and so on.

Leaf Level. To create the leaf nodes, we need to first sort the left and right endpoints of the segments in together by -coordinate. This can be done in I/Os as follows. First, , which is sorted by -coordinates, gives a sorted list of the left endpoints. On the other hand, our algorithm of the previous subsection generates in non-descending order of the right endpoints’ -coordinates (breaking ties by favoring lower points). By merging the two lists, we obtain the desired sorted list of left and right endpoints combined.

Let us briefly review the algorithm proposed in [6] to build a PPB-tree. The algorithm conceptually moves a vertical line from to . At any moment, it maintains a B-tree on the -coordinates of the segments in . We call a snapshot B-tree. To do so, whenever hits the left (resp. right) endpoint of a segment , it inserts (resp. deletes) the -coordinate of in . The PPB-tree can be regarded as a space-efficient union of all the snapshot B-trees. The algorithm incurs I/Os because (i) there are updates, and (ii) for each update, I/Os are needed to locate the leaf node affected.

When is nesting and monotonic, the construction can be significantly accelerated. A crucial observation is that any update to happens only at the bottom of . Specifically, whenever hits the left/right endpoint of a segment , must be the lowest segment in . This implies that the leaf node of to be altered must be the leftmost111We adopt the convention that the leaf elements of a B-tree are ordered from left to right in ascending order. one in . Hence, we can find this leaf without any I/Os by buffering it in memory, in contrast to the cost originally needed.

The other details are standard, and are sketched below assuming the knowledge of the classic algorithm in [6]. Whenever the leftmost leaf of is full, we version copy it to , and possibly perform a split or merge, if strong-version overflows or underflows, respectively222Version copy, strong-version overflow and strong-version underflow are concepts from the terminology of [6].. A version copy, split, and merge can all be handled in I/Os, and can happen only times. Therefore, the cost of building the leaf level is .

Internal Levels. The level- nodes can be built by exactly the same algorithm, but on a different set of segments which are generated from the leaf nodes of the PPB-tree. To explain, let us first review an intuitive way [16] to visualize a node in a PPB-tree. A node can be viewed as a rectangle in , where (resp. ) is the position of when is created (resp. version copied), and represents the -range of in all the snapshot B-trees where belongs. See Figure 4.

Figure 4: A node in a PPB-tree.

For each leaf node (already created), we add the bottom edge of , namely , into . The next lemma points out a crucial fact.

Lemma 3

is both nesting and monotonic.

{proof}

We prove the lemma by induction on the position of . For this purpose, care must be taken to interpret the rectangles of the nodes currently in . As these nodes are still “alive” (i.e., they have not been version copied yet), the right edges of their rectangles rest on , and move rightwards along with . Let set include the bottom edges of the rectangles of all level-1 nodes already spawned so far, counting also the ones in . When we finish building all the level-1 nodes, becomes the final . We will show that is nesting and monotonic at all times. This is obviously true when is at .

Now, suppose that is currently nesting and monotonic. We will prove that it remains so after the next update on . This is trivial if the update does not cause any version copy, i.e., the first leaf node of is not full yet. Consider instead that is version copied to when is at . At this point, is finalized. Because is the lowest among the rectangles of the nodes in , its finalization cannot affect the nesting and monotonicity of . The version copy also creates . Note that the x-intervals of and are disjoint, because the former does not include , but the latter does. Furthermore, has the same -interval as , and a zero-length -interval . Therefore, if no split/merge follows, is still nesting and monotonic.

Next, consider that is split into and . In this case, disappears from , and is replaced by and , which are the bottom two among the rectangles of the nodes in . Furthermore, both and have zero-length -intervals. So is still nesting and monotonic.

It remains to discuss the case where needs to merge with its sibling in . When this happens, the algorithm first version copies to , which finalizes . The -interval of must contain that of , which is consistent with nesting and monotonicity because is above . The merge of and creates a node , such that has a zero-length -interval. Note that is currently the lowest of the rectangles of the nodes in . So remains nesting and monotonic.

Finally, may still need to be split one more time, but this case can be analyzed in the same way as the split scenario mentioned earlier. We thus conclude the proof.

Our algorithm (for building the leaf nodes) writes the left and right endpoints of the segments in in non-descending order of their -coordinates (breaking ties by favoring lower endpoints). This, together with Lemma 3, permits us to create the level- nodes using the same algorithm in I/Os (as ). We repeat the above process to construct the nodes of higher levels. The cost decreases by a factor of each level up. The overall construction cost is therefore .

Theorem 1

There is an indivisible linear-size structure on points in , such that top-open range skyline queries can be answered in I/Os, where is the number of reported points. If all points have been sorted by -coordinates, the structure can be built in linear I/Os. The query cost is optimal (even without assuming indivisibility).

{proof}

We focus on the query optimality because the rest of the theorem follows from our earlier discussion directly.

The term is clearly indispensable. The term , on the other hand, is also compulsory due to a reduction from predecessor search. First, it is well-known (see, e.g., [8]) that predecessor search can be reduced to top-open range reporting (note: not top-open range skyline), such that if a linear-size structure can answer a top-open range query in time, the same structure also solves a predecessor query in time. Interestingly, given a predecessor query, the converted top-open range query always returns only one point. Hence, the query can as well be interpreted as a top-open range skyline query. This indicates that the same reduction also works from predecessor search to top-open range skyline. Finally, any linear-size structure must incur I/Os answering a predecessor query in the worst case [32] (even without the indivisibility assumption). It thus follows that also lower bounds the cost of a top-open range skyline query.

3 Divisible Top-Open Structure

The structure of the previous section obeys the indivisibility assumption. This section eliminates the assumption, and unleashes the power endowed by bit manipulation. As we will see, when the universe is small, it admits linear-size structures with lower query cost.

In Section 3.1, we study a different problem called ray-dragging. Then, in Section 3.2, our ray-dragging structure is deployed to develop a “few-point structure” for answering top-open queries on a small point set. Finally, in Section 3.3, we combine our few-point structure with an existing structure [9] to obtain the final optimal top-open structure.

3.1 Ray Dragging

In the ray dragging problem, the input is a set of points in where is an integer. Given a vertical ray where , a ray dragging query reports the first point in to be hit by when moves left. The rest of the subsection serves as the proof for:

Lemma 4

For , we can store in a structure of size that can answer ray dragging queries in I/Os.

Minute Structure. Set . We first consider the scenario where has very few points: . Let us convert to a set of points in an grid. Specifically, map a point to such that (resp. ) is the rank of (resp. ) among the - (-) coordinates in .

Given a ray , we instead answer a query in using a ray , where (resp. ) is the rank of the predecessor of (resp. ) among the - (resp. -) coordinates in . Create a fusion tree [19, 28] on the - (resp. -) coordinates in so that the predecessor of  (resp. ) can be found in I/Os, which is thus also the cost of turning into . The fusion tree uses blocks.

We will ensure that the query with (in ) returns an id from 1 to that uniquely identifies a point in , if the result is non-empty. To convert the id into the coordinates of , we store in an array of blocks such that any point can be retrieved in one I/O by id.

The benefit of working with is that each coordinate in requires fewer bits to represent (than in ), that is, bits. In particular, we need bits in total to represent a point’s -, -coordinates, and id. Since , the storage of the entire demands bits. If , then . On the other hand, if , then . In other words, we can always store the entire set in blocks. Given a query with , we simply load this block into memory, and answer the query in memory with no more I/O.

We have completed the description of a structure that uses blocks, and answers queries in constant I/Os when . We refer to it as a minute structure.

Proof of Lemma 4. We store in a B-tree that indexes the -coordinates of the points in . We set the B-tree’s leaf capacity to and internal fanout to . Note that the tree has a constant height.

Given a node in the tree, define as the highest point whose -coordinate is stored in the subtree of . Now, consider to be an internal node with child nodes . Define . We store in a minute structure. Also, for each point , we store an index indicating the child node whose subtree contains the -coordinate of . A child index requires bits, which is no more than the length of a coordinate. Hence, we can store the index along with in the minute structure without increasing its space by more than a constant factor. For a leaf node , define to be the set of points whose -coordinates are stored in .

Since there are internal nodes and each minute structure demands space, all the minute structures occupy blocks in total. Therefore, the overall structure consumes linear space.

We answer a ray-dragging query with ray as follows. First, descend a root-to-leaf path to the leaf node containing the predecessor of among the -coordinates in . Let be the lowest node on such that has a point that can be hit by when moves left. For each node , whether has such a point can be checked in I/Os by querying the minute structure over . Hence, can be identified in I/Os where is the height of the B-tree. If does not exist, we return an empty result (i.e., does not hit any point no matter how far it moves).

If exists, let be the first point in hit by when it moves left. Suppose that the -coordinate of is in the subtree of , where is a child node of . The query result must be in the subtree of , although it may not necessarily be . To find out, we descend another path from to a leaf. Specifically, we set to , and find the first point in () that is hit by when it moves left (notice that has changed). Now, letting be the child node of whose subtree is from, we repeat the above steps. This continues until becomes a leaf, in which case the algorithm returns as the final answer. The query cost is . This completes the proof of Lemma 4. We will refer to the above structure as a ray-drag tree.

3.2 Top-Open Structure on Few Points

Next, we present a structure for answering top-open queries on small , called henceforth the few-point structure. Remember that is a set of points in for some integer , and a query is a rectangle where .

Lemma 5

For , we can store in a structure of space that answers top-open range skyline queries with output size in I/Os.

{proof}

Consider a query with . Let  be the first point hit by the ray when moves left. If does not exist or is out of (i.e., ), the top-open query has an empty result. Otherwise, must be the lowest point in the skyline of .

The subsequent discussion focuses on the scenario where . We index with a PPB-tree , as in Theorem 1. Recall that the top-open query can be solved by retrieving the set of segments in intersecting the vertical segment , where is the highest -coordinate of the points in . To do so in I/Os, we utilize the next two observations. :

Observation 1

All segments of intersect .

Proof: is the lowest among the segments of intersecting (recall that is the segment in converted from ). Hence, a segment of intersects if and only if it intersects . On the other hand, a segment of intersects if and only if it intersects . To explain, let be a segment in intersecting . As is higher than , the -interval of must contain that of (due to the nesting and monotonicity properties of ), implying that intersects . Similarly, one can also show that if intersects , it also intersects .

Observation 2

Let be the snapshot B-tree in when is at the position . Once we have obtained the leaf node in containing , we can retrieve in I/Os without knowing the value of .

Proof: Each leaf node in has a sibling pointer to its succeeding leaf node333Due to the nesting and monotonicity properties, every leaf node in the PPB-tree needs only one sibling pointer during the entire period when is alive.. Hence, starting from the leaf node storing , we can visit the leaves of in ascending order of the -coordinates they contain. The effect is to report in the bottom-up order the segments of that intersect . By the nesting and monotonicity properties, the left endpoint of a segment reported latter has a smaller -coordinate. We stop as soon as reaching a segment whose left endpoint falls out of . The cost is because segments are reported in each accessed leaf, except possibly the last one.

We now elaborate on the structure of Lemma 5. Besides , also create a structure of Lemma 4 on . Moreover, for every point , keep a pointer to the leaf node of that (i) is in the snapshot B-tree when is at , and (ii) contains . Call the leaf node the host leaf of . Store the pointers in an array of size to permit retrieving the pointer of any point in one I/O.

The query algorithm should have become straightforward from the above two observations. We first find in I/Os the first point hit by when moves left. Then, using , we jump to the host leaf of . Next, by Observation 2, we retrieve in I/Os. The total query cost is .

3.3 Final Top-Open Structure

We are ready to describe our top-open structure that achieves sub-logarithmic query I/Os for arbitrary . For this purpose, we externalize an internal-memory structure of [9]. The structure of [9], however, has logarithmic query overhead, which we improve with new ideas based on the few-point structure in Lemma 5.

Theorem 2

There is a linear-size structure on points in rank space such that top-open range skyline queries can be answered optimally in I/Os, where is the number of reported points.

Structure. Let be the length of each dimension. We assume, without loss of generality, that is an integer. Divide the -dimension of into consecutive intervals of length each, except possibly the last interval. Call each interval a chunk. Assign each point to the unique chunk covering . Note that some chunks may be empty.

Create a complete binary search tree on the chunks. Let be a node of . We say that a point is “in the subtree of ” if it is assigned to a chunk in the subtree of . Denote by the set of points in the subtree of . Define as the set of highest points in the skyline of ; if the skyline of has less than points, includes all of them. Furthermore, if , let be the lowest point in ; otherwise, nil. We store along with .

Let be any internal node such that is not nil. Denote by the path from the leaf (a.k.a. chunk) of covering to the child of that is an ancestor of . Define as the set of right siblings444If a node is the right child of its parent, it has no right sibling. Similarly, if a node is a left child, it has no left sibling. of the nodes in . Let be the skyline of the point set We store along with , and order the points in by -coordinate (hence, also by -coordinate). In Figure 5, for example, is the skyline of .

The above completes the externalization of the structure in [9]. Next, we describe new mechanisms for achieving query cost . First, we index the points in each chunk with a few-point structure of Lemma 5. Moreover, for every and every proper ancestor of , we store two sets and defined as follows. Let be the path from to the child of that is an ancestor of . Define as the set of left siblings of the nodes on , and conversely, the set of right siblings of those nodes. Then:

  • is the skyline of

  • is the skyline of .

For instance, in Figure 5, is the skyline of , whereas is the skyline of . The points of both and are sorted by -coordinate.

Space. Let be the height of . We analyze first the space consumed by the internal nodes of . Clearly, fits in blocks, whereas occupies blocks. All the internal nodes thus demand blocks in total.

Now, let us focus on the leaf nodes of . As each few-point structure uses linear space, all the few-point structures demand blocks altogether. Regarding , has at most proper ancestors , while each requires blocks. Hence, the of all and occupy blocks in total. The case with is symmetric. The overall space consumption is therefore linear.

Query. We need the following fact:

Lemma 6

Given a node in and a value , let be the set of points in with -coordinates greater than . We can report the skyline of in I/Os where is the number of points reported.

{proof}

If is a leaf, find the skyline of by issuing a top-open query with search rectangle on the few-point structure of . The query time is by Lemma 5.

The rest of the proof adapts an argument in [9] to external memory. Given an internal node , we find the skyline of as follows. Load into memory, and report the points therein with -coordinates above . If there are less than such points, we have found the entire skyline of .

Suppose instead that the entire is reported. Let . It suffices to consider the points that

  • are in the subtrees of the nodes in , or

  • share the same chunk as , but are to the right of .

Any other point of must be either in – which is already found – or dominated by .

To find the skyline points in (i), first report the set of points in whose -coordinates are above . Then, we explore the subtrees of certain nodes in . Specifically, let be the nodes in for some integer . For each , define ; if ,555This can be checked efficiently because the points of are consecutive in . the subtree of can be pruned from further consideration666This means that either , or is dominated by a point in . In both cases, we have found all the result points from the subtree of .. Otherwise (i.e., ), we recursively report the skyline of , where is the -coordinate of the point just to the right of in the staircase of ; if no such point exists, .

The skyline points in (ii) can be retrieved with a top-open query on the few-point structure of the chunk covering , where can be identified in constant I/Os by dividing by . Specifically, if , define to be the -coordinate of the highest point in ; otherwise, define . The top-open query for has rectangle .

Now we analyze the query cost. If less than points of are reported, the algorithm finishes with I/Os. Otherwise, the scan of takes I/Os. If , we charge the cost on the points in ; otherwise, we charge the cost on the points of . The top-open query on the few-point structure of requires I/Os if it returns points. If , we charge the cost on the points of ; otherwise, charge the I/Os on the points.

It remains to discuss the I/Os spent on . For each , if , there is no cost on . Otherwise, we charge on the points of the I/Os spent on reading before recursively reporting the skyline of . The rest of the I/Os performed by the recursion are charged in the same manner as explained above. In this way, every reported point is charged I/Os overall. The total query time is therefore .

To answer a top-open query with , where , we first identify the chunks and that cover and , respectively. This takes I/Os by dividing and by the chunk size , respectively. If , the query can be solved by searching the few-point structure of in I/Os (Lemma 5). The subsequent discussion considers .

Figure 5: Illustration of , , and

Let be the lowest common ancestor of and in . As is a complete binary tree, can be determined in constant I/Os. The rest of the algorithm proceeds in 4 steps:

  1. Use the few-point structure of to report the skyline of . Let be the set of points retrieved, and the -coordinate of the highest point in . If , .

  2. Report the set of points in