On the largest empty axis-parallel box amidst points
We give the first nontrivial upper and lower bounds on the maximum volume of an empty axis-parallel box inside an axis-parallel unit hypercube in containing points. For a fixed , we show that the maximum volume is of the order . We then use the fact that the maximum volume is in our design of the first efficient -approximation algorithm for the following problem: Given an axis-parallel -dimensional box in containing points, compute a maximum-volume empty axis-parallel -dimensional box contained in . The running time of our algorithm is nearly linear in , for small , and increases only by an factor when one goes up one dimension. No previous efficient exact or approximation algorithms were known for this problem for . As the problem has been recently shown to be NP-hard in arbitrary high dimensions (i.e., when is part of the input), the existence of efficient exact algorithms is unlikely.
We also obtain tight estimates on the maximum volume of an empty axis-parallel hypercube inside an axis-parallel unit hypercube in containing points. For a fixed , this maximum volume is of the same order order . A faster -approximation algorithm, with a milder dependence on in the running time, is obtained in this case.
Keywords: Largest empty box, largest empty hypercube, approximation algorithm.
Given a set of points in the unit square , let be the maximum area of an empty axis-parallel rectangle contained in , and let be the minimum value of over all sets of points in . For any dimension , given a set of points in the unit hypercube , let as the maximum volume of an empty axis-parallel hyperrectangle (-dimensional axis-parallel box) contained in , and let be the minimum value of over all sets of points in . For simplicity we sometimes omit the subscript in the planar case ().
In this paper we give the first nontrivial upper and lower bounds on . For any dimension , our estimates are within a multiplicative constant (depending on ) from each other. For a fixed , we show that the maximum volume is of the order . While the algorithmic problem of finding an empty axis-parallel box of maximum volume has been previously studied for (see below), estimating the maximum volume of such a box as a function of and seems to have not been previously considered.
We first introduce some notations and definitions. Throughout this paper, a box is an open axis-parallel hyperrectangle contained in the unit hypercube , . Given a set of points in , a box is empty if it contains no points in , i.e., . If is a box, we also refer to the side length of in the th coordinate as the extent in the th coordinate of . Throughout this paper, and are the logarithms of in base and , respectively.
Given an axis-parallel rectangle in the plane containing points, the problem of computing a maximum-area empty axis-parallel sub-rectangle contained in is one of the oldest problems studied in computational geometry. For instance, this problem arises when a rectangular shaped facility is to be located within a similar region which has a number of forbidden areas, or in cutting out a rectangular piece from a large similarly shaped metal sheet with some defective spots to be avoided . In higher dimensions, finding the largest empty axis-parallel box has applications in data mining, in finding large gaps in a multi-dimensional data set .
Several algorithms have been proposed for the planar problem over the years [1, 2, 3, 8, 11, 17, 18, 19]. For instance, an early algorithm by Chazelle, Drysdale and Lee  runs in time and space. The fastest known algorithm, proposed by Aggarwal and Suri in 1987 , runs in time and space. A lower bound of in the algebraic decision tree model for this problem has been shown by Mckenna et al. .
For any dimension , there is an obvious brute-force algorithm running in time and space. No significantly faster algorithms, i.e., with a fixed degree polynomial running time in , where known. Confirming this state of affairs, Backer and Keil [5, 6] recently proved that the problem is NP-hard in arbitrary high dimensions (i.e., when is part of the input). They also gave an exact algorithm running in time, for any . In particular, the running time of their exact algorithm for is . Previously, Datta and Soundaralakshmi  had reported an time exact algorithm for the case, but their analysis for the running time seems incomplete. Specifically, the running time depends on an upper bound on the number of maximal empty boxes (see discussions in the next paragraph), but they only gave an lower bound. Here we present the first efficient -approximation algorithm for finding an axis-parallel empty box of maximum volume, whose running time is nearly linear for small , and increases only by an factor when one goes up one dimension.
An empty box of maximum volume must be maximal with respect to inclusion. Following the terminology in , a maximal empty box is called restricted. Thus the maximum-volume empty box in is restricted. Naamad et al.  have shown that in the plane, the number of restricted rectangles is , and that this bound is tight. It was conjectured by Datta and Soundaralakshmi  that the maximum number of restricted boxes is for each (fixed) . The conjecture has been recently confirmed by Backer and Keil [5, 6] (for ). Here we extend (Theorem 7, Appendix D) the constructions with restricted boxes for in  and in  for arbitrary . Independently and simultaneously, Backer and Keil have also obtained this result [4, 5, 6]. Hence the maximum number of restricted boxes is for each fixed . This means that any algorithm for computing a maximum-volume empty box based on enumerating restricted boxes is inefficient in the worst case. On the other hand, at the expense of giving an -approximation, our algorithm does not enumerate all restricted boxes, and achieves efficiency by enumerating all canonical boxes (to be defined) instead.
Our results are:
In Section 2 we show that for . More precisely: , and . From the other direction we have , and for any . Here is the th prime.
In Section 3 we exploit the fact that the maximum volume is in our design of the first efficient -approximation algorithm for finding the largest empty box: Given an axis-parallel -dimensional box in containing points, there is a -approximation algorithm, running in time, for computing a maximum-volume empty axis-parallel box contained in .
In Appendix B we show that the estimate also holds for the maximum volume (or area) of an axis-aligned hypercube (or square) amidst point in . In Appendix C we present a faster -approximation algorithm for finding the largest empty hypercube: Given an axis-parallel -dimensional hypercube in containing points, there is a -approximation algorithm, running in time, for computing a maximum-volume empty axis-parallel hypercube contained in .
2 Empty rectangles and boxes
2.1 Empty rectangles in the plane
The lower bound.
We start with a very simple-minded lower bound; however, as it turns out, it is very close to optimal. One can immediately see that , by partitioning the unit square with vertical lines through each point: out of at most resulting empty rectangles, the largest rectangle has area at least . Thus we have:
The following observation is immediate from invariance under scaling with respect to any of the coordinate axes.
Assume that holds for some and . Then, given points in an axis-aligned rectangle , there is an empty rectangle contained in of area at least .
Using the next two lemmas we will slightly improve the trivial lower bound in our next Theorem 1. Let be the solution in of the quadratic equation .
Given points in the unit square, there exists an empty rectangle with area at least . This bound is tight, i.e., .
Let , and assume without loss of generality that , and . Write , and . Consider the three empty rectangles , , and . Their areas are , , and , respectively. If or , we are done, as one of the first two rectangles has area at least . So we can assume that and . Then it follows that
so the third rectangle has area at least , as required.
To see that this bound is tight, take , , and check that the largest empty rectangle has area . ∎
The proof of the next lemma appears in Appendix A.
Given points in the unit square, there exists an empty rectangle with area at least . This bound is tight, i.e., .
Given points in the unit square, there exists an empty rectangle with area at least . That is, .
The lower bound derived in the proof, , is better than for all . For , the resulting bound is . An alternative partition, yielding the same bound in Theorem 1, can be obtained by dividing into rectangles with vertical lines through every th point of the set. Slightly better lower bounds, particularly for small values of can be obtained by constructing different partitions tailored for specific values of (with a number of points other than in a few of the rectangles), and using estimates on , , etc. For instance, from Lemma 2 we can derive that . Incidentally, we remark that a suitable -point construction gives from the other direction that .
The upper bound.
Let be the van der Corput set of points [9, 10], with coordinates , , constructed as follows [7, 16]: Let . If is the binary representation of , where , then . Observe that all points in lie in the unit square .
For the van der Corput set of points, , the area of the largest empty axis-parallel rectangle is less than .
Let be any open empty axis-parallel rectangle inside the unit square. We next show111The argument we use here is similar to that used for bounding the geometric discrepancy of the van der Corput set of points. that the area of is less than . Following the presentation in [16, p. 39], a canonical interval is an interval of the form for some positive integer and an integer .
Let be the longest canonical interval contained in the projection of the empty rectangle onto the -axis (recall that is open, so this projection is an open interval). Then the side length of along must be less than because otherwise the projection would contain a longer canonical interval of length .
Let be the binary representation of an integer , . In the van der Corput construction, a point in with -coordinate has its -coordinate in the canonical interval if and only if , which happens exactly when . In this case, is a constant . It then follows that the side length of along is at most . Therefore the area of is less than , as required. ∎
2.2 Empty boxes in higher dimensions
As in the planar case, is immediate, by partitioning the hypercube with axis-parallel hyperplanes, one through each of the points. By projecting the points on one of the faces of , and proceeding by induction on , it follows that the lower bound in Theorem 2 carries over here too. Thus we have:
. Moreover, .
We next show that, modulo a constant factor depending on , this estimate is also best possible. Let be the Halton-Hammersely set of points [14, 15], with coordinates , , constructed as follows [7, 16]: Let be the th prime number. Each integer has a unique base- representation , where . Let , and let , . Then all points in are inside the unit hypercube .
For the Halton-Hammersely set of points, , the volume of the largest empty axis-parallel box is less than , where is the th prime.
Let be any open empty box inside the unit hypercube. We next show that the volume of is less than . Generalizing the planar case, a canonical interval of the axis , , is an interval of the form for some positive integer and an integer . Note that .
First consider each axis , . Let be a longest canonical interval (there could be more than one for ) contained in the projection of the empty box onto the axis . Then the side length of along must be less than because otherwise the projection would contain a longer canonical interval of length .
Next consider the axis . Let be the base- representation of an integer , and . In the Halton-Hammersely construction, a point in with -coordinate has its -coordinate in the canonical interval if and only if , which happens exactly when . In this case, is a constant .
Note that the integers , , are relatively prime. By the Chinese remainder theorem, it follows that a point in with -coordinate has its -coordinate in the canonical interval for all if and only if for some integer . Therefore the side length of along is at most . Consequently, the volume of is less than . ∎
It is known that as .
3 A -approximation algorithm for finding the largest empty box
Let be an axis-parallel -dimensional box in containing points. In this section, we present an efficient -approximation algorithm for computing a maximum-volume empty axis-parallel box contained in .
Given an axis-parallel -dimensional box in containing points, there is a -approximation algorithm, running in
time, for computing a maximum-volume empty axis-parallel box contained in .
We first set a few parameters.
We assume that , and , which cover all cases of interest. To somewhat simplify our calculations we also assume that . Let us choose parameters
Let be the unique positive integer such that
We next derive some inequalities that follow from this setting. By assumptions and , we have , and . Then a simple calculation shows that
It is also clear that . So satisfies
Since and , it follows from the second inequality in (3) that . We now derive an upper bound on as a function of , and . First observe that
We also have
From (3) we deduce the following sequence of inequalities:
From (6), a straightforward calculation (where we use and ) gives
Overview of the algorithm.
By a direct generalization of Observation 1, we can assume w.l.o.g. that . Let be the set of points contained in . The algorithm generates a finite set of canonical boxes; to be precise, only a subset of large canonical boxes. For each large canonical box , a corresponding canonical grid is considered, and is placed with its lowest corner at each such grid position and tested for emptiness and containment in . The one with the largest volume amongst these is returned in the end.
Canonical boxes and their associated grids.
Consider the set of canonical boxes, whose all side lengths are elements of
For a given canonical box , with sides , consider the canonical grid associated with with points of coordinates
contained in .
Let be a maximum-volume empty box in , with . By the trivial inequality of Proposition 2, we have . This lower bound is crucial in the design of our approximation algorithm, as it enables us to bound from above the number of large canonical boxes (canonical boxes of smaller volume can be safely ignored).
Consider the following set of intervals
Observe that for each , the extent in the th coordinate of is at least , since otherwise we would have , a contradiction. Let be the extent in the th coordinate of , for . By the above observation, for each , belongs to one of the last intervals in the set . That is, there exists an integer , such that
The next lemma shows that contains an (empty) canonical box with side lengths
at some position in the canonical grid associated with it. We call such a canonical box contained in a maximum-volume empty box, a large canonical box. Two key properties of large canonical boxes are proved in Lemma 4 and Lemma 5.
It is enough to prove the containment for each coordinate axis . Let and be the corresponding intervals of and , respectively. Assume for contradiction that the placement of with its left end point at the first canonical grid position in is not contained in . But then we have, by taking into account the grid cell lengths:
We reached a contradiction to the 2nd inequality in (5), and the proof is complete. ∎
We now show that the (empty) large canonical box from the previous lemma yields a -approximation for the empty box of maximum volume.
Observe that the number of canonical boxes in is exactly , and by (7) is bounded from above as follows:
We can prove however a better upper bound on the number of large canonical boxes.
The number of large canonical boxes in is at most
Recall that satisfies
for some integers . It follows immediately that
and we want an upper bound on the number of solutions of (14). Make the substitution , for . So , for . The above inequalities are equivalent to
Let be a nonnegative integer. It is well-known (see for instance ) that the number of nonnegative integer solutions of the equation equals , that is, when we ignore the upper bound on each . Summing over all values of , and using a well-known binomial identity (see for instance [21, p. 217]) yields that the number of solutions of (15), hence also of (14), is no more than
A well-known upper bound approximation for binomial coefficients
for positive integers and with , further yields that
We now check that
Recall inequality (6). A straightforward calculation (where we use , , and ), gives
as claimed. Substituting this upper bound into (16) yields
as required. This expression is an upper bound on the number of solutions of (14), hence also on the number of large canonical boxes in . ∎
Given a grid with cell lengths , we superimpose it so that the origin of is a grid point of the above grid. Denote the corresponding grid cells by index tuples , where . Note that some of the grid cells on the boundary of may be smaller. Given a grid superimposed on , let be the number of cells (with nonempty interior) into which is partitioned.
Consider a (fixed) canonical box, say , with side lengths as in (12). The associated canonical grid, say , has side lengths times smaller in each coordinate. We now derive an upper bound on the number of canonical grid positions where a canonical box is placed and tested for emptiness, according to (9).
The number of canonical grid positions for placing in is bounded as follows:
By substituting this bound in the product we get that
Substituting these upper bounds in (3) gives the desired bound:
Testing canonical boxes for emptiness.
Given a grid with cell lengths , denote the corresponding grid cell counts or cell numbers (i.e., the number of points) in cell by . For simplicity, we can assume w.l.o.g. that in all the grids that are generated by the algorithm, no point of lies on a grid cell boundary. Indeed the points of on the boundary of can be safely ignored, and the above condition holds with probability if instead of the given , the algorithm uses a value chosen uniformly at random from the interval ; see also the setting of the parameters in (2). Given a grid , and dimensions (array sizes) , a floating box at some position aligned with it, that is, whose lower left corner is a grid point, and with the specified dimensions is called a grid box. All the canonical boxes generated by our algorithm are in fact grid boxes.
The next four lemmas (7, 8, 9, 10) outline the method we use for efficiently computing the number of points in in a rectangular box, over a sequence of boxes. In particular these boxes can be tested for emptiness within the same specified time.
Let be a grid with cell lengths , superimposed on , with cells. Then the number of points of lying in each cell, over all cells, can be computed in time.
The number of points in each cell of is initialized to . For each point , its cell index tuple (label) is computed in time using the floor function for each coordinate, and the corresponding cell count is updated. ∎
Remark. If the floor function is not an option, the number of points in each cell can be computed using binary search for each coordinate. The resulting time complexity is .
Denote by the number of points in in the box with lower left cell , and upper right cell ; we refer to the numbers as corner box numbers.
Given a grid with cell lengths placed at the origin, with cells, and grid cell counts , over all cells, the corner box numbers , over all cells, can be computed in time.
Define , if for some . Let be a binary vector. Let the parity of be . By the inclusion-exclusion principle, the corner box numbers are given by the following formula with at most operations:
Since has cells, the bound follows. ∎
Given is a grid with cell lengths placed at the origin, with cells, and corner box numbers , over all cells. Let be a (canonical) grid box with dimensions (array sizes) , and lower left cell . Then the number of points of in , denoted , can be computed in time.
Let be the upper right cell of . By the inclusion-exclusion principle, the corner box number can be computed as follows:
Hence can be extracted from the above formula with at most operations. ∎
Let be the number of points in in the canonical box of dimensions (array sizes) , and lower left cell .
Given is a grid with cell lengths placed at the origin, with cells, and corner box numbers , over all cells. Then the numbers (counts) , over all cells, can be computed in time.
There are cells determined by in , and for each, apply the bound of Lemma 9. ∎