Computing Coverage Kernels
Under Restricted Settings
We consider the Minimum Coverage Kernel problem: given a set of -dimensional boxes, find a subset of of minimum size covering the same region as . This problem is -hard, but as for many -hard problems on graphs, the problem becomes solvable in polynomial time under restrictions on the graph induced by . We consider various classes of graphs, show that Minimum Coverage Kernel remains -hard even for severely restricted instances, and provide two polynomial time approximation algorithms for this problem.
Given a set of points, and a set of boxes (i.e. axis-aligned closed hyper-rectangles) in -dimensional space, the Box Cover problem consists in finding a set of minimum size such that covers . A special case is the Orthogonal Polygon Covering problem: given an orthogonal polygon with edges, find a set of boxes of minimum size whose union covers . Both problems are NP-hard [CulbersonR94, Fowler1981], but their known approximabilities in polynomial time are different: while Box Cover can be approximated up to a factor within , where is the size of an optimal solution [BronnimannG95, Clarkson2007]; Orthogonal Polygon Covering can be approximated up to a factor within [KumarR03]. In an attempt to better understand what makes these problems hard, and why there is such a gap in their approximabilities, we introduce the notion of coverage kernels and study its computational complexity.
Given a set of -dimensional boxes, a coverage kernel of is a subset covering the same region as , and a minimum coverage kernel of is a coverage kernel of minimum size. The computation of a minimum coverage kernel (namely, the Minimum Coverage Kernel problem) is intermediate between the Orthogonal Polygon Covering and the Box Cover problems. This problem has found applications (under distinct names, and slight variations) in the compression of access control lists in networks [DalyLT16], and in obtaining concise descriptions of structured sets in databases [LakshmananNWZJ02, PuM05]. Since Orthogonal Polygon Covering is -hard, the same holds for the Minimum Coverage Kernel problem. We are interested in the exact computation and approximability of Minimum Coverage Kernel in various restricted settings:
Under which restrictions is the exact computation of Minimum Coverage Kernel still -hard?
How precisely can one approximate a Minimum Coverage Kernel in polynomial time?
When the interactions between the boxes in a set are simple (e.g., when all the boxes are disjoint), a minimum coverage kernel of can be computed efficiently. A natural way to capture the complexity of these interactions is through the intersection graph. The intersection graph of is the un-directed graph with a vertex for each box, and in which two vertices are adjacent if and only the respective boxes intersect. When the intersection graph is a tree, for instance, each box of is either completely covered by another, or present in any coverage kernel of , and thus a minimum coverage kernel can be computed efficiently. For problem on graphs, a common approach to understand when does an NP-hard problem become easy is to study distinct restricted classes of graphs, in the hope to define some form of “boundary classes” of inputs separating “easy” from “hard” instances [AlekseevBKL07]. Based on this, we study the hardness of the problem under restricted classes of the intersection graph of the input.
We study the Minimum Coverage Kernel problem under three restrictions of the intersection graph, commonly considered for other problems [AlekseevBKL07]: planarity of the graph, bounded clique-number, and bounded vertex-degree. We show that the problem remains -hard even when the intersection graph of the boxes has clique-number at most 4, and the maximum degree is at most 8. For the Box Cover problem we show that it remains -hard even under the severely restricted setting where the intersection graph of the boxes is planar, its clique-number is at most 2 (i.e., the graph is triangle-free), the maximum degree is at most 3, and every point is contained in at most two boxes.
We complement these hardness results with two approximation algorithms for the Minimum Coverage Kernel problem running in polynomial time. We describe a -approximation algorithm which runs in time within ; and a randomized algorithm computing a -approximation in expected time within , with high probability (at least ). Our main contribution in this matter is not the existence of polynomial time approximation algorithms (which can be inferred from results on Box Cover), but a new data structure which allows to significantly improve the running time of finding those approximations (when compared to the approximation algorithms for Box Cover). This is relevant in applications where a minimum coverage kernel needs to be computed repeatedly [Agarwal2014, DalyLT16, LakshmananNWZJ02, PuM05].
In the next section we review the reductions between the three problems we consider, and introduce some basic concepts. We then present the hardness results in Section 3, and describe in Section 4 the two approximation algorithms. We conclude in Section 5 with a discussion on the results and future work.
To better understand the relation between the Orthogonal Polygon Covering, the Box Cover and the Minimum Coverage Kernel problems, we briefly review the reductions between them. We describe them in the Cartesian plane, as the generalization to higher dimensions is straightforward.
Let be an orthogonal polygon with horizontal/vertical edges. Consider the grid formed by drawing infinitely long lines through each edge of (see Figure 1.a for an illustration), and let be the set of points of this grid lying on the intersection of two lines. Create a set of boxes as follows: for each pair of points in , if the box having those two points as opposed vertices is completely inside , then add it to (see Figure 1.b.) Let be any set of boxes covering . Note that for any box , either the vertices of are in , or can be extended horizontally and/or vertically (keeping inside ) until this property is met. Hence, there is at least one box in that covers each , respectively, and thus there is a subset covering with . Therefore, any minimum coverage kernel of is also an optimal covering of (and thus, transferring the NP-hardness of the Orthogonal Polygon Covering problem [CulbersonR94] to the Minimum Coverage Kernel problem).
Now, let be a set of boxes, and consider the grid formed by drawing infinite lines through the edges of each box in . This grid has within cells ( when generalized to dimensions). Create a point-set as follows: for each cell which is completely inside a box in we add to the middle point of (see Figure 1.c for an illustration). We call such a point-set a coverage discretization of , and denote it as . Note that a set covers if and only if covers the same region as (namely, is a coverage kernel of ). Therefore, the Minimum Coverage Kernel problem is a special case of the Box Cover problem.
The relation between the Box Cover and the Minimum Coverage Kernel problems has two main implications. Firstly, hardness results for the Minimum Coverage Kernel problem can be transferred to the Box Cover problem. In fact, we do this in Section 3, where we show that Minimum Coverage Kernel remains NP-hard under severely restricted settings, and extend this result to the Box Cover problem under even more restricted settings. The other main implication is that polynomial-time approximation algorithms for the Box Cover problem can also be used for Minimum Coverage Kernel. However, in scenarios where the boxes in represent high dimensional data [DalyLT16, LakshmananNWZJ02, PuM05] and Coverage Kernels need to be computed repeatedly [Agarwal2014], using approximation algorithms for Box Cover can be unpractical. This is because constructing requires time and space within . We deal with this in Section 4, where we introduce a data structure to index without constructing it explicitly. Then, we show how to improve two existing approximation algorithms [BronnimannG95, Lovasz75] for the Box Cover problem by using this index, making possible to use them for the Minimum Coverage Kernel problem in the scenarios commented on.
3 Hardness under Restricted Settings
We prove that Minimum Coverage Kernel remains -hard for restricted classes of the intersection graph of the input set of boxes. We consider three main restrictions: when the graph is planar, when the size of its largest clique (namely the clique-width of the graph) is bounded by a constant, and when the degree of a vertex with maximum degree (namely the vertex-degree of the graph) is bounded by a constant.
3.1 Hardness of Minimum Coverage Kernel
Consider the -Coverage Kernel problem: given a set of boxes, find whether there are boxes in covering the same region as the entire set. Proving that -Coverage Kernel is -complete under restricted settings yields the -hardness of Minimum Coverage Kernel under the same conditions. To prove that -Coverage Kernel is -hard under restricted settings we reduce instances of the Planar 3-SAT problem (a classical -complete problem [MulzerR08]) to restricted instances of -Coverage Kernel. In the Planar 3-SAT problem, given a boolean formula in 3-CNF whose incidence graph111 The incidence graph of a 3-SAT formula is a bipartite graph with a vertex for each variable and each clause, and an edge between a variable vertex and a clause vertex for each occurrence of a variable in a clause. is planar, the goal is to find whether there is an assignment which satisfies the formula. The (planar) incidence graph of any planar 3-SAT formula can be represented in the plane as illustrated in Figure 2 for an example, where all variables lie on a horizontal line, and all clauses are represented by non-intersecting three-legged combs [KnuthR92]. We refer to such a representation of as the planar embedding of .
Based on this planar embedding we proof the results in Theorem 3.1. Although our arguments are described in two dimensions, they extend trivially to higher dimensions.
Let be a set of boxes in the plane and let be the intersection graph of . Solving -Coverage Kernel over is NP-complete even if has clique-number at most 4, and vertex-degree at most 8.
Given any set of boxes in , and any subset of , certifying that covers the same region as can be done in time within using \citeauthorChan2013’s algorithm [Chan2013] for computing the volume of the union of the boxes in . Therefore, -Coverage Kernel is in . To prove that it is NP-complete, given a planar 3-SAT formula with variables and clauses, we construct a set of boxes with a coverage kernel of size if and only if there is an assignment of the variables satisfying . We use the planar embedding of as a start point, and replace the components corresponding to variables and clauses, respectively, by gadgets composed of several boxes. We show that this construction can be obtained in polynomial time, and thus any polynomial time solution to -Coverage Kernel yields a polynomial time solution for Planar 3-SAT. We replace the components in that embedding corresponding to variables and clauses, respectively, by gadgets composed of several boxes, adding a total number of boxes polynomial in the number of variable and clauses.
Let be a variable of a planar 3-SAT formula , and let be the number of clauses of in which appears. The gadget for is composed of rectangles colored either red or blue (see Figure 4 for an illustration): horizontal rectangles (of units of size), separated into two “rows” with rectangles each, and two vertical rectangles (of units of size) connecting the rows. The rectangles in each row are enumerated from left to right, starting by one. The -th rectangle of the -th row is defined by the product of intervals , for all and . The gadget occupies a rectangular region of units. Although the gadget is defined with respect to the origin of coordinates, it is later translated to the region corresponding to in the embedding of \textciteKnuthR92, which we assume without loss of generality to be large enough to fit the gadget. Every horizontal rectangle is colored red if its numbering is odd, and blue otherwise. Besides, the vertical leftmost (resp. rightmost) rectangle is colored blue (resp. red). As we will see later, these colors are useful when connecting a clause gadget with its variables.
Observe that: (.) every red (resp. blue) rectangle intersects exactly two others, both blue (resp. red), sharing with each a squared region of units (which we call redundant regions); (.) the optimal way to cover the redundant regions is by choosing either all the red rectangles or all the blue rectangles (see Figure 4 for an example).
Let be a clause with variables , , and , appearing in this order from left to right in the embedding of . Assume, without loss of generality, that the component for in the embedding is above the variables. We create a gadget for composed of 9 black rectangles, located and enumerated as in Figure 5.. The vertical rectangles numbered 1, 2 and 3 correspond to the legs of in the embedding, and connect with the gadgets of , and , respectively. The remaining six horizontal rectangles connect the three legs between them. The vertical rectangles have one unit of width and their height is given by the height of the respective legs in the embedding of . Similarly, the horizontal rectangles have one unit of height and their width is given by the separation between the legs in the embedding of (see Figure 3 for an example of how these rectangles are extended or stretched as needed). Note that: (.) every rectangle in the gadget intersects exactly two others (again, we call redundant regions the regions where they meet); (.) any minimum cover of the redundant regions (edges in Figure 5.) has five rectangles, one of which must be a leg; and () any cover of the redundant regions which includes the three legs must have at least six rectangles (e.g., see Figure 5.).
Connecting the gadgets.
Let be a variable of a formula and be the number of clauses in which occurs. The legs of the clause gadgets are connected with the gadget for , from left to right, in the same order they appear in the embedding of . Let be the gadget for a clause containing whose component in the embedding of is above (resp. below) that for . connects with the gadget for in one of the rectangles in the upper (resp. lower) row, sharing a region of units with one of the red (resp. blue) rectangles if the variable appears positive (resp. negative) in the clause (see Figure 6.). We call this region where the variable and clause gadgets meet as connection region, and its color is given by the color of the respective rectangle in the variable gadget. Note that a variable gadget has enough connection regions for all the clauses in which it appears, because each row of the gadget has rectangles of each color.
Completing the instance.
Each rectangle in a variable or clause gadget, as described, has a region that no other rectangle covers (i.e., of depth 1). Thus, the coverage kernel of the instances described up to here is trivial: all the rectangles. To avoid this, we cover all the regions of depth 1 with green rectangles (as illustrated in Figure 6.) which are forced to be in any coverage kernel222 For simplicity, these green rectangles were omitted in Figure 3 and Figure 7. . For every clause gadget we add such green rectangles, and for each variable gadget for a variable occurring in clauses we add green rectangles.
Let be a formula with variables and clauses. The instance of -Coverage Kernel that we create for has a total of rectangles: () each clause gadget has 9 rectangles for the comb, and 11 green rectangles, for a total of rectangles over all the clauses; () a gadget for a variable has red and blue rectangles, and we add a green rectangle for each of those that does not connect to a clause gadget ( per variable), thus adding a total of rectangles by gadget; and () over all variables, we add a total of rectangles333 Note that since exactly 3 variables occurs in each clause. .
Intuition: from minimum kernels to boolean values.
Consider a gadget for a variable . Any minimum coverage kernel of the gadget is composed of all its green rectangles together with either all its blue or all its red rectangles. Thus the minimum number of rectangles needed to cover all the variable gadgets is fixed, and known. If all the red rectangles are present in the kernel, we consider that , otherwise if all the blue rectangles are present, we consider that (see Figure 7 for an example).
In the same way that choosing a value for may affect the output of a clause in which occurs, choosing a color to cover the gadget for may affect the number of rectangles required to cover the gadget for . For instance, consider that the gadget for is covered with blue rectangles (i.e., ), and that occurs unnegated in (see the gadgets for and the second clause in Figure 7). The respective leg of the gadget for meets in one of its red rectangles. Since red rectangles were not selected to cover the variable gadget, that leg is forced to cover the connection region shared with the variable gadget, and thus is forced to be in any kernel of the gadget for . This corresponds to the fact that the literal of in evaluates to 0. If the same happens for the three variables in (i.e., is not satisfied by the assignment), then to cover its gadget at least six of the black rectangles will be required (see the gadget for the second clause in Figure 7). However, if at least one of its legs can be disposed of (i.e., at least one of the literals evaluates to 1), then the clause gadget can be covered with five of its black rectangles (see the gadget for the first clause in Figure 7). The minimum number of rectangles needed to cover all the variable gadgets is fixed and known: all the green rectangles, and one half of the red/blue rectangles of each variable. Therefore, it suffices to show that there is an assignment satisfying a 3-SAT formula if and only if every clause gadget can be covered by five of its black rectangles (plus all its green rectangles).
We prove the theorem in two steps. First, we show that such an instance has a coverage kernel of size if and only is satisfiable. Therefore, answering -Coverage-Kernel over this instance with yields an answer for Planar 3-SAT on . Finally, we will show that the instance described matches all the restrictions in Theorem 3.1, under minor variations.
Let be 3-CNF formula with variables and clauses , let be a set of boxes created as described above for , and let be an assignment which satisfies . We create a coverage kernel of as follows:
For each variable gadget for such that (resp. ), add to all but its red (resp. blue) rectangles, thus covering the entire gadget minus its red (resp. blue) connection regions. This uncovered regions, which must connect with clauses in which the literal of evaluates to 0, will be covered later with the legs of the clause gadgets. Over all variables, we add to a total of rectangles.
For each clause , add to all its green rectangles and the legs that connect with connection regions of the variable gadgets left uncovered in the previous step. Note that at least one of the legs of is not added to since at least one of the literals in the clause evaluates to 1, and the connection region corresponding to that literal is already covered by the variable gadget. Thus, the redundant regions of the gadget for can be covered with five of its black rectangles (including the legs already added). So, finally add to black rectangles from the clause for as needed (up to a total of five), until all the redundant regions are covered (and with the green rectangles, the entire gadget). Over all clauses, we add a total of rectangles.
By construction, is a coverage kernel of : it covers completely every variable and clause gadget. Moreover, the size of is .
Let be 3-CNF formula with variables and clauses , let be a set of boxes created as described above for , and let be a coverage kernel of whose size is . Any coverage kernel of must include all its green rectangles. Furthermore, to cover any clause gadget at least five of its black rectangles are required, and to cover any variable rectangle, at least half of its red/blue rectangles are required. Thus, any coverage kernel of most have at least of the clause rectangles, and at least of the variable rectangles are required. Hence, must be a coverage kernel of minimum size.
Since the redundant regions of any two gadgets are independent, must cover each gadgets optimally (in a local sense). Given that the intersection graph of the red/blue rectangles of a variable gadget is a ring (see Figure 4.d for an illustration), the only way to cover a variable gadget optimally is by choosing either all its blue or all its red rectangles (together with the green rectangles). Hence, the way in which every variable gadget is covered is consistent with an assignment for its variable as described before in the intuition. Moreover, the assignment induced by must satisfy : in each clause gadget, at least one of the legs was discarded (to cover the gadget with 5 rectangles), and at least the literal in the clause corresponding to that leg evaluates to 1.
Meeting the restrictions.
Now we prove that the instance of Minimum Coverage Kernel generated for the reduction meets the restrictions of the theorem. First, we show the bounded clique-number and vertex-degree properties for the intersection graph of a clause gadget and its three respective variable gadgets. In Figure 8 we illustrate the intersection graph for the clause . The sign of the variables in the clause does not change the maximum vertex degree or the clique-number of the graph, so the figure is general enough for our purpose. Since we consider the rectangles composing the instance to be closed rectangles, if two rectangles containing at least one point in common (in their interior or boundary), their respective vertices in the intersection graph are adjacent. Note that in Figure 8 the vertices with highest degree are the ones corresponding to legs of the clause gadget (rectangles 1, 2, and 3). There are 4-cliques in the graph, for instance the right lower corner of the green rectangle denoted is covered also by rectangles and , and hence their respective vertices form a clique. However, since there is no point that is covered by five rectangles at the same time, there are no 5-cliques in the graph.
Finally, note that, since the clause gadgets are located according to the planar embedding of the formula, they are pairwise independent.444Two clause gadgets are pairwise independent if the rectangles composing them are pairwise independent, as well as the rectangles where they connect with their respective variables gadgets. Thus, the bounds on the clique-number and vertex-degree of the intersection graph of any clause gadget extend also to the intersection graph of an entire general instance. ∎
3.2 Extension to Box Cover
Since the Minimum Coverage Kernel problem is a special case of the Box Cover problem, the result of Theorem 3.1 also applies to the Box Cover problem. However, in Theorem 3.2 we show that this problem remains hard under even more restricted settings.
Let , be a set of points and boxes in the plane, respectively, and let be the intersection graph of . Solving Box Cover over and is NP-complete even if every point in is covered by at most two boxes of , and is planar, has clique-number at most 2, and vertex-degree at most 4.
We use the same reduction from Planar 3-SAT, but with three main variations in the gadgets: we drop the green rectangles of both the variable and clause gadgets, add points within the redundant and connection region of both variable and clause gadgets, and separate the rectangles numbered 5 and 6 of each clause gadget so they do not intersect (see Figure 9 for an example).
Since the interior of every connection or redundant region is covered by at most two of the rectangles in the gadgets, every point of the instance we create is contained in at most two boxes. In Figure 9.b we illustrate the intersection graph for the clause . Since the sign of the variables in the clause does not change the maximum vertex degree or the clique-number of the graph, or its planarity, the properties we mention next are also true for any clause. Note that three is the maximum vertex-degree of the intersection graph, and that there are no 3-cliques. Also note that the intersection graph can be drawn within the planar embedding of so that no two edges cross, and hence the graph is planar. Again, due to the pairwise independence of the clause gadgets, these properties extend to the entire intersection graph of a general instance. ∎
In the next section, we complement these hardness results with two approximation algorithms for the Minimum Coverage Kernel problem.
4 Efficient approximation of Minimum Coverage Kernels
Let be a set of boxes in , and let be a coverage discretization of (as defined in Section 2). A weight index for is a data structure which can perform the following operations:
Initialization: Assign an initial unitary weight to every point in ;
Query: Given a box , find the total weight of the points in .
Update: Given a box , multiply the weights of all the points within by a given value ;
We assume that the weights are small enough so that arithmetic operations over the weights can be performed in constant time. There is a trivial implementation of a weight index with initialization and update time within , and with constant query time. In this section we describe an efficient implementation of a weight index, and combine this data structure with two existing approximation algorithms for the Box Cover problem [Lovasz75, BronnimannG95] and obtain improved approximation algorithms (in the running time sense) for the Minimum Coverage Kernel problem.
4.1 An Efficient Weight Index for a Set of Boxes
We describe a weight index for which can be initialized in time within , and with query and update time within . Let us consider first the case of a set of intervals.
A weight index for a set of intervals.
A trivial weight index which explicitly saves the weights of each point in can be initialized in time within , has linear update time, and constant query time. We show that by sacrificing query time (by a factor within ) one can improve update time to within . The main idea is to maintain the weights of each point of indirectly using a tree.
Consider a balanced binary tree whose leafs are in one-to-one correspondence with the values in (from left to right in a non-decreasing order). Let denote the point corresponding to a leaf node of the tree. In order to represent the weights of the points in , we store a value at each node of the tree subject to the following invariant: for each leaf , the weight of the point equals the product of the values of all the ancestors of (including itself). The values allow to increase the weights of many points with only a few changes. For instance, if we want to double the weights of all the points we simply multiply by 2 the value of the root of the tree. Besides the values, to allow efficient query time we also store at each node three values : the values and are the minimum and maximum , respectively, such that is a leaf of the tree rooted at ; the value is the sum of the weights of all such that is a leaf of the tree rooted at .
Initially, all the values are set to one. Besides, for every leaf of the tree is set to one, while and are set to . The , and values of every internal node with children , are initialized in a bottom-up fashion as follows: ; ; . It is simple to verify that after this initialization, the tree meets all the invariants mentioned above. We show in Theorem 4.1 that this tree can be used as a weight index for .
Let be a set of intervals in . There exists a weight index for which can be initialized in time within , and with query and update time within .
Since intervals have linear union complexity, has within points, and it can be computed in linear time after sorting, for a total time within . We store the points in the tree described above. Its initialization can be done in linear time since the tree has within nodes, and when implemented in a bottom-up fashion, the initialization of the and values, respectively, cost constant time per node.
To analyze the query time, let denote the procedure which finds the total weight of the points corresponding to leafs of the tree rooted at that are in the interval . This procedure can be implemented as follows:
if is disjoint to return 0;
if completely contains return ;
if both conditions fail (leafs must meet either 1. or 2.), let be the left and right child of , respectively;
if return ;
if return ;
otherwise return .
Due to the invariants to which the and values are subjected, every leaf of corresponding to a point in has an ancestor (including itself) which is visited during the call to totalWeight and which meets the condition in step 2. For this, and because of the invariants to which the and values are subjected, the procedure totalWeight is correct. Note that the number of nodes visited is at most 4 times the height of the tree: when both children need to be visited, one of the endpoints of the interval to query is replaced by , which ensures that in subsequent calls at least one of the children is completely covered by the query interval. Since , and the operations at each node consume constant time, the running time of totalWeight is within .
Similarly, to analyze the update time, let denote the procedure which multiplies by a value the weights of the points in the interval stored in leafs descending from . This can be implemented as follows:
if is disjoint to , finish;
if completely contains set , set , and finish;
if both conditions fail, let be the left and right child of , respectively;
if , call ;
else if , call ;
otherwise, call , and ;
finally, after the recursive calls set , and finish.
Note that, for every point in corresponding to a leaf descending from , the value of exactly one of the ancestors of changes (by a factor of ): at least one changes because of the invariants to which the and values are subjected (as analyzed for totalWeight); and no more than one can change because once is assigned for the first time to some ancestor of , the procedure finishes leaving the descendants of untouched. The analysis of the running time is analogous to that of totalWeight, and thus within . ∎
The weight index for set of intervals described in Theorem 4.1 plays an important role in obtaining an index for a higher dimensional set of boxes. In a step towards that, we first describe how to use one dimensional indexes to obtain indexes for another special case of sets of boxes, this time in high dimension.
A weight index for a set of slabs.
A box is said to be a slab within another box if covers completely in all but one dimension (see Figure 10.a for an illustration).
Let be a set of -dimensional boxes that are slabs within another box -dimensional box . Let denote the set of the boxes restricted to . We describe a weight index for with initialization time within the size , and with update and query time within .
For all , let be the subset of slabs that are orthogonal to the -th dimension, and let be the set of intervals resulting from projecting and each rectangle in to the -th dimension (see Figure 10.b for an illustration). The key to obtain an efficient weight index for a set of slabs is the fact that weight indexes for can be combined without much extra computational effort into a weight index for . Let be a point and let denote the value of the -th coordinate of . Observe that for all , (see Figure 10.b for an illustration). This allows the representation of the weight of each point by means of the weights of for all . We do this by maintaining the following weight invariant: the weight of a point is equal to .
Let be a set of -dimensional boxes that are equivalent to slabs when restricted to another -dimensional box . There exists a weight index for which can be initialized in time within , and with query and update time within .
Let be the subset of orthogonal to the -th dimension, and let be the set of intervals resulting from projecting and each rectangle in to the -th dimension. Initialize a weight index for as in Theorem 4.1, for all . Since the weights of all the points in the one dimensional indexes are initialized to one, the weight of every point in is also initialized to one, according to the weight invariant. This initialization can be done in total time within .
Let be a box which covers in every dimension except for the -th one, for some (i.e., ), and let bet the subset of contained within the projection of to the -th dimension. The set of points of that are within can be generated by the expression . Therefore, the total weight of the points within is given by the total weight of the points for all multiplied by the total weight of the points in , for all distinct from .
To query the total weight of the points of within a box we query the weight index of to find the total weight of the points in the projection of to the -th dimension (in time within ), then query the remaining indexes to find the total weight store in the index (stored at the value of the root of the trees), and return the product of those values. Clearly the running time is within .
The update is similar: to multiply by a value the weight of all the points of within a box we simply update the weight index of multiplying by all the weights of the points within the projection of to the -th dimension, and leave the other weight indexes untouched. The running time of this operation is also within , and the invariant remains valid after the update. ∎
Lemma 1 shows that there are weight indexes for a set of slabs within another box that significantly improve the approach of explicitly constructing , an improvement that grows exponentially with the dimension . We take advantage of this to describe a similar improvement for the general case.
A Weight Index for The General Case.
We now show how to maintain a weight index of a general set of -dimensional boxes. The main idea is to partition the space into cells such that, within each cell , any box either completely contains or is equivalent to a slab. Then, we use weight indexes for slabs (as described in Lemma 1) to index the weights within each of the cells. This approach was first introduced by \textciteOvermars1991 in order to compute the volume of the region covered by a set boxes, and similar variants were used since then to compute other measures [2017-COCOON-DepthDistributionInHighDimension-BabrayPerezRojas, Chan2013, YildizHershbergerSuri11]. The following lemma summarizes the key properties of the partition we use:
Lemma 2 (Lemma 4.2 of \textciteOvermars1991)
Let be a set of boxes in -dimensional space. There exist a binary partition tree for storing any subset of such that
It can be computed in time within , and it has nodes;
Each box is stored in leafs;
The boxes stored in a leaf are slabs within the cell corresponding to the node;
Each leaf stores no more than boxes.
Consider the tree of Lemma 2. Analogously to the case of intervals, we augment this tree with information to support the operations of a weight index efficiently. At every node we store two values : the first allows to multiply all the weights of the points of that are maintained in leafs descending from (allowing to support updates efficiently); while stores the total weight of these points (allowing to support queries efficiently). To ensure that all/only the nodes that intersect a box are visited during a query or update operation, we store at each node the boundaries of the cell corresponding to that node. Furthermore, at every leaf node we implicitly represent the points of that are inside the cell corresponding to using a weight index for slabs.
To initialize this data structure, all the values are set to one. Then the weight index within each leaf cell are initialized. Finally, the values of every node with children , are initialized in a bottom-up fashion setting . We show in Theorem 4.2 how to implement the weight index operations over this tree and we analyze its running times.
Let be a set of -dimensional boxes. There is a weight index for which can be initialized in time within , and with query and update time within .
The initialization of the index, when implemented as described before, runs in constant time for each internal node of the tree, and in time within for each leaf (due to Lemma 1, and to the last item of Lemma 2). Since the tree has nodes, the total running time of the initialization is within .
Since the implementations of the query and update operations are analogous to those for the intervals weight index (see the proof of Theorem 4.1), we omit the details of their correctness. While performing a query/update operation at most leafs are visited (due to the third item of Lemma 2), and since the height of the tree is within , at most internal nodes are visited in total. Hence, the cost of a query/update operation within each leaf is within (by Lemma 1), and is constant within each internal node. Thus, the total running of a query/update operation is within . ∎
4.2 Practical approximation algorithms for Minimum Coverage Kernel.
Approximating the Minimum Coverage Kernel of a set of boxes via approximation algorithms for the Box Cover problem requires that is explicitly constructed. However, the weight index described in the proof of Theorem 4.2 can be used to significantly improve the running time of these algorithms. We describe below two examples.
The first algorithm we consider is the greedy -approximation algorithm by \textciteLovasz75. The greedy strategy applies naturally to the Minimum Coverage Kernel problem: iteratively pick the box which covers the most yet uncovered points of , until there are no points of left to cover. To avoid the explicit construction of three operations most be simulated: (.) find how many uncovered points are within a given a box ; (.) delete the points that are covered by a box ; and (.) find whether a subset of covers all the points of .
For the first two we use the weight index described in the proof of Theorem 4.2: to delete the points within a given box we simply multiply the weights of all the points of within by ; and finding the number of uncovered points within a box is equivalent to finding the total weight of the points of within . For the last of the three operations we use the following observation:
Let be a set of -dimensional boxes, and let be a subset of . The volume of the region covered by equals that of if and only if and cover the exact same region.
Let denote the size of a minimum coverage kernel of , and let denote the size of (). The greedy algorithm of \textciteLovasz75, when run over the sets and works in steps; and at each stage a box is added to the solution. The size of the output is within . This algorithm can be modified to achieve the following running time, while achieving the same approximation ratio:
Let be a set of boxes in with a minimum coverage kernel of size . Then, a Coverage Kernel of of size within can be computed in time within .
We initialize a weight index as in Theorem 4.2, which can be done in time , and compute the volume of the region covered by , which can be done in time within [Chan2013]. Let be an empty set. At each stage of the algorithm, for every box we compute the total weight of the points inside (which can be done in time within using the weight index). We add to the box with the highest total weight, and update the weights of all the points within this box to zero (by multiplying their weights by ) in time within . If the volume of the region covered by (which can be computed in -time [Chan2013]) is the same as that of , then we stop and return as the approximated solution. The total running time of each stage is within . This, and the fact that the number of stages is within yield the result of the theorem. ∎
Now, we show how to improve \citeauthorBronnimannG95’s approximation algorithm [BronnimannG95] via a weight index. First, we describe their main idea. Let be a weight function for the points of , and for a subset let denote the total weight of the points in . A point is said to be -heavy, for a value , if , and -light otherwise. A subset is said to be an -net with respect to if for every -heavy point there is a box in which contains . Let denote the size of a minimum coverage kernel of , and let be an integer such that . The algorithm initializes the weight of each point in to 1, and repeats the following weight-doubling step until every range is -heavy: find a -light point and double the weights of all the points within every box . When this process stops, it returns a -net with respect to the final weights as the approximated solution.
Since each point in is -heavy, covers all the points of . Hence, if a -net of size can be computed efficiently, this algorithm computes a solution of size . Besides, \textciteBronnimannG95 showed that for a given , if more than weight-doubling steps are performed, then . This allows to guess the correct via exponential search, and to bound the maximum weight of any point by (which allows to represent the weights with bits). See \citeauthorBronnimannG95’s article [BronnimannG95] for the complete details of their approach.
We simulate the operations over the weights of again using a weight index, this time with a minor variation to that of Theorem 4.2: in every node of the space partition tree, besides the values, we also store the minimum weight of the points within the cell corresponding to the node. During the initialization and update operations of the weight index this value can be maintained as follows: for a node with children , the minimum weight of a point in the cell of can be computed as . This value allows to efficiently detect whether there are -light points, and to find one in the case of existence by tracing down, in the partition tree, the path from which that value comes.
To compute a -net, we choose a sample of by performing at least random independent draws from . We then check whether it is effectively a -net, and if not, we repeat the process, up to a maximum of times. \textciteHausslerW87 showed that such a sample is a -net with probability at least . Thus, the expected number of samples needed to obtain a -net is constant, and since we repeat the process up to times, the probability of effectively finding one is at least . We analyze the running time of this approach in the following theorem.
Let be a set of boxes in with a minimum coverage kernel of size . A coverage kernel of of size within can be computed in -expected time, with probability at least .
The algorithm performs several stages guessing the value of . Within each stage we initialize a weight index in time within . Finding whether there is a -light point can be done in constant time: the root of the partition tree stores both and the minimum weight of any point in the and values, respectively. For every light point, the weight-doubling steps consume time within (by Theorem 4.2). Since at each stage at most weight-doubling steps are performed, the total running time of each stage is within . Given that increases geometrically while guessing its right value, and since the running time of each stage is a polynomial function, the sum of the running times of all the stages is asymptotically dominated by that of the last stage, for which we have that . Thus the result of the theorem follows. ∎
Compared to the algorithm of Theorem 4.3, this last approach obtains a better approximation factor on instances with small Coverage Kernels ( vs. ), but the improvement comes with a sacrifice, not only in the running time, but in the probability of finding such a good approximation. In two and three dimensions, weight indexes might also help to obtain practical approximation algorithms for the Minimum Coverage Problem. We discuss this, and other future directions of research in the next section.
Whether it is possible to close the gap between the factors of approximation of Box Cover and Orthogonal Polygon Covering has been a long standing open question [KumarR03]. The Minimum Coverage Kernel problem, intermediate between those two, has the potential of yielding answers in that direction, and has natural applications of its own [DalyLT16, LakshmananNWZJ02, PuM05]. Trying to understand the differences in hardness between these problems we studied distinct restricted settings. We show that while Minimum Coverage Kernel remains NP-hard under severely restricted settings, the same can be said for the Box Cover problem under even more extreme settings; and show that while the Box Cover and Minimum Coverage Kernel can be approximated by at least the same factors, the running time of obtaining some of those approximations can be significantly improved for the Minimum Coverage Kernel problem.
Another approach to understand what makes a problem hard is Parameterized Complexity [DowneyF99], where the hardness of a problem is analyzed with respect to multiple parameters of the input, with the hope of finding measures gradually separating “easy” instances form the “hard” ones. The hardness results described in Section 3 show that for the Minimum Coverage Kernel and Box Cover problems, the vertex-degree and clique-number of the underlaying graph are not good candidates of such kind of measures, opposed to what happens for other related problems [AlekseevBKL07].
In two and three dimensions, the Box Cover problem can be approximated up to [AronovES10]. We do not know whether the running time of this algorithm can be also improved for the case of Minimum Coverage Kernel via a weight index. We omit this analysis since the approach described in Section 4 is relevant when the dimension of the boxes is high (while still constant), as in distinct applications [DalyLT16, LakshmananNWZJ02, PuM05] of the Minimum Coverage Kernel problem.