Range-efficient consistent sampling and locality-sensitive hashing for polygons1footnote 11footnote 1This work was supported under Australian Research Council’s Discovery Projects funding scheme (project number DP150101134, Gudmundsson) and the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013 / ERC grant agreement no. 614331, Pagh).

Range-efficient consistent sampling and locality-sensitive hashing for polygons111This work was supported under Australian Research Council’s Discovery Projects funding scheme (project number DP150101134, Gudmundsson) and the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013 / ERC grant agreement no. 614331, Pagh).

Joachim Gudmundsson University of Sydney, Australia
joachim.gudmundsson@sydney.edu.au
Rasmus Pagh IT University of Copenhagen, Denmark
pagh@itu.dk
Abstract

Locality-sensitive hashing (LSH) is a fundamental technique for similarity search and similarity estimation in high-dimensional spaces. The basic idea is that similar objects should produce hash collisions with probability significantly larger than objects with low similarity. We consider LSH for objects that can be represented as point sets in either one or two dimensions. To make the point sets finite size we consider the subset of points on a grid. Directly applying LSH (e.g. min-wise hashing) to these point sets would require time proportional to the number of points. We seek to achieve time that is much lower than direct approaches.

Technically, we introduce new primitives for range-efficient consistent sampling (of independent interest), and show how to turn such samples into LSH values. Another application of our technique is a data structure for quickly estimating the size of the intersection or union of a set of preprocessed polygons. Curiously, our consistent sampling method uses transformation to a geometric problem.

Locality-sensitive hashing, probability distribution, polygon, min-wise hashing, consistent sampling

Joachim Gudmundsson and Rasmus Pagh \subjclassE.1 DATA STRUCTURES; F.2.2 Nonnumerical Algorithms and Problems – Geometrical problems and computations

1 Introduction

Suppose that you would like to search a collection of polygons for a shape resembling a particular query polygon. Or that you have a collection of discrete probability distributions, and would like to search for a distribution that resembles a given query distribution. A framework for addressing this kind of question is locality-sensitive hashing (LSH), which seeks to achieve hash collisions between similar objects, while keeping the collision probability low for objects that are not very similar. Arguably the most practically important LSH method is min-wise hashing, which works on any type of data where similarity can be expressed in terms of Jaccard similarity of sets, i.e., the ratio between the size of the intersection and the size of the union of the sets. Indeed, the seminal papers of Broder et al. introducing min-wise hashing [5, 7] have more than 1000 citations. Independently, Cohen [12] developed estimation algorithms based on similar ideas (see also [13]). The basic idea behind min-wise hashing is to map a set to , which for a strong enough hash function gives collision probability equal (or close) to the Jaccard similarity (see e.g. [6] for a discussion of sufficient requirements on ).

If we represent discrete probability distributions by histograms there is a one-to-one relationship between the Jaccard similarity of two histograms and the statistical distance between the corresponding distributions. So a search for close distributions in terms of Jaccard similarity will translate into a search for distributions that are close in statistical distance, see Figure 1.

To make min-wise hashing well-defined on infinite point sets in the plane we may shift to an approximation by considering only those points contained in a finite grid of points. However, for a good approximation these sets must be very large, which means that computing a hash value  for each point , in order to do min-wise hashing, is not attractive.

1.1 Our results

We consider efficient locality-sensitive hashing for objects that can be represented as point sets in either one or two dimensions, and whose similarity is measured as the Jaccard similarity of these point sets. The model of computation considered is a Word RAM with word size at least , where is a prime number. We use integers in (or equivalently elements in the field of size ) to represent coordinates of points on the grid. Our first result concerns histograms with values in .

{theorem}

For every constant and every integer it is possible to choose an explicit hash function that has constant description size, can be evaluated in time , and for which , where is the weighted Jaccard similarity of vectors and of weight . Our construction gives an explicit alternative to existing results on weighted min-wise hashing (see [18, 21]) whose analysis relies on hash functions that are fully random and cannot be described in small space. It was previously shown that a form of priority sampling based on 2-independence can be used to estimate Jaccard similarity of histograms [26], but similarity estimation is less general than locality-sensitive hashing methods such as weighted min-wise hashing.

We proceed to show the generality of our technique by presenting an LSH method for geometric objects. We will use approximation to achieve high performance even for “hard” shapes, and adopt the so-called fuzzy model [1]. In a fuzzy polygon, points that are “close” to the boundary (relative to the polygon’s diameter) may or may not be included in the polygon. That is, given a polygon and real value , define the outer range to be the locus of points whose distance from a point interior to is at most , where is the diameter of . The inner range of is defined symmetrically.

Using the fuzzy model a valid answer to the Jaccard similarity of two polygons and w.r.t.  is any value such that and , where denotes the area of the region. To simplify the statement of the theorem we say that a polygon is -dense in a rectangle if for some value its area is at least a fraction of the area of . We use this to bound the time it takes to generate the sample points. {theorem} For every choice of constants , and square it is possible to choose an explicit random hash function whose description size is constant, that can be evaluated in time , where is the time to test if a given point lies inside a polygon, and with the following guarantee on collision probability: Let be polygons such that and are -dense in . Then , where is some valid Jaccard similarity of and in the fuzzy model with parameter .

It is an interesting problem whether the additive error in Theorems 1.1 and 1.1 can be improved to a multiplicative error.

In Section 5 we present further applications of our technique and show how a small summary can be constructed for a set of polygons such that for any subset of , an estimate of the area of and can be computed efficiently in the fuzzy model with respect to .

Techniques.

Our main technical contribution lies in methods for range-efficient min-wise hashing in one and two dimensions, efficiently implementing min-wise hashing for intervals and rectangles. More specifically, we consider intervals in and rectangles in . The new technique can be related to earlier methods for sampling items with small hash values in one or more dimensions [24, 27]. (In fact, en route we obtain new hash-based sampling algorithms with improved speed, which may be of independent interest.) However, using [24, 27] to sample a single item is not likely to yield a good locality-sensitive hash function. The reason is that the hash functions used in these methods are taken from simple, 2-independent families and, as explained by Thorup [26], min-wise hashing using 2-independence does not in general yield collision probability that is close to (or even a function of) the Jaccard similarity. Instead we use a 2-phase approach: First produce a sample of elements having the smallest hash values, and then perform standard min-wise hashing on a carefully selected subset of the sample using a different hash function.

We can combine and filter the samples to handle a variety of point sets that are not intervals or rectangles. To create a sample for a subset of a rectangle we can generate a sample of the rectangle, and then filter away those sample points that are not in the subset. This is efficient if the subset is suitably dense in the rectangle (which we ensure by working in the fuzzy model). To create a sample from the union of two sets, simply take the union of the samples. Theorems 1.1 and 1.1 are obtained in this way, and it would be possible to instantiate many other applications.

At the heart of our range-efficient sampling algorithms for one and two dimensions lies a reduction to the problem of finding an integer point (or integer points) in a given interval with small vertical distance to a given line. Such a point can effectively be found by traversing the integer convex hull of the line. Using a result of Charrier and Buzer [10] this can be done in logarithmic time. Thus, geometry shows up in an unexpected way in the solution.

1.2 Comparison with related work

We are not aware of previous work dealing with range-efficient locality-sensitive hashing. The most closely related work is on range-efficient consistent (or coordinated) sampling, which is a technique for constructing summaries and sketches of large data sets. The technique comes in two flavors: bottom- (or min-wise) sampling, which fixes the sample size, and consistent sampling (or sub-sampling), which fixes the sampling probability. In both cases the idea is to choose as a sample those elements from a set that have small hash values under a random hash function . If the sample size is fixed and some hash values are identical then an arbitrary tie-breaking rule can be used, e.g., selecting the minimum element. To make uniquely defined, which is convenient, we take to be the smallest value for which . To denote the set of the elements having the smallest hash values (with ties broken in the same way) we use the notation . We focus on settings in which  is large and it is infeasible to store a table of all hash values.

In one dimension.

Pavan and Tirthapura [24] consider the 2-independent family of linear hash functions in the field of size , i.e., functions of the form . They show how to find hash values below a given threshold , where is restricted to an interval . (See also [2] for another application of this primitive.) The algorithm of Pavan and Tirthapura uses time , where is the number of elements with . Using this in connection with doubling search leads to an algorithm finding the minimum hash value in time . In this paper we show how to improve the time complexity: {lemma} Let , where is prime and . Given consider the interval . It is possible to compute (the min-hash of ) in time . We will argue in Section 2 that Lemma 1.2 can be applied repeatedly to subintervals to output the  smallest hash values (and corresponding inputs) in time . The possibility of choosing is included for mathematical convenience (to ensure 2-independence), though in most applications it will be better to choose (which in addition makes uniquely defined without a tie-breaking rule).

In more than one dimension.

Tirthapura and Woodruff [27] consider another class of 2-independent functions, namely linear transformations on vectors over the field . Integers naturally correspond to such vectors, and for a dyadic interval containing all integers that share a certain prefix, the problem of finding elements in that map to zero is equivalent to solving a linear system of equations. Since an arbitrary interval can be split into a logarithmic number of dyadic intervals they are able to compute all the integers that map to zero in polylogarithmic time. The sampling probability can be chosen as an arbitrary integer power of two. This method generalizes to rectangles in dimension .

In this paper we instead consider linear, 2-independent hash functions of the form We do not know of a method for efficiently computing a min-hash over a rectangle for such functions, but we are able to efficiently implement consistent sampling with sampling probability . {lemma} Let , where is prime and . Given and consider . It is possible to compute in time . For random , , the expected size of the sample is , and because of 2-independence the distribution of is concentrated around this value. Compared to the method of [27] ours is faster, but has the disadvantage that the sampling probability cannot be chosen freely. However, as we will see this restriction is not a real limitation to our applications to locality sensitive hashing and size estimation.

From consistent sampling to LSH.

Our technique for transforming a consistent sample to an LSH value is of independent interest. Thorup [26] shows that min-wise hashing using 2-independence does not in general yield collision probability that is close to (or even a function of) the Jaccard similarity. On the positive side he shows that bottom- samples of two sets made using a 2-independent hash function can be used to estimate the Jaccard similarity of the sets with arbitrarily good precision. However, this does not yield a locality-sensitive hash function with collision probability (close to) , and obvious approaches such as min-wise hashing applied to the samples fails to have the right collision probability. Instead, we use consistent sampling (using a 2-independent family) followed by a stronger hash function for which min-wise hashing has the desired collision probability up to an additive error . This transformation yields the first LSH family for Jaccard similarity (with proven guarantees on collision probability) where the function can be:

• evaluated in time on a set of size , and

• described and computed in a constant number of machine words (independent of ).

Previous such functions have used either time per element that grows as approaches zero [20], or required description space that is a root of (see [14]).

1.3 Preliminaries

We will make extensive use of 2-independence:

{definition}

A family of hash functions mapping to is called 2-independent if with and chosen uniformly we have

 Pr[h(x1)=a1∧h(x2)=a2]=1/|U|2.

It will be convenient to use the notation for a number in the interval .

Carter and Wegman [8] showed that the family is 2-independent on the set when is a prime. Finally, we make use of -minwise independent families: {definition}A family of hash functions mapping to is called -minwise independent if for every set , every , and random : . Indyk [20] showed that an efficient -minwise independent family mapping to a range of size can be constructed by using an -independent family of functions (e.g. polynomial hash functions). Dahlgaard and Thorup [14] showed that the evaluation time can be made constant, independent of , by using space . If we only care about sets of size up to some number , this space usage can be improved to .

2 Range-efficient bottom-k sampling in one dimension

The aim of this section is to show Lemma 1.2 and how it can be used to efficiently compute consistent as well as bottom- samples. Together with the general transformation presented in Section 4 this will lead to Theorem 1.1.

Without loss of generality suppose , consider , and let . To show Lemma 1.2 we must prove that can be computed in time . In case this is trivial (just output ), so we focus on the case . We will show how the problem can be reduced to the problem of finding the integer point at the smallest (vertical) distance below the line segment

 ℓ={(x,(ax+b)/p)|x∈[i1;i2]}. (1)

To see this observe that for we have vertical distance between the line and the nearest integer point. Using the equality

 (ax+b)/p−⌊(ax+b)/p⌋=((ax+b) mod p)/p

we see that minimizing is equivalent to minimizing , as claimed. Therefore it suffices to search for the point below that is closest to . Since is a line, the point must lie on the convex hull of the set of points in that lie below , referred to as the “integer convex hull”, see Figure 2. Clearly, the closest point will always be on the upper part of the hull, denoted . Zolotykh [29] showed that consists of line segments. To find a point on the integer convex hull with the smallest vertical distance to we will use a result by Charrier and Buzer [10].

{theorem}

(Charrier and Buzer [10]) Given a line segment , the upper integer convex hull can be computed in time, where and are the -coordinates of the end points of . Charrier and Buzer initially assume that passes through the origin. However, they note (Section 7 in [10]) that this requirement is not needed. Thus, using their result on the line defined in (1) we obtain Lemma 1.2.

We now discuss how to use Lemma 1.2 to output the smallest hash values (and corresponding inputs, i.e., the bottom- sample) in time . First compute and find the point with the smallest vertical distance to . Next, split the problem into two subintervals; one for the part of in the -interval and one for the part of in the -interval . Using a heap to find the integer point with smallest vertical distance in the intervals considered, we can repeat this process until points have been found. To compute a consistent sample rather than the bottom- sample we simply stop the procedure whenever we see an element with a hash value larger than the threshold.

{corollary}

Let , where is prime and . Given consider . It is possible to compute the bottom- sample (or the consistent sample of expected size ) from the interval with respect to in (expected) time . It is an interesting problem whether it is possible to improve this bound to .

3 Rectangle-efficient consistent sampling

The aim and structure of this section are similar to those of Section 2, but now addressing the case where we want to do hashing-based sampling in a rectangle . Specifically, we prove Lemma 1.2 and show how one can use it to perform consistent sampling. This will be used in Section 4 to prove Theorem 1.1 and in Section 5 to construct an efficient data structure for estimating the size of intersections and unions of polygons. Assume without loss of generality that , and . Consider the 2-independent family and choose . To prove Lemma 1.2 we have to argue that

 I′={(x,y)∈I|h(x,y)=0} (2)

can be computed in time . Similar to the previous section we will show how the problem can be reduced to the problem of finding all integer points below a line segment with a small vertical distance to .

To find all for which , as a first step we translate the function such that we can consider input . Specifically, we replace with , where , and consider inputs with , . This is equivalent to the original task since . Next note that for and :

 (ax+by+c′) mod p=0⇔y≡(−b−1ax−b−1c′) mod p,

To simplify the expression set and . Then we have a zero hash value when for some positive integer . Dividing by and substituting and we get where and . Now we can express the original problem as finding all such that . Consider the line segment . An integer point below with and vertical distance at most to corresponds to a point such that .

To find all the points that fulfill the restrictions we can apply the same technique as in Section 2. That is, compute the integer convex hull using the algorithm by Charrier and Buzer [10]. One difference from the setting of Section 2 is that we are interested in all integer points close to , but is guaranteed only to include one such point if it exists. This is handled by recursing on subintervals in which no points have been reported until we find an interval where the integer convex hull does not contain a point close to . Recall that the time to output the integer convex hull is by the result of Zolotykh [29], so the cost per point reported is logarithmic. This concludes the proof of Lemma 1.2.

3.1 Concentration bound

{definition}

An ()-estimator for a quantity is a randomized procedure that, given parameters and , computes an estimate of such that .

For some consider an arbitrary set , and the sample where is defined in (2). Let be the sampling probability. We now show that is concentrated around its expectation when is not too large.

{lemma}

For , is an -estimator for .

Proof.

The proof is a standard application of the second moment bound for 2-independent indicator variables. For each point let be the indicator variable that equals if and 0 otherwise. Clearly we have where , so . By definition of the variables are 2-independent, and so . Now Chebyshev’s inequality implies . ∎

To get an -estimator we thus need . The expected time for computing in Lemma 1.2 is upper bounded by which is . If we choose , to get an -estimator, and let be the fraction of points of that are also in , then the expected time simplifies to . That is, the bound independent of the size of , has logarithmic dependence on , and linear dependence on , , and .

4 From consistent sampling to locality-sensitive hashing

We now present a general transformation of methods for 2-independent consistent sampling to locality-sensitive hashing for Jaccard similarity. Together with the consistent sampling methods in Sections 2 and 3 this will yield Theorems 1.1 and 1.1.

Thorup [26] observed that min-wise hashing based on a 2-independent family does not give collision probability that is close to (or a function of) Jaccard similarity. He observes a bias for a 2-independent family of hash functions based on multiplication, similar to the ones used in this paper. Thus we take a different route: First produce a consistent sample using 2-independence, and then apply min-wise hashing to the sample using a stronger hash function. The expected time per element is constant if we make sure that the sample has expected constant size.

Let constants and be given. For a point set with we produce a 2-independent sample with sampling probability , where is a prime number. This is possible assuming because there exists a prime in every interval , . Now select at random from an -minwise independent family and define the hash value

 H∗(S)=\operatornamewithlimitsargminx∈I′∩Sf(x). (3)
{lemma}

For with and we have , where the probability is over the choice of and .

Proof.

Consider the Jaccard similarity of samples and :

 J′=|S′∩T′||S′∪T′|=|S′|+|T′|−|S′∪T′||S′∪T′|.

Conditioned on a fixed , the collision probability of is by the choice of . Thus it suffices to show that differs from by at most with probability at least

By Lemma 3.1, is an -estimator for since . Similarly, is an -estimator for and is an -estimator for . The probability that all estimators are good is at least , and in that case

 J−ε/2<|S∩T|−(3ε/8)|S∪T||S∪T|+(ε/8)|S∪T|≤J′≤|S∩T|+(3ε/8)|S∪T||S∪T|−(ε/8)|S∪T|

as desired. ∎

We have not specified . The most obvious choice is to use an -independent hash function [20]. Another appealing choice is twisted tabulation hashing [14] that yields constant evaluation time, independent of . The expected size of is bounded by a function of and . This means that we can combine twisted tabulation with an injective universe reduction step to reduce the domain of twisted tabulation to a (large) constant depending on and .

Proof of Theorem 1.1.

Consider a vector . We follow the folklore approach [18] of conceptually mapping each vector to a set , such that the Jaccard similarity of and exactly equals the weighted Jaccard similarity of and . In particular, it is easy to verify that this is the case if we let . Note that and both have size . We will use the following class of hash functions from to :

 H2={(x,y)↦(ax+by+c) mod p|a,b,c∈U}. (4)

The 2-independence of follows from the arguments of Carter and Wegman [8]. A proof is included in Appendix A for completeness. When restricted to points of the form for a fixed , each function has a form suitable for Corollary 2 in Section 2. This means we can find the minimum for restricted to a given column in time . Using a heap to keep track of the smallest hash value from each column of not (yet) reported in the sample, we can output all elements of with a hash value smaller than any given threshold in time per element. The threshold is chosen to match the desired sampling probability .

Lemma 4 then says that we get the desired collision probability up to an additive error of . The expected time to hash is (to populate the priority queue) plus times the expected number of samples. The expected number of samples is constant for every constant , which gives the desired time bound in expectation.

It is possible to turn the expected bound into a worst case bound by stopping the computation if the running time exceeds times the expectation, which happens with probability at most . If we simply output a constant in this case the collision probability changes by at most  (which we can compensate for by decreasing ).

Proof of Theorem 1.1.

The proof is similar to the proof of Theorem 1.1 but with some added geometric observations. Let and be two polygons contained in . As mentioned in the introduction, a valid answer to the Jaccard similarity of polygons and with respect to is any value such that and , where for .

We now switch to considering the restrictions of and to a -by- grid of points whose enclosing rectangle contains . See [19] for a survey on snapping points to a grid.

The grid points are identified in the natural way with integer coordinates in . We choose such that the number of points inside is times the desired number of samples required for Lemma 4 to hold.

Let be the minimum bounding box of and . The consistent sampling will be made on , . The reason for this is that

 |P+1(w1/2)∩P+2(w2/2)|/|P+1(w1/2)∪P+2(w2/2)|

is a valid answer to the Jaccard similarity of and in the fuzzy model with respect to , which follows immediately from the below two inequalities that are proven in Lemma 5 (Section 5):

 A(P1∪P2)≤|(P+1(w1/2)∪P+2(w2/2))∩I|≤A(P+1(w1/2)∪P+2(w2/2)), and A(P1∩P2)≤|(P+1(w1/2)∩P+2(w2/2))∩I|≤A(P+1(w1/2)∩P+2(w2/2)).

Lemma 4 gives us the desired collision probability up to an additive error of . The expected time to hash is plus times the expected number of samples, where is the time to test if a given grid point lies inside a polygon. If we assume that and are -dense in , that is, there exists an such that , then the expected number of samples is for any constants and , which gives the desired time bound in expectation. In many natural settings is a constant, which implies that the expected number of samples is also constant.

5 Estimating union and intersection of polygons

In this section we consider the question: Given a set of preprocessed polygons in the plane, how efficiently can we compute the area of the union or the intersection of a given subset ? In contrast to elementary approaches based on global, fully random sampling, our solution allows polygons to be independently preprocessed based on a small amount of shared randomness that specifies a pseudorandom sample.

Computing the area of the union of a set of geometric objects is a well-studied problem in computational geometry. One example is the Klee’s Measure Problem (KMP). Given axis-parallel boxes in the -dimensional space, the problem asks for the measure of their union. In 1977, Victor Klee [22] showed that it can be solved in time for . This was generalized to dimensions by Bentley [3] in the same year, and later improved by van Leeuwen and Wood [28], Overmars and Yap [23] and, Chan [9]. In 2010, Bringmann and Friedrich [4] gave an Monte Carlo -approximation algorithm for the problem.

A related question is the computation of the area of the intersection of polygons in -dimensional space. Bringmann and Friedrich [4] showed that there cannot be a (deterministic or randomized) multiplicative -approximation algorithm in general, unless NPBPP. They therefore gave an additive -approximation for a large class of geometric bodies, with a running time of assuming that the following three queries can be approximately answered efficiently: point inside body, volume of body and sample point within a body.

In this section we will approach the problem slightly differently. The approach we suggest is to produce a small summary of the set , such that given any subset of the union and intersection of can be estimated efficiently. Unfortunately, the lower bound arguments by Bringmann and Friedrich [4] defeat any reasonable hope of achieving polynomial running time for arbitrary polygons. To get around the lower bounds we again adopt the approximation model proposed by Arya and Mount [1] (stated in Section 1.1) , which has been used extensively in the literature [11, 15, 16].

Similar to the approach by Bringmann and Friedrich [4] we will also use sampling of the polygons to estimate the size of the union and intersection. However, compared to earlier attempts, the main advantage of our approach is that we generate the sample points (a summary of the input) in a preprocessing step and after that we may discard the polygons. Union and intersection queries are answered using only the summary. Also, we do not impose any restrictions on the input polygons. The drawbacks are that we only consider the case when and the approximation model [1] we use is somewhat more “forgiving” than previously used models.

For each polygon in , , let , where is the diameter of and is a given constant. Let be the input to a union or intersection query, that is, is a subset of . To simplify the notations we will write and . Define and symmetrically.

Following the above discussion, given a legal answer to a set intersection query is any  such that and for a union query a legal answer is any  such that It is immediate from the above definitions that for any polygon and any we have: . We will use the number of integer coordinates, denoted , within a polygon to estimate the area of the polygon, denoted . Proofs of Lemmas 55 and 5 can be found in the appendix.

{lemma}

For a polygon having integer coordinates we have . To make the queries more efficient we will not estimate the number of integer coordinates in the intersection/union of a query, instead we will estimate an approximation of . We show:

{lemma}

For any polygon and : .

As an immediate consequence of Lemma 5 we can use the consistent samples in , , for our estimates of the intersection and union, provided that . It remains to show how a summary of can be computed and how the summary can be used to answer union and intersection queries.

Constructing a summary.

For a given query containing polygons, let , and let . If and then we will write and , respectively. Before giving the construction of summary and query algorithms we state two lemmas:

{lemma}

.

{lemma}

If and then . We will use the rectangle-efficient consistent sampling technique described in Section 3 to generate a summary of to estimate the area of or , where is a given subset of .

The idea of the construction algorithm for the summary is simple. Let or depending on the query and, assume that . In a preprocessing step construct a summary of , denoted . The summary will contain consistent samples for a number of different sampling rates. To answer a query, pick a minimum sampling rate that guarantees that the expected number of consistent samples in is small but sufficient to guarantee an -estimate of . If contains enough unique consistent samples then the algorithm reports an estimate of , otherwise it iteratively increases the sampling rate with a constant factor until contains sufficiently many unique consistent samples. From Section 3.1 we know that an -estimator of requires the sampling rate to be approximately .

From Lemmas 5 and 5 we have that the smallest area that will ever be considered in a query has size at least and the largest area is at most . To get an -estimate of at least unique consistent samples are required to lie within . As output from the above algorithm we get two data structures:

• : Returns a prime number between .

• : Returns the set of consistent samples within , i.e., points satisfying the equation . If the set is empty it returns False.

Complexity.

Consider the total number of consistent samples generated for a polygon . The number of consistent samples is expected to increase with a factor of two in each iteration of the algorithm, that is, the expected total number of consistent samples form an exponentially growing geometric series which sums to . Summing up over all the polygons, the total number of consistent samples is bounded by , which is also the expected size of the summary.

For the time complexity we first note that the above procedure can be implemented such that iterations where no consistent samples are expected to be generated are omitted without consideration. Since at least a fraction of of all consistent samples in the minimal bounding box of is expected to lie within (can be shown using a similar argument as in the proof of Lemma 5) the total number of generated consistent samples is expected to be at most a factor of greater than the number of consistent samples in the summary. Each consistent sample requires at most time to generate, according to Theorem 1.2. If we assume that testing if a consistent sample lies inside a polygon can be done in time then the expected time to build a summary of is . A description of union and intersection queries can be found in the appendix. We can now summarize the results in this section:

{theorem}

. Given a set of polygons and three constants and . If for all then, in the fuzzy model with respect to , there exists a summary of size such that for any subset of containing polygons an -estimate of can be computed in expected time and an -estimate of can be computed in expected time.

6 Conclusion and open problems

We have investigated efficient methods for consistent sampling and locality-sensitive hashing of 2-dimensional point sets. Though the methods are simple, it is not clear if they are as useful in practice as, say, minwise hashing. In addition to practicality, some theoretical questions remain, for example whether the additive constant in our theorems can be avoided. Further, our measure of similarity among point sets is by no means the only possible one — it would be interesting to consider notions of similarity that are invariant under rotations and translations of the point set.

Acknowledgement.

We thank the anonymous reviewers for their useful comments.

References

• [1] S. Arya and D. M. Mount. Approximate range searching. Computational Geometry – Theory and Applications, 17:135–152, 2000.
• [2] Y. Bachrach and E. Porat. Sketching for big data recommender systems using fast pseudo-random fingerprints. In Proceedings of 40th International Colloquium on Automata, Languages, and Programming (ICALP), pages 459–471. Springer, 2013.
• [3] J. L. Bentley. Algorithms for Klee’s rectangle problems. Unpublished note, Computer Science Department, Carnegie Mellon University, 1977.
• [4] K. Bringmann and T. Friedrich. Approximating the volume of unions and intersections of high-dimensional geometric objects. Computational Geometry – Theory and Applications, 43(6-7):601–610, 2010.
• [5] A. Z. Broder. On the resemblance and containment of documents. In Proceedings of International Conference on Compression and Complexity of Sequences (SEQUENCES), pages 21–29. IEEE, 1997.
• [6] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630–659, 2000.
• [7] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157–1166, 1997.
• [8] J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proceedings of 9th ACM Symposium on Theory of Computing (STOC), pages 106–112. ACM, 1977.
• [9] T. M. Chan. Klee’s measure problem made easy. In Proceedings of 54th IEEE Symposium on Foundations of Computer Science (FOCS), pages 410–419, 2013.
• [10] E. Charrier and L. Buzer. Approximating a real number by a rational number with a limited denominator: A geometric approach. Discrete Applied Mathematics, 157:3473–3484, 2009.
• [11] D. Z. Chen, M. H. M. Smid, and B. Xu. Geometric algorithms for density-based data clustering. International Journal of Computational Geometry and Applications, 15(3):239–260, 2005.
• [12] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci., 55(3):441–453, 1997.
• [13] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of 26th annual ACM Symposium on Principles of Distributed Computing (PODC), pages 225–234. ACM, 2007.
• [14] S. Dahlgaard and M. Thorup. Approximately minwise independence with twisted tabulation. In Proceedings of 14th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT), pages 134–145. Springer International Publishing, 2014.
• [15] T. K. Dang. Solving approximate similarity queries. Computer Systems Science and Engineering, 22(1-2):71–89, 2007.
• [16] S. A. Friedler and D. M. Mount. Spatio-temporal range searching over compressed kinetic sensor data. In Proceedings of 18th Annual European Symposium on Algorithms (ESA), pages 386–397. Springer Berlin Heidelberg, 2010.
• [17] Joachim Gudmundsson and Rasmus Pagh. Range-efficient consistent sampling and locality-sensitive hashing for polygons. CoRR, abs/1701.05290, 2017.
• [18] B. Haeupler, M. Manasse, and K. Talwar. Consistent weighted sampling made fast, small, and easy. arXiv:1410.4266, 2014.
• [19] J. Hershberger. Stable snap rounding. Computational Geometry, 46(4):403––416, 2013.
• [20] P. Indyk. A small approximately min-wise independent family of hash functions. Journal of Algorithms, 38(1):84–90, 2001.
• [21] Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In Proceedings of 10th IEEE International Conference on Data Mining (ICDM), pages 246–255, 2010.
• [22] V. Klee. Can the measure of be computed in less than steps? American Mathematical Monthly, 84:284–285, 1977.
• [23] M. H. Overmars and C.-K. Yap. New upper bounds in Klee’s measure problem. SIAM Journal on Computing, 20(6):1034–1045, 1991.
• [24] A. Pavan and S. Tirthapura. Range-efficient counting of distinct elements in a massive data stream. SIAM Journal on Computing, 37(2):359–379, 2007.
• [25] G. Pick. Geometrisches zur Zahlenlehre. Sitzenber. Lotos (Prague), 19:311–319, 1889.
• [26] M. Thorup. Bottom- and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of 45th ACM Symposium on Theory of Computing (STOC), pages 371–380. ACM, 2013.
• [27] S. Tirthapura and D. Woodruff. Rectangle-efficient aggregation in spatial data streams. In Proceedings of 31st Symposium on Principles of Database Systems (PODS), pages 283–294. ACM, 2012.
• [28] J. van Leeuwen and D. Wood. The measure problem for rectangular ranges in -space. Journal of Algorithms, 2(3):282–300, 1981.
• [29] N. Y. Zolotykh. On the number of vertices in integer linear programming problems. Technical report, University of Nizhni Novograd, 2000.

Appendix A 2-independence of H2

{lemma}

The family defined in (4) is 2-independent.

Proof.

Let us check that our hash function is in fact 2-independent. Let and s.t. . Let be the unique multiplicative inverse of . Note that this is guaranteed to exist if and only if is non-zero. What is the probability that and ? We have and . Since we may assume without loss of generality that . Now fix . We get:

 (1x11x2)(ac)=(t1−by1t2−by2)

For every there exists exactly one pair such that the above equality holds. Since and are drawn uniformly and independently from , this probability is . ∎∎

Appendix B Omitted material from Section 5

b.1 Proof of Lemma 5

Proof.

We use Pick’s theorem [25]. Let be the number of integer coordinates in the interior of and let be the number of integer coordinates on the boundary of . Pick’s theorem states:

 A(P)=i(P)+b(P)2−(h+1),

where is the number of holes in . Since the lemma follows. ∎

b.2 Proof of Lemma 5

Proof.

The first inequality is immediate and the second inequality follows from Lemma 5. For the third inequality we note that any point in the plane within distance from an integer coordinate within will lie within , hence . ∎

b.3 Proof of Lemma 5

Proof.

Since the second inequality is immediate. For the first inequality we first observe that

 A(P+min(w/2))≤π2((1+ϕ)⋅dmin)2. (5)

To see this let and be two points on the boundary of with largest inter-point distance among all points in . The distance between and is at most