Approximate Furthest Neighbor with Application to Annulus Query1footnote 11footnote 1The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme(FP7/2007-2013) / ERC grant agreement no. 614331.

Approximate Furthest Neighbor with Application to Annulus Query111The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme(FP7/2007-2013) / ERC grant agreement no. 614331.

Rasmus Pagh1, Francesco Silvestri2, Johan Sivertsen3 & Matthew Skala4
22pagh@itu.dk
33fras@itu.dk
44jovt@itu.dk
55mska@itu.dk
Abstract

Much recent work has been devoted to approximate nearest neighbor queries. Motivated by applications in recommender systems, we consider approximate furthest neighbor (AFN) queries and present a simple, fast, and highly practical data structure for answering AFN queries in high-dimensional Euclidean space. The method builds on the technique of Indyk (SODA 2003), storing random projections to provide sublinear query time for AFN. However, we introduce a different query algorithm, improving on Indyk’s approximation factor and reducing the running time by a logarithmic factor. We also present a variation based on a query-independent ordering of the database points; while this does not have the provable approximation factor of the query-dependent data structure, it offers significant improvement in time and space complexity. We give a theoretical analysis, and experimental results. As an application, the query-dependent approach is used for deriving a data structure for the approximate annulus query problem, which is defined as follows: given an input set and two parameters and , construct a data structure that returns for each query point a point such that the distance between and is at least and at most .666Formal publication DOI: http://dx.doi.org/10.1016/j.is.2016.07.006777This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/

1 Introduction

Similarity search is concerned with locating elements from a set that are close to a given query . The query can be thought of as describing criteria we would like returned items to satisfy approximately. For example, if a customer has expressed interest in a product , we may want to recommend other, similar products. However, we might not want to recommend products that are too similar, since that would not significantly increase the probability of a sale. Among the points that satisfy a near neighbor condition (“similar”), we would like to return those that also satisfy a furthest-point condition (“not too similar”), without explicitly computing the set of all near neighbors and then searching it. We refer to this problem as the annulus query problem. We claim that an approximate solution to the annulus query problem can be found by suitably combining Locality Sensitive Hashing (LSH), which is an approximation technique commonly used for finding the nearest neighbor of a query, with an approximation technique for furthest neighbor, which is the main topic of this paper.

The furthest neighbor problem consists of finding the point in an input set that maximizes the distance to a query point . In this paper we investigate the approximate furthest neighbor problem in -dimensional Euclidean space (i.e., ), with theoretical and experimental results. We then show how to cast one of our data structures to solve the annulus query problem. As shown in the opening example, the furthest neighbor problem has been used in recommender systems to create more diverse recommendations [said2013user, said2012increasing]. Moreover, the furthest neighbor is an important primitive in computational geometry, that has been used for computing the minimum spanning tree and the diameter of a set of points [AgarwalMS92, Eppstein95].

Our focus is on approximate solution because the exact version of the furthest neighbor problem would also solve exact similarity search in -dimensional Hamming space, and thus is as difficult as that problem [Williams04, AhlePRS16]. The reduction follows from the fact that the complement of every sphere in Hamming space is also a sphere. That limits the hope we may have for an efficient solution to the exact version, so we consider the -approximate furthest neighbor (-AFN) problem where the task is to return a point with , with denoting the distance between two points. We will pursue randomized solutions having a small probability of not returning a -AFN. The success probability can be made arbitrarily close to 1 by repetition.

We describe and analyze our data structures in Section 2. We propose two approaches, both based on random projections but differing in what candidate points are considered at query time. In the main query-dependent version the candidates will vary depending on the given query, while in the query-independent version the candidates will be a fixed set.

The query-dependent data structure is presented in Section 2.1. It returns the -approximate furthest neighbor, for any , with probability at least . When the number of dimensions is , our result requires time per query and total space, where denotes the input size.888The notation omits polylog terms. Theorem 7 gives bounds in the general case. This data structure is closely similar to one proposed by Indyk [Indyk2003], but we use a different approach for the query algorithm.

The query-independent data structure is presented in Section 2.2. When the approximation factor is a constant strictly between and , this approach requires query time and space. This approach is significantly faster than the query dependent approach when the dimensionality is small.

The space requirements of our data structures are quite high: the query-independent data structure requires space exponential in the dimension, while the query-dependent one requires more than linear space when . However, we claim that this bound cannot be significantly improved. In Section 2.3 we show that any data structure that solves the -AFN by storing a suitable subset of the input points must store at least data points when .

Section 3 describes experiments on our data structure, and some modified versions, on real and randomly-generated data sets. In practice, we can achieve approximation factors significantly below the theoretical result, even with the query-independent version of the algorithm. We can also achieve good approximation in practice with significantly fewer projections and points examined than the worst-case bounds suggested by the theory. Our techniques are much simpler to implement than existing methods for -AFN, which generally require convex programming [clarkson1995vegas, matouvsek1996subexponential]. Our techniques can also be extended to general metric spaces.

Having developed an improved AFN technique we return to the annulus query problem in Section 4. We present a sublinear time solution to the approximate annulus query problem based on combining our AFN data structure with LSH techniques [Har-Peled2012].

A preliminary version of our data structures for -AFN appeared in the proceedings of the 8th International Conference on Similarity Search and Applications (SISAP) [PaghSSS15].

1.1 Related work

Exact furthest neighbor

In two dimensions the furthest neighbor problem can be solved in linear space and logarithmic query time using point location in a furthest point Voronoi diagram (see, for example, de Berg et al. [CGbook08]). However, the space usage of Voronoi diagrams grows exponentially with the number of dimensions, making this approach impractical in high dimensions. More generally, an efficient data structure for the exact furthest neighbor problem in high dimension would lead to surprising algorithms for satisfiability [Williams04], so barring a breakthrough in satisfiability algorithms we must assume that such data structures are not feasible. Further evidence of the difficulty of exact furthest neighbor is the following reduction: Given a set and a query vector , a furthest neighbor (in Euclidean space) from is a vector in of minimum Hamming distance to . That is, exact furthest neighbor is at least as hard as exact nearest neighbor in -dimensional Hamming space, which is also believed to be hard for large and worst-case [Williams04].

Approximate furthest neighbor

Agarwal et al. [AgarwalMS92] proposes an algorithm for computing the -AFN for all points in a set in time where and . Bespamyatnikh [Bespam:Dynamic] gives a dynamic data structure for -AFN. This data structure relies on fair split trees and requires time per query and space, with . The query times of both results exhibit an exponential dependency on the dimension. Indyk [Indyk2003] proposes the first approach avoiding this exponential dependency, by means of multiple random projections of the data and query points to one dimension. More precisely, Indyk shows how to solve a fixed radius version of the problem where given a parameter the task is to return a point at distance at least given that there exist one or more points at distance at least . Then, he gives a solution to the furthest neighbor problem with approximation factor , where is a sufficiently small constant, by reducing it to queries on many copies of that data structure. The overall result is space and query time , which improved the previous lower bound when . The data structure presented in this paper shows that the same basic method, multiple random projections to one dimension, can be used for solving -AFN directly, avoiding the intermediate data structures for the fixed radius version. Our result is then a simpler data structure that works for all radii and, being interested in static queries, we are able to reduce the space to .

Methods based on an enclosing ball

Goel et al. [GIV01] show that a -approximate furthest neighbor can always be found on the surface of the minimum enclosing ball of . More specifically, there is a set of at most points from whose minimum enclosing ball contains all of , and returning the furthest point in always gives a -approximation to the furthest neighbor in . This method is query independent in the sense that it examines the same set of points for every query. Conversely, Goel et al. [GIV01] show that for a random data set consisting of (almost) orthonormal vectors, finding a -approximate furthest neighbor for a constant gives the ability to find an -approximate near neighbor. Since it is not known how to do that in time it is reasonable to aim for query times of the form for approximation .

Applications in recommender systems

Several papers on recommender systems have investigated the use of furthest neighbor search [said2013user, said2012increasing]. The aim there was to use furthest neighbor search to create more diverse recommendations. However, these papers do not address performance issues related to furthest neighbor search, which are the main focus of our paper. The data structures presented in this paper are intended to improve performance in recommender systems relying on furthest neighbor queries. Other related works on recommender systems include those of Abbar et al. [abbar2013real] and Indyk et al. [indyk2014composable], which use core-set techniques to return a small set of recommendations no two of which are too close. In turn, core-set techniques also underpin works on approximating the minimum enclosing ball [badoiu2008optimal, KMY03].

1.2 Notation

We use the following notation throughout:

  • for the set of all points in a ball of radius with center .

  • for the annulus between two balls, that is . For an example, see Figure 1.

  • for the integers .

  • for the set of elements from that have the largest values of , breaking ties arbitrarily.

  • for the normal distribution with mean and variance .

Figure 1: The -annulus query.

2 Algorithms and analysis

2.1 Furthest neighbor with query-dependent candidates

Our data structure works by choosing a random line and storing the order of the data points along it. Two points far apart on the line are at least as far apart in the original space. So given a query we can find the points furthest from the query on the projection line, and take those as candidates to be the furthest point in the original space. We build several such data structures and query them in parallel, merging the results.

Given a set of size (the input data), let (the number of random lines) and (the number of candidates to be examined at query time), where is the desired approximation factor. We pick random vectors with each entry of coming from the standard normal distribution .

For any , we let and store the elements of in sorted order according to the value . Our data structure for -AFN consists of subsets , each of size . Since these subsets come from independent random projections, they will not necessarily be disjoint in general; but in high dimensions, they are unlikely to overlap very much. At query time, the algorithm searches for the furthest point from the query among the points in that maximize , where is a point of and the random vector used for constructing . The pseudocode is given in Algorithm 1. We observe that although the data structure is essentially that of Indyk [Indyk2003], our technique differs in the query procedure.

1:initialize a priority queue of (point, integer) pairs, indexed by real keys
2:for  to  do
3:     compute and store
4:     create an iterator into , moving in decreasing order of
5:     get the first element from and advance the iterator
6:     insert in the priority queue with key
7:end for
8:
9:for  to  do
10:     extract highest-key element from the priority queue
11:     if  or is further than from  then
12:         
13:     end if
14:     get the next element from and advance the iterator
15:     insert in the priority queue with key
16:end for
17:return
Algorithm 1 Query-dependent approximate furthest neighbor

Note that early termination is possible if is known at query time.

Correctness and analysis

The algorithm examines distances to a set of at most points selected from the , we will call the set :

We choose the name to emphasize that the set changes based on . Our algorithm succeeds if and only if contains a -approximate furthest neighbor. We now prove that this happens with constant probability.

We make use of the following standard lemmas that can be found, for example, in the work of Datar et al. [Datar04] and Karger, Motwani, and Sudan [KMS98].

Lemma 1 (See Section 3.2 of Datar et al. [Datar04]).

For every choice of vectors :

Lemma 2 (See Lemma 7.4 in Karger, Motwani, and Sudan [Kms98]).

For every , if then

The next lemma follows, as suggested by Indyk [Indyk2003, Claims 2-3].

Lemma 3.

Let be a furthest neighbor from the query with , and let be a point such that . Let with satisfying the equation (that is, ). Then, for a sufficiently large , we have

Proof.

Let . By Lemma 1 and the right part of Lemma 2, we have for a point that

The last step follows because implies that , and holds for a sufficiently large . Similarly, by Lemma 1 and the left part of Lemma 2, we have for a furthest neighbor that

∎∎

Theorem 4.

The data structure when queried by Algorithm 1 returns a -AFN of a given query with probability in

time per query. The data structure requires preprocessing time and total space

Proof.

The space required by the data structure is the space required for storing the sets . If for each set we store the points and the projection values, then memory words are required. On the other hand, if pointers to the input points are stored, then the total required space is . The representations are equivalent, and the best one depends on the value of and . The claim on space requirement follows. The preproceesing time is dominated by the computation of the projection values and by the sorting for computing the sets . Finally, the query time is dominated by the at most insertion or deletion operations on the priority queue and the cost of searching for the furthest neighbor, .

We now upper bound the success probability. As in the statement of Lemma 3, we let denote a furthest neighbor from , , be a point such that , and with such that . The query succeeds if: (i) for at least one projection vector , and (ii) the (multi)set contains at most points (i.e., there are at most near points each with a distance from the query at least in some projection). If (i) and (ii) hold, then the set of candidates examined by the algorithm must contain the furthest neighbor since there are at most points near to with projection values larger than the maximum projection value of . Note that we do not consider points at distance larger than but smaller than : they are -approximate furthest neighbors of and can only increase the success probability of our data structure.

By Lemma 3, event (i) happens with probability . Since there are independent projections, this event fails to happen with probability at most . For a point at distance at most from , the probability that is less than for Lemma 3. Since there are projections of points, the expected number of such points is . Then, we have that is greater than with probability at most by the Markov inequality. Note that a Chernoff bound cannot be used since there exists a dependency among the projections onto the same random vector . By a union bound, we can therefore conclude that the algorithm succeeds with probability at least . ∎∎

2.2 Furthest neighbor with query-independent candidates

Suppose instead of determining the candidates depending on the query point by means of a priority queue, we choose a fixed candidate set to be used for every query. The -approximation the minimum enclosing sphere is one example of such a query-independent algorithm. In this section we consider a query-independent variation of our projection-based algorithm.

During preprocessing, we choose unit vectors independently and uniformly at random over the sphere of unit vectors in dimensions. We project the data points in onto each of these unit vectors and choose the extreme data point in each projection; that is,

The data structure stores the set of all data points so chosen; there are at most of them, independent of . At query time, we check the query point against all the points we stored, and return the furthest one.

To prove a bound on the approximation, we will use the following result of Böröczky and Wintsche [Boroczky:Covering, Corollary 1.2]. Note that their notation differs from ours in that they use for the dimensionality of the surface of the sphere, hence one less than the dimensionality of the vectors, and for the constant, conflicting with our for approximation factor. We state the result here in terms of our own variable names.

Lemma 5 (See Corollary 1.2 in Böröczky and Wintsche [Boroczky:Covering]).

For any angle with , in -dimensional Euclidean space, there exists a set of at most unit vectors such that for every unit vector , there exists some with the angle between and at most , and

(1)

where is a universal constant.

Let ; that is half the angle between two unit vectors whose dot product is , as shown in Figure 2. Then by choosing unit vectors uniformly at random, we will argue that with high probability we choose a set of unit vectors such that every unit vector has dot product at least with at least one of them. Then the data structure achieves -approximation on all queries.

1

Figure 2: Choosing .
Theorem 6.

With for some function of and any such that , with high probability over the choice of the projection vectors, the data structure returns a -dimensional -approximate furthest neighbor on every query.

Proof.

Let . Then, since is between and , we can apply the usual half-angle formulas as follows:

Substituting into (1) from Lemma 5 gives

Let be the set of unit vectors from Lemma 5; every unit vector on the sphere is within angle at most from one of them. The vectors in are the centres of a set of spherical caps that cover the sphere.

Since the caps are all of equal size and they cover the sphere, there is probability at least that a unit vector chosen uniformly at random will be inside each cap. Let . This . Then for each of the caps, the probability none of the projection vectors is within that cap is , which approaches . By a union bound, the probability that every cap is hit is at least . Suppose this occurs.

Then for any query, the vector between the query and the true furthest neighbor will have angle at most with some vector in , and that vector will have angle at most with some projection vector used in building the data structure. Figure 2 illustrates these steps: if is the query and is the true furthest neighbor, a projection onto the unit vector in the direction from to would give a perfect approximation. The sphere covering guarantees the existence of a unit vector within an angle of this perfect projection; and then we have high probability of at least one of the random projections also being within an angle of . If that random projection returns some candidate other than the true furthest neighbor, the worst case is if it returns the point labelled , which is still a -approximation. We have such approximations for all queries simultaneously with high probability over the choice of the projection vectors. ∎∎

Note that we could also achieve -approximation deterministically, with somewhat fewer projection vectors, by applying Lemma 5 directly with and using the centres of the covering caps as the projection vectors instead of choosing them randomly. That would require implementing an explicit construction of the covering, however. Böröczky and Wintsche [Boroczky:Covering] argue that their result is optimal to within a factor , so not much asymptotic improvement is possible.

2.3 A lower bound on the approximation factor

In this section, we show that a data structure aiming at an approximation factor less than must use space on worst-case data. The lower bound holds for those data structures that compute the approximate furthest neighbor by storing a suitable subset of the input points.

Theorem 7.

Consider any data structure that computes the -AFN of an -point input set by storing a subest of the data set. If with , then the algorithm must store at least points.

Proof.

Suppose there exists a set of size such that for any we have and , with . We will later prove that such a set exists. We now prove by contradiction that any data structure requiring less than input points cannot return a -approximation.

Assume . Consider the input set consisting of arbitrary points of and set the query to , where is an input point not in the data structure. The furthest neighbor is and it is at distance . On the other hand, for any point in the data structure, we get

Therefore, the point returned by the data structure cannot be better than a approximation with

(2)

The claim follows by setting .

Assume now that . Without loss of generality, let be a multiple of . Consider as input set the set containing copies of each vector in , each copy expanded by a factor for any ; specifically, let . By assumption, the data structure can store at most points and hence there exists a point such that is not in the data structure for every . Consider the query where . The furthest neighbor of in is and it has distance . On the other hand, for every point in the data structure, we get

We then get the same approximation factor given in equation 2, and the claim follows.

The existence of the set of size follows from the Johnson-Lindenstrauss lemma [Matousek:JL]. Specifically, consider an orthornormal base of . Since , by the Johnson-Lindenstrauss lemma there exists a linear map such that and for any . We also have that , and hence . It then suffices to set to . ∎∎

The lower bound translates into the number of points that must be read by each query. However, this does not apply for query dependent data structures.

3 Furthest neighbor experiments

We implemented several variations of furthest neighbor query in both the C and F# programming languages. This code is available online999https://github.com/johanvts/FN-Implementations. Our C implementation is structured as an alternate index type for the SISAP C library [SISAP:Library], returning the furthest neighbor instead of the nearest.

We selected five databases for experimentation: the “nasa” and “colors” vector databases from the SISAP library; two randomly generated databases of 10-dimensional vectors each, one using a multidimensional normal distribution and one uniform on the unit cube; and the MovieLens 20M dataset [Harper:MovieLens]. The 10-dimensional random distributions were intended to represent realistic data, but their intrinsic dimensionality as measured by the statistic of Chávez and Navarro [Chavez:Intrinsic] is significantly higher than what we would expect to see in real-life applications.

For each database and each choice of from 1 to 30 and from to , we made 1000 approximate furthest neighbor queries. To provide a representative sample over the randomization of both the projection vectors and the queries, we used 100 different seeds for generation of the projection vectors, and did 10 queries (each uniformly selected from the database points) with each seed. We computed the approximation achieved, compared to the true furthest neighbor found by brute force, for every query. The resulting distributions are summarized in Figures 37.

11.11.21.31.41.5051015202530projections and points examined ()range/quartiles/mediansample meanquery-independent
Figure 3: Experimental results for 10-dimensional uniform distribution

We also ran some experiments on higher-dimensional random vector databases (with 30 and 100 dimensions, in particular) and saw approximation factors very close to those achieved for 10 dimensions.

vs.  tradeoff

The two parameters and both improve the approximation as they increase, and they each have a cost in the time and space bounds. The best tradeoff is not clear from the analysis. We chose as a typical value, but we also collected data on many other parameter choices.

11.11.21.31.41.5051015202530projections and points examined ()range/quartiles/mediansample meanquery-independent
Figure 4: Experimental results for 10-dimensional normal distribution
11.11.21.31.41.5051015202530projections and points examined ()range/quartiles/mediansample meanquery-independent
Figure 5: Experimental results for SISAP nasa database
11.11.21.31.41.5051015202530projections and points examined ()range/quartiles/mediansample meanquery-independent
Figure 6: Experimental results for SISAP colors database
projections and points examined ()range/quartiles/mediansample meanquery-independent
Figure 7: Experimental results for MovieLens 20M database
11.11.21.31.41.502468101214points examined (), with range/quartiles/mediansample meanquery-independent
Figure 8: The tradeoff between and on 10-dimensional normal vectors

Figure 8 offers some insight into the tradeoff: since the cost of doing a query is roughly proportional to both and , we chose a fixed value for their product, , and plotted the approximation results in relation to given that, for the database of normally distributed vectors in 10 dimensions. As the figure shows, the approximation factor does not change much with the tradeoff between and .

Query-independent ordering

The furthest-neighbor algorithm described in Section 2.1 examines candidates for the furthest neighbor in a query dependent order. In order to compute the order for arbitrary queries, we must store point IDs for each of the projections, and use a priority queue data structure during query, incurring some costs in both time and space. It seems intuitively reasonable that the search will usually examine points in a very similar order regardless of the query: first those that are outliers, on or near the convex hull of the database, and then working its way inward.

We implemented a modified version of the algorithm in which the index stores a single ordering of the points. Given a set of size , for each point let . The key for each point is its greatest projection value on any of the randomly-selected projections. The data structure stores points (all of them, or enough to accomodate the largest we plan to use) in order of decreasing key value: , , where . Note that this is not the same query-independent data structure discussed in Section 2.2; it differs both in the set of points stored and the order of sorting them.

The query examines the first points in the query independent ordering and returns the one furthest from the query point. Sample mean approximation factor for this algorithm in our experiments is shown by the dotted lines in Figures 38.

1:
2:for  to  do
3:     if  or is further than from  then
4:         
5:     end if
6:end for
7:return
Algorithm 2 Query-independent approximate furthest neighbor
Variations on the algorithm

We have experimented with a number of practical improvements to the algorithm. The most significant is to use the rank-based depth of projections rather than the projection value. In this variation we sort the points by their projection value for each . The first and last point then have depth 0, the second and second-to-last have depth 1, and so on up to the middle at depth . We find the minimum depth of each point over all projections and store the points in a query independent order using the minimum depth as the key. This approach seems to give better results in practice. A further improvement is to break ties in the minimum depth by count of how many times that depth is achieved, giving more priority to investigating points that repeatedly project to extreme values. Although such algorithms may be difficult to analyse in general, we give some results in Section 2.2 for the case where the data structure stores exactly the one most extreme point from each projection.

The number of points examined can be chosen per query and even during a query, allowing for interactive search. After returning the best result for some , the algorithm can continue to a larger for a possibly better approximation factor on the same query. The smooth tradeoff we observed between and suggests that choosing an during preprocessing will not much constrain the eventual choice of .

Discussion

The main experimental result is that the algorithm works very well for the tested datasets in terms of returning good approximations of the furthest neighbor. Even for small and the algorithm returns good approximations. Another result is that the query independent variation of the algorithm returns points only slighly worse than the query dependent. The query independent algorithm is simpler to implement, it can be queried in time as opposed to and uses only storage. In many cases these advances more than make up for the slightly worse approximation observed in these experiments. However, by Theorem 7, to guarantee approximation the query-independent ordering version would need to store and read points.

In data sets of high intrinsic dimensionality, the furthest point from a query may not be much further than any randomly selected point, and we can ask whether our results are any better than a trivial random selection from the database. The intrinsic dimensionality statistic of Chávez and Navarro [Chavez:Intrinsic] provides some insight into this question. Note that instrinsic dimensionality as measured by is not the same thing as the number of coordinates in a vector. For real data sets it is often much smaller than that. Intrinsic dimensionality also applies to data sets that are not vectors and do not have coordinates. Skala proves a formula for the value of on a multidimensional normal distribution [Skala:Dissertation, Theorem 2.10]; it is for the 10-dimensional distribution used in Figure 4. With the definition , this means the standard deviation of a randomly selected distance will be about 32% of the mean distance. Our experimental results come much closer than that to the true furthest distance, and so are non-trivial.

The concentration of distances in data sets of high intrinsic dimensionality reduces the usefulness of approximate furthest neighbor. Thus, although we observed similar values of in higher dimensions to our 10-dimensional random vector results, random vectors of higher dimension may represent a case where -approximate furthest neighbor is not a particularly interesting problem. However, vectors in a space with many dimensions but low intrinsic dimensionality, such as the colors database, are representative of many real applications, and our algorithms performed well on such data sets.

The experimental results on the MovieLens 20M data set [Harper:MovieLens], which were not included in the conference version of the present work, show some interesting effects resulting from the very high nominal (number of coordinates) dimensionality of this data set. The data set consists of 20000263 “ratings,” representing the opinions of 138493 users on 27278 movies. We treated this as a database of 27278 points (one for each movie) in a 138493-dimensional Euclidean space, filling in zeroes for the large majority of coordinates where a given user did not rate a given movie. Because of their sparsity, vectors in this data set usually tend to be orthogonal, with the distance between two simply determined by their lengths. Since the vectors’ lengths vary over a wide range (length proportional to number of users rating a movie, which varies widely), the pairwise distances also have a large variance, implying a low intrinsic dimensionality. We measured it as .

The curves plotted in Figure 7 show similar behaviour to that of the random distributions in Figures 3 and 4. Approximation factor improves rapidly with more projections and points examined, in the same pattern, but to a greater degree, as in the 10-coordinate vector databases, which have higher intrinsic dimensionality. However, here there is no noticeable penalty for using the query-independent algorithm. The data set appears to be dominated (insofar as furthest neighbours are concerned) by a few extreme outliers: movies rated very differently from any others. For almost any query, it is likely that one of these will be at least a good approximation of the true furthest neighbour; so the algorithm that identifies a set of outliers in advance and then chooses among them gives essentially the same results as the more expensive query-dependant algorithm.

4 Annulus query

In this section we return to the problem of annulus query. Using the AFN data structure in combination with LSH techniques we present a sub-linear time data structure for solving the approximate annulus query problem (AAQ) with constant failure probability in Euclidean space. Let’s begin by defining the exact and approximate annulus query problem:

Annulus query: Consider a set of points in and . The exact -annulus query is defined as follows: Given a query point , return a point . That is, we search for such that . If no such point exists in the query returns null. An alternative definition returns all points in , but we will focus our attention on the definition above.

Approximate annulus query: For a set of points in , and . The -approximate annulus query (AAQ) is defined as follows: Given a query point , if there exists , then return a point . If no such exists we can return either null or any point within .

4.1 Solving the -Aaq

We now show how to solve the -AAQ with constant failure probability in by combining the furthest neighbor technique with locality sensitive hashing methods [Har-Peled2012]. Consider an LSH function family . We say that is -sensitive for if:

Theorem 8.

Consider a -sensitive hash family for and let . For any set of at most points there exists a data structure for -AAQ such that:

  • Queries can be answered in time .

  • The data structure takes space in addition to storing .

The failure probability is constant and can be reduced to any by increasing the space and time cost by a constant factor.

We will now give a description of such a data structure and then prove that it has the properties stated in Theorem 8.

4.2 Annulus query data structure

Let and be integer parameters to be chosen later We construct a function family by concatenating members of . Choose functions from and pick random vectors with entries sampled independently from .

4.2.1 Preprocessing

During preprocessing, all points are hashed with each of the functions . We say that a point is in a bucket if . For every point the dot product values are calculated. These values are stored in the bucket along with a reference to . Each bucket consists of linked lists, list containing the entries sorted on , decreasing from the head of the list. See Figure 9 for an illustration where is the tuple . A bucket provides constant time access to the head of each list. Only non-empty buckets are stored.

Figure 9: Illustration of a bucket for . .

4.2.2 Querying

For a given query point the query procedure can be viewed as building the set of points from within with the largest values and computing the distances between and the points in . At query time is hashed using . From each bucket the top pointer is selected from each list. The selected points are then added to a priority queue with priority . This is done in time. Now we begin a cycle of adding and removing elements from the priority queue. The largest priority element is dequeued and the predecessor link is followed and the returned pointer added to the queue. If the pointer just visited was the last in its list, nothing is added to the queue. If the priority queue becomes empty the algorithm fails. Since is known at query time in the -AAQ it is possible to terminate the query procedure as soon as some point within the annulus is found. Note that this differs from the general furthest neighbor problem. For the analysis however we will consider the worst case where only the last element in lies in the annulus and bound to achieve constant success probability.

Proof.

Fix a query point . By the problem definition, we may assume . Define to be the set of candidate points for which the data structure described in section 4.2 calculates the distance to when queried. The correctness of the algorithm follows if .

To simplify the notation let and . Points in the these two sets have useful properties. Let be the solution to the equality:

If we set , we can use the ideas from Lemma 3 to conclude that:

Also, for the lower bound gives:

By definition, , so for some function we get:

Now for large , let be the set of points that hashed to the same bucket as for at least one hash function and projected above on at least one projection vector.

Let and . Using the probability bound (4.2.2) we see that . So by Markov’s inequality. By a result of Har-Peled, Indyk, and Motwani [Har-Peled2012, Theorem 3.4], the total number of points from across all buckets is at most with probability at least . So . This bounds the number of too far and too near points expected in .

By applying [Har-Peled2012, Theorem 3.4] again, we get that for each there exists such that with probability at least . Conditioning on the existence of this hash function, the probability of a point projecting above is at least . Then it follows that . The points in will necessarily be added to before all other points in the buckets; then, if we allow for , we get

The data structure requires us to store the top points per projection vector, per bucket, for a total space cost of , in addition to storing the dataset, . The query time is . The first term is for initializing the priority queue, and the second for constructing and calculating distances. Substituting in and we get query time:

(3)

where . Depending on the parameters different terms might dominate the cost, but for large we can simplify to the version stated in the theorem. The hash buckets take space:

(4)

Depending on , we might want to bound the space by instead, which yields a bound of . ∎∎

5 Conclusions and future work

We have proposed a data structure for AFN with theoretical and experimental guarantees. We have introduced the approximate annulus query and given a theoretical sublinear time solution. Although we have proved that it is not possible to use less than total space for -AFN when the approximation factor is less than , it is an open problem to close the gap between this lower bound and the space requirements of our result. Another interesting problem is to apply our data structure to improve the output sensitivity of near neighbor search based on locality-sensitive hashing. By replacing each hash bucket with an AFN data structure with suitable approximation factors, it is possible to control the number of times each point in is reported.

Our data structure extends naturally to general metric spaces. Instead of computing projections with dot product, which requires a vector space, we could choose some random pivots and order the points by distance to each pivot. The query operation would be essentially unchanged. Analysis and testing of this extension is a subject for future work.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
10243
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description