Approximate -flat Nearest Neighbor Search††thanks: WM and PS were supported in part by DFG Grants MU 3501/1 and MU 3501/2. YS was supported by the Deutsche Forschungsgemeinschaft within the research training group ‘Methods for Discrete Structures’ (GRK 1408).
Let be a nonnegative integer. In the approximate -flat nearest neighbor (-ANN) problem, we are given a set of points in -dimensional space and a fixed approximation factor . Our goal is to preprocess so that we can efficiently answer approximate -flat nearest neighbor queries: given a -flat , find a point in whose distance to is within a factor of the distance between and the closest point in . The case corresponds to the well-studied approximate nearest neighbor problem, for which a plethora of results are known, both in low and high dimensions. The case is called approximate line nearest neighbor. In this case, we are aware of only one provably efficient data structure, due to Andoni, Indyk, Krauthgamer, and Nguyễn (AIKN) . For , we know of no previous results.
We present the first efficient data structure that can handle approximate nearest neighbor queries for arbitrary . We use a data structure for -ANN-queries as a black box, and the performance depends on the parameters of the -ANN solution: suppose we have an -ANN structure with query time and space requirement , for . Then we can answer -ANN queries in time and space . Here, is an arbitrary constant and the -notation hides exponential factors in , , and and polynomials in .
Our approach generalizes the techniques of AIKN for -ANN: we partition into clusters of increasing radius, and we build a low-dimensional data structure for a random projection of . Given a query flat , the query can be answered directly in clusters whose radius is “small” compared to using a grid. For the remaining points, the low dimensional approximation turns out to be precise enough. Our new data structures also give an improvement in the space requirement over the previous result for -ANN: we can achieve near-linear space and sublinear query time, a further step towards practical applications where space constitutes the bottleneck.
Nearest neighbor search is a fundamental problem in computational geometry, with countless applications in databases, information retrieval, computer vision, machine learning, signal processing, etc. . Given a set of points in -dimensional space, we would like to preprocess so that for any query point , we can quickly find the point in that is closest to .
There are efficient algorithms if the dimension is “small” [7, 18]. However, as increases, these algorithms quickly become inefficient: either the query time approaches linear or the space grows exponentially with . This phenomenon is usually called the “curse of dimensionality”. Nonetheless, if one is satisfied with just an approximate nearest neighbor whose distance to the query point lies within some factor , , of the distance between and the actual nearest neighbor, there are efficient solutions even for high dimensions. Several methods are known, offering trade-offs between the approximation factor, the space requirement, and the query time (see, e.g., [1, 3] and the references therein).
From a practical perspective, it is important to keep both the query time and the space small. Ideally, we would like algorithms with almost linear (or at least sub-quadratic) space requirement and sub-linear query time. Fortunately, there are solutions with these guarantees. These methods include locality sensitive hashing (LSH) [11, 12] and a more recent approach that improves upon LSH . Specifically, the latter algorithm achieves query time and space , where is the approximation factor.
Often, however, the query object is more complex than a single point. Here, the complexity of the problem is much less understood. Perhaps the simplest such scenario occurs when the query object is a -dimensional flat, for some small constant . This is called the approximate -flat nearest neighbor problem . It constitutes a natural generalization of approximate nearest neighbors, which corresponds to . In practice, low-dimensional flats are used to model data subject to linear variations. For example, one could capture the appearance of a physical object under different lighting conditions or under different viewpoints (see  and the references therein).
So far, the only known algorithm with worst-case guarantees is for , the approximate line nearest neighbor problem. For this case, Andoni, Indyk, Krauthgamer, and Nguyễn (AIKN) achieve sub-linear query time and space , for arbitrarily small . For the “dual” version of the problem, where the query is a point but the data set consists of -flats, three results are known [4, 15, 14]. The first algorithm is essentially a heuristic with some control of the quality of approximation . The second algorithm provides provable guarantees and a very fast query time of . The third result, due to Mahabadi, is very recent and improves the space requirement of Magen’s result . Unfortunately, these algorithms all suffer from very high space requirements, thus limiting their applicability in practice. In fact, even the basic LSH approach for is already too expensive for large datasets and additional theoretical work and heuristics are required to reduce the memory usage and make LSH suitable for this setting [19, 13]. For , we know of no results in the theory literature.
Our results. We present the first efficient data structure for general approximate -flat nearest neighbor search. Suppose we have a data structure for approximate point nearest neighbor search with query time and space , for some constants . Then our algorithm achieves query time and space , where can be made arbitrarily small. The constant factors for the query time depend on , , and . Our main result is as follows.
Fix an integer and an approximation factor . Suppose we have a data structure for approximate point nearest neighbor search with query time and space , for some constants . Let be a -dimensional -point set. For any parameter , we can construct a data structure with space that can answer the following queries in expected time : given a -flat , find a point with .
The table above gives an overview of some approximate point nearest neighbor structures that can be used in Theorem 1.1. The result by AINR gives the current best query performance for large enough values of . For smaller , an approach using locality sensitive hashing (LSH1) may be preferable. With another variant of locality sensitive hashing (LSH2), the space can be made almost linear, at the expense of a slightly higher query time. The last result (and related practical results, e.g., ) is of particular interest in applications as the memory consumption is a major bottleneck in practice. It also improves over the previous algorithm by AIKN for line queries.
Along the way towards Theorem 1.1, we present a novel data structure for -flat near neighbor reporting when the dimension is constant. The space requirement in this case is and the query time is , where is the answer set. We believe that this data structure may be of independent interest and may lead to further applications. Our results provide a vast generalization of the result in AIKN and shows for the first time that it is possible to achieve provably efficient nearest neighbor search for higher-dimensional query objects.
Our techniques. Our general strategy is similar to the approach by AIKN. The data structure consists of two main structures: the projection structure and the clusters. The projection structure works by projecting the point set to a space of constant dimension and by answering the nearest neighbor query in that space. As we will see, this suffices to obtain a rough estimate for the distance, and it can be used to obtain an exact answer if the point set is “spread out”.
Unfortunately, this does not need to be the case. Therefore, we partition the point set into a sequence of clusters. A cluster consists of points and a -flat such that all points in the cluster are “close” to , where is a parameter to be optimized. Using a rough estimate from the projection structure, we can classify the clusters as small and large. The points in the large clusters are spread out and can be handled through projection. The points in the small clusters are well behaved and can be handled directly in high dimensions using grids and discretization.
Organization. In order to provide the curious reader with quick gratification, we will give the main data structure together with the properties of the cluster and the projection structure in Section 2. Considering these structures as black boxes, this already proves Theorem 1.1.
In the remainder of the paper, we describe the details of the helper structures. The necessary tools are introduced in Section 3. Section 4 gives the approximate nearest neighbor algorithm for small clusters. In Section 5, we consider approximate near neighbor reporting for -flats in constant dimension. This data structure is then used for the projection structures in Section 6.
2 Main Data Structure and Algorithm Overview
We describe our main data structure for approximate -flat nearest neighbor search. It relies on various substructures that will be described in the following sections. Throughout, denotes a -dimensional -point set, and is the desired approximation factor.
Let be a -flat in dimensions. The flat-cluster (or cluster for short) of with radius is the set of all points with distance at most to , i.e., . A cluster is full if it contains at least points from , where is a parameter to be determined. We call -cluster-free if there is no full cluster with radius . Let be an arbitrarily small parameter. Our data structure requires the following three subqueries.
Given a query flat , find a point with .
Assume is contained in a flat-cluster with radius . Given a query flat with , return a point with .
Assume is -cluster free. Given a query flat with , find the nearest neighbor to .
Briefly, our strategy is as follows: during the preprocessing phase, we partition the point set into a set of full clusters of increasing radii. To answer a query , we first perform a query of type Q1 to obtain an -approximate estimate for . Using , we identify the “small” clusters. These clusters can be processed using a query of type Q2. The remaining point set contains no “small” full cluster, so we can process it with a query of type Q3.
We will now describe the properties of the subqueries and the organization of the data structure in more detail. The data structure for Q2-queries is called the cluster structure. It is described in Section 4, and it has the following properties.
Let be a -dimensional -point set that is contained in a flat-cluster of radius . Let be an approximation factor. Using space , we can build a data structure with the following property. Given a query -flat with and an estimate with , we can find a -approximate nearest neighbor for in in total time .
The data structures for Q1 and Q3 are very similar, and we cover them in Section 6. They are called projection structures, since they are based on projecting into a low dimensional subspace. In the projected space, we use a data structure for approximate -flat near neighbor search to be described in Section 5. The projection structures have the following properties.
Let be a -dimensional -point set, and let be a small enough constant. Using space and time , we can obtain a data structure for the following query: given a -flat , find a point with . A query needs time, and the answer is correct with high probability.
Let be a -dimensional -point set, and let be a small enough constant. Using space and time , we can obtain a data structure for the following query: given a -flat and such that and such that is -cluster-free, find an exact nearest neighbor for in . A query needs time, and the answer is correct with high probability. Here, denotes the size of a full cluster.
2.1 Constructing the Data Structure
First, we build a projection structure for Q1 queries on . This needs space, by Theorem 2.2. Then, we repeatedly find the full flat-cluster with smallest radius. The points in are removed from , and we build a cluster structure for Q2 queries on this set. By Theorem 2.1, this needs space. To find , we check all flats spanned by distinct points of . In Lemma 3.3 below, we prove that this provides a good enough approximation. In the end, we have point sets ordered by decreasing radius, i.e., the cluster for has the largest radius. The total space occupied by the cluster structures is .
Finally, we build a perfect binary tree with leaves labeled , from left to right. For a node let be the union of all assigned to leaves below . For each we build a data structure for to answer Q3 queries. Since each point is contained in data structures, the total size is , by Theorem 2.3. For pseudocode, see Algorithm 1.
2.2 Performing a Query
Suppose we are given a -flat . To find an approximate nearest neighbor for we proceed similarly as AIKN . We use Q2 queries on “small” clusters and Q3 queries on the remaining points; for pseudocode, see Algorithm 2.
First, we perform a query of type Q1 to get a -approximate nearest neighbor for in time . Let . We use as an estimate to distinguish between “small” and “large” clusters. Let be the largest integer such that the cluster assigned with has radius . For , we use as an estimate for a Q2 query on . Since and by Theorem 2.1, this needs total time .
It remains to deal with points in “large” clusters. The goal is to perform a type Q3 query on . For this, we start at the leaf of labeled and walk up to the root. Each time we encounter a new node from its right child, we perform a Q3 query on , where denotes the left child of . Let be all the left children we find in this way. Then clearly we have and . Moreover, by construction, there is no full cluster with radius less than defined by vertices of for any . We will see that this implies every to be -cluster-free, so Theorem 2.3 guarantees a total query time of for this step. Among all the points we obtained during the queries, we return the one that is closest to . A good trade-off point is achieved for , i.e., for . This gives the bounds claimed in Theorem 1.1.
Correctness. Let be a point with . First, suppose that , for some . Then, we have , where is the radius of the cluster assigned to . Since is a valid -approximate estimate for , a query of type Q2 on gives a -approximate nearest neighbor, by Theorem 2.1. Now, suppose that for . Let be the node of with . Then Theorem 2.3 guarantees that we will find when doing a Q3 query on .
Partition Trees. Fix an integer constant , and let be a -dimensional -point set. A simplicial -partition for is a sequence of pairs such that (i) the sets form a partition of with , for ; (ii) each is a relatively open simplex with , for ; and (iii) every hyperplane in crosses simplices in . Here, a hyperplane crosses a simplex if intersects , but does not contain it. In a classic result, Matoušek showed that such a simplicial partition always exists and that it can be computed efficiently [16, 6].
Theorem 3.1 (Partition theorem, Theorem 3.1 and Lemma 3.4 in ).
For any -dimensional -point set and for any constant , there exists a simplicial -partition for . Furthermore, if is bounded by a constant, such a partition can be found in time .∎
Through repeated application of Theorem 3.1, one can construct a partition tree for . A partition tree is a rooted tree in which each node is associated with a pair , such that is a subset of and is a relatively open simplex that contains . If , the children of constitute a simplicial -partition of . Otherwise, the node has children where each child corresponds to a point in . A partition tree has constant degree, linear size, and logarithmic depth.
Given a hyperplane , there is a straightforward query algorithm to find the highest nodes in whose associated simplex does not cross : start at the root and recurse on all children whose associated simplex crosses ; repeat until there are no more crossings or until a leaf is reached. The children of the traversed nodes whose simplices do not cross constitute the desired answer. A direct application of Theorem 3.1 yields a partition tree for which this query takes time , where is a constant that can be made arbitrarily small by increasing . In 2012, Chan  described a more global construction that eliminates the factor.
Theorem 3.2 (Optimal Partition Trees ).
For any -dimensional -point set , and for any large enough constant , there is a partition tree with the following properties: (i) the tree has degree and depth ; (ii) each node is of the form , where is a subset of and a relatively open simplex that contains ; (iii) for each node , the simplices of the children of are contained in and are pairwise disjoint; (iv) the point set associated with a node of depth has size at most ; (v) for any hyperplane in , the number of simplices in that intersects at level obeys the recurrence
Thus, intersects simplices in total. The tree can be build in expected time .
-flat Discretization. For our cluster structure we must find -flats that are close to many points. The following lemma shows that it suffices to check “few” -flats for this.
Let be a finite point set with , and let be a -flat. There is a -flat such that is the affine hull of points in and , where and .
This proof generalizes the proof of Lemma 2.3 by AIKN .
Let be the orthogonal projection of onto . We may assume that is the affine hull of , since otherwise we could replace by without affecting . We choose an orthonormal basis for such that is the linear subspace spanned by the first coordinates. An affine basis for is constructed as follows: first, take a point whose -coordinate is minimum. Let be the projection of onto , and translate the coordinate system such that is the origin. Next, choose additional points such that is maximum, where is the projection of onto , for . That is, we choose additional points such that the volume of the -dimensional parallelogram spanned by their projections onto is maximized. The set is a basis for , since the maximum determinant cannot be by our assumption that is spanned by .
Now fix some point and let be its projection onto . We write . Then, the point lies in . By the triangle inequality, we have
To upper-bound we first show that all coefficients lie in .
Take , and as above. Write . Then for , we have , and for at least one .
We first prove that all coefficients lie in the interval . Suppose that for some . We may assume that . Using the linearity of the determinant,
contradicting the choice of .
Furthermore, by our choice of the origin, all points in have a non-negative -coordinate. Thus, at least one coefficient , , has to be non-negative. ∎
4 Cluster Structure
A -flat cluster structure consists of a -flat and a set of points with , for all . Let be a parametrization of , with and such that the columns of constitute an orthonormal basis for and such that is orthogonal to . We are also given an approximation parameter . The cluster structure uses a data structure for approximate point nearest neighbor search as a black box. We assume that we have such a structure available that can answer -approximate point nearest neighbor queries in dimensions with query time and space requirement for some constants . As mentioned in the introduction, the literature offers several data structures for us to choose from.
The cluster structure distinguishes two cases: if the query flat is close to , we can approximate by few “patches” that are parallel to , such that a good nearest neighbor for the patches is also good for . Since the patches are parallel to , they can be handled through -ANN queries in the orthogonal space and low-dimensional queries inside . If the query flat is far from , we can approximate by its projection onto and handle the query with a low-dimensional data structure.
Let be the linear subspace of that is orthogonal to . Let be the projection of onto , and let be the projection of onto . We compute a -dimensional partition tree for . As stated in Theorem 3.2, the tree has nodes, and it can be computed in time .
For each node of , we do the following: we determine the set whose projection onto gives , and we take the projection of onto . Then, we build a dimensional -ANN data structure for , as given by the assumption, where . See Algorithm 3 for pseudocode.
The cluster structure can be constructed in total time , and it requires space.
By Theorem 3.2, the partition tree can be built in time. Thus, the preprocessing time is dominated by the time to construct the -ANN data structures at the nodes of the partition tree . Since the sets on each level of constitute a partition of , and since the sizes of the sets decrease geometrically, the bounds on the preprocessing time and space requirement follow directly from our assumption. Note that by our choice of , the space requirement and query time for the ANN data structure change only by a constant factor. ∎
4.2 Processing a Query
We set . Let be the query -flat, given as , with and such that the columns of are an orthonormal basis for and is orthogonal to . Our first task is to find bases for the flats and that provide us with information about the relative position of and . For this, we take the matrix , and we compute a singular value decomposition of [9, Chapter 7.3]. Recall that and are orthogonal matrices and that is a diagonal matrix with . We call the singular values of . The following lemma summarizes the properties of the SVD that are relevant to us.
Let , and let be a singular value decomposition for . Let be the columns of and be the columns of . Then, (i) is an orthonormal basis for (in the coordinate system induced by ); (ii) is an orthonormal basis for (in the coordinate system induced by ): and (iii) for , the projection of onto is and the projection of onto is (again in the coordinate systems induced by and ). In particular, we have .
Properties (i) and (ii) follow since and are orthogonal matrices. Property (iii) holds because describes the projection from onto (in the coordinate systems induced by and ) and because describes the projection from onto . ∎
We reparametrize according to and according to . More precisely, we set and , and we write and . The new coordinate system provides a simple representation for the distances between and . We begin with a technical lemma that is a simple corollary of Lemma 4.2.
Let be the columns of the matrix ; let be the columns of the matrix , and the columns of the matrix . Then, (i) for , the vector is the projection of onto and the vector is the projection of onto ; (ii) for , we have and ; and (iii) the vectors are pairwise orthogonal. An analogous statement holds for the matrices , , and .
Properties (i) and (ii) are an immediate consequence of the definition of and and of Lemma 4.2. The set is orthogonal by Lemma 4.2(ii). Furthermore, since for any , the vector lies in and the vector lies in , and are orthogonal. Finally, let . Then,
since we already saw that . The argument for the other matrices is completely analogous. ∎
The next lemma shows how our choice of bases gives a convenient representation of the distances between and .
Take two points and such that . Write and . Then, for any point with , we have
and for any point with , we have
We show the calculation for . The calculation for is symmetric. Let with be given. Let be the projection of onto . Then,
where the last equality is due to Pythagoras, since lies in , lies in , and is orthogonal to both and . Now, we have . Similarly, since is the projection of onto , we have . Thus,
using the definition of and . By Lemma 4.3, the columns of the matrix are pairwise orthogonal and for , we have . Pythagoras gives
We now give a brief overview of the query algorithm, refer to Algorithm 4 for pseudocode. First, we check for the special case that and are parallel, i.e., that . In this case, we need to perform only a single -ANN query in to obtain the desired result. If and are not parallel, we distinguish two scenarios: if is far from , we can approximate by its projection onto . Thus, we take the closest point in to , and we return an approximate nearest neighbor for in according to an appropriate metric derived from Lemma 4.4. Details can be found in Section 4.2.2. If is close to , we use Lemma 4.4 to discretize the relevant part of into patches, such that each patch is parallel to and such that the best nearest neighbor in for the patches provides an approximate nearest neighbor for . Each patch can then be handled essentially by an appropriate nearest neighbor query in . Details follow in Section 4.2.1. We say and are close if , and far if . Recall that we chose .
We use our reparametrization of and to split the coordinates as follows: recall that are the singular values of . Pick such that , for , and , for . For a matrix , let denote the matrix with the first columns of , and the matrix with the remaining columns of . Similarly, for a vector , let be the vector in with the first coordinates of , and the vector in with the remaining coordinates of .
The following lemma is an immediate consequence of Lemma 4.4. It tells us that we can partition the directions in into those that are almost parallel to and those that are almost orthogonal to . Along the orthogonal directions, we discretize into few lower-dimensional flats that are almost parallel to . After that, we approximate these flats by few patches that are actually parallel to . These patches are then used to perform the query.
Let be a point and with . Write and . Then, .
By Lemma 4.4 and the choice of ,