Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement (v1.0)
Abstract
Approximate Nearest neighbor search (ANNS) is fundamental and essential operation in applications from many domains, such as databases, machine learning, multimedia, and computer vision. Although many algorithms have been continuously proposed in the literature in the above domains each year, there is no comprehensive evaluation and analysis of their performances.
In this paper, we conduct a comprehensive experimental evaluation of many stateoftheart methods for approximate nearest neighbor search. Our study (1) is crossdisciplinary (i.e., including 16 algorithms in different domains, and from practitioners) and (2) has evaluated a diverse range of settings, including 20 datasets, several evaluation metrics, and different query workloads. The experimental results are carefully reported and analyzed to understand the performance results. Furthermore, we propose a new method that achieves both high query efficiency and high recall empirically on majority of the datasets under a wide range of settings.
1 Introduction
Nearest neighbor search finds an object in a reference database which has the smallest distance to a query object. It is a fundamental and essential operation in applications from many domains, including databases, computer vision, multimedia, machine learning and recommendation systems.
Despite much research on this problem, it is commonly believed that it is very costly to find the exact nearest neighbor in high dimensional Euclidean space, due to the curse of dimensionality [26]. Experiments showed that exact methods can rarely outperform the bruteforce linear scan method when dimensionality is high [43] (e.g., more than 20). Nevertheless, returning sufficiently nearby objects, referred to as approximate nearest neighbor search (ANNS), can be performed efficiently and are sufficiently useful for many practical problems, thus attracting an enormous number of research efforts. Both exact and approximate NNS problems can also be extended to their top versions.
1.1 Motivation
There are hundreds of papers published on algorithms for (approximate) nearest neighbor search, but there has been few systematic and comprehensive comparisons among these algorithms. In this paper, we conduct a comprehensive experimental evaluation on the stateoftheart approximate nearest neighbor search algorithms in the literature, due to the following needs:
1. Coverage of Competitor Algorithms and Datasets from Different Areas. As the need for performing ANN search arises naturally in so many diverse domains, researchers have come up with many methods while unaware of alternative methods proposed in another area. In addition, there are practical methods proposed by practitioners and deployed in largescale projects such as the music recommendation system at spotify.com [7]. As a result, it is not uncommon that important algorithms from different areas are overlooked and not compared with. For example, there is no evaluation among Rank Cover Tree [23] (from Machine Learning), Product Quantization [27, 20] (from Multimedia), SRS [39] (from Databases), and KGraph [15] (from practitioners). Moreover, each domain typically has a small set of commonly used datasets to evaluate ANNS algorithms; there are very few datasets used by all these domains. In contrast, we conduct comprehensive experiments using carefully selected representative or latest algorithms from different domains, and test all of them on 20 datasets including those frequently used in prior studies in different domains. Our study confirms that there are substantial variability of the performances of all the algorithms across these datasets.
2. Overlooked Evaluation Measures/Settings. An NNS algorithm can be measured from various aspects, including (i) search time complexity, (ii) search quality, (iii) index size, (iv) scalability with respect to the number of objects and the number of dimensions, (v) robustness against datasets, query workloads, and parameter settings, (vi) updatability, and (vii) efforts required in tuning its parameters. Unfortunately, none of the prior studies evaluates these measures completely and thoroughly.
For example, most existing studies use a query workload that is essentially the same as the distribution of the data. Measuring algorithms under different query workloads is an important issue, but little result is known. In this paper, we evaluate the performances of the algorithms under a wide variety of settings and measures, to gain a complete understanding of each algorithm (c.f., Table 6).
3. Discrepancies in Existing Results. There are discrepancies in the experimental results reported in some of the notable papers on this topic. For example, in the annbenchmark [8] by practitioners, FLANN was shown to perform better than KGraph, while the study in [15] indicates otherwise. While much of the discrepancies can be explained by the different settings, datasets and tuning methods used, as well as implementation differences, it is always desirable to have a maximally consistent result to reach an uptodate ruleofthethumb recommendation in different scenarios for researchers and practitioners.
In this paper, we try our best to make a fair comparison among all methods evaluated, and test them on all 20 datasets. For example, all search programs are implemented in C++, and all hardwarespecific optimizations are disabled (e.g., SIMDbased distance computation). Finally, we will also publish the source codes, datasets, and other documents so that the results can be easily reproduced.
We classify popular NN algorithms into three categories: LSHbased, Space partitioningbased and Neighborhoodbased. The key idea of each category of the method will be introduced in Section 36.
1.2 Contributions
Our principle contributions are summarized as follows.

Comprehensive experimental study of stateoftheart ANNS methods across several different research areas. Our comprehensive experimental study extends beyond past studies by: (i) comparing all the methods without adding any implementation tricks, which makes the comparison more fair; (ii) evaluating all the methods using multiple measures; and (iii) we provide ruleofthethumb recommendations about how to select the method under different settings. We believe such a comprehensive experimental evaluation will be beneficial to both the scientific community and practitioners, and similar studies have been performed in other areas (e.g., classification algorithms [12]).

We group algorithms into several categories (Section 3, 4, 5 and 6), and then perform detailed analysis on both intra and intercategory evaluations (Sections 8). Our databased analyses provide confirmation of useful principles to solve the problem, the strength and weakness of some of the best methods, and some initial explanation and understanding of why some datasets are harder than others. The experience and insights we gained throughout the study enable us to engineer a new empirical algorithm, DPG (Section 7), that achieves both high query efficiency and high recall empirically on majority of the datasets under a wide range of settings.
Our paper is organised as follows. Section 2 introduces the problem definition as well as some constraints in this paper. Section 3, 4, 5 and 6 give the descriptions about some stateoftheart ANNS algorithms that we evaluated. Section 7 presents our improved ANNS approach. The comprehensive experiments and the analyses are reports in Section 8. Section 10 concludes out paper with future work. An evaluation of the parameters of the tested algorithms is reported in Appendix A. Appendix B gives some supplement results of the second round.
2 Background
2.1 Problem Definition
In this paper, we focus on the case where data points are dimensional vectors in and the distance metric is the Euclidean distance. Henceforth, we will use points and vectors interchangeably. Let be the Euclidean distance between two data points and and be a set of reference data points. A NN search for a given query point is defined as returning nearest neighbors with respect to such that and , , .
Due to the curse of dimensionality [26], much research efforts focus on the approximate solution for the problem of nearest neighbor search on high dimensional data. Let the results returned by an algorithm be . A common way to measure the quality of is its recall, defined as , which is also called precision in some papers.
2.2 Scope
The problem of ANNS on high dimensional data has been extensively studied in various literatures such as databases, theory, computer vision, and machine learning. Over hundreds of algorithms have been proposed to solve the problem from different perspectives, and this line of research remains very active in the above domains due to its importance and the huge challenges. To make a comprehensive yet focused comparison of ANNS algorithms, in this paper, we restrict the scope of the study by imposing the following constraints.
Representative and competitive ANNS algorithms. We consider the stateoftheart algorithms in several domains, and omit other algorithms that have been dominated by them unless there are strong evidences against the previous findings.
No hardware specific optimizations. Not all the implementations we obtained or implemented have the same level of sophistication in utilizing the hardware specific features to speed up the query processing. Therefore, we modified several implementations so that no algorithm uses multiple threads, multiple CPUs, SIMD instructions, hardware prefetching, or GPUs.
Dense vectors. We mainly focus on the case where the input data vectors are dense, i.e., nonzero in most of the dimensions.
Support the Euclidian distance. The Euclidean distance is one of the most widely used measures on highdimensional datasets. It is also supported by most of the ANNS algorithms.
Exact NN as the ground truth. In several existing works, each data point has a label (typically in classification or clustering applications) and the labels are regarded as the ground truth when evaluating the recall of approximate NN algorithms. In this paper, we use the exact NN points as the ground truth as this works for all datasets and majority of the applications.
Prior Benchmark Studies There are two recent NNS benchmark studies: [34] and annbenchmark [8]. The former considers a large number of other distance measures in addition to the Euclidean distance, and the latter does not disable general implementation tricks. In both cases, their studies are less comprehensive than ours, e.g., with respect to the number of algorithms and datasets evaluated. More discussions on comparison of benchmarking results are given in Section 9.3.
3 LSHbased methods
These methods are typically based on the idea of localitysensitive hashing (LSH), which maps a highdimensional point to a lowdimensional point via a set of appropriately chosen random projection functions. Methods in this category enjoys the sound probabilistic theoretical guarantees on query result quality, efficiency, and index size even in the worst case. On the flip side, it has been observed that the dataindependent methods usually are outperformed by datadependent methods empirically, since the latter cannot exploit the data distribution.
Localitysensitive hashing (LSH) is first introduced by Indyk and Motwani in [26]. An LSH function family for a distance function is defined as sensitive iff for any two data points and , there exists two distance thresholds and and two probability thresholds and that satisfy: . This means, the chance of mapping two points , to the same value grows as their distance decreases. The stable distribution, i.e., the Gaussian/normal distribution, can be used to construct the LSH function family for the Euclidean distance. A data point is mapped to a hash value (e.g., bucket) based on a random projection and discritize method. A certain number of hash functions are used together to ensure the performance guarantee, using different schemes such as the ANDthenOR scheme [14], the ORthenAND scheme [45], inverted listbased scheme [18, 25], or treebased scheme [39]. In this category, we evaluate two most recent LSHbased methods with theoretical guarantees: SRS [39] and QALSH [25]. Both of them can work with any . Note that we exclude several recent work because either there is no known practical implementation (e.g., [3]) or they do not support the Euclidean distance (e.g., FALCONN [2]). There are also a few empirical LSHbased methods that loses the theoretical guarantees and some of them will be included in other categories.
3.1 Srs
SRS projects the dataset with high dimensionality into a lowdimensional space (no more than 10) to do exact NN search. We refer the distance between the query and any point in original space as , and the distance in projected space as . The key observation of SRS is for any point , follows the standard distribution. Hence, The basic method solve a ANN problem in two steps: (1) obtain an ordered set of candidates by issuing a NN query with on the dimensional projections of data points; (2) Examine the distance of these candidates in order and return the points with the smallest distance so far if it satisfies the earlytermination test(if there exists a point that is a cANN point with probability at least a given threshold) or the algorithm has exhausted the points. By setting , the algorithm guarantees that the returned point is no further away than times the nearest neighbor distance with constant probability; both the space and time complexities are linear in and independent of .
3.2 Qalsh
Traditionally, LSH functions are constructed in a queryoblivious manner in the sense that buckets are partitioned before any query arrives. However, objects closer to a query may be partitioned into different buckets, which is undesirable. QALSH introduces the concept of queryaware bucket partition and develop novel queryaware LSH functions accordingly. Given a prespecified bucket width , a hash function is used in previous work, while QALSH uses the function that first projects object along the random line as before.When a query arrives, the projection of (i.e., ) is computed and the query projection (or simply the query) is applied as the “anchor” for bucket partition.. This approach of bucket partition is said to be queryaware. In the preprocessing step, the projections of all the data objects along the random line are calculated, and all the data projections are indexed by a tree. When a query object arrives, QALSH computes the query projection and uses the tree to locate objects falling in the interval . Moverover, QALSH can gradually locate data objects even farther away from the query, just like performing a tree range search.
4 Encodingbased Methods
A large body of works have been dedicated to learning hash functions from the data distribution so that the nearest neighbor search result in the hash coding space is as close to the search result in the original space as possible. Please refer to [42, 41] for a comprehensive survey.
Some example methods we evaluated in this category include Neighbor Sensitive Hashing [35], Selective Hashing [19], Anchor Graph Hashing [30], Scalable Graph Hashing [28], Neighborhood APProximation index [34], and Optimal Product Quantization [20].
We exclude a recent work [4] optimizing the performance of product quantizationbased method as it is mainly based on utilizing the hardwarespecific features.
4.1 Anchor Graph Hashing
The most critical shortcoming of the existing unsupervised hashing methods is the need to specify a global distance measure. On the contrary, in many realworld applications data lives on a lowdimensional manifold, which should be taken into account to capture meaningful nearest neighbors. For these, one can only specify local distance measures, while the global distances are automatically determined by the underlying manifold. AGH is a graphbased hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes in an unsupervised manner.
The first step of the graphhashing methods is building a neighborhood graph with all the data points. In order to compute the neighborhood graph effectively, AGH builds an approximate neighborhood graph using Anchor Graphs, in which the similarity between a pair of data points is measured with respect to a small number of anchors.The resulting graph is constructed in time complexity and is sufficiently sparse with performance approaching the true kNN graph as the number of anchors increase.
AGH uses the anchor graph to approximate the neighborhood graph, and accordingly uses the graph Laplacian over the anchor graph to approximate the graph Laplacian of the original graph. Then the eigenvectors can be fastly computed. It also uses a hierarchical hashing to address the boundary issue.
4.2 Scalable Graph hashing
AGH and some representative unsupervised hashing methods directly exploit the similarity (neighborhood structure) to guide the hashing learning procedure, which are considered as a class of graph hashing. Generally, graph hashing methods are expected to achieve better performance than nongraph based hashing methods if the learning algorithms are effective enough. However, graph hashing methods need to compute the pairwise similarities between any two data points. Hence, for largescale datasets, it is memoryconsuming and timeconsuming or even intractable to learn from the whole similarity graph. Existing methods have to adopt approximation or subsampling methods for graph hashing on largescale datasets. However, the accuracy of approximation cannot be guaranteed.
SGH adopts a feature transformation method to effectively approximate the whole graph without explicitly computing the similarity graph matrix. Hence, the computation cost and storage cost are avoided in SGH. The aim of SGH is approximate the similarity matrix by the learned hashing codes, which resulting in the following objective function:
where . Integrating the idea of kernelized localitysensitive hashing, the hash function for the th bit of is defined as follow:
where is the weight matrix, is a kernel function.
SGH applies a feature transformation method to use all the similarities without explicitly computing . The discrete function makes the problem very difficult to solve. SGH uses a sequential learning strategy in a bitwise manner, where the residual caused by former bits can be complementarily captured in the following bits.
4.3 Neighbor Sensitive Hashing
For hashingbased methods, the accuracy is judged by their effectiveness in preserving the kNN relationships (among the original items) in the Hamming space. That is, the goal of the optimal hash functions is that the relative distance of the original items are ideally linear to relative Hamming distance. To preserve the original distances, existing hashing techniques tend to place the separators uniformly, while those are more effective for rNN searching.
The goal of existing methods is to assign hashcodes such that the Hamming distance between each pair of items is as close to a linear function of their original distance as possible. However, NSH changes the shape of the projection function which impose a larger slope when the original distance between a pair of items is small, and allow the Hamming distance to remain stable beyond a certain distance.
Given a reference point , NSH changes the shape of the projection function which impose a larger slope when the original distance to is small, and allow the projection distance to remain stable beyond a certain distance. Hence, when using the traditional function under the projection space, it is more likely to be assigned with different hash codes for the points close to . If the distances between the query and is smaller than a threshold, the above property also apply to . For handling all possible queries, NSH selects multiple pivots and limits the maximal average distances between a pivot and its closest neighbor pivot to ensure that there will be at least one nearby pivot for any novel query.
4.4 Napp
permutation methods are dimensionalityreduction approaches, which assess similarity of objects based on their relative distances to some selected reference points, rather than the real distance values directly. An advantage of permutation methods is that they are not relying on metric properties of the original distance and can be successfully applied to nonmetric spaces. The underpinning assumption of permutation methods is that most nearest neighbors can be found by retrieving a small fraction of data points whose pivot rankings are similar to the pivot ranking of the query.
A basic version of permutation methods selects pivots randomly from the data points. For each data point , the pivots are arranged in the order of increasing distance from . Such a permutation of pivots is essentially a lowdimensional integervalued vector whose th element is the ranking order of the th pivot in the set of pivots sorted by their distances from . For the pivot closest to the data point, the value of the vector element is one, while for the most distance pivot the value is .
In the searching step, a certain number of candidate points are retrieved whose permutations are sufficiently close to the permutation of the query vector. Afterwards, the search method sorts these candidate data points based on the original distance function.
This basic method can be improved in several ways. For example, one can index permutations rather than searching them sequentially: It is possible to employ a permutation prefix tree, an inverted index, or an index designed for metric spaces, e.g., a VPtree.
Neighborhood APProximation index(NAPP) was implemented by Bilegsaikhan. First, the algorithm selects pivots and computes the permutations induced by the data points. For each data point, most closest pivots are indexed in a inverted file. At query time, the data points that share at least nearest neighbor pivots with the query are selected. Then, these candidates points are compared directly against the query based on the real distance.
4.5 Selective Hashing
LSH is designed for finding points within a fixed radius of the query point, i.e., radius search. For a kNN problem (e.g., the first page results of search engine), the corresponding radii for different query points may vary by orders of magnitude, depending on how densely the region around the query point is populated.
Recently, there has been increasing interest in designing learningbased hashing functions to alleviate the limitations of LSH. If the data distributions only have global density patterns, good choices of hashing functions may be possible. However, it is not possible to construct global hashing functions capturing diverse local patterns. Actually, kNN distance (i.e., the desired search range) depends on the local density around the query.
Selective Hashing(SH) is especially suitable for kNN search which works on the top of radius search algorithms such as LSH. The main innovation of SH is to create multiple LSH indices with different granularities (i.e., radii). Then, every data object is only stored in one selected index, with certain granularity that is especially effective for kNN searches near it.
Traditional LSHrelated approaches with multiple indexes include all the data points in every index, while selective hashing builds multiple indices with each object being placed in only one index. Hence, there is almost no extra storage overhead compared with one fixed radius. Data points in dense regions are stored in the index with small granularity, while data points in sparse regions are stored in the index with large granularity. In the querying phase, the algorithm will push down the query and check the cell with suitable granularity.
4.6 Optimal Product Quantization (Opq)
Vector quantization (VQ) [21] is a popular and successful method for ANN search, which could be used to build inverted index for nonexhaustive search. A vector is mapped to a codeword in a codebook with in a finite index set. The mapping, termed as a quantizer, is denoted by: , where the function and is called encoder and decoder, respectively. The means approach has been widely used to construct the codebook where is the index of the nearest mean of and is corresponding mean. An inverted index can be subsequently built to return all data vectors that are mapped to the same codeword.
A product quantizer [27] is a solution to VQ when a large number of codewords are desired. The key idea is to decompose the original vector space into the Cartesian product of lower dimensional subspaces and quantize each subspace separately. Suppose and the dimensions are evenly partitioned. Each vector can be represented as the concatenation of subvectors: , Then we have where (resp. ) is constructed by applying the means algorithm on the subspace consisting of the first (resp. second) half dimensions. As such, there are codewords, and ’s nearest codeword in is the concatenation of the two nearest subcodewords and based on its subvectors and , respectively. Given a query vector , the search algorithm in [6] first retrieves closest subcodedwords in each subspaces based on and , respectively, and then a multisequence algorithm based on a priority queue is applied to traverse the set of pairs in increasing distance to achieve candidate codeword set , finally, the distances of points belonging to one of the code word in are examined via the inverted index.
Recently, the Optimal Product Quantization (OPQ) method is proposed [20], which optimizes the index by minimizing the quantization distortion with respect to the space decompositions and the quantization codewords.
5 Treebased Space Partition Methods
Treebased space partition has been widely used for the problem of exact and approximate NNS. Generally, the space is partitioned in a hierarchically manner and there are two main partitioning schemes: pivoting and compact partitioning schemes. Pivoting methods partition the vector space relying on the distance from the data point to pivots while compact partitioning methods either divide the data points into clusters, approximate Voronoi partitions or random divided space. In this subsection, we introduce two representative compact partitioning methods, Annoy [7] and FLANN [33]. as they are used in a commercial recommendation system and widely used in the Machine Learning and Computer Vision communities. In addition, we also evaluate a classical pivoting partitioning method, VantagePoint tree [44] (VPtree), in the experiments.
5.1 Flann
FLANN is an automatic nearest neighbor algorithm configuration method which
select the most suitable algorithm from randomized
kdtree [38], hierarchical kmeans
tree [17]
Randomized kdtrees
The main differences from FLANN’s randomize kdtrees with the traditional kdtree are: (i) The data points are recursively split into two halves by first choosing a splitting dimension and then use the perpendicular hyperplane centered at the mean of the dimension values of all input data points. (ii) The splitting dimension is chosen at random from top5 dimensions that have the largest sample variance based on the input data points. (iii) Multiple randomized kdtrees are built as the index.
To answer a query , a depthfirst search prioritized by some heuristic scoring function are used to search multiple randomized kdtrees with a shared priority queue and candidate result set. The scoring function always favors the child node that is closer to the query point, and lower bounding distance is maintained across all the randomized trees.
Hierarchical kmeans tree is constructed by partitioning the data points at each level into regions using Kmeans clustering. Then the same method is applied recursively to the data points in each region. The recursion is stopped when the number of points in each leaf node is smaller than . The tree is searched by initially traversing the tree from the root to the closest leaf, during which the algorithm always picks up the the inner node closest to the query point, and adding all unexplored branches along the path to a priority queue.
FLANN carefully chooses one of the three candidate algorithms by minimizing a cost function which is a combination of the search time, index building time and memory overhead to determine the search algorithm. The cost function is defined as follows:
where , and represent the search time, tree build time and memory overhead for the trees constructed and queried with parameters .
Flann use random subsampling crossvalidation to generate the data and the query points when we run the optimization. The optimization can be run on the full data set for the most accurate results or using just a fraction of the data set to have a faster autotuning process.
5.2 Annoy
Annoy is an empirically engineered algorithm that has been used in the recommendation engine in spotify.com, and has been recognized as one of the best ANN libraries.
Index Construction .
In earlier versions, Annoy constructs multiple random projection trees [13]. In the latest version (as of March 2016), it instead constructs multiple hierarchical 2means trees. Each tree is independently constructed by recursively partitioning the data points as follows. At each iteration, two centers are formed by running a simplified clustering algorithm on a subset of samples from the input data points. The two centers defines a partition hyperplane which has equidistant from the centers. Then the data points are partitioned into two subtrees by the hyperplane, and the algorithm builds the index on each subtrees recursively.
Search .
The search process is carried out by traveling tree nodes of the multiple RP trees. Initially, the roots of the RP trees are pushed into a maximum priority queue with key values infinity. For a given tree node with parent node , if the query falls into the subtree of , the key of is the minimum of its parent node and the distance to the hyperplane; otherwise, the key is the minimum of its parent node and the negative value of the distance to the hyperplane. At each iteration, the node with the maximum key is chosen for exploration.
5.3 VPtree
VPtree is a classic space decomposition tree that recursively divides the space with respect to a randomly chosen pivot . For each partition, a median value R of the distance from to every other point in the current partition was computed. The pivotcentered ball with the radius is used to partition the space: the inner points are placed into the left subtree, while the outer points are placed into the right subtree (points that are exactly at distance from can be placed arbitrarily). Partitioning stops when the number of points falls below the threshold .
In classic VPtree, the triangle inequality can be used to prune unpromising partitions as follows: imagine that is a radius of the query and the query point is inside the pivotcentered ball (i.e., in the left subtree). If , the right partition cannot have an answer and the right subtree can be safely pruned. The nearestneighbor search is simulated as a range search with a decreasing radius: Each time we evaluate the distance between and a data point, we compare this distance with . If the distance is smaller, it becomes a new value of . In [34], a simple polynomial pruner is employed. More specifically, the right partition can be pruned if the query is in the left partition and . The left partition can be pruned if the query is in the right partition and .
6 Neighborhoodbased Methods
In general, the neighborhoodbased methods build the index by retaining the neighborhood information for each individual data point towards other data
points or a set of pivot points. Then various greedy heuristics are proposed to navigate the proximity graph for the given query. In this subsection, we
introduce the representative neighborhoodbased methods, namely Hierarchical Navigable Small World(HNSW [32]) and KGraph [16]. In addition, we also evaluate other two representative
methods including Small World (SW [31]), and Rank Cover Tree (RCT [23])
6.1 KGraph
KGraph [15] is the representative technique for KNN graph construction and nearest neighbor searches.
Index Construction .KNN graph is a simple directed graph where there are outgoing edges for each node, pointing to its nearest data points. Since the exact computation of KNN graph is costly, many heuristics have been proposed in the literature [16, 5]. In [16], the construction of KNN graph relies on a simple principle: A neighbor’s neighbor is probable also a neighbor. Starting from randomly picked nodes for each data point, it iteratively improves the approximation by comparing each data point against its current neighbors’ neighbors, including both KNN and reverse KNN points, and stop when no improvement can be made. Afterwards, four optimizations are applied (local join, incremental search, sampling and early termination) to reduce the redundant computation and speed up the indexing phrase to stop at an acceptable accuracy.
Search .A greedy search algorithm is employed in [16] to discover the nearest neighbor from a query point . Starting from randomly chosen nodes (i.e.,data points), the search maintains a node list to store the current best nodes sorted by their distances to the query, and recursively computes the distances from the query to each neighbor point (by following the graph edges) of first unexplored data point of the node list. The node list is updated by the neighbor points that are closer to the query. This greedy algorithm stops when each data point at the node list is closer to the query than any of its neighbor.
6.2 Small World
A small world(SW) method is a variant of a navigable small world graph data structure. The small world graph contains an approximation of the Delaunay graph and has longrange links together with the smallworld navigation property.SW uses the same searching method with KNNG. The greatest difference between KNNG and SW is the connection structure of nodes. KNNG strives to obtain the local approximate optimal solution for each node, but SW explore the current optimal links for the inserted points by a incremental construction strategy. Besides, SW keeps twoway connection for each edge, but KNNG only preserve the kNN neighbors of each node.
Index Construction The construction of SW is a bottomtop procedure that insert all the data points consecutively. For every new incoming point,they find the set of its closest neighbors from the graph and the undirected edges are created to connect the set and the point. As more and more elements are inserted into the graph, links that previously served as shortrange links now become longrange links making a navigable small world.
Search The nearest neighbor search algorithm is, thus, a greedy search procedure that carries out several subsearches. A subsearch starts at a random node and proceeds to expanding the set of traversed nodes which are not visited by following neighboring links. The subsearch stops when it cannot find points that are closer than already found nearest points ( is a search parameter).
6.3 Hierarchical Navigable Small World
The key idea of the Hierarchical Navigable Small World(HNSW) algorithm is to separate the links according to their length scale. In this case, the average connections per element in all of the layers can be limited.
Index Construction HNSW can be seen as a multilayer and multiresolution variant of a proximity graph. A ground (zerolevel) layer includes all data points and higher layer has fewer points. Similar to SW, HNSW is constructed by incrementially inserting data points, one by one. For each data point, a maximum level is selected randomly and the new point is added to all the layers starting from layer down to the layer zero. The insertion process can be divided into two phrases. The first phrase starts from the top layer to by greedily traversing the graph in order to find the closest neighbor in the layer, which is used as the enter point to continue the search in the next layer. The second phrase starts from layer down to zero. nearest neighbors are found and connected with the new point. The searching quality is controlled by the parameter , which is the number of the enter points and plays the similar role with of KGraph. In first phrase, the number of is set as 1.
Search The searching algorithm is roughly equivalent to the insertion algorithm from an item with maximum level . The closest neighbors found at the ground layer are returned as the search result.
6.4 Rank Cover Tree
Rank cover tree (RCT) [23] is a probabilistic data structure for similarity search, which entirely avoids the use of numerical constraints such as triangle inequality. The searching algorithm is based on the ranks of the points with respect to the query, and returns a correct result in time that depends competitively on a measure of the intrinsic dimensionality of the data set.
The structure of RCT blends some of the design features of SASH [24] and Cover Tree [9]. It is a hierarchical tree in which each node in the bottom level (level 0) is associated with a data point, and the nodes in level are randomly selected from the set of level with certain probability. The index construction of RCT is performed by inserting the nodes from high levels to low levels. If one node in level appears in the RCT tree, the indexing algorithm will search its nearest neighbor in level , and then link to . Otherwise, the node links to its copy in level .
Searching of RCT starts from the root of the tree, and on each level , only a subset of nodes is kept, as the nodes in are the most similar to the query. The search algorithm iteratively perform the above procedure until reaching the bottom level.
7 Diversified Proximity Graph
The experience and insights we gained from this study enable us to engineer a new method, Diversified Proximity Graph (DPG), which constructs a different neighborhood graph to achieve better and more robust search performance.
7.1 Motivation
In KNN graph construction, we only consider the distances of neighbors for each data point. But intuitively we should also consider the coverage of the neighbors. As shown in Figure 1, the two closest neighbors of the point are and , and hence in the NN graph cannot lead the search to the NN of (i.e., the node ) although it is close to . Since are clustered, it is not costeffective to retain both and in the KNN list of . This motivates us to consider the direction diversity (i.e., angular dissimilarity) of the KNN list of in addition to the distance, leading to the diversified KNN graph. Regarding the example, including and is a better choice for the KNN list of .
Now assume we have replaced edge with the edge (i.e., the dashed line in Figure 1), but there is still another problem. As we can see that there is no incoming edge for because it is relatively far from two clusters of points (i.e., is not NN of these data points). This implies that is isolated, and two clusters are disconnected in the example. This is not uncommon in high dimensional data due to the phenomena of “hubness”[36] where a large portion of data points rarely serve as KNN of other data points, and thus have no or only a few incoming edges in the KNN graph. This motivates us to also use the reverse edges in the diversified KNN graph; that is, we keep an bidirected diversified KNN graph as the index, and we name it Diversified Proximity Graph (DPG).
7.2 Diversified Proximity Graph
The construction of DPG is a diversification of an existing KNN graph, followed by adding reverse edges.
Given a reference data point , the similarity of two points and in ’s KNN list is defined as the angle of , denoted by . We aim to choose a subset of data points, denoted by , from so that the average angle between two points in is maximized; or equivalently, .
The above problem is NPhard [29]. Hence, we design a simple greedy heuristic. Initially, is set to the closest point of in . In each of the following iterations, a point is moved from to so that the average pairwise angular similarity of the points in is minimized. Then for each data point in , we include both edges and in the diversified proximity graph. The time complexity of the diversification process is where is the number of data points, and there are totally at most edges in the diversified proximity graph.
It is critical to find a proper value for a desired in the diversified proximity graph as we need to find a good tradeoff between diversity and proximity. In our empirical study, the DPG algorithm usually achieves the best performance when . Thus, we set for the diversified proximity graph construction. Although the angular similarity based DPG algorithm can achieve very good search performance, the diversification time of is still costly. In our implementation, we use another heuristic to construct the diversified KNN graph as follows. We keep a counter for each data point in the KNN list of the data point . For each pair of points , , we increase the counter of by one if is closer to than to ; that is, . Then, we simply keep the data points with lowest count frequencies for because, intuitively, the count of a data point is high if there are many other points with similar direction. This leads to the time complexity of for diversification. Our empirical study shows that we can achieve similar search performance, while significantly reduce the diversification time. We also demonstrate that both diversification and reverse edges contribute to the improvement of the search performance.
Note that the search process of the DPG is the same as that of KGraph introduced in Section 6.1.
8 Experiments
In this section, we present our experimental evaluation.
8.1 Experimental Setting
Algorithms Evaluated
We use representative existing NNS algorithms from the three categories and our proposed diversified proximity graph (DPG) method. All the source codes are publicly available. Algorithms are implemented in C++ unless otherwise specified. We carefully go through all the implementations and make necessary modifications for fair comparisons. For instance, we reimplement the search process of some algorithms in C++. We also disable the multithreads, SIMD instructions, fastmath, and hardware prefetching technique. All of the modified source codes used in this paper are public available on GitHub [40].
(1) LSHbased Methods.
We evaluate QueryAware LSH [25]
(QALSH
(2) Encodingbased Methods.
We evaluate Scalable Graph Hashing [28]
(SGH
We also evaluate the Selective Hashing [19]
(SH
(3) Treebased Space Partition Methods.
We evaluate
FLANN
(4) Neighborhoodbased Methods.
We evaluate Small World Graph [31]
(SW10, IS’14), Hierarchical Navigable Small World [32](HNSW10, CoRR’16), Rank Cover
Tree [23]
(RCT
Computing Environment. All C++ source codes are complied by g++ , and MATLAB source codes (only for index construction of some algorithms) are compiled by MATLAB . All experiments are conducted on a Linux server with Intel Xeon core CPU at GHz, and G memory.
Datasets and Query Workload
We deploy real datasets used by existing works which cover a wide range of applications including image, audio, video and textual data. We also use two synthetic datasets. Table 1 summarizes the characteristics of the datasets including the number of data points, dimensionality, Relative Contrast (RC [22]), local intrinsic dimensionality (LID [1]), and data type where RC and LID are used to describe the hardness of the datasets.
Relative Contrast evaluates the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces.
Suppose and a query where , are i.i.d samples from an unknown distribution . Let is the distance to the nearest database sample, and is the expected distance of a random database sample from the query . The relative contrast of the dataset X for a query is defined as . The relative contrast for the dataset X is given as .
Intuitively, captures the notion of difficulty of NN search in . Smaller the , more difficult the search. If is close to 1, then on average a query will have almost the same distance to its nearest neighbor as that to a random point in .
We also can define Relative Contrast for nearest neighbor setting as , where is the expected distance to the th nearest neighbor. Smaller RC value implies harder datasets.
Local Intrinsic Dimensionality evaluates the rate at which the number of encountered objects grows as the considered range of distances expands from a reference location. We employ RVE(estimation using regularly varying funtions) to compute the value of LID. It applies an ad hoc estimator for the intrinsic dimensionality based on the characterization of distribution tails as regularly varying functions. The dataset with higher LID value implies harder than others.
We mark the first four datasets in Table 1 with asterisks to indicate that they are “hard” datasets compared with others according to their RC and LID values.
Below, we describe the datasets used in the experiments.
Nusw
Gist
Random contains M randomly chosen points in a unit hypersphere with dimensionality .
Glove
Cifar
Audio
Mnist
Sun397
Enron origins from a collection of emails. yifang et. al. extract bigrams and form feature vectors of 1369 dimensions.
Trevi
Notre
Youtube_Faces
Msong
Sift
Deep
UKbench
ImageNet
Gauss is generated by randomly choosing cluster centers with in space , and each cluster follows the a Gaussian distribution with deviation on each dimension.
UQ_Video is video dataset and the local features based on some keyframes are extracted which include 256 dimensions.
Bann
Name  RC  LID  Type  

Nus*  269  500  1.67  24.5  Image 
Gist*  983  960  1.94  18.9  Image 
Rand*  1,000  100  3.05  58.7  Synthetic 
Glove*  1,192  100  1.82  20.0  Text 
Cifa  50  512  1.97  9.0  Image 
Audio  53  192  2.97  5.6  Audio 
Mnist  69  784  2.38  6.5  Image 
Sun  79  512  1.94  9.9  Image 
Enron  95  1,369  6.39  11.7  Text 
Trevi  100  4,096  2.95  9.2  Image 
Notre  333  128  3.22  9.0  Image 
Yout  346  1,770  2.29  12.6  Video 
Msong  922  420  3.81  9.5  Audio 
Sift  994  128  3.50  9.3  Image 
Deep  1,000  128  1.96  12.1  Image 
Ben  1,098  128  1.96  8.3  Image 
Imag  2,340  150  2.54  11.6  Image 
Gauss  2,000  512  3.36  19.6  Synthetic 
UQV  3,038  256  8.39  7.2  Video 
BANN  10,000  128  2.60  10.3  Image 
Query Workload. Following the convention, we randomly remove data points as the query points for each datasets. The average performance of the NN searches is reported. The value varies from to in the experiments with default value . In this paper, we use Euclidean distance for ANNS.
8.2 Evaluation Metrics
Since exact NN can be found by a bruteforce linear scan algorithm (denoted as ), we use its query time as the baseline and define the speedup of Algorithm as , where is the average search time of Algorithm .
The search quality of the returned points is measured by the standard recall against the NN points (See Section 2.1).
All reported measures are averaged over all queries in the query workload. We also evaluate other aspects such as index construction time, index size, and scalability.
Note that the same algorithm can achieve different combination of speedup and recall (typically via using different threshold on the number of points verified, i.e., ).
8.3 Parameter Tunning
Below are the default settings of the key parameters of the algorithms in the second round evaluation in Section 8.5.

SRS. The number of projections () is set to .

OPQ. The number of subspaces is , and each subspace can have codewords (i.e., cluster centers) by default.

Annoy. The number of the Annoy trees, , is set to .

FLANN. We let the algorithm tune its own parameters.

HNSW. The number of the connections for each point, , is set to .

KGraph. By default, we use for the KNN graph index.

DPG. We use so that the index size of DPG is the same as that of KGraph in the worst case.
More details about how to tune each algorithm and the comparisons about our counting_based DPG and angular_based DPG were reported in Appendix A.
8.4 Comparison with Each Category
In this subsection, we evaluate the tradeoffs between speedup and recall of all the algorithms in each category. Given the large number of algorithms in the space partitionbased category, we evaluate them in the encodingbased and treebased subcategories separately. The goal of this round of evaluation is to select several algorithms from each category as the representatives in the second round evaluation (Section 8.5).
LSHbased Methods
Figure 2 plots the tradeoffs between the speedup and recall of two most recent dataindependent algorithms SRS and QALSH on Sift and Audio. As both algorithms are originally external memory based approaches, we evaluate the speedup by means of the total number of pages of the dataset divided by the number of pages accessed during the search. It shows that SRS consistently outperforms QALSH. Table 2 shows the construction time and index size of SRS and QALSH, we can see that SRS has smaller index size than QALSH (at least 5 times larger than SRS). Thus, SRS is chosen as the representative in the second round evaluation where a covertree based inmemory implementation will be used.
Dataset  SRS  QALSH  

size(MB)  time(s)  size(MB)  time(s)  
Audio  2.8  26.5  14.1  27.3 
Sift  46.6  980  318  277 
Notre  15.5  253.9  98.1  95.2 
Sun  4.1  45.1  21.5  67.2 
Encodingbased Methods
We evaluate six encoding based algorithms including OPQ, NAPP, SGH, AGH, NSH and SH. Figure 3 demonstrates that, of all methods, the search performance of OPQ beats other algorithms by a big margin on most of the datasets.
Table 3 reports the construction time (second) and index size (MB) of encodingbased methods. For most of datasets, NSH has the smallest index size, followed by AGH, SGH and OPQ. Selective Hashing has the largest index size because it requires long hash table to reduce the points number in a bucket and multiple hash tables to achieve high recall.
NSH and AGH spend relatively small time to build the index. The index time value of OPQ has a strong association with the length of the subcodeword and dimension of the data point. Nevertheless, the index construction time of OPQ still turns out to be very competitive compared with other algorithms in the second round evaluation. Therefore, we choose OPQ as the representative of the encoding based methods.
Dataset  SH  OPQ  NAPP  NSH  AGH  SGH  
size  time  size  time  size  time  size  time  size  time  size  time  
Sift  729  320  65  788.7  119  122  34.1  35.06  65  28  65.2  221 
Trevi  185  132  158  10457  12  262  1.9  242.9  19.4  133  19.4  246 
Mnist  165  34.5  11  152.4  8.3  35  2.4  96.4  4.6  26.9  6.9  358 
Glove  850  375  43  697.7  143  119  77.3  105.9  78  20.6  41.3  89.4 
Notre  325  107.2  17  138.4  40  39  16.5  28.2  11.9  5.3  22.3  206 
Ben  792  352.3  68  844.6  131  134  37.7  38  38.2  16.8  71.9  447 
Audio  155  18.1  9  45.8  6.4  8.4  1.8  24.8  3  8.1  4.6  360 
Cifa  153  21.5  9  92.6  6  18.3  0.8  18.8  3.7  22  1.9  3 
Treebased Space Partition Methods
We evaluate three algorithms in this category: FLANN, Annoy and VPtree. To better illustrate the performance of FLANN, we report the performance of both randomized kdtrees and hierarchical means tree, namely FLANNKD and FLANNHKM, respectively. Note that among datasets deployed in the experiments, the randomized kdtrees method (FLANNKD) is chosen by FLANN in five datasets: Enron, Trevi, UQV, BANN and Gauss. The linear scan is used in the hardest dataset: Rand, and the hierarchical means tree (FLANNHKM) is employed in the remaining datasets. Figure 4 shows that Annoy, FLANNHKM and FLANNKD can obtain the highest performance in different datasets.
Note that the index size of VPtree is the memory size used during the indexing. From table 4, we can see VPTree almost has the largest index time, which is because VPtree spend long time to tune the parameters automatically. While the search performance of VPtree is not competitive compared to FLANN and Annoy under all settings, it is excluded from the next round evaluation.
Dataset  Annoy  FlannKD  FlannHKM  VPtree  

size(MB)  time(s)  size(MB)  time(s)  size(MB)  time(s)  size(MB)  time(s)  
Sift  572  144.6  274.0  45.5  230.0  27.4  573.1  131 
Deep  571  224  550.0  148.6  74.0  1748.1  1063  474 
Enron  56  51.1  27.0  30.0  74.0  412.9  500.7  954 
Gist  563  544.9  540.0  462.9  1222.0  775.5  4177.6  1189 
UQV  1753  801.8  1670.0  462.0  99.0  3109.2  3195.5  15.9 
Sun  46  18.4  44.0  20.6  58.0  43.0  161  124 
Audio  31  7.08  15.0  3.0  20.0  10.6  45  151 
Cifa  28  10.8  14.0  7.0  9.0  23.7  101  131 
Neighborhoodbased Methods
In the category of neighborhoodbased methods, we evaluate four existing techniques: KGraph, SW, HNSW and RCT. Figure 5 shows that the search performance of KGraph and HNSW substantially outperforms that of other two algorithms on most of the datasets.
RCT has the smallest index size and the construction time of KGraph and HNSW are relatively large. Due to the outstanding search performance of KGraph and HNSW, we choose them as the representatives of the neighborhoodbased methods. Note that we delay the comparison of DPG to the second round evaluation.
Dataset  HNSW  KGraph  SW  RCT  

size(MB)  time(s)  size(MB)  time(s)  size(MB)  time(s)  size(MB)  time(s)  
Sift  589  1290  160  815.9  236  525  50  483 
Nusw  541  701  44  913  64  605  22.1  813.9 
Audio  45  36.9  9  38.2  13  8.1  2.4  11.8 
Gist  3701  5090  158  6761.3  233  2994  50.3  3997 
Deep  1081  1800  161  1527  236.5  372.1  51.2  815 
Cifar  103  99.8  8  97.5  5.9  25.6  2.3  28 
Trevi  1572  1292  16  1800.8  12.2  1003  4.5  649 
Ben  651  1062  176  863  260  225  57  482 
8.5 Second Round Evaluation
In the second round evaluation, we conduct comprehensive experiments on seven representative algorithms: SRS, OPQ, FLANN, Annoy, HNSW, KGraph, and DPG.
Search Quality and Time
In the first set of experiments, Figure 6 reports the speedup of seven algorithms when they reach the recall around on datasets. Note that the speedup is set to if an algorithm cannot outperform the linear scan algorithm. Among seven algorithms, DPG and HNSW have the best overall search performance and KGraph follows. It is shown that DPG enhances the performance of KGraph, especially on hard datasets: Nusw, Gist, Glove and Rand. As reported thereafter, the improvement is also more significant on higher recall. For instance, DPG is ranked after KGraph on three datasets under this setting (recall ), but it eventually surpasses KGraph on higher recall. Overall, DPG and HNSW have the best performance for different datasets.
OPQ can also achieve a decent speedup under all settings. Not surprisingly, SRS is slower than other competitors with a huge margin as it does not exploit the data distribution. Similar observations are reported in Figure 7, which depicts the recalls achieved by the algorithms with speedup around .
We evaluate the tradeoff between search quality (Recall) and search time (Speedup and the percentage of data points accessed). The number of data points to be accessed () together with other parameters such as the number of trees (Annoy) and the search queue size (HNSW, KGraph and DPG), are tuned to achieve various recalls with different search time (i.e., speedup). Figure 8 illustrates speedup of the algorithms on eight datasets with recall varying from to . It further demonstrates the superior search performance of DPG on high recall and hard datasets (Figure 8(a)(d)). The overall performances of HNSW, KGraph and Annoy are also very competitive, followed by FLANN. It is shown that the performance of both DPG and KGraph are ranked lower than that of HNSW, Annoy, FLANN and OPQ in Figure 8(h) where the data points are clustered.
In Figure 9, we evaluate the recalls of the algorithms against the percentage of data points accessed, i.e., whose distances to the query in the data space are calculated. As the search of proximitybased methods starts from random entrance points and then gradually approaches the results while other algorithms initiate their search from the most promising candidates, the search quality of these methods is outperformed by Annoy, FLANN and even OPQ at the beginning stage. Nevertheless, their recall values rise up faster than other algorithms when more data points are accessed. Because of the usage of the hierarchical structure in HNSW, HNSW could approach the results more quickly.
Searching with Different
We evaluate the speedup at the recall of with different . As shown in figure 12, DPG and HNSW almost has the best search performance for between 1 and 100, followed by KGraph and Annoy. The similar ranking could be observed in different .
Index Size and Construction Time
In addition to search performance, we also evaluate the index size and construction time of seven algorithms on datasets. Figure 10 reports ratio of the index size (exclude the data points) and the data size. Except Annoy, the index sizes of all algorithms are smaller than the corresponding data sizes.
The index sizes of DPG, KGraph, HNSW and SRS are irrelevant to the dimensionality because a fixed number of neighbor IDs and projections are kept for each data point by neighborhoodbased methods (DPG, KGraph and HNSW) and SRS, respectively. Consequently, they have a relatively small ratio on data with high dimensionality (e.g., Trevi). Overall, OPQ and SRS have the smallest index sizes, less than % among most of the datasets, followed by HNSW, DPG, KGraph and FLANN. It is shown that the rank of the index size of FLANN varies dramatically over datasets because it may choose three possible index structures. Annoy needs to maintain a considerable number of trees for a good search quality, and hence has the largest index size.
Figure 11 reports the index construction time of the algorithms on datasets. SRS has the best overall performance due to its simplicity. Then Annoy ranks the second. The construction time of OPQ is related to the dimensionality because of the calculation of the subcodewords (e.g., Trevi). HNSW, KGraph and DPG have similar construction time. Compared with KGraph, DPG doesn’t spend large extra time for the graph diversification. Nevertheless, they can still build the indexes within one hour for out of datasets.
Scalability Evaluation
In Figure 13, we investigate the scalability of the algorithms in terms of the search time, index size and index construction time against the growth of the number of data points () and the dimensionality (). The target recall is set to for evaluation of and , respectively. Particularly, BANN is deployed with the number of data points growing from M to M. Following the convention, we use random data, Rand, to evaluate the impact of the dimensionality which varies from to . To better illustrate the scalability of the algorithms, we report the growth ratio of the algorithms against the increase of and . For instance, the index size growth ratio of DPG with is obtained by its index size divided by the index size of DPG with .
With the increase of the number of data points (), DPG, KGraph, HNSW and Annoy have the best search scalability while SRS ranks the last. On the other hand, OPQ has the best scalability over index size and construction time, followed by FLANN. It is noticed that the performance of FLANN is rather unstable mainly because it chooses FLANNKD when is M and M, and FLANNHKM otherwise.
Regarding the growth of dimensionality, FLANN has the worst overall performance which simply goes for bruteforce linear scan when . As expected, DPG, KGraph, HNSW and SRS have the best index size scalability since their index sizes are independent of . It is interesting that SRS has the best search scalability, and its speedup even outperforms DPG when . This is probably credited to its theoretical worse performance guarantee. Note that we do not report the performance of KGraph in Figure 13(d) because it is always outperformed by linear scan algorithm. This implies that KGraph is rather vulnerable to the high dimensionality, and hence justifies the importance of DPG, which achieves much better search scalability towards the growth of dimensionality.
Harder Query Workload
We evaluate the algorithm performances when the distribution of the query workload becomes different from that of the datasets. We control the generation of increasingly different query workloads by perturbing the default queries by a fixed length in a random direction. By increasingly large for the default queries on Sift, we obtain query workloads whose RC values vary from (without perturbation) to . Intuitively, queries with smaller RC values are harder as the distance of a random point in the dataset and that of the NN point becomes less distinguishable. Figure 14 shows the speedups of the algorithms with recall at around on the harder query workloads characterized by the RC values. The speedups of all the algorithms decrease with the increase of RC values. HNSW has the best performance on easy queries, followed by DPG, KGraph, Annoy and FLANN. Nevertheless, DPG is the least affected and still achieves more than 100x speedup (with a recall at least 0.8) on the hardest settings. This demonstrates the robustness of DPG against different query workloads.
8.6 Summary
Table 6 ranks the performances of the seven algorithms from various perspectives including search performance, index size, index construction time, and scalability. We also indicate that SRS is the only one with theoretical guarantee of searching quality, and it is very easy to tune the parameters for search quality and search time. The tuning of Annoy is also simple, simply varying the number of the trees. It is much complicated to tune the parameters of FLANN. Authors therefore developed the autoconfigure algorithm to handle this.
Below are some recommendations for users according to our comprehensive evaluations.

When there are sufficient computing resources (both main memory and CPUs) for the offline index construction, and sufficient main memory to hold the resulting index, DPG and HNSW are the best choices for ANNS on high dimensional data due to their outstanding search performance in terms of robustness to the datasets, result quality, search time and search scalability.
We also recommend Annoy, due to its excellent search performance, and robustness to the datasets. Additionally, a nice property of Annoy is that, compared with proximity graph based approaches, it can provide better tradeoff between search performance and index size/construction time. This is because, one can reduce the number of trees without hurting the search performance substantially.
Note that KGraph also provides overall excellent performance except on few datasets (e.g., the four hard datasets and Yout). We recommend Annoy instead of KGraph as DPG is an improved version of KGraph with better performance, and Annoy performs best in the few cases where both DPG and KGraph do not perform as well (e.g., Yout, and Gauss).

To deal with large scale datasets (e.g., billion of data points) with moderate computing resources, OPQ and SRS are good candidates due to their small index sizes and construction time. It is worthwhile to mention that, SRS can easily handle the data points updates and have theoretical guarantee, which distinguish itself from other five algorithms.
Category  Search Performance  Index  Index Scalabiliity  Search Scalabiliity  Theoretical Guarantee  Tuning Difficulty  
Size  Time  Datasize  Dim  Datasize  Dim  
DPG  1st  4th  7th  =4th  =1st  =1st  5th  No  Medium 
HNSW  1st  3rd  5th  =4th  4th  =1st  4th  No  Medium 
KGraph  3rd  5th  6th  =4th  =1st  =1st  7th  No  Medium 
Annoy  4th  7th  2nd  7th  3rd  6th  =2nd  No  Easy 
FLANN  5th  6th  4th  =2nd  7th  =1st  6th  No  Hard 
OPQ  6th  2nd  3rd  1st  =5th  5th  =2nd  No  Medium 
SRS  7th  1st  1st  =2nd  =5th  7th  1st  Yes  Easy 
9 Further Analyses
In this section, we analyze the most competitive algorithms in our evaluations, grouped by category, in order to understand their strength and weakness.
9.1 Space Partitionbased Approach
Our comprehensive experiments show that Annoy, FLANN and OPQ have the best performance among the space partitioningbased methods. Note that FLANN chooses FLANNHKM in most of the datasets. Therefore, all three algorithms are based on kmeans space partitioning.
We identify that a key factor for the effectiveness of kmeansstyle space partitioning is that the large number of clusters, typically . Note that we cannot directly apply kmeans with because (i) the index construction time complexity of kmeans is linear to , and (ii) the time complexity to identify the partition where the query is located takes time. Both OPQ and FLANNHKM/Annoy achieve this goal indirectly by using the ideas of subspace partitioning and recursion, respectively.
We perform experiments to understand which idea is more effective. We consider the goal of achieving kmeansstyle space partitioning with approximately the same number of partitions. Specifically, we consider the following choices: (i) Use kmeans directly with . (ii) Use OPQ with subspaces and each has 256 clusters. The number of effective partitions (i.e., nonempty partitions) is also . (iii) Use FLANNHKM with branching factor and , respectively. We also modify the stopping condition so that the resulting trees have and partitions, respectively. Figure 15(a) reports the recalls of the above choices on Audio against the percentage of data points accessed. Partitions are accessed ordered by the distances of their centers to the query. We can see that OPQbased partition has the worst performance, followed by (modified) FLANNHKM with , and then . kmeans has the best performance, although the performance differences between the latter three are not significant. Therefore, our analysis suggests that hierarchical kmeansbased partitioning is the most promising direction so far.
Our second analysis is to investigate whether we can further boost the search performance by using multiple hierarchical kmeans trees. Note that Annoy already uses multiple trees and it significantly outperforms a single hierarchical kmeans tree in FLANNHKM on most of the datasets. It is natural to try to enhance the performance of FLANNHKM in a similar way. We set up an experiment to construct multiple FLANNHKM trees. In order to build different trees, we perform kmeans clustering on a set of random samples of the input data points. Figure 15(b) shows the resulting speedup vs recall where we use up to 50 trees. We can see that it is not costeffective to apply multiple trees for FLANNHKM on Audio, mainly because the trees obtained are still similar to each other, and hence the advantage of multiple trees cannot offset the extra indexing and searching overheads. Note that Annoy does not suffer from this problem because means partition with limited number of samples and iterations naturally provides diverse partitions.
9.2 Neighborhoodbased Approach
Our first analysis is to understand why KGraph, DPG and HNSW work very well (esp. attaining a very high recall) in most of the datasets. Our preliminary analysis indicates that this is because (i) the NN points of a query are typically closely connected in the neighborhood graph, and (ii) most points are well connected to at least one of the NN points of a query. (ii) means there is a high empirical probability that one of the entry points selected randomly by the search algorithm can reach one of the NN points, and (i) ensures that most of the NN points can be returned. By well connected, we mean there are many paths from an entry point to one of the NN point, hence there is a large probability that the “hills” on one of the path is low enough so that the search algorithm won’t stuck in the local minima.
We also investigate why KGraph does not work well on some datasets and why DPG and HNSW works much better. KGraph does not work on Yout and Gauss mainly because both datasets have many wellseparated clusters. Hence, the index of KGraph has many disconnected components. Thus, unless one of the entrance points used by its search algorithm is located in the same cluster as the query results, there is no or little chance for KGraph to find any near point. On the other hand, mainly due to the diversification step and the use of the reverse edges in DPG, there are edges linking points from different clusters, hence resulting in much improved recalls. Similarly, in HNSW, the edges are also well linked.
For example, we perform the experiment where we use the NN of the query as the entrance point of the
search on Yout. KGraph then achieves 100%
recall. In addition, we plot the distribution of the minimum number of hops
(minHops) for a data point to reach
any of the NN points of a query

For KGraph, there are a large percentage of data points that cannot reach any NN points (i.e., those corresponding to hops) on Yout (60.38%), while the percentage is low on Gist (0.04%).

The percentages of the hops are much lower for DPG (1.28% on Yout and 0.005% on Gist).

There is no hops for HNSW on both datasets.

DPG and HNSW have much more points with small minHops than KGraph, which contributes to making it easier to reach one of the NN points. Moreover, on Yout, HNSW has the most points with small minHops over three algorithms, which results in a better performance as shown in Figure 8(g).
9.3 Comparisions with Prior Benchmarks
We have verified that the performance results obtained in our evaluation generally match prior evaluation results, and we can satisfactorily explain most of the few discrepancies.
Comparison with annbenchmark’s Results. While the curves in both evaluations have similar shapes, the relative orders of the best performing methods are different. This is mainly due to the fact that we turned off all hardwarespecific optimizations in the implementations of the methods. Specifically, we disabled distance computation using SIMD and multithreading in KGraph, ffastmath compiler option in Annoy, multithreading in FLANN, and distance computation using SIMD, multithreading, prefetching technique and Ofast compiler option in methods implemented in the NonMetricSpaceLib, i.e., SW, NAPP, VPtree and HNSW). In addition, we disabled the optimized search implementation used in HNSW. We confirm that the results resemble annbenchmark’s more when these optimizations are turned on.
Disabling these hardwarespecific optimizations allows us to gain more insights into the actual power of the algorithms. In fact, the optimizations can be easily added to the implementations of all the algorithms.
KGraph and SW.
KGraph was ranked very low in the annbenchmark
study [8] possibly due to an error in the
test
Annoy. We note that the performance of latest version of Annoy (based on randomized hierarchical means trees) is noticeably better than its earlier versions (based on multiple heuristic RPtrees), and this may affect prior evaluation results.
10 Conclusion and Future Work
NNS is an fundamental problem with both significant theoretical values and empowering a diverse range of applications. It is widely believed that there is no practically competitive algorithm to answer exact NN queries in sublinear time with linear sized index. A natural question is whether we can have an algorithm that empirically returns most of the NN points in a robust fashion by building an index of size and by accessing at most data points, where is a small constant (such as 1%).
In this paper, we evaluate many stateoftheart algorithms proposed in different research areas and by practitioners in a comprehensive manner. We analyze their performances and give practical recommendations.
Due to various constraints, the study in this paper is inevitably limited. In our future work, we will (i) use larger datasets (e.g., M+ points); (ii) consider high dimensional sparse data; (iii) use more complete, including exhaustive method, to tune the algorithms; (iv) consider other distance metrics, as in [34].
Finally, our understanding of high dimensional real data is still vastly inadequate. This is manifested in many heuristics with no reasonable justification, yet working very well in real datasets. We hope that this study opens up more questions that call for innovative solutions by the entire community.
Appendix A Parameter Setting
In this section, we carefully tune all algorithms to achieve a good search performance with reasonable index space and construction time overheads. By default, we use the number of data points verified (i.e., computing the exact distance to the query), denoted by , to achieve the tradeoff between the search quality and search speed unless specially mentioned.
a.1 Srs
We test SRS in two versions: ExternalMemory version and InMemory version. In this paper, we do not use the early termination test and stop the searching when it has accessed enough points. Meanwhile, the approximation ratio is set to 4 and the page size for external memory version is . For the sake of fairness, the success probability of all dataindependent methods is set to . Therefore, there are two parameters (the maximum number of data points accessed in the query processing), (the number of dimension of projected space) for SRS algorithm. We change the number of the accessed data points to tune the tradeoff between the search quality and search speed under a certain .
ExternalMemory Version
Figure 17 plots the changes of the search performance with different projection dimensions. As the increase of the projection dimensionality , SRS could achieve a better performance.
InMemory version
For InMemory version, we compare the speedup using the ratio of the bruteforce time and the search time. From figure 18, we can see as the increase of the value of , higher speedup could be achieved for high recall while the search speed would be more faster when one requires a relatively moderate value of recall. Considering various aspects, values of from 8 to 10 provide a good tradeoff.
a.2 Qalsh
Because QALSH didn’t release the source code of InMemory version, we only test the performance of ExternalMemory version. We use the default setting for QALSH and tune the value of (approximation ratio) to obtain different search performance.
a.3 Scalable Graph Hashing
In SGH, we use the default setting recommended by the author. For kernel feature construction, we use Gaussian kernel and take 300 randomly sampled points as kernel bases. Hence, we compare the search accuracies with different hashcode length . For most of the datasets, always obtains the worst search performance compared with larger hashcodes and the best speedup could be achieved with 128 bits.
a.4 Anchor Graph Hashing
To run 1AGH and 2AGH, we have to fix three parameters: (the number of anchor), (number of nearest anchors need to be considered for each point) and hash code length . We focus to 300 and =5.
Following are the comparisons for 1AGH and 2AGH. We first show the search performance for a singlelayer AGH and twolayer AGH with different length of hashcode. For most datasets, it seems using longer hashcode could obtain the much higher performance. Hence, we only compare the performance for and on both layer AGH. We can observe that 2AGH provides superior performance compared to 1AGH for majority of datasets.
a.5 NeighborSensitive Hashing
we set the number of pivots to where is the length of hash code and used kmeans++ to generate the pivots as described by the authors. The value of was set to 1.9 times of the average distance from a pivot to its closest pivot. We tune the value of to get the best performance.
a.6 Napp
Tunning NAPP involves selection of three parameters: (the total number of pivots), (the number of indexed pivots) and (the number of the shared pivots with the query). According to the experiments in [34], the values of between 500 and 2000 provide a good tradeoff. The large value of will take long construction time. We will tune the value of from 500 to 2000. The author also recommend the value of to be 32. We will change to get different search performance.
a.7 Selective Hashing
The experimentation was performed with the default parameter settings provided by the authors. The total number of buckets of per table is 9973. The number of radii is set to 20. We change the number of the retrieved points and the approximation ratio to get different recalls.
a.8 Opq
An optimized product quantizer with subspaces and subcodeword in each is used to build inverted multiindex. The product quantizer is optimized using the nonparametric solution initialized by the natural order. The OPQ is generated using and evaluates at most data points to obtain different search tradeoff. For most of the datasets, would achieve a good performance while would be better for the datasets with small data points.
a.9 VPtree
In NonMetricSpaceLibrary, VPtree uses a simple polynomial pruner to generate the optimal parameters and . We use autotuning procedure to produce different search performance by given the input parameter (which is the expected recall) and (the maximum number of points in leaf nodes).
a.10 Annoy
Annoy only involves one parameter: trees number . Table 7 and 8 shows the index sizes and the construction time complexities are linear to . The search performance could be significantly improved by using multiple Annoy trees while the growth rate changes slowly when is larger than for most of the datasets.Considering the search performance and index performance comprehensively, we build trees for all the datasets.
Name  Tree  Trees  Trees  Trees  Trees 

Audio  0.04  0.5  2.3  4.5  9.4 
Yout  2.3  22.1  112.6  217  442 
Sift  1.8  17  85  128  352 
Gauss  8.2  77  384  608  1538 
Name  Tree  Trees  Trees  Trees  Trees 

Audio  40.5  45.9  70  100  160.1 
Yout  2347  2382  2539  2735  3128 
Sift  512  612  1058  1615  2730 
Gauss  3960  4161  5055  6171  8407 
a.11 HKMeans
We randomly select the init centers in the kmeans clustering. According to the recommendations from the source code of Flann, we apply different combinations of iteration times and branching size to generate the recallspeedup tradeoff.
a.12 KDTree
Except the parameter which is the number of Randomized kdtrees, we use the default setting provided by the source code. The search performance would be improved as the growth of the number of trees. While the speedup doesn’t show considerable increase when is larger than 8.
a.13 Flann
Flann defines the cost as a combination of the search time, tree build time and the tree memory overhead.
We used the default search precisions (90 percent) and several combinations of the tradeoff factors and . For the build time weight, , we used three different possible values: 0 representing the case where we don’t care about the tree build time, 1 for the case where the tree build time and search time have the same importance and 0.01 representing the case where we care mainly about the search time but we also want to avoid a large build time. Similarly, the memory weight was chosen to be 0 for the case where the memory usage is not a concern, 100 representing the case where the memory use is the dominant concern and 1 as a middle ground between the two cases.
When we pay more attention to the size of the memory use, the search speedup is very low and almost declines to that of linear scan. For the datasets who have the medium data size or the system with enough memory, the of 0 would provide a good search performance. Due to the large margin between large and small for the search performance, we select for in this paper.
a.14 Small World
Small World involves the parameter NN( NN closest points are found during index constructing period). (number of entries using in searching procedure) is a searching parameter and we tune the value of to obtain the tradeoff between the search speed and quality. The construction algorithm is computationally expensive and the time is roughly linear to the value of . Figure 34 shows the speedup with different . The small value of could provide a good search performance for low recall but decrease for high recall. For most of datasets, the algorithm could provide a good search tradeoff when is 10 or 20.
a.15 Hierarchical Navigable Small World
Two main parameters are adopted to get the tradeoff between the index complexity and search performance: indicates the size of the potential neighbors in some layers for indexing phase, and is used to controlled the quality of the neighbors during indexing. We use the default value of , which is set to 200. is a parameter similar to to control the search quality. We change the value of to get the tradeoff between the search speed and quality. Figure 35 show the search performance of different . Similar to SW, the small value of could provide a good search performance for low recall but decrease for high recall. We set as a default value.
a.16 Rank Cover Tree
In RCT, there are parameters: ( is the sample rate sampled from the lower level, which could determine the height of the tree), (the maximum number of parents for each node) and coverage parameter .Acording to the the recommendations from the authors, we select to build the rank cover tree.
a.17 KGraph
KNN Graph involves three parameters: (the number of most similar objects connected with each vertex), sample rate , termination threshold and (initial entries count to start the search). The meaning of is the loss in recall tolerable with early termination. We use a default termination threshold of 0.002. As reported by the author, the recall grows slowly beyond . Here we study the impact of and on performance. From figure 37, we see that is need for most of the datasets.
a.18 Dpg
The parameter tuning of DPG is similar to that of KGraph. In the experiments, DPG has the same setting with KGraph except that we use so that the index size of DPG is the same as that of KGraph in the worst case.
Figure 38 shows that using count based diversification (i.e., DPG_C ) achieves similar search performance as using angular similarity based diversification (i.e., DPG_O).
Figure 39 shows the comparisons of the diversification time between countingbased DPG and angularbased DPG. DPG_C spends more less time than DPG_O. The improvements are especially significant for the dataset with large data points.
a.19 Default setting
Below are the default settings of the key parameters of the algorithms in the second round evaluation in Section 8.5.

SRS. The number of projections () is set to .

OPQ. The number of subspaces is , and each subspace can have codewords (i.e., cluster centers) by default.

Annoy. The number of the Annoy trees, , is set to .

FLANN. We let the algorithm tune its own parameters.

HNSW. The number of the connections for each point, , is set to .

KGraph. By default, we use for the KNN graph index.

DPG. We use so that the index size of DPG is the same as that of KGraph in the worst case.
Appendix B Supplement for the Second Round Evaluation
Footnotes
 Also known as GNAT [11] and vocabulary tree [37].
 Though the tree structure is employed by RCT, the key idea of the RCT relies on the neighborhood information during the index construction.
 http://ss.sysu.edu.cn/~fjl/qalsh/qalsh_1.1.2.tar.gz
 https://github.com/DBWangGroupUNSW/SRS
 http://cs.nju.edu.cn/lwj
 http://www.ee.columbia.edu/ln/dvmm/downloads
 https://github.com/pyongjoo/nsh
 http://www.comp.nus.edu.sg/~dsh/index.html
 http://research.microsoft.com/enus/um/people/kahe
 https://github.com/searchivarius/nmslib
 http://arbabenko.github.io/MultiIndex/index.html
 http://www.cs.ubc.ca/research/flann
 https://github.com/spotify/annoy
 https://github.com/DBWangGroupUNSW/nns_benchmark
 https://github.com/aaalgo/kgraph
 http://lms.comp.nus.edu.sg/research/NUSWIDE.htm
 footnotemark:
 http://nlp.stanford.edu/projects/glove/
 http://www.cs.toronto.edu/~kriz/cifar.html
 http://www.cs.princeton.edu/cass/audio.tar.gz
 http://yann.lecun.com/exdb/mnist/
 http://groups.csail.mit.edu/vision/SUN/
 http://phototour.cs.washington.edu/patches/default.htm
 http://phototour.cs.washington.edu/datasets/
 http://www.cs.tau.ac.il/~wolf/ytfaces/index.html
 http://www.ifs.tuwien.ac.at/mir/msd/download.html
 http://corpustexmex.irisa.fr
 https://yadi.sk/d/I_yaFVqchJmoc
 http://vis.uky.edu/~stewe/ukbench/
 http://cloudcv.org/objdetect/
 footnotemark:
 The figures are very similar for all the tested queries.
 Our evaluation agrees with
http://www.kgraph.org/index.php?n=Main.Benchmark
References
 L. Amsaleg, O. Chelly, T. Furon, S. Girard, M. E. Houle, K. Kawarabayashi, and M. Nett. Estimating local intrinsic dimensionality. In SIGKDD, 2015.
 A. Andoni, P. Indyk, T. Laarhoven, I. P. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular distance. CoRR, abs/1509.02897, 2015.
 A. Andoni and I. Razenshteyn. Optimal datadependent hashing for approximate near neighbors. In STOC, 2015.
 F. André, A. Kermarrec, and N. L. Scouarnec. Cache locality is not enough: Highperformance nearest neighbor search with product quantization fast scan. PVLDB, 9(4):288–299, 2015.
 K. Aoyama, K. Saito, H. Sawada, and N. Ueda. Fast approximate similarity search based on degreereduced neighborhood graphs. In SIGKDD, 2011.
 A. Babenko and V. S. Lempitsky. The inverted multiindex. In CVPR, pages 3069–3076, 2012.
 E. Bernhardsson. Annoy at github https://github.com/spotify/annoy, 2015.
 E. Bernhardsson. Benchmarking nearest neighbors https://github.com/erikbern/annbenchmarks, 2016.
 A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, pages 97–104, 2006.
 L. Boytsov and B. Naidan. Learning to prune in metric and nonmetric spaces. In NIPS, 2013.
 S. Brin. Near neighbor search in large metric spaces. In VLDB, pages 574–584, 1995.
 R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical evaluation of supervised learning in high dimensions. In ICML, pages 96–103, 2008.
 S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, 2008.
 M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In SoCG, 2004.
 W. Dong. Kgraph. http://www.kgraph.org, 2014.
 W. Dong, M. Charikar, and K. Li. Efficient knearest neighbor graph construction for generic similarity measures. In WWW, 2011.
 K. Fukunaga and P. M. Narendra. A branch and bound algorithms for computing knearest neighbors. IEEE Trans. Computers, 24(7):750–753, 1975. hierachical kmeans tree.
 J. Gan, J. Feng, Q. Fang, and W. Ng. Localitysensitive hashing scheme based on dynamic collision counting. In SIGMOD, 2012.
 J. Gao, H. V. Jagadish, B. C. Ooi, and S. Wang. Selective hashing: Closing the gap between radius search and knn search. In SIGKDD, 2015.
 T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):744–755, 2014.
 R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383, 1998.
 J. He, S. Kumar, and S. Chang. On the difficulty of nearest neighbor search. In ICML, 2012.
 M. E. Houle and M. Nett. Rankbased similarity search: Reducing the dimensional dependence. IEEE TPAMI, 37(1):136–150, 2015.
 M. E. Houle and J. Sakuma. Fast approximate similarity search in extremely highdimensional data sets. In ICDE, pages 619–630, 2005.
 Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Queryaware localitysensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1–12, 2015.
 P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.
 H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011.
 Q. Jiang and W. Li. Scalable graph hashing with feature transformation. In IJCAI, pages 2248–2254, 2015.
 C.C. Kuo, F. Glover, and K. S. Dhir. Analyzing and modeling the maximum diversity problem by zeroone programming*. Decision Sciences, 24(6):1171–1185, 1993.
 W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1–8, 2011.
 Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst., 45:61–68, 2014.
 Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR, 2016.
 M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2227–2240, 2014.
 B. Naidan, L. Boytsov, and E. Nyberg. Permutation search methods are efficient, yet faster search is possible. PVLDB, 8(12):1618–1629, 2015.
 Y. Park, M. J. Cafarella, and B. Mozafari. Neighborsensitive hashing. PVLDB, 9(3):144–155, 2015.
 M. Radovanovic, A. Nanopoulos, and M. Ivanovic. Hubs in space: Popular nearest neighbors in highdimensional data. Journal of Machine Learning Research, 11:2487–2531, 2010.
 G. Schindler, M. A. Brown, and R. Szeliski. Cityscale location recognition. In CVPR, 2007.
 C. SilpaAnan and R. I. Hartley. Optimised kdtrees for fast image descriptor matching. In CVPR, 2008.
 Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving capproximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB, 8(1):1–12, 2014.
 Y. Sun, W. Wang, Y. Zhang, and W. Li. Nearest neighbor search benchmark https://github.com/DBWangGroupUNSW/nns_benchmark, 2016.
 J. Wang, W. Liu, S. Kumar, and S. Chang. Learning to hash for indexing big data  A survey. Proceedings of the IEEE, 104(1):34–57, 2016.
 J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014.
 R. Weber, H.J. Schek, and S. Blott. A quantitative analysis and performance study for similaritysearch methods in highdimensional spaces. In A. Gupta, O. Shmueli, and J. Widom, editors, VLDB, 1998.
 P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, pages 311–321, 1993.
 J. Zhai, Y. Lou, and J. Gehrke. Atlas: a probabilistic algorithm for high dimensional similarity search. In SIGMOD, 2011.