kNN Graph Construction:
a Generic Online Approach
Abstract
Nearest neighbor search and knearest neighbor graph construction are two fundamental issues arise from many disciplines such as information retrieval, datamining, machine learning and computer vision. Despite continuous efforts have been taken in the last several decades, these two issues remain challenging. They become more and more imminent given the big data emerges in various fields and has been expanded significantly over the years. In this paper, a simple but effective solution both for knearest neighbor search and knearest neighbor graph construction is presented. Namely, these two issues are addressed jointly. On one hand, the knearest neighbor graph construction is treated as a nearest neighbor search task. Each data sample along with its knearest neighbors are joined into the knearest neighbor graph by sequentially performing the nearest neighbor search on the graph under construction. On the other hand, the built knearest neighbor graph is used to support knearest neighbor search. Since the graph is built online, dynamic updating of the graph, which is not desirable from most of the existing solutions, is supported. Moreover, this solution is feasible for various distance measures. Its effectiveness both as a knearest neighbor construction and knearest neighbor search approach is verified across various datasets in different scales, various dimensions and under different metrics.
I Introduction
Given a dataset , kNN graph refers to the structure that keeps topk nearest neighbors for each sample in the dataset. It is the key data structure in the manifold learning [1, 2], computer vision, machine learning and information retrieval, etc [3]. Due to the fundamental role it plays, it has been studied for several decades. Basically, given a metric, the construction of kNN graph is to find the topk nearest neighbors for each data sample. When it is built in bruteforce way, the time complexity is , where is the dimension and is the size of dataset. Due to the eminence of big data issue in various contexts, both and could be very large. As a result, it is computationally prohibitive to build an exact kNN graph in an exhaustive manner. For this reason, works in the literature [3, 4, 5, 6] only aim to search for an approximate but efficient solution.
Despite numerous progress has been made in recent years, the major issues latent in kNN graph construction still remain challenging. First of all, many existing approaches only perform well on lowdimensional data. And the scale of data they are assumed to cope with is usually less than one million. Moreover, most of approaches are designed under specific metric i.e., norm. Only recent few works [7, 3, 8] aim to address this issue in the generic metric spaces. Thanks to the introduction of NNDescent in [3], the construction complexity has been reduced from [7] to for data with medium dimensions (e.g. 24) [3]. However, the performance of NNDescent turns out to be unstable for data with high intrinsic dimension.
Besides the major issues aforementioned, many existing approaches still face another potential issue. In practice, it is possible that the dataset changes from time to time. This is especially the case for largescale Internet applications. For instance, the photos and videos in Flickr grows on a daily basis. In visual object tracking [9], the new object templates are joined into the candidate set, and the obsolete templates should be swapped out, as the tracking continues. In these scenarios, one would expect the kNN graph that works behind should be updated from time to time. Unfortunately, for most of the existing approaches, the dataset is assumed to be fixed. Any update on the dataset invokes a complete reconstruction on the kNN graph. As a consequence, the aggregated cost is high even the dataset is in smallscale. It is more convenient if it is allowed to simply insert/remove the samples into/from the existing kNN graph. Nevertheless, it is complicated to update the kNN graph with the support of conventional indexing structure such as locality sensitive hashing [10], RTree [11] or kd tree [12].
Recent study [13] shows that it is possible to build kNN graph incrementally by invoking knearest neighbor search directly on an existing kNN graph. Unfortunately, limited speedup (10 to 20 times) is observed in [13]. In order to support fast indexing, kmedoids is called in the approach to partition the samples that are in kNN graph, which becomes very slow when both d and n are large. Moreover, since it is built upon kmedoids, it only works in certain metric spaces.
Another problem that is closely related to kNN graph construction is knearest neighbor search (kNN search), which, like kNN graph construction, arises from a wide range of applications. The nearest neighbor search problem is defined as follows. Given a query vector (), and n candidates in that are under the same dimensionality. It is required to return sample(s) for the query that are closest to it according to a given metric.
Traditionally, this issue has been addressed by various space partitioning strategies. However, these methods are hardly scalable to high dimensional (e.g., ), largescale and dense vector space. In such case, most of the traditional approaches such as kd tree [12], Rtree [11] and locality sensitive hashing (LSH) [10] are unable to return decent results.
Recently, there are two major trends in the literature that aim to address this issue. In one direction, kNN search is undertaken based upon vector quantization [14, 15, 16]. The primary goal of this way is to compress the reference set by vector quantization. Such that it is possible to load the whole reference set (after compression) into the memory in the case that the reference set is extremely large. The distance between query and the reference set is measured in the compressed space, which is more efficient than it is undertaken in the original space. Due to the quantization loss, high accuracy is undesirable from this type of approaches. Alternatively, another more promising way is to conduct the kNN search based on an approximate kNN graph [17, 18, 4] or the like [19, 20] with hillclimbing strategy [21].
In this paper, a generic kNN graph construction approach is presented. The kNN graph construction is treated as a kNN search task. The kNN graph is incrementally built by invoking each sample to query against the kNN graph under construction. After one round of kNN search, the query sample is joined into the graph with the resulting topk nearest neighbors. The kNN lists of samples (already in the graph) that are visited during the search are accordingly updated. The kNN search basically follows the hillclimbing strategy [21]. In order to achieve high performance in terms of both efficiency and quality, two major innovations are proposed.

The hillclimbing procedure is undertaken on both kNN graph and its reverse kNN graph. In order to avoid the high cost of converting the intermediate kNN graph to its reverse kNN graph each time, the data structure “orthogonal list” is adopted, in which kNN graph and reverse kNN graph are maintained as a whole.

To further boost the performance, a lazy graph diversification (LGD) scheme is proposed. It helps to avoid unnecessary distance computations during the hillclimbing while involving no additional computations.
The advantages of this approach are several folds. Firstly, the online construction avoids repetitive distance computations that most of the current kNN graph construction approaches suffer from. Secondly, online construction is particularly suitable for the scenario that dataset is dynamically changing. Moreover, our approach has no specification on the distance measure, it is therefore a generic solution, which is confirmed in our experiments. Thanks to aforementioned two innovations, the kNN search turns out to be very costeffective. When turning off the graph update operations, it is also an effective kNN search algorithm. Namely, the problems of kNN graph construction and kNN search have been jointly addressed in our solution.
The remaining of this paper is organized as follows. In Section II, a brief review about the research works on kNN graph construction and approximate kNN search is presented. Section III presents an enhanced hillclimbing algorithm upon which the kNN graph construction approach is built. Section IV presents two online kNN graph construction approaches, which are NN search approach as well when turning off the graph update operations. The experimental studies over the kNN graph construction and kNN search are presented in Section V. Section VI concludes the paper.
Ii Related Works
Iia kNN Search
The early study about the kNN search issue could be traced back to 1970s, when the need of NN search on the file system arises. In those days, the data to be processed are in very low dimension, typically 1D. This problem is welladdressed by BTree [22] and its variant tree [22], based on which the NN search complexity could be as low as . Btree is not naturally extensible to more than 1D case. More sophisticated indexing structures were designed to handle NN search in multidimensional data. Representative structures are kdtree [12], Rtree [11] and R*tree [23]. For kd tree, pivot vector is selected each time to split the dataset evenly into two. By applying this bisecting repeatedly, the hyperspace is partitioned into embedded hierarchical subspaces. The NN search is performed by traversing over one or several branches to probe the nearest neighbors. Unlike Btree in 1D case, the partition scheme does not exclude the possibility that nearest neighbor resides outside of these candidate subspaces. Therefore, extensive probing over the large number of branches in the tree becomes inevitable. For this reason, NN search with kd tree and the like could be very slow. Recent indexing structure FLANN [24, 25] partitions the space with hierarchical kmeans and multiple kd trees. Although efficient, suboptimal results are achieved.
For all the aforementioned tree partitioning methods, another major disadvantage lies in their heavy demand in memory. On one hand, in order to support fast comparison, all the candidate vectors are loaded into the memory. On the other hand, the tree nodes that are used for indexing also take up considerable amount of extra memory. Overall, the memory consumption is usually several times bigger than the size of reference set.
Aiming to reduce the memory consumption, quantization based approaches [26, 14, 27, 16, 28] compress the reference vectors by quantization [29]. For all the quantization based methods, they share two things in common. Firstly, the candidate vectors are all compressed via vector (or subvector) quantization. This makes it easier than previous methods to hold the whole dataset in the memory. Secondly, NN search is conducted between the query and the compressed candidate vectors. The distance between query and candidates is approximated by the distance between query and vocabulary words that are used for quantization. Due to the heavy compression on the reference vectors, high search quality is hardly desirable. Furthermore, this type of approaches are only suitable for norm.
Apart from above approaches, several attempts have been made to apply LSH [10, 30] on NN search. In general, there are two steps involved in the search stage. Namely, step 1. collects the candidates that share the same or similar hash keys as the query; step 2. performs exhaustive comparison between the query and all these selected candidates to find out the nearest neighbor. Similar as FLANN, computational cost remains high if one expects high search quality. Additionally, the design of hash functions that are feasible for various metrics is nontrivial.
Recently, the hillclimbing (HC) strategy that performs NN search based on kNN graph has been also explored [21, 17, 19, 8, 4]. The hillclimbing starts from a group of random seeds (random locations in the vector space). The search traverses iteratively over an approximate kNN graph (built in advance) by bestfirst search. Guided by the kNN graph, the search procedure ascents closer to the true nearest neighbor in each round until no better candidates could be found. Approaches in [21, 20, 19, 4] follow similar search procedure. The major difference lies in the strategies that are used to build the graph. According to recent study [31], these graph based approaches demonstrate superior performance over other types of approaches across variety types of data.
IiB kNN Graph Construction
The approaches for kNN graph construction can be roughly grouped into two categories. Approaches such as [6, 32] basically follow the divideandconquer strategy. On the first step, samples are partitioned into a number of small subsets, exhaustive comparisons are carried out within each subset. The closeness relations (viz. edges in the kNN graph) between any two samples in one subset are established. In the second step, these closeness relations are collected to build the kNN graph. To enhance the performance, the first step is repeated for several times with different partitions. The produced closeness relations are used to update the kNN graph. Since it is hard to design partition scheme that is feasible in generic space, they are in general only effective in space. Another category of kNN Graph Construction, typically NNdescent [3] avoids such disadvantage. The graph construction in NNdescent starts from a random kNN graph. Based on the principle “neighbor’s neighbor is likely to be the neighbor”, the kNN graph evolves by invoking comparisons between samples in each sample’s neighborhood. Better closeness relations that are produced in the comparisons are used to update the neighborhood of one sample. This approach turns out to be generic and efficient. Essentially, it can be viewed as performing hillclimbing [21] batchfully. Recently, the mixture scheme derived from the above two categories is also seen in the literature [4].
It is worth noting that approaches proposed in [19, 20, 8, 31] are not kNN graph construction algorithms. The graphs are built primarily for knearest neighbor search. In these approaches, the samples which should be in the kNN list of one sample are deliberately omitted for efficiency. While the links to remote neighbors are maintained [19, 20]. As a consequence, graphs constructed by these approaches are not kNN graph in the real sense. Such kind of graphs are hardly supportive to tasks beyond kNN search.
In most of the approaches aforementioned, one potential issue is that the construction procedure has to keep records on the comparisons that have been made between sample pairs to avoid repetitive comparisons. However, its space complexity could be as high as . Otherwise the repetitive comparisons are inevitable even by adopting specific sampling schemes [3]. In this paper, the kNN graph construction and kNN search are addressed jointly. The kNN graph construction is undertaken by invoking each sample as a query to query against the kNN graph that is under construction. Since the query is new to the graph under construction each time, the issue of repetitive comparisons is overcome. More interestingly, we discover that the kNN graph construction and kNN search are beneficial to each other. Namely, high quality kNN search leads to the high quality of intermediate kNN graphs. In turn, the efficiency and quality of kNN search are guaranteed with the support of high quality of intermediate kNN graphs.
Iii Baseline Search Algorithm
Since our kNN graph construction approach is based on kNN search. Before we introduce the construction approach, the baseline kNN search algorithm that is based on a prebuilt kNN graph is presented. It essentially follows hillclimbing strategy [21], while considerable modifications have been made.
Algorithm 1
Ehanced HillClimbing NN search (EHC)

1:Input: q: query, : kNN Graph, : reverse kNN Graph, : reference set2:Output: Q: kNN list of q3:; Flag[1n];4:while _updated_ do5: R[1p] p random seeds in6: for each R do7: insertQ(, , );8: end for9: for do10: if Flag[] == 0 then11: for each do12: InsertQ(, , );13: end for14: for each do15: insertQ(, , );16: end for17: end if18: Flag[r] = 1;19: end for20: it = it + 1;21:end while
end
Given kNN graph , returns kNN list of sample . Accordingly, is the reverse kNN graph of , which is nothing more than a reorganization of graph . keeps ID of samples that sample appears in their NN lists. Noticed that the size of is not necessarily and there would be overlappings between and . An illustration about graphs and are seen in Fig. 1. With the support of kNN graph and its reverse graph , the baseline search algorithm is presented in Alg. 1.
As shown in Alg. 1, the hillclimbing starts from random seeds, the compared samples are kept in , which is sorted in ascending order^{1}^{1}1Without the loss of generality, it is assumed that the smaller of the distance the closer of two samples across the paper for description convenience. all the way. The neighbors and reverse neighbors of sample , which is not visited and ranked closest to the query, are compared sequentially (Line 5–19). Notice that is updated as soon as a closer sample is found. The distance function could be any distance metric defined on the input dataset. This search algorithm is generic in the sense that there is no specification or assumption related to the distance metric. This is also true for the kNN construction algorithm that will be presented afterwards. The iteration (Line 4 – 21) continues until no closer sample is identified. Although this algorithm starts from random location, only minor performance fluctuation is observed across different runs. In order to avoid repetitive comparisons, variable Flag keeps the status whether a sample has been expanded.
Iv Online kNN Graph Construction
The prerequisite for kNN search algorithm in Alg. 1 are the kNN graph and its reverse graph . In this section, we are going to show how a kNN Graph and its reverse Graph are built based on kNN search algorithm itself. Additionally, a more efficient scheme for kNN search and in turn for kNN graph construction is presented. Additionally, we also discuss how further performance boost is achieved by integrating schemes proposed in [3].
Iva kNN Construction Graph by Search
In Alg. 1, when the hillclimbing is conducted, the comparison to the candidates starts from several random locations, moving along several trails, the search moves towards the closer neighborhood of a query. In the ideal case, topk nearest neighbors will be discovered. On one hand, the topk nearest neighbors of this query are known after the search. On the other hand, this query will be joined into the neighborhoods of samples’ being visited as a new neighbor during the search. As a result, a kNN graph is augmented to include this new sample. Motivated by this observation, the online construction algorithm is conceived.
Generally, there are two major steps in the algorithm. Firstly, an initial graph is built exhaustively from a small subset of . The size of is fixed to 256 across the paper. In the second step, each of the remaining samples is treated as query to query against the graph following the flow of Alg. 1. Each time, the kNN list of one sample is joined into the graph being rightnow searched over. The kNN lists of samples which have been compared during the search are accordingly updated. The general procedure of the construction algorithm is summarized in Alg. 2.
Algorithm 2
Online kNN Graph Construction (OLG)

1:Input: : dataset; : size of NN list; : num. of seeds2:Output: : kNN Graph3:; Flag[1n];4:Extract a small subset from ;5:Initialize and with ;6:;7:for each do8: while _updated_ do9: R[1p] p random seeds in10: for each R do11: InsertQ(, , );12: end for13: for do14: if Flag[r] == 0 then15: for each do16: InsertQ(, , );17: InsertG(, , , );18: end for19: for each do20: InsertQ(, , );21: InsertG(, , , );22: end for23: end if24: Flag[r] = 1;25: end for26: end while27: for each Q do28: InsertG(, , , );29: end for30: ; Flag[1n];31:end for
end
In order to support efficient search and construction, an orthogonal list (as shown in Fig. 2) is adopted to keep the both and (Line 8) in the implementation. As shown in Fig. 2, two linked lists are kept for one sample, one is kNN list and another is reverse kNN list. While for clarity, graph and are still referred in our description.
In Alg. 2, the procedure of kNN graph construction is basically a repetitive call of the search algorithm in Alg. 1 (Line 10–26). The major difference is that function insertG(, , , ) is called after the query is compared to a sample in the graph. Function insertG() is responsible for inserting an edge into kNN list of in graphs and . The major operations inside insertG() involve updating the kNN list of and the reverse kNN list of . Sample in the rear of kNN list is removed if a closer sample is joined in. Distance is kept in kNN list of , which allows the list to be sorted all the time. In the end of each loop, insertG() is called again to join the kNN list of query into .
Although the size of input dataset is fixed in Alg. 2, it is apparently feasible for an open set, where new samples are allowed to join in from time to time. As will be revealed in the experiments (Section V), Alg. 2 already performs pretty well. In the following, a novel scheme is presented to further boost its performance.
IvB Lazy Graph Diversification
In Alg. 1, when expanding sample in the rank list , all the samples in the neighborhood of will be compared with the query. According to recent studies [31, 20, 8], when samples in the neighborhood of are so close to each other that the distances between them are smaller than their distances to , it is no need to compare to all of them during the expansion. The expansion on these close samples later will likely lead the climbing process to the same local region. The phenomenon that samples in the kNN list are closer to each other than they are to is called as “occlusion” [8]. An illustration of occlusion is shown in Fig. 3. In the illustration, samples and are occluded by sample . It is easy to see one sample can only be occluded by samples which are closer to than it. According to [31, 20, 8], the hillclimbing will be more efficient when samples like and are not considered when expanding .
In order to know whether samples in a kNN list are occluded one by another, the pairwise comparisons between samples in the kNN list are required, which is the practice in [31, 8, 20]. This is unfeasible for an online construction procedure (i.e., Alg. 2). First of all, kNN list is dynamically changing, pairwise distances cannot be simply computed and kept once for all. Secondly, it is still too costly to update the pairwise occlusion relations as long as a new sample is joined, which involves a complete comparison between the new neighbor and the rest. Moreover, the occluded samples cannot be simply removed from a kNN list since our primary goal is to build the kNN graph.
In this paper, a novel scheme called lazy graph diversification is proposed to identify the occlusions between samples during the online graph construction. To achieve that, an occlusion factor is introduced as the attribute attached to each sample in a kNN list. s of all the neighbors are initialized to 0, when the kNN list of current query is joined into the graph. Factor will be updated when another new sample is joined into the kNN list in the later stage. Given a new sample q to be inserted into sample r’s kNN list, we should know its distances to other neighbors in the list and also the distances of all the neighbors to r. The distances to r are known as they are kept for sorting. While the distances between this new sample and the rest are unknown. Instead of performing a costly thorough comparison between q and the rest neighbors, we make use of distances that are computed during the hillclimbing iteration. To do that, another variable D is introduced to keep the distances between sample q and the samples that have been compared during the hillclimbing. It is possible that not all the samples in kNN list of r are joined in the comparisons. However, according to the principle “neighbor of a neighbor is likely to be a neighbor”, the “old” samples in the kNN list of r are likely being encountered by q. With the support of D, occlusion factor of all the samples in the kNN list is updated with following three rules.

Rule 1: is kept unchanged for samples ranked before q;

Rule 2: of sample is incremented by 1 if a sample ranked before q is closer to q than q is to r;

Rule 3: of a sample ranked after q is incremented by 1 if its distance to q is lower than q is to r.
The default distance of each sample to q is set to . As a result, the s of notbeingvisited neighbors will not be updated according to Rule 1 and Rule 3. This is reasonable because the notbeingvisited neighbors should be sufficiently far away from , otherwise they are already being visited according to the principle “a neighbor of a neighbor is also likely to be a neighbor”. Since we have all the possible distances (between and samples in the graph) only after the the hillclimbing converges, the operations of inserting into kNN list of and updating factor in the list are postponed after the search. Fig. 4 illustrates a search trail that is formed by the hillclimbing. In the kneighborhood of , the LGD operations are applied.
Once the occlusion factor is available, the search algorithm (Alg. 1) and the construction algorithm (Alg. 2) are accordingly modified. When the query is compared to the neighbors in one kNN list, we only consider the samples whose is no greater than the average occlusion factor of this list. Notice that this is different from the way proposed in [8, 20], in which the occluded samples () are simply omitted. We find such kind of rule is too restrictive in our case. In contrast, only samples that are occluded by many other neighbors are ignored. The operation of skipping samples with high factor could be interpreted as performing diversification in the graph [31]. Different from [31], the diversification is undertaken in a lazy way in our case. This scheme is therefore called as lazy graph diversification (LGD). The three rules used to calculate occlusion factor are called as LGD rules. The kNN graph construction algorithm with LGD is accordingly named as LGD, which is summarized in Alg. 3.
Algorithm 3
Online kNN Graph Construction with LGD (LGD)

1:Input: : dataset; : size of NN list; : num. of seeds2:Output: : kNN Graph3:; D[1n] ; E[1n];4:Extract a small subset from ;5:Initialize and with ;6:;7:for each do do8: while _isUpdated_ do9: R[1p] p random seeds in10: for each R do11: InsertQ(, , );12: end for13: for do14: if == 0 then15: for && do16: insertQ(, , );17: D[] = ;18: end for19: for && do20: insertQ(, , );21: D[] = ;22: end for23: end if24: E[r] = 1;25: end for26: end while27: for each visited do28: updateG(, , , );29: end for30: for each Q do31: insertG(, , , );32: end for33: ; D[1n] ; E[1n];34:end for
end
Compared to Alg. 2, Alg. 3 is different in three major aspects. In Alg. 3, query sample is compared to samples whose occlusion factor is no greater than average factor in both kNN list and reverse kNN list (see line 15 and Line 19). After is compared to a sample in the kNN list, kNN list of is not updated immediately. Instead, the distance from to sample is collected into (Line 17, Line 21) for later use. The update of kNN lists for all the samples so far encountered are postponed to the end of kNN search (Line 27–30). Function updateG() is basically similar as insertG(). The additional operation inside updateG() is to update of all the neighbors according to LGD rules. It is easy to see Alg. 3 becomes a fast kNN search algorithm when the updateG and insertG operations are turned off.
IvC Sample Removal from the Graph
In practice, we should allow samples to be dropped out from the kNN graph. A good case is to maintain a kNN graph for product photos for an eshopping website, where oldfashioned products should be withdrew from sale. The removal of samples from the kNN graph dynamically is supported in our approach. If the graph is built by Alg. 2, the removal operation is as easy as deleting the sample from kNN lists of its reverse neighbors and releasing its own kNN list. If the graph is built by Alg. 3, before the sample is deleted, the occlusion factors of the samples living in the same kNN list have to be updated. Fortunately not all the samples in the list should be considered. According to LGD Rule 3, only samples ranked after current sample should be considered. The update operations involves times distance computations in average. Given is a small constant, the time cost is much lower than fulfilling a query. Notice that such kind of dynamic updating operation is not necessarily supported by other online algorithms [13, 20], in which the deleting operation may lead to collapse of the indexing structure.
Discussions Two kNN graph construction approaches namely OLG (Alg. 2) and LGD (Alg. 3) in general follow the same framework. In both processes, the construction starts from a smallscale kNN graph of 100% quality. The search process appends a kNN list of a new sample to the graph each time. At the same time, the kNN lists of the already inserted samples will be possibly updated when the new sample happen to be in their neighborhoods. It is therefore a winwin situation for both graph construction and NN search. Effective search procedure returns high quality kNN list. While kNN graph with high quality gives a good guidance for the hillclimbing process.
Besides the size of NN list , there is another parameter involved in OLG and LGD. Namely, the number of seeds . Usually, the size of NN list should be no less than the intrinsic data dimension , which is less than or equal to the data dimension . The number of seeds is usually set to be no bigger than . When is very high (i.e., several hundreds to thousands) and is close to , the construction process could be slow when is set to be close to . In such situation, a good tradeoff is hardly achieved between the quality of kNN graph and the efficiency of the construction.
IvD Refinement on kNN graph
Although Alg. 3 already performs well under different cases, it still suffers from scalability issue that most of the kNN construction approaches face. When the size or the dimension of a dataset increase, the quality of the kNN graph drops steadily. In order to alleviate this problem, following the scheme in NNDescent [3], it is possible to undertake a pairwise comparisons within each kNN list when kNN graph is built. By such kind of postprocessing, missed true neighbor pairs could be recovered and the quality of kNN graph is enhanced. It is also possible to perform such kind of refinement periodically during the online construction. For instance, after every 10 thousand times of insertion operations in either Alg. 2 or Alg. 3, this refinement routine is called. The frequency of calling such kind of refinement is a tradeoff between efficiency and graph quality. Notice that such kind of refinement is unnecessary for simple kNN graph construction problem.
V Experiments
In this section, the performance of the proposed algorithms is studied both as a kNN graph construction and an approximate nearest neighbor search approach. In the evaluation, the performance is reported on both synthetic random data and data from real world. It is believed that the intrinsic dimension of synthetic data is roughly equal to the data dimension [3, 8]. While for the datasets from realworld, the intrinsic dimension is usually smaller than the data dimension [8]. The brief information about the datasets that are adopted in the evaluation are summarized in Table I.
In the nearest neighbor search task, the performance of the proposed search approaches is studied in comparison to the representative approaches of different categories. Namely they are graph based approaches such as DPG [31], HNSW [20]. The typical locality sensitive hash approaches[10, 33], namely SRS [34] is considered. For quantization based approach, product quantizer (PQ) [14] is incorporated in the comparison. FLANN [24] and Annoy [35] are selected as the representative tree partitioning approaches, both of which are popular NN search libraries in the literature.
Va Evaluation Protocol
For kNN graph construction, five synthetic random datasets sized of K are used in the evaluation. Their dimension ranges from 2 to 100. Data in each dimension are independently drawn from the range under uniform distribution. It guarantees the intrinsic dimension of the synthetic data largely equals to the data dimension. The top1 (recall@1) and top10 (recall@10) recalls on each dataset are studied under and metrics respectively. Given function returns the number of truthpositive neighbors at topk NN list of sample , the recall at top on the whole set is given as
(1) 
Name  # Qry  m()  Type  

Rand100K    synthetic  
Rand100K    synthetic  
SIFT1M [14]  SIFT [36]  
SIFT10M [14]  SIFT  
GIST1M [37]  GIST [37]  
GloVe1M [38]  Cosine  Text  
NUSW [39]  BoVW [40]  
NUSW [39]  BoVW  
YFCC1M [41]  Deep Feat.  
Rand1M  synthetic 
Besides kNN graph quality, the construction cost is also studied by measuring the scanning rate [3] of each approach. Given is the total number of distance computations in the construction, the scanning rate is defined as
(2) 
In addition, another seven datasets are adopted to evaluate the performance of both nearest neighbor search and kNN graph construction. Among them, six datasets are derived from real world images or text data. In particular, all four datasets (namely GIST1M, Glove1M, NUSW and Rand1M) that are marked as most challenging datasets in [31] are adopted in the evaluation. For each of the dataset, another 1,000 or 10,000 queries of the same data type are prepared. Different metrics such as , Cosine and are adopted in accordance with the data type of each set. The search quality is measured by the top1 recall for the first nearest neighbor. In order to make our study comparable under different hardware settings, the search quality is reported along with the speedup one approach achieves over bruteforce search. All the codes of different approaches considered in this study are compiled by g++ 5.4. In order to make our study to be fair, we disable all the multithreads, SIMD and prefetching instructions in the codes. All the experiments are pulled out on a PC with 3.6GHz CPU and 32G memory setup.
VB Performance of Baseline NNSearch
In the first experiment, the focus is to verify the effectiveness of the baseline search algorithm upon which our kNN graph construction is built. Four different configurations are tested. Firstly, the NN search is supplied with prebuilt kNN graph from NNDescent [3] and the true kNN graph. In the experiment, k is fixed to 40. The experiment is pulled out on SIFT1M and GIST1M datasets. The quality of 40NN graph from NNDescent is above 0.90 in terms of its top1 and top10 recall for both datasets. These kNN graphs are supplied to hillclimbing (HC) [7] and enhanced hillclimbing (EHC, Alg. 1). EHC differs from HC mainly in the use of reverse kNN graph during the search. The search performance in terms of recall@1 is shown in Fig. 5. The performance is compared to the stateoftheart approach HNSW [20].
As seen from the figure, there is a significant performance gap between EHC and HC. The NN list of each sample in EHC is usually longer than that of HC due to the incorporation of reverse nearest neighbors. On one hand, EHC has to visit more samples during the expansion, and therefore should be slower. On the other hand, the reverse nearest neighbors also provide shortcuts to the remote neighbors for the hillclimbing, which is similar as the mechanism offered by smallworld graph [20, 19]. As a result, EHC turns out to be more efficient. Another interesting discovery is that, the NN search performance based on approximate kNN graph is very close to that of being based on true kNN graph. This indicates minor difference in kNN graph quality does not lead to any big difference in the search performance. Above observations apply to other datasets that are considered in this paper.
With the support of the effective search procedure, it becomes possible to build the kNN graph with the search results. In the following, we are going to show the quality of kNN Graph that is built based on this search algorithm.
VC kNN Graph Construction
In the second experiment, the performance about kNN graph construction is studied when the enhanced hillclimbing is employed as a graph construction approach. In the evaluation, the performance of OLG (Alg. 2) and LGD (Alg. 3) is compared to NNDescent [3], which is still recognized as the most effective kNN graph construction approach that works in the generic metric spaces. In order to be in line with the experiments in [3], synthetic data in the same series of dimensions are used. In the test, the shared parameter among different approaches are tuned to be close to the data dimension and no higher than 50. Meanwhile, we ensure the scanning rate of different approaches to be on the same level. Usually, the higher is the scanning rate the better is the kNN graph quality. The scanning rate of all three approaches are reported in Table II. While the top1 and top10 recall of all three approaches under and distance measures are shown in Fig. 6.
m()  2  5  10  20  50  100  

NNDesc.  0.0040  0.0057  0.00883  0.0213  0.1037  0.139  
OLG  0.0041  0.0036  0.0060  0.0217  0.1133  0.135  
LGD  0.0039  0.0035  0.0060  0.0209  0.1034  0.136  
NNDesc.  0.0034  0.0047  0.0075  0.0209  0.1014  0.134  
OLG  0.0038  0.0033  0.0056  0.0196  0.1054  0.145  
LGD  0.0036  0.0030  0.0049  0.0194  0.1081  0.138 
As seen from the figure, in most of the cases, the quality of kNN graph from OLG and LGD is considerably better than NNDescent when the their scanning rates are similar to each other. This is particularly true under norm. The scanning rates of all the approaches increases steadily as the dimension of data goes higher. Meanwhile, the recall that one approach could reach drops. As shown in the figure, when reaches to 100, the scanning rates of all the approaches are above 10%, which is too high for them to be practically useful. In general, NNDescent, OLG and LGD are effective in the similar range of data dimensions, namely . Within this dimensional range, OLG and LGD achieve better tradeoff between scanning cost and the graph quality than that of NNDescent.
VD Nearest Neighbor Search
In our third experiment, the performance of NN search with the support of kNN graph built by OLG and LGD is evaluated. Six datasets derived from real world data are adopted. Among them, NUSW is tested under both and distance measures. In addition, another 100dimensional random dataset sized of one million is adopted. The brief information about all the datasets are summarized in Table I. When we use OLG (Alg. 2) and LGD (Alg. 3) to build the approximate kNN graph for search, the parameter k and p are fixed to 40 for all the datasets. Once kNN graphs are built, algorithms OLG and LGD are used as NN search procedures as we turn off their update and insertion operations. For convenience, the NN search approaches based on the graph constructed by OLG and LGD are given as OLG and LGD respectively.
In the evaluation, the NN search performance is compared to three representative graph based approaches. Namely, they are NNDescent [3], DPG [31] and HNSW [20], all of which work in generic metric spaces. Additionally, all the approaches use the similar hillclimbing search procedure. The major difference lies in the graph upon which the search procedure is undertaken. DPG graph is basically built upon NNDescent. In DPG, the kNN graph built by NNDescent is diversified by an offline postprocess. Additionally, the diversified NN list of each sample has been appended with its reverse NN list. In the experiment, NNDescent shares the same kNN graph with DPG for each dataset. Parameter k in NNDescent is fixed to 40 for all the datasets^{2}^{2}2This is to be in line with the experiments in [31]. OLG searches over graph which is a merge of kNN graph and its reverse kNN graph that are produced by OLG. NN search in LGD is on a kNN graph merged with its reverse kNN graph that have been diversified online by LGD rules. While for HNSW, the search is undertaken on a small world graph. The graph maintains links between close neighbors as well as long range links to the remote neighbors, which are kept in a hierarchy. The parameter in HNSW is fixed to 20. The edges kept for each sample in the bottomlayer is 40. Its size of NN list is therefore on the same level as NNDescent, DPG, OLG and LGD.
Since the search is conducted based on a kNN graph for NNDescent, DPG, OLG and LGD, the recall@1 and recall@10 of corresponding kNN graph from NNDescent and our approaches are shown in Fig. 7. Accordingly, their scanning rate c on each dataset is reported in Table III. As shown in the table, the scanning rates of LGD on five outof seven datasets from real world are at least 20% lower than that of NNDescent and OLG. While as is shown in Fig. 7, the recall of kNN graph from LGD is only at most 5% lower. The graph quality that achieved by different approaches is in general similar. One exception comes from NNDescent. It shows considerably poor performance on GloVe1M, whose intrinsic dimension is believed to be high [31].
NNDescent  OLG  LGD  m()  

SIFT1M  0.01085  0.00597  0.00438  
SIFT10M  0.00131  0.00075  0.00053  
GIST1M  0.01665  0.02140  0.0141  
GloVe1M  0.01393  0.00826  0.00555  Cosine 
NUSW  0.05750  0.09686  0.07164  
NUSW  0.06167  0.09410  0.06867  
YFCC1M  0.01113  0.00631  0.00468  
Rand1M  0.0016  0.0312  0.0228 
Dataset  SIFT1M  SIFT10M  GIST1M  GloVe1M 
# Qry  
Time  892.4  8923.6  748.5  79.6 
Dataset  NUSW  NUSW  YFCC1M  Rand1M 
# Qry  
Time  100.8  805.8  870.0  72.4 
The search performance on eight datasets is shown in Fig. 9. In order to make the search results comparable to the results that are pulled out under different hardware setups, the performance is reported as the recall curve against the speedup achieved over exhaustive search. The time costs of exhaustive search on each dataset are reported separately on Table IV. It is therefore also convenient for the readers who want to see the efficiency that our approach achieves with current setup.
As shown in the figure, the performance from NNDescent and HNSW are unstable across different datasets. LGD performs marginally better than OLG on most of the datasets. While LGD shows close performance as DPG on four datasets (SIFT1M, SIFT10M, GIST1M and YFCC1M) and outperforms it on the other four datasets, all of which are marked as “most challenging” in [31]. As the superior performance is observed from DPG, OLG and LGD, it is clear to see the performance boost mainly owes to the use of reserve kNN and the introduction of graph diversification. As shown in Fig. 7, the recalls of kNN graph from LGD search 1%5% lower than that of OLG and NNDescent, however this does not cause search performance loss in LGD. On the contrary, the best performance is observed from LGD in most of the cases. On one hand, this indicates the search procedure is tolerant to minor graph quality degradation. On the other hand, it also shows the search gets more costeffective when skipping occluded samples.
Comparing the result presented in Fig. 9(a) to the one presented in Fig. 9(b), the high scalability is achieved by the proposed approaches. As seen in the figure, the size of reference set has been increased by 10 times, while the time cost only increases from 0.201ms (per query) to 0.316ms (per query), when the search quality is maintained on 0.9 level. Similar high scalability is observed on deep features i.e., YFCC1M (Fig. 9(d)). This is good news given the deep features have been widely adopted in different applications nowadays. In contrast, such kind of high speedup is not achievable on NUSW, GloVe1M and Rand1M, although the dimensionality of GloVe1M and Rand1M is lower than that of SIFT1M and YFCC1M. We found that the speedup that graphbased approaches could achieve is partly related to the intrinsic data dimension [3, 8]. When the intrinsic data dimension is low, with the guidance of a kNN graph, the hillclimbing search is actually undertaken on the subspaces where most of the data samples lying in. Due to the low dimensionality of these subspaces, the search complexity is lower than it seemingly is. Fig. 8 illustrates this phenomenon. This is one of the major reasons that the graph based approaches exhibit superior performance over other type of approaches.
VE Comparison to stateoftheart kNN Search
Fig. 10 further compares our approach with the most representative approaches of different categories in the literature. Besides aforementioned HNSW, NNDescent and DPG, approaches considered in the comparison include tree partitioning approaches Annoy [35] and FLANN [24], locality sensitive hashing approach SRS [34], and vector quantization approach product quantizer (PQ) [14]. In the figures, the speedups that each approach achieves are reported when recall@1 is set to 0.8 and 0.9 levels. For PQ, it is impossible to achieve top1 recall above 0.5 due to its heavy quantization loss. As an exception, its recall is measured at top16 for SIFT1M and NUSW, and measured at top128 for GIST1M and Rand1M respectively.
As shown in the figure, the best results come from graph based approaches. And the proposed LGD performs the best in most of the cases. This observation is consistent across different datasets. The speedup of all the approaches drops as the recall@1 rises from 0.8 to 0.9. The speedup degradation is more significant for approaches such as PQ and FLANN. No considerable speedup is observed for SRS in all the cases. This basically indicates SRS is not suitable for the tasks which require high NN search quality. Another interesting observation is that the performance gap between graph based approaches and the rest is wider on the “easy” dataset than that of “hard”. Compared to the approaches of other categories, the NN search based on the graph makes good use of the subspace structures latent in a dataset. Since the intrinsic dimension of “easy” dataset is low [8], the hillclimbing is actually undertaken on these lowdimensional subspaces. The lower is the intrinsic dimension the higher is the speedup that graphbased approaches achieve. In constrast, there is no specific strategy in other type of approaches exploits on such latent structures in the data.
On one hand, the high search speedup is observed from LGD on data types such SIFT, GIST and deep features. With such efficiency, it is possible to realize an image search system with instant response on billion level dataset by a single PC. On the other hand, it is still too early to say NN search on highdimensional data has been solved. As shown on Rand1M and NUSW datasets, where both the data dimension and intrinsic data dimension are high, the efficiency achieved by our approach is still limited. Highly efficient NN search on these types of data (i.e., intrinsic dimension above 50) is still an open issue.
Vi Conclusion
We have presented our solution for both kNN graph construction and approximate nearest neighbor search. These two issues have been addressed under a unified framework. Namely, the NN search and NN graph construction are designed as an interdependent procedure that one is built upon another. The advantages of this design are several folds. First of all, the kNN graph construction is an online procedure. It therefore allows the samples to be inserted in or dropped out from the graph dynamically, which is undesirable from most of the existing solutions. Moreover, no sophisticated indexing structure is required to support this online approach. Furthermore, the solution has no specification on the distance measure, which makes it a generic approach both for kNN graph construction and NN search. The effectiveness of the proposed solution both as a kNN graph construction approach and NN search has been extensively studied. Superior performance is observed in both cases under different test configurations.
Acknowledgment
This work is supported by National Natural Science Foundation of China under grants 61572408.
References
 [1] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 5500, pp. 2319–2323, Dec. 2000.
 [2] S. T. Roweis and L. K. Saul, “Report nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 5500, pp. 2323–2326, Dec. 2000.
 [3] W. Dong, C. Moses, and K. Li, “Efficient knearest neighbor graph construction for generic similarity measures,” in Proceedings of the 20th International Conference on World Wide Web, WWW’11, (New York, NY, USA), pp. 577–586, ACM, 2011.
 [4] C. Fu and D. Cai, “EFANNA : An extremely fast approximate nearest neighbor search algorithm based on knn graph,” arXiv.org, 2016. arXiv:1609.07228.
 [5] J. Chen, H. ren Fang, and Yousef, “Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection,” Journal of Machine Learning Research, vol. 10, pp. 1989–2012, Dec. 2009.
 [6] J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li, “Scalable knn graph construction for visual descriptors,” in CVPR, pp. 1106–1113, Jun. 2012.
 [7] R. Paredes, E. Chávez, K. Figueroa, and G. Navarro, “Practical construction of knearest neighbor graphs in metric spaces,” in Proceedings of the 5th International Conference on Experimental Algorithms, WEA’06, (Berlin, Heidelberg), pp. 85–97, SpringerVerlag, 2006.
 [8] B. Harwood and T. Drummond, “FANNG: Fast approximate nearest neighbour graphs,” in CVPR, pp. 5713–5722, 2016.
 [9] M. S. Arulamparam, S. Maskell, and N. Gordon, “A tutorial on particle filters for online nonlinear/nongaussian bayesian tracking,” IEEE Transactions on Signal Processing, vol. 50, pp. 174–188, 2002.
 [10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on pstable distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, (New York, NY, USA), pp. 253–262, ACM, 2004.
 [11] A. Guttman, “Rtrees: A dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD international conference on Management of data, vol. 14, (New York, NY, USA), pp. 47–57, ACM, Jun. 1984.
 [12] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of ACM, vol. 18, pp. 509–517, Sep. 1975.
 [13] T. Debatty, P. Michiardi, and W. Mees, “Fast online knn graph building,” CoRR, vol. abs/1602.06819, 2016.
 [14] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” Trans. PAMI, vol. 33, pp. 117–128, Jan. 2011.
 [15] A. Babenko and V. Lempitsky, “Additive quantization for extreme vector compression,” in CVPR, pp. 931–938, 2014.
 [16] T. Zhang, C. Du, and J. Wang, “Composite quantization for approximate nearest neighbor search,” in ICML, pp. 838–846, 2014.
 [17] J. Wang and S. Li, “Querydriven iterated neighborhood graph search for large scale indexing,” in Proceedings of the 20th ACM International Conference on Multimedia, (New York, NY, USA), pp. 179–188, ACM, 2012.
 [18] W. Zhou, C. Yuan, R. Gu, and Y. Huang in International Conference on Advanced Cloud and Big Data, 2013.
 [19] Y. A. Malkov, A. Ponomarenko, A. Lovinov, and V. Krylov, “Approximate nearest neighbor algorithm based on navigable small world graphs,” Information Systems, 2013.
 [20] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” Arxiv.org, 2016. https://arxiv.org/abs/1411.2173.
 [21] K. Hajebi, Y. AbbasiYadkor, H. Shahbazi, and H. Zhang, “Fast approximate nearestneighbor search with knearest neighbor graph,” in International Joint Conference on Artificial Intelligence, pp. 1312–1317, 2011.
 [22] D. Comer, “Ubiquitous btree,” ACM Computing Surveys, vol. 11, pp. 121–137, Jun. 1979.
 [23] N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, “The R*tree: an efficient and robust access method for points and rectangles,” in International Conference on Management of Data, pp. 322–331, 1990.
 [24] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” Trans. PAMI, vol. 36, pp. 2227–2240, 2014.
 [25] C. SilpaAnan and R. Hartley, “Optimised kdtrees for fast image descriptor matching,” in CVPR, 2008.
 [26] Y. Chen, T. Guan, and C. Wang, “Approximate nearest neighbor search by residual vector quantization,” Sensors, vol. 10, pp. 11259–11273, 2010.
 [27] A. Babenko and V. Lempitsky, “Efficient indexing of billionscale datasets of deep descriptors,” in CVPR, pp. 2055–2063, 2016.
 [28] J. Martinez, H. H. Hoos, and J. J. Little, “Stacked quantizers for compositional vector compression,” Arxiv.org, 2014. https://arxiv.org/abs/1411.2173.
 [29] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Transactions on Information Theory, vol. 44, pp. 2325–2383, Sep. 2006.
 [30] Q. Lv, W. Josephson, Z. Wang, and M. C. amd Kai Li, “Multiprobe lsh: Efficient indexing for highdimensional similarity search,” in Proceedings of Very Large Data bases, Sep. 2007.
 [31] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin, “Approximate nearest neighbor search on high dimensional dataexperiments, analysis and improvement,” Arxiv.org, 2016. https://arxiv.org/abs/1610.02455.
 [32] Y.M. Zhang, K. Huang, G. Geng, and C.L. Liu, “Fast knn graph construction with locality sensitive hashing,” in Proceedings of Machine Learning and Knowledge Discovery in Databases: European Conference, pp. 660–674, Sep.
 [33] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest neighbor methods in learning and vision theory and practice. MIT Press, 2006.
 [34] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin, “Srs: solving capproximate nearest neighbor queries in high dimensional euclidean space with a tiny,” in The Proceedings of the VLDB Endowment, pp. 1–12, Sep. 2014.
 [35] E. Bernhardsson, “Annoy: approximate nearest neighbors in c++/python optimized for memory usage and loading/saving to disk,” 2016.
 [36] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal on Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
 [37] M. Douze, H. Jégou, H. Singh, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for webscale image search,” in CIVR, pp. 19:1–19:8, Jul. 2009.
 [38] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014.
 [39] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.T. Zheng, “Nuswide: A realworld web image database from national university of singapore,” in CIVR, (Santorini, Greece.), 2009.
 [40] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in ICCV, pp. 1470–1477, 2003.
 [41] G. Amato, F. Falchi, C. Gennaro, and F. Rabitti, “Yfcc100m hybridnet fc6 deep features for contentbased image retrieval,” in Proceedings of the 2016 ACM Workshop on Multimedia COMMONS, pp. 11–18, 2016.