k-NN Graph Construction:
a Generic Online Approach
Nearest neighbor search and k-nearest neighbor graph construction are two fundamental issues arise from many disciplines such as information retrieval, data-mining, machine learning and computer vision. Despite continuous efforts have been taken in the last several decades, these two issues remain challenging. They become more and more imminent given the big data emerges in various fields and has been expanded significantly over the years. In this paper, a simple but effective solution both for k-nearest neighbor search and k-nearest neighbor graph construction is presented. Namely, these two issues are addressed jointly. On one hand, the k-nearest neighbor graph construction is treated as a nearest neighbor search task. Each data sample along with its k-nearest neighbors are joined into the k-nearest neighbor graph by sequentially performing the nearest neighbor search on the graph under construction. On the other hand, the built k-nearest neighbor graph is used to support k-nearest neighbor search. Since the graph is built online, dynamic updating of the graph, which is not desirable from most of the existing solutions, is supported. Moreover, this solution is feasible for various distance measures. Its effectiveness both as a k-nearest neighbor construction and k-nearest neighbor search approach is verified across various datasets in different scales, various dimensions and under different metrics.
Given a dataset , k-NN graph refers to the structure that keeps top-k nearest neighbors for each sample in the dataset. It is the key data structure in the manifold learning [1, 2], computer vision, machine learning and information retrieval, etc . Due to the fundamental role it plays, it has been studied for several decades. Basically, given a metric, the construction of k-NN graph is to find the top-k nearest neighbors for each data sample. When it is built in brute-force way, the time complexity is , where is the dimension and is the size of dataset. Due to the eminence of big data issue in various contexts, both and could be very large. As a result, it is computationally prohibitive to build an exact k-NN graph in an exhaustive manner. For this reason, works in the literature [3, 4, 5, 6] only aim to search for an approximate but efficient solution.
Despite numerous progress has been made in recent years, the major issues latent in k-NN graph construction still remain challenging. First of all, many existing approaches only perform well on low-dimensional data. And the scale of data they are assumed to cope with is usually less than one million. Moreover, most of approaches are designed under specific metric i.e., -norm. Only recent few works [7, 3, 8] aim to address this issue in the generic metric spaces. Thanks to the introduction of NN-Descent in , the construction complexity has been reduced from  to for data with medium dimensions (e.g. 24) . However, the performance of NN-Descent turns out to be unstable for data with high intrinsic dimension.
Besides the major issues aforementioned, many existing approaches still face another potential issue. In practice, it is possible that the dataset changes from time to time. This is especially the case for large-scale Internet applications. For instance, the photos and videos in Flickr grows on a daily basis. In visual object tracking , the new object templates are joined into the candidate set, and the obsolete templates should be swapped out, as the tracking continues. In these scenarios, one would expect the k-NN graph that works behind should be updated from time to time. Unfortunately, for most of the existing approaches, the dataset is assumed to be fixed. Any update on the dataset invokes a complete reconstruction on the k-NN graph. As a consequence, the aggregated cost is high even the dataset is in small-scale. It is more convenient if it is allowed to simply insert/remove the samples into/from the existing k-NN graph. Nevertheless, it is complicated to update the k-NN graph with the support of conventional indexing structure such as locality sensitive hashing , R-Tree  or k-d tree .
Recent study  shows that it is possible to build k-NN graph incrementally by invoking k-nearest neighbor search directly on an existing k-NN graph. Unfortunately, limited speed-up (10 to 20 times) is observed in . In order to support fast indexing, k-medoids is called in the approach to partition the samples that are in k-NN graph, which becomes very slow when both d and n are large. Moreover, since it is built upon k-medoids, it only works in certain metric spaces.
Another problem that is closely related to k-NN graph construction is k-nearest neighbor search (k-NN search), which, like k-NN graph construction, arises from a wide range of applications. The nearest neighbor search problem is defined as follows. Given a query vector (), and n candidates in that are under the same dimensionality. It is required to return sample(s) for the query that are closest to it according to a given metric.
Traditionally, this issue has been addressed by various space partitioning strategies. However, these methods are hardly scalable to high dimensional (e.g., ), large-scale and dense vector space. In such case, most of the traditional approaches such as k-d tree , R-tree  and locality sensitive hashing (LSH)  are unable to return decent results.
Recently, there are two major trends in the literature that aim to address this issue. In one direction, k-NN search is undertaken based upon vector quantization [14, 15, 16]. The primary goal of this way is to compress the reference set by vector quantization. Such that it is possible to load the whole reference set (after compression) into the memory in the case that the reference set is extremely large. The distance between query and the reference set is measured in the compressed space, which is more efficient than it is undertaken in the original space. Due to the quantization loss, high accuracy is undesirable from this type of approaches. Alternatively, another more promising way is to conduct the k-NN search based on an approximate k-NN graph [17, 18, 4] or the like [19, 20] with hill-climbing strategy .
In this paper, a generic k-NN graph construction approach is presented. The k-NN graph construction is treated as a k-NN search task. The k-NN graph is incrementally built by invoking each sample to query against the k-NN graph under construction. After one round of k-NN search, the query sample is joined into the graph with the resulting top-k nearest neighbors. The k-NN lists of samples (already in the graph) that are visited during the search are accordingly updated. The k-NN search basically follows the hill-climbing strategy . In order to achieve high performance in terms of both efficiency and quality, two major innovations are proposed.
The hill-climbing procedure is undertaken on both k-NN graph and its reverse k-NN graph. In order to avoid the high cost of converting the intermediate k-NN graph to its reverse k-NN graph each time, the data structure “orthogonal list” is adopted, in which k-NN graph and reverse k-NN graph are maintained as a whole.
To further boost the performance, a lazy graph diversification (LGD) scheme is proposed. It helps to avoid unnecessary distance computations during the hill-climbing while involving no additional computations.
The advantages of this approach are several folds. Firstly, the online construction avoids repetitive distance computations that most of the current k-NN graph construction approaches suffer from. Secondly, online construction is particularly suitable for the scenario that dataset is dynamically changing. Moreover, our approach has no specification on the distance measure, it is therefore a generic solution, which is confirmed in our experiments. Thanks to aforementioned two innovations, the k-NN search turns out to be very cost-effective. When turning off the graph update operations, it is also an effective k-NN search algorithm. Namely, the problems of k-NN graph construction and k-NN search have been jointly addressed in our solution.
The remaining of this paper is organized as follows. In Section II, a brief review about the research works on k-NN graph construction and approximate k-NN search is presented. Section III presents an enhanced hill-climbing algorithm upon which the k-NN graph construction approach is built. Section IV presents two online k-NN graph construction approaches, which are NN search approach as well when turning off the graph update operations. The experimental studies over the k-NN graph construction and k-NN search are presented in Section V. Section VI concludes the paper.
Ii Related Works
Ii-a k-NN Search
The early study about the k-NN search issue could be traced back to 1970s, when the need of NN search on the file system arises. In those days, the data to be processed are in very low dimension, typically 1D. This problem is well-addressed by B-Tree  and its variant -tree , based on which the NN search complexity could be as low as . B-tree is not naturally extensible to more than 1D case. More sophisticated indexing structures were designed to handle NN search in multi-dimensional data. Representative structures are k-d-tree , R-tree  and R*-tree . For k-d tree, pivot vector is selected each time to split the dataset evenly into two. By applying this bisecting repeatedly, the hyper-space is partitioned into embedded hierarchical sub-spaces. The NN search is performed by traversing over one or several branches to probe the nearest neighbors. Unlike B-tree in 1D case, the partition scheme does not exclude the possibility that nearest neighbor resides outside of these candidate sub-spaces. Therefore, extensive probing over the large number of branches in the tree becomes inevitable. For this reason, NN search with k-d tree and the like could be very slow. Recent indexing structure FLANN [24, 25] partitions the space with hierarchical k-means and multiple k-d trees. Although efficient, sub-optimal results are achieved.
For all the aforementioned tree partitioning methods, another major disadvantage lies in their heavy demand in memory. On one hand, in order to support fast comparison, all the candidate vectors are loaded into the memory. On the other hand, the tree nodes that are used for indexing also take up considerable amount of extra memory. Overall, the memory consumption is usually several times bigger than the size of reference set.
Aiming to reduce the memory consumption, quantization based approaches [26, 14, 27, 16, 28] compress the reference vectors by quantization . For all the quantization based methods, they share two things in common. Firstly, the candidate vectors are all compressed via vector (or sub-vector) quantization. This makes it easier than previous methods to hold the whole dataset in the memory. Secondly, NN search is conducted between the query and the compressed candidate vectors. The distance between query and candidates is approximated by the distance between query and vocabulary words that are used for quantization. Due to the heavy compression on the reference vectors, high search quality is hardly desirable. Furthermore, this type of approaches are only suitable for -norm.
Apart from above approaches, several attempts have been made to apply LSH [10, 30] on NN search. In general, there are two steps involved in the search stage. Namely, step 1. collects the candidates that share the same or similar hash keys as the query; step 2. performs exhaustive comparison between the query and all these selected candidates to find out the nearest neighbor. Similar as FLANN, computational cost remains high if one expects high search quality. Additionally, the design of hash functions that are feasible for various metrics is non-trivial.
Recently, the hill-climbing (HC) strategy that performs NN search based on k-NN graph has been also explored [21, 17, 19, 8, 4]. The hill-climbing starts from a group of random seeds (random locations in the vector space). The search traverses iteratively over an approximate k-NN graph (built in advance) by best-first search. Guided by the k-NN graph, the search procedure ascents closer to the true nearest neighbor in each round until no better candidates could be found. Approaches in [21, 20, 19, 4] follow similar search procedure. The major difference lies in the strategies that are used to build the graph. According to recent study , these graph based approaches demonstrate superior performance over other types of approaches across variety types of data.
Ii-B k-NN Graph Construction
The approaches for k-NN graph construction can be roughly grouped into two categories. Approaches such as [6, 32] basically follow the divide-and-conquer strategy. On the first step, samples are partitioned into a number of small sub-sets, exhaustive comparisons are carried out within each sub-set. The closeness relations (viz. edges in the k-NN graph) between any two samples in one sub-set are established. In the second step, these closeness relations are collected to build the k-NN graph. To enhance the performance, the first step is repeated for several times with different partitions. The produced closeness relations are used to update the k-NN graph. Since it is hard to design partition scheme that is feasible in generic space, they are in general only effective in -space. Another category of k-NN Graph Construction, typically NN-descent  avoids such disadvantage. The graph construction in NN-descent starts from a random k-NN graph. Based on the principle “neighbor’s neighbor is likely to be the neighbor”, the k-NN graph evolves by invoking comparisons between samples in each sample’s neighborhood. Better closeness relations that are produced in the comparisons are used to update the neighborhood of one sample. This approach turns out to be generic and efficient. Essentially, it can be viewed as performing hill-climbing  batchfully. Recently, the mixture scheme derived from the above two categories is also seen in the literature .
It is worth noting that approaches proposed in [19, 20, 8, 31] are not k-NN graph construction algorithms. The graphs are built primarily for k-nearest neighbor search. In these approaches, the samples which should be in the k-NN list of one sample are deliberately omitted for efficiency. While the links to remote neighbors are maintained [19, 20]. As a consequence, graphs constructed by these approaches are not k-NN graph in the real sense. Such kind of graphs are hardly supportive to tasks beyond k-NN search.
In most of the approaches aforementioned, one potential issue is that the construction procedure has to keep records on the comparisons that have been made between sample pairs to avoid repetitive comparisons. However, its space complexity could be as high as . Otherwise the repetitive comparisons are inevitable even by adopting specific sampling schemes . In this paper, the k-NN graph construction and k-NN search are addressed jointly. The k-NN graph construction is undertaken by invoking each sample as a query to query against the k-NN graph that is under construction. Since the query is new to the graph under construction each time, the issue of repetitive comparisons is overcome. More interestingly, we discover that the k-NN graph construction and k-NN search are beneficial to each other. Namely, high quality k-NN search leads to the high quality of intermediate k-NN graphs. In turn, the efficiency and quality of k-NN search are guaranteed with the support of high quality of intermediate k-NN graphs.
Iii Baseline Search Algorithm
Since our k-NN graph construction approach is based on k-NN search. Before we introduce the construction approach, the baseline k-NN search algorithm that is based on a pre-built k-NN graph is presented. It essentially follows hill-climbing strategy , while considerable modifications have been made.
Ehanced Hill-Climbing NN search (EHC)
1:Input: q: query, : k-NN Graph, : reverse k-NN Graph, : reference set2:Output: Q: k-NN list of q3:; Flag[1n];4:while _updated_ do5: R[1p] p random seeds in6: for each R do7: insertQ(, , );8: end for9: for do10: if Flag == 0 then11: for each do12: InsertQ(, , );13: end for14: for each do15: insertQ(, , );16: end for17: end if18: Flag[r] = 1;19: end for20: it = it + 1;21:end while
Given k-NN graph , returns k-NN list of sample . Accordingly, is the reverse k-NN graph of , which is nothing more than a re-organization of graph . keeps ID of samples that sample appears in their -NN lists. Noticed that the size of is not necessarily and there would be overlappings between and . An illustration about graphs and are seen in Fig. 1. With the support of k-NN graph and its reverse graph , the baseline search algorithm is presented in Alg. 1.
As shown in Alg. 1, the hill-climbing starts from random seeds, the compared samples are kept in , which is sorted in ascending order111Without the loss of generality, it is assumed that the smaller of the distance the closer of two samples across the paper for description convenience. all the way. The neighbors and reverse neighbors of sample , which is not visited and ranked closest to the query, are compared sequentially (Line 5–19). Notice that is updated as soon as a closer sample is found. The distance function could be any distance metric defined on the input dataset. This search algorithm is generic in the sense that there is no specification or assumption related to the distance metric. This is also true for the k-NN construction algorithm that will be presented afterwards. The iteration (Line 4 – 21) continues until no closer sample is identified. Although this algorithm starts from random location, only minor performance fluctuation is observed across different runs. In order to avoid repetitive comparisons, variable Flag keeps the status whether a sample has been expanded.
Iv Online k-NN Graph Construction
The prerequisite for k-NN search algorithm in Alg. 1 are the k-NN graph and its reverse graph . In this section, we are going to show how a k-NN Graph and its reverse Graph are built based on k-NN search algorithm itself. Additionally, a more efficient scheme for k-NN search and in turn for k-NN graph construction is presented. Additionally, we also discuss how further performance boost is achieved by integrating schemes proposed in .
Iv-a k-NN Construction Graph by Search
In Alg. 1, when the hill-climbing is conducted, the comparison to the candidates starts from several random locations, moving along several trails, the search moves towards the closer neighborhood of a query. In the ideal case, top-k nearest neighbors will be discovered. On one hand, the top-k nearest neighbors of this query are known after the search. On the other hand, this query will be joined into the neighborhoods of samples’ being visited as a new neighbor during the search. As a result, a k-NN graph is augmented to include this new sample. Motivated by this observation, the online construction algorithm is conceived.
Generally, there are two major steps in the algorithm. Firstly, an initial graph is built exhaustively from a small subset of . The size of is fixed to 256 across the paper. In the second step, each of the remaining samples is treated as query to query against the graph following the flow of Alg. 1. Each time, the k-NN list of one sample is joined into the graph being right-now searched over. The k-NN lists of samples which have been compared during the search are accordingly updated. The general procedure of the construction algorithm is summarized in Alg. 2.
Online k-NN Graph Construction (OLG)
1:Input: : dataset; : size of NN list; : num. of seeds2:Output: : k-NN Graph3:; Flag[1n];4:Extract a small subset from ;5:Initialize and with ;6:;7:for each do8: while _updated_ do9: R[1p] p random seeds in10: for each R do11: InsertQ(, , );12: end for13: for do14: if Flag[r] == 0 then15: for each do16: InsertQ(, , );17: InsertG(, , , );18: end for19: for each do20: InsertQ(, , );21: InsertG(, , , );22: end for23: end if24: Flag[r] = 1;25: end for26: end while27: for each Q do28: InsertG(, , , );29: end for30: ; Flag[1n];31:end for
In order to support efficient search and construction, an orthogonal list (as shown in Fig. 2) is adopted to keep the both and (Line 8) in the implementation. As shown in Fig. 2, two linked lists are kept for one sample, one is k-NN list and another is reverse k-NN list. While for clarity, graph and are still referred in our description.
In Alg. 2, the procedure of k-NN graph construction is basically a repetitive call of the search algorithm in Alg. 1 (Line 10–26). The major difference is that function insertG(, , , ) is called after the query is compared to a sample in the graph. Function insertG() is responsible for inserting an edge into k-NN list of in graphs and . The major operations inside insertG() involve updating the k-NN list of and the reverse k-NN list of . Sample in the rear of k-NN list is removed if a closer sample is joined in. Distance is kept in k-NN list of , which allows the list to be sorted all the time. In the end of each loop, insertG() is called again to join the k-NN list of query into .
Although the size of input dataset is fixed in Alg. 2, it is apparently feasible for an open set, where new samples are allowed to join in from time to time. As will be revealed in the experiments (Section V), Alg. 2 already performs pretty well. In the following, a novel scheme is presented to further boost its performance.
Iv-B Lazy Graph Diversification
In Alg. 1, when expanding sample in the rank list , all the samples in the neighborhood of will be compared with the query. According to recent studies [31, 20, 8], when samples in the neighborhood of are so close to each other that the distances between them are smaller than their distances to , it is no need to compare to all of them during the expansion. The expansion on these close samples later will likely lead the climbing process to the same local region. The phenomenon that samples in the k-NN list are closer to each other than they are to is called as “occlusion” . An illustration of occlusion is shown in Fig. 3. In the illustration, samples and are occluded by sample . It is easy to see one sample can only be occluded by samples which are closer to than it. According to [31, 20, 8], the hill-climbing will be more efficient when samples like and are not considered when expanding .
In order to know whether samples in a k-NN list are occluded one by another, the pair-wise comparisons between samples in the k-NN list are required, which is the practice in [31, 8, 20]. This is unfeasible for an online construction procedure (i.e., Alg. 2). First of all, k-NN list is dynamically changing, pair-wise distances cannot be simply computed and kept once for all. Secondly, it is still too costly to update the pair-wise occlusion relations as long as a new sample is joined, which involves a complete comparison between the new neighbor and the rest. Moreover, the occluded samples cannot be simply removed from a k-NN list since our primary goal is to build the k-NN graph.
In this paper, a novel scheme called lazy graph diversification is proposed to identify the occlusions between samples during the online graph construction. To achieve that, an occlusion factor is introduced as the attribute attached to each sample in a k-NN list. s of all the neighbors are initialized to 0, when the k-NN list of current query is joined into the graph. Factor will be updated when another new sample is joined into the k-NN list in the later stage. Given a new sample q to be inserted into sample r’s k-NN list, we should know its distances to other neighbors in the list and also the distances of all the neighbors to r. The distances to r are known as they are kept for sorting. While the distances between this new sample and the rest are unknown. Instead of performing a costly thorough comparison between q and the rest neighbors, we make use of distances that are computed during the hill-climbing iteration. To do that, another variable D is introduced to keep the distances between sample q and the samples that have been compared during the hill-climbing. It is possible that not all the samples in k-NN list of r are joined in the comparisons. However, according to the principle “neighbor of a neighbor is likely to be a neighbor”, the “old” samples in the k-NN list of r are likely being encountered by q. With the support of D, occlusion factor of all the samples in the k-NN list is updated with following three rules.
Rule 1: is kept unchanged for samples ranked before q;
Rule 2: of sample is incremented by 1 if a sample ranked before q is closer to q than q is to r;
Rule 3: of a sample ranked after q is incremented by 1 if its distance to q is lower than q is to r.
The default distance of each sample to q is set to . As a result, the s of not-being-visited neighbors will not be updated according to Rule 1 and Rule 3. This is reasonable because the not-being-visited neighbors should be sufficiently far away from , otherwise they are already being visited according to the principle “a neighbor of a neighbor is also likely to be a neighbor”. Since we have all the possible distances (between and samples in the graph) only after the the hill-climbing converges, the operations of inserting into k-NN list of and updating factor in the list are postponed after the search. Fig. 4 illustrates a search trail that is formed by the hill-climbing. In the k-neighborhood of , the LGD operations are applied.
Once the occlusion factor is available, the search algorithm (Alg. 1) and the construction algorithm (Alg. 2) are accordingly modified. When the query is compared to the neighbors in one k-NN list, we only consider the samples whose is no greater than the average occlusion factor of this list. Notice that this is different from the way proposed in [8, 20], in which the occluded samples () are simply omitted. We find such kind of rule is too restrictive in our case. In contrast, only samples that are occluded by many other neighbors are ignored. The operation of skipping samples with high factor could be interpreted as performing diversification in the graph . Different from , the diversification is undertaken in a lazy way in our case. This scheme is therefore called as lazy graph diversification (LGD). The three rules used to calculate occlusion factor are called as LGD rules. The k-NN graph construction algorithm with LGD is accordingly named as LGD, which is summarized in Alg. 3.
Online k-NN Graph Construction with LGD (LGD)
1:Input: : dataset; : size of NN list; : num. of seeds2:Output: : k-NN Graph3:; D[1n] ; E[1n];4:Extract a small subset from ;5:Initialize and with ;6:;7:for each do do8: while _isUpdated_ do9: R[1p] p random seeds in10: for each R do11: InsertQ(, , );12: end for13: for do14: if == 0 then15: for && do16: insertQ(, , );17: D = ;18: end for19: for && do20: insertQ(, , );21: D = ;22: end for23: end if24: E[r] = 1;25: end for26: end while27: for each visited do28: updateG(, , , );29: end for30: for each Q do31: insertG(, , , );32: end for33: ; D[1n] ; E[1n];34:end for
Compared to Alg. 2, Alg. 3 is different in three major aspects. In Alg. 3, query sample is compared to samples whose occlusion factor is no greater than average factor in both k-NN list and reverse k-NN list (see line 15 and Line 19). After is compared to a sample in the k-NN list, k-NN list of is not updated immediately. Instead, the distance from to sample is collected into (Line 17, Line 21) for later use. The update of k-NN lists for all the samples so far encountered are postponed to the end of k-NN search (Line 27–30). Function updateG() is basically similar as insertG(). The additional operation inside updateG() is to update of all the neighbors according to LGD rules. It is easy to see Alg. 3 becomes a fast k-NN search algorithm when the updateG and insertG operations are turned off.
Iv-C Sample Removal from the Graph
In practice, we should allow samples to be dropped out from the k-NN graph. A good case is to maintain a k-NN graph for product photos for an e-shopping website, where old-fashioned products should be withdrew from sale. The removal of samples from the k-NN graph dynamically is supported in our approach. If the graph is built by Alg. 2, the removal operation is as easy as deleting the sample from k-NN lists of its reverse neighbors and releasing its own k-NN list. If the graph is built by Alg. 3, before the sample is deleted, the occlusion factors of the samples living in the same k-NN list have to be updated. Fortunately not all the samples in the list should be considered. According to LGD Rule 3, only samples ranked after current sample should be considered. The update operations involves times distance computations in average. Given is a small constant, the time cost is much lower than fulfilling a query. Notice that such kind of dynamic updating operation is not necessarily supported by other online algorithms [13, 20], in which the deleting operation may lead to collapse of the indexing structure.
Discussions Two k-NN graph construction approaches namely OLG (Alg. 2) and LGD (Alg. 3) in general follow the same framework. In both processes, the construction starts from a small-scale k-NN graph of 100% quality. The search process appends a k-NN list of a new sample to the graph each time. At the same time, the k-NN lists of the already inserted samples will be possibly updated when the new sample happen to be in their neighborhoods. It is therefore a win-win situation for both graph construction and NN search. Effective search procedure returns high quality k-NN list. While k-NN graph with high quality gives a good guidance for the hill-climbing process.
Besides the size of NN list , there is another parameter involved in OLG and LGD. Namely, the number of seeds . Usually, the size of NN list should be no less than the intrinsic data dimension , which is less than or equal to the data dimension . The number of seeds is usually set to be no bigger than . When is very high (i.e., several hundreds to thousands) and is close to , the construction process could be slow when is set to be close to . In such situation, a good trade-off is hardly achieved between the quality of k-NN graph and the efficiency of the construction.
Iv-D Refinement on k-NN graph
Although Alg. 3 already performs well under different cases, it still suffers from scalability issue that most of the k-NN construction approaches face. When the size or the dimension of a dataset increase, the quality of the k-NN graph drops steadily. In order to alleviate this problem, following the scheme in NN-Descent , it is possible to undertake a pair-wise comparisons within each k-NN list when k-NN graph is built. By such kind of post-processing, missed true neighbor pairs could be recovered and the quality of k-NN graph is enhanced. It is also possible to perform such kind of refinement periodically during the online construction. For instance, after every 10 thousand times of insertion operations in either Alg. 2 or Alg. 3, this refinement routine is called. The frequency of calling such kind of refinement is a trade-off between efficiency and graph quality. Notice that such kind of refinement is unnecessary for simple k-NN graph construction problem.
In this section, the performance of the proposed algorithms is studied both as a k-NN graph construction and an approximate nearest neighbor search approach. In the evaluation, the performance is reported on both synthetic random data and data from real world. It is believed that the intrinsic dimension of synthetic data is roughly equal to the data dimension [3, 8]. While for the datasets from real-world, the intrinsic dimension is usually smaller than the data dimension . The brief information about the datasets that are adopted in the evaluation are summarized in Table I.
In the nearest neighbor search task, the performance of the proposed search approaches is studied in comparison to the representative approaches of different categories. Namely they are graph based approaches such as DPG , HNSW . The typical locality sensitive hash approaches[10, 33], namely SRS  is considered. For quantization based approach, product quantizer (PQ)  is incorporated in the comparison. FLANN  and Annoy  are selected as the representative tree partitioning approaches, both of which are popular NN search libraries in the literature.
V-a Evaluation Protocol
For k-NN graph construction, five synthetic random datasets sized of K are used in the evaluation. Their dimension ranges from 2 to 100. Data in each dimension are independently drawn from the range under uniform distribution. It guarantees the intrinsic dimension of the synthetic data largely equals to the data dimension. The top-1 (recall@1) and top-10 (recall@10) recalls on each dataset are studied under and metrics respectively. Given function returns the number of truth-positive neighbors at top-k NN list of sample , the recall at top- on the whole set is given as
|SIFT1M ||SIFT |
|GIST1M ||GIST |
|NUSW ||BoVW |
|YFCC1M ||Deep Feat.|
Besides k-NN graph quality, the construction cost is also studied by measuring the scanning rate  of each approach. Given is the total number of distance computations in the construction, the scanning rate is defined as
In addition, another seven datasets are adopted to evaluate the performance of both nearest neighbor search and k-NN graph construction. Among them, six datasets are derived from real world images or text data. In particular, all four datasets (namely GIST1M, Glove1M, NUSW and Rand1M) that are marked as most challenging datasets in  are adopted in the evaluation. For each of the dataset, another 1,000 or 10,000 queries of the same data type are prepared. Different metrics such as , Cosine and are adopted in accordance with the data type of each set. The search quality is measured by the top-1 recall for the first nearest neighbor. In order to make our study comparable under different hardware settings, the search quality is reported along with the speed-up one approach achieves over brute-force search. All the codes of different approaches considered in this study are compiled by g++ 5.4. In order to make our study to be fair, we disable all the multithreads, SIMD and pre-fetching instructions in the codes. All the experiments are pulled out on a PC with 3.6GHz CPU and 32G memory setup.
V-B Performance of Baseline NN-Search
In the first experiment, the focus is to verify the effectiveness of the baseline search algorithm upon which our k-NN graph construction is built. Four different configurations are tested. Firstly, the NN search is supplied with prebuilt k-NN graph from NN-Descent  and the true k-NN graph. In the experiment, k is fixed to 40. The experiment is pulled out on SIFT1M and GIST1M datasets. The quality of 40-NN graph from NN-Descent is above 0.90 in terms of its top-1 and top-10 recall for both datasets. These k-NN graphs are supplied to hill-climbing (HC)  and enhanced hill-climbing (EHC, Alg. 1). EHC differs from HC mainly in the use of reverse k-NN graph during the search. The search performance in terms of recall@1 is shown in Fig. 5. The performance is compared to the state-of-the-art approach HNSW .
As seen from the figure, there is a significant performance gap between EHC and HC. The NN list of each sample in EHC is usually longer than that of HC due to the incorporation of reverse nearest neighbors. On one hand, EHC has to visit more samples during the expansion, and therefore should be slower. On the other hand, the reverse nearest neighbors also provide short-cuts to the remote neighbors for the hill-climbing, which is similar as the mechanism offered by small-world graph [20, 19]. As a result, EHC turns out to be more efficient. Another interesting discovery is that, the NN search performance based on approximate k-NN graph is very close to that of being based on true k-NN graph. This indicates minor difference in k-NN graph quality does not lead to any big difference in the search performance. Above observations apply to other datasets that are considered in this paper.
With the support of the effective search procedure, it becomes possible to build the k-NN graph with the search results. In the following, we are going to show the quality of k-NN Graph that is built based on this search algorithm.
V-C k-NN Graph Construction
In the second experiment, the performance about k-NN graph construction is studied when the enhanced hill-climbing is employed as a graph construction approach. In the evaluation, the performance of OLG (Alg. 2) and LGD (Alg. 3) is compared to NN-Descent , which is still recognized as the most effective k-NN graph construction approach that works in the generic metric spaces. In order to be in line with the experiments in , synthetic data in the same series of dimensions are used. In the test, the shared parameter among different approaches are tuned to be close to the data dimension and no higher than 50. Meanwhile, we ensure the scanning rate of different approaches to be on the same level. Usually, the higher is the scanning rate the better is the k-NN graph quality. The scanning rate of all three approaches are reported in Table II. While the top-1 and top-10 recall of all three approaches under and distance measures are shown in Fig. 6.
As seen from the figure, in most of the cases, the quality of k-NN graph from OLG and LGD is considerably better than NN-Descent when the their scanning rates are similar to each other. This is particularly true under -norm. The scanning rates of all the approaches increases steadily as the dimension of data goes higher. Meanwhile, the recall that one approach could reach drops. As shown in the figure, when reaches to 100, the scanning rates of all the approaches are above 10%, which is too high for them to be practically useful. In general, NN-Descent, OLG and LGD are effective in the similar range of data dimensions, namely . Within this dimensional range, OLG and LGD achieve better trade-off between scanning cost and the graph quality than that of NN-Descent.
V-D Nearest Neighbor Search
In our third experiment, the performance of NN search with the support of k-NN graph built by OLG and LGD is evaluated. Six datasets derived from real world data are adopted. Among them, NUSW is tested under both and distance measures. In addition, another 100-dimensional random dataset sized of one million is adopted. The brief information about all the datasets are summarized in Table I. When we use OLG (Alg. 2) and LGD (Alg. 3) to build the approximate k-NN graph for search, the parameter k and p are fixed to 40 for all the datasets. Once k-NN graphs are built, algorithms OLG and LGD are used as NN search procedures as we turn off their update and insertion operations. For convenience, the NN search approaches based on the graph constructed by OLG and LGD are given as OLG and LGD respectively.
In the evaluation, the NN search performance is compared to three representative graph based approaches. Namely, they are NN-Descent , DPG  and HNSW , all of which work in generic metric spaces. Additionally, all the approaches use the similar hill-climbing search procedure. The major difference lies in the graph upon which the search procedure is undertaken. DPG graph is basically built upon NN-Descent. In DPG, the k-NN graph built by NN-Descent is diversified by an off-line post-process. Additionally, the diversified NN list of each sample has been appended with its reverse NN list. In the experiment, NN-Descent shares the same k-NN graph with DPG for each dataset. Parameter k in NN-Descent is fixed to 40 for all the datasets222This is to be in line with the experiments in . OLG searches over graph which is a merge of k-NN graph and its reverse k-NN graph that are produced by OLG. NN search in LGD is on a k-NN graph merged with its reverse k-NN graph that have been diversified online by LGD rules. While for HNSW, the search is undertaken on a small world graph. The graph maintains links between close neighbors as well as long range links to the remote neighbors, which are kept in a hierarchy. The parameter in HNSW is fixed to 20. The edges kept for each sample in the bottom-layer is 40. Its size of NN list is therefore on the same level as NN-Descent, DPG, OLG and LGD.
Since the search is conducted based on a k-NN graph for NN-Descent, DPG, OLG and LGD, the recall@1 and recall@10 of corresponding k-NN graph from NN-Descent and our approaches are shown in Fig. 7. Accordingly, their scanning rate c on each dataset is reported in Table III. As shown in the table, the scanning rates of LGD on five out-of seven datasets from real world are at least 20% lower than that of NN-Descent and OLG. While as is shown in Fig. 7, the recall of k-NN graph from LGD is only at most 5% lower. The graph quality that achieved by different approaches is in general similar. One exception comes from NN-Descent. It shows considerably poor performance on GloVe1M, whose intrinsic dimension is believed to be high .
The search performance on eight datasets is shown in Fig. 9. In order to make the search results comparable to the results that are pulled out under different hardware setups, the performance is reported as the recall curve against the speed-up achieved over exhaustive search. The time costs of exhaustive search on each dataset are reported separately on Table IV. It is therefore also convenient for the readers who want to see the efficiency that our approach achieves with current setup.
As shown in the figure, the performance from NN-Descent and HNSW are unstable across different datasets. LGD performs marginally better than OLG on most of the datasets. While LGD shows close performance as DPG on four datasets (SIFT1M, SIFT10M, GIST1M and YFCC1M) and outperforms it on the other four datasets, all of which are marked as “most challenging” in . As the superior performance is observed from DPG, OLG and LGD, it is clear to see the performance boost mainly owes to the use of reserve k-NN and the introduction of graph diversification. As shown in Fig. 7, the recalls of k-NN graph from LGD search 1%-5% lower than that of OLG and NN-Descent, however this does not cause search performance loss in LGD. On the contrary, the best performance is observed from LGD in most of the cases. On one hand, this indicates the search procedure is tolerant to minor graph quality degradation. On the other hand, it also shows the search gets more cost-effective when skipping occluded samples.
Comparing the result presented in Fig. 9(a) to the one presented in Fig. 9(b), the high scalability is achieved by the proposed approaches. As seen in the figure, the size of reference set has been increased by 10 times, while the time cost only increases from 0.201ms (per query) to 0.316ms (per query), when the search quality is maintained on 0.9 level. Similar high scalability is observed on deep features i.e., YFCC1M (Fig. 9(d)). This is good news given the deep features have been widely adopted in different applications nowadays. In contrast, such kind of high speed-up is not achievable on NUSW, GloVe1M and Rand1M, although the dimensionality of GloVe1M and Rand1M is lower than that of SIFT1M and YFCC1M. We found that the speed-up that graph-based approaches could achieve is partly related to the intrinsic data dimension [3, 8]. When the intrinsic data dimension is low, with the guidance of a k-NN graph, the hill-climbing search is actually undertaken on the sub-spaces where most of the data samples lying in. Due to the low dimensionality of these sub-spaces, the search complexity is lower than it seemingly is. Fig. 8 illustrates this phenomenon. This is one of the major reasons that the graph based approaches exhibit superior performance over other type of approaches.
V-E Comparison to state-of-the-art k-NN Search
Fig. 10 further compares our approach with the most representative approaches of different categories in the literature. Besides aforementioned HNSW, NN-Descent and DPG, approaches considered in the comparison include tree partitioning approaches Annoy  and FLANN , locality sensitive hashing approach SRS , and vector quantization approach product quantizer (PQ) . In the figures, the speed-ups that each approach achieves are reported when recall@1 is set to 0.8 and 0.9 levels. For PQ, it is impossible to achieve top-1 recall above 0.5 due to its heavy quantization loss. As an exception, its recall is measured at top-16 for SIFT1M and NUSW, and measured at top-128 for GIST1M and Rand1M respectively.
As shown in the figure, the best results come from graph based approaches. And the proposed LGD performs the best in most of the cases. This observation is consistent across different datasets. The speed-up of all the approaches drops as the recall@1 rises from 0.8 to 0.9. The speed-up degradation is more significant for approaches such as PQ and FLANN. No considerable speed-up is observed for SRS in all the cases. This basically indicates SRS is not suitable for the tasks which require high NN search quality. Another interesting observation is that the performance gap between graph based approaches and the rest is wider on the “easy” dataset than that of “hard”. Compared to the approaches of other categories, the NN search based on the graph makes good use of the sub-space structures latent in a dataset. Since the intrinsic dimension of “easy” dataset is low , the hill-climbing is actually undertaken on these low-dimensional sub-spaces. The lower is the intrinsic dimension the higher is the speed-up that graph-based approaches achieve. In constrast, there is no specific strategy in other type of approaches exploits on such latent structures in the data.
On one hand, the high search speed-up is observed from LGD on data types such SIFT, GIST and deep features. With such efficiency, it is possible to realize an image search system with instant response on billion level dataset by a single PC. On the other hand, it is still too early to say NN search on high-dimensional data has been solved. As shown on Rand1M and NUSW datasets, where both the data dimension and intrinsic data dimension are high, the efficiency achieved by our approach is still limited. Highly efficient NN search on these types of data (i.e., intrinsic dimension above 50) is still an open issue.
We have presented our solution for both k-NN graph construction and approximate nearest neighbor search. These two issues have been addressed under a unified framework. Namely, the NN search and NN graph construction are designed as an interdependent procedure that one is built upon another. The advantages of this design are several folds. First of all, the k-NN graph construction is an online procedure. It therefore allows the samples to be inserted in or dropped out from the graph dynamically, which is undesirable from most of the existing solutions. Moreover, no sophisticated indexing structure is required to support this online approach. Furthermore, the solution has no specification on the distance measure, which makes it a generic approach both for k-NN graph construction and NN search. The effectiveness of the proposed solution both as a k-NN graph construction approach and NN search has been extensively studied. Superior performance is observed in both cases under different test configurations.
This work is supported by National Natural Science Foundation of China under grants 61572408.
-  J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 5500, pp. 2319–2323, Dec. 2000.
-  S. T. Roweis and L. K. Saul, “Report nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 5500, pp. 2323–2326, Dec. 2000.
-  W. Dong, C. Moses, and K. Li, “Efficient k-nearest neighbor graph construction for generic similarity measures,” in Proceedings of the 20th International Conference on World Wide Web, WWW’11, (New York, NY, USA), pp. 577–586, ACM, 2011.
-  C. Fu and D. Cai, “EFANNA : An extremely fast approximate nearest neighbor search algorithm based on knn graph,” arXiv.org, 2016. arXiv:1609.07228.
-  J. Chen, H. ren Fang, and Yousef, “Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection,” Journal of Machine Learning Research, vol. 10, pp. 1989–2012, Dec. 2009.
-  J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li, “Scalable k-nn graph construction for visual descriptors,” in CVPR, pp. 1106–1113, Jun. 2012.
-  R. Paredes, E. Chávez, K. Figueroa, and G. Navarro, “Practical construction of k-nearest neighbor graphs in metric spaces,” in Proceedings of the 5th International Conference on Experimental Algorithms, WEA’06, (Berlin, Heidelberg), pp. 85–97, Springer-Verlag, 2006.
-  B. Harwood and T. Drummond, “FANNG: Fast approximate nearest neighbour graphs,” in CVPR, pp. 5713–5722, 2016.
-  M. S. Arulamparam, S. Maskell, and N. Gordon, “A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking,” IEEE Transactions on Signal Processing, vol. 50, pp. 174–188, 2002.
-  M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, (New York, NY, USA), pp. 253–262, ACM, 2004.
-  A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD international conference on Management of data, vol. 14, (New York, NY, USA), pp. 47–57, ACM, Jun. 1984.
-  J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of ACM, vol. 18, pp. 509–517, Sep. 1975.
-  T. Debatty, P. Michiardi, and W. Mees, “Fast online k-nn graph building,” CoRR, vol. abs/1602.06819, 2016.
-  H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” Trans. PAMI, vol. 33, pp. 117–128, Jan. 2011.
-  A. Babenko and V. Lempitsky, “Additive quantization for extreme vector compression,” in CVPR, pp. 931–938, 2014.
-  T. Zhang, C. Du, and J. Wang, “Composite quantization for approximate nearest neighbor search,” in ICML, pp. 838–846, 2014.
-  J. Wang and S. Li, “Query-driven iterated neighborhood graph search for large scale indexing,” in Proceedings of the 20th ACM International Conference on Multimedia, (New York, NY, USA), pp. 179–188, ACM, 2012.
-  W. Zhou, C. Yuan, R. Gu, and Y. Huang in International Conference on Advanced Cloud and Big Data, 2013.
-  Y. A. Malkov, A. Ponomarenko, A. Lovinov, and V. Krylov, “Approximate nearest neighbor algorithm based on navigable small world graphs,” Information Systems, 2013.
-  Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” Arxiv.org, 2016. https://arxiv.org/abs/1411.2173.
-  K. Hajebi, Y. Abbasi-Yadkor, H. Shahbazi, and H. Zhang, “Fast approximate nearest-neighbor search with k-nearest neighbor graph,” in International Joint Conference on Artificial Intelligence, pp. 1312–1317, 2011.
-  D. Comer, “Ubiquitous b-tree,” ACM Computing Surveys, vol. 11, pp. 121–137, Jun. 1979.
-  N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: an efficient and robust access method for points and rectangles,” in International Conference on Management of Data, pp. 322–331, 1990.
-  M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” Trans. PAMI, vol. 36, pp. 2227–2240, 2014.
-  C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” in CVPR, 2008.
-  Y. Chen, T. Guan, and C. Wang, “Approximate nearest neighbor search by residual vector quantization,” Sensors, vol. 10, pp. 11259–11273, 2010.
-  A. Babenko and V. Lempitsky, “Efficient indexing of billion-scale datasets of deep descriptors,” in CVPR, pp. 2055–2063, 2016.
-  J. Martinez, H. H. Hoos, and J. J. Little, “Stacked quantizers for compositional vector compression,” Arxiv.org, 2014. https://arxiv.org/abs/1411.2173.
-  R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Transactions on Information Theory, vol. 44, pp. 2325–2383, Sep. 2006.
-  Q. Lv, W. Josephson, Z. Wang, and M. C. amd Kai Li, “Multi-probe lsh: Efficient indexing for high-dimensional similarity search,” in Proceedings of Very Large Data bases, Sep. 2007.
-  W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin, “Approximate nearest neighbor search on high dimensional data-experiments, analysis and improvement,” Arxiv.org, 2016. https://arxiv.org/abs/1610.02455.
-  Y.-M. Zhang, K. Huang, G. Geng, and C.-L. Liu, “Fast knn graph construction with locality sensitive hashing,” in Proceedings of Machine Learning and Knowledge Discovery in Databases: European Conference, pp. 660–674, Sep.
-  G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest neighbor methods in learning and vision theory and practice. MIT Press, 2006.
-  Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin, “Srs: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny,” in The Proceedings of the VLDB Endowment, pp. 1–12, Sep. 2014.
-  E. Bernhardsson, “Annoy: approximate nearest neighbors in c++/python optimized for memory usage and loading/saving to disk,” 2016.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal on Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
-  M. Douze, H. Jégou, H. Singh, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in CIVR, pp. 19:1–19:8, Jul. 2009.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng, “Nus-wide: A real-world web image database from national university of singapore,” in CIVR, (Santorini, Greece.), 2009.
-  J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in ICCV, pp. 1470–1477, 2003.
-  G. Amato, F. Falchi, C. Gennaro, and F. Rabitti, “Yfcc100m hybridnet fc6 deep features for content-based image retrieval,” in Proceedings of the 2016 ACM Workshop on Multimedia COMMONS, pp. 11–18, 2016.