HCLAE: High Capacity Locally Aggregating Encodings for Approximate Nearest Neighbor Search
Abstract
Vector quantizationbased approaches are successful to solve Approximate Nearest Neighbor (ANN) problems which are critical to many applications. The idea is to generate effective encodings to allow fast distance approximation. We propose quantizationbased methods should partition the data space finely and exhibit locality of the dataset to allow efficient nonexhaustive search. In this paper, we introduce the concept of High Capacity Locality Aggregating Encodings (HCLAE) to this end, and propose Dictionary Annealing (DA) to learn HCLAE by a simulated annealing procedure. The quantization error is lower than other stateoftheart. The algorithms of DA can be easily extended to an online learning scheme, allowing effective handle of large scale data. Further, we propose AggregatingTree (ATree), a nonexhaustive search method using HCLAE to perform efficient ANNSearch. ATree achieves magnitudes of speedup on ANNSearch tasks, compared to the stateoftheart.
HCLAE: High Capacity Locally Aggregating Encodings for Approximate Nearest Neighbor Search
Liu Shicong, Shao Junru, Lu Hongtao {artheru, yz_sjr, htlu}@sjtu.edu.cn Shanghai Jiaotong University
Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Approximate nearest neighbor (ANN) search is a fundamental problem in many computer science topics, especially in those involving highdimensional and largescale datasets like machine learning, pattern recognition, computer vision, information retrieval, etc, due to the high computation efficiency requirements. Among existing ANN techniques, quantizationbased algorithms((?),(?),(?), etc.) have shown the stateoftheart performances by allowing efficient distance computation via asymmetric distance computation (ADC)(?) between a query vector and an encoded vector. One can perform an exhaustive ADC to retrieve the approximate nearest neighbor.
Even so, an exhaustive comparison between the query and the dataset is still prohibitive for even larger datasets like (?). IVFADC (?) provides nonexhaustive search based on coarse quantizers and encoded residues. The idea is to obtain a candidates list possibly containing the nearest neighbor, then perform ADC on the list. Similar methods like inverted multiindex(?), Locally Optimized Product Quantization(?), Joint Inverted Indexing (?), etc, has various improvements.
Problems of existing quantizationbased algorithms.
One challenge in designing nonexhaustive search algorithm is: the locality of a vector is not exhibited in the encoding. Thus, researchers have to do some roundabout to dig out the locality, like using a coarse quantizer. These methods lack efficiency because candidate listing and reranking are totally irrelevant. In addition, we would like the encodings to have high capacity w.r.t the data space, i.e. to distinguish more vectors, so the data space can be effectively represented. However, existing quantization methods didn’t explicitly consider these issues.
Major Contributions
In this paper, we are interested in encodings which not only accelerate distance computation, but also ’aggregate’ the locality of a dataset, along with high capacities. We introduce the concept of High Capacity Locally Aggregating Encodings (HCLAE) for ANNsearch to address the aforementioned problems. We propose Dictionary Annealing (DA) algorithm to generate HCLAE encodings of the dataset. Inspired by simulated annealing, the main idea of DA is to ”heat up” a dictionary with current residue, then ”cool down” the dictionary to reduce the residue. Auxiliary algorithms for DA are also introduced to further increase capacity and to reduce distortion. DA is naturally an online learning algorithm and is suitable for large scale learning.
To utilize HCLAE encodings on large scale data, we propose Aggregating Tree (ATree) for fast nonexhaustive search. It’s a radixtree like structure based on the encoding of the dataset, so the common prefixes of the encodings can be effectively represented with one node. Atree is memory efficient and allows fast nonexhaustive search: we breadth first traverse the tree with a priority queue to obtain the candidate list. The time consumption is significantly lower than other nonexhaustive search methods.
We have validated DA and ATree on various standard benchmarks: SIFT1M, GIST1M(?), SIFT1B(?). Empirical Results show DA improves the quantization of dataset greatly, and ATree can bring magnitudes of speed up compared to existing nonexhaustive search methods. The overall performance of DA and ATree outperforms existing stateoftheart methods. The online version also shows great practical interest. Applications depending on ANN search cam greatly benefit from our algorithms.
Background and Motivation
The main idea of quantizationbased methods is to generate encodings consisting of parts for fast distance computation. For example, Product Quantization(?) splits the data space into disjoint subspaces, and separately learns dictionaries for each subspaces, then quantizes each subspace to produces encodings of a vector . PQ allows fast approximate distance computation between a query vector and an encoded vector via Asymmetric Distance Computation(ADC), which is discussed in detail in (?), (?), (?).
However, in the real applications involving large scale data, exhaustively computing distances doesn’t meet the query speed requirement. It’s practical to perform some preprocessing such as candidates listing. IVFADC(?), the Inverted Multiindex(?) and Locally Optimized Product Quantization(?), etc. are proposed to perform these tasks. However, these candidates listing methods are totally irrelevant to the encodings of the dataset, adding additional computation and storage cost.
Locally Aggregating Encodings
A common methodology for nonexhaustive search is boundandbranch with trees. The effectiveness of tree structures lies in how can it effectively tell which child node contains the nearest neighbor. However, in highdimensional space, tree structures like KDTree(?) generally degrades to linear scan because the nearest neighbor may be contained in any node(?). To utilize boundandbranch methodology, this search scope must be able to narrow down.
Our solution is to utilize the priors of the visited node: if a node is deep in the tree, then we know which child node may contain the nearest neighbor. We name it Locally Aggregating. Note one can transform encodings to a radix tree. Denote the th part encoding of a vector ’s local vector as , and as the conditional entropy:
directly measures to what extent can we narrow down the search scope, so a fast descending is preferred. Directly computing is not easy, nevertheless, we present the mutual information matrix of obtained with different quantization methods in Figure 1 for visualization.
Encoding Capacity
To effective encode a dataset, we would like the dataspace is partitioned finely, so vectors could be easily distinguished. It’s straightforward to define the Encoding Capacity as the total information entropy: . In practice, optimizing encoding capacity is usually relaxed into two separate objectives:

Maximize selfinformation for

Minimize mutual information for
The above objectives were explicitly considered in hashing methods including Spectral Hashing(?), Semisupervised Hashing(?), etc. which are proposed to learn balanced and uncorrelated bits. For quantization methods, encoding capacity has not been addressed yet. In Figure 2, we visualize the comparison of the encoding capacities of different quantization methods in mutual information matrix.
Learning High Capacity Locally Aggregating Encodings (HCLAE)
As described above, for a high capacity encoding, is maximized. By chain rule, to lower , should be minimized, I.e. the local vectors should have the same prefix encoding. By Lloyd’s condition(?), we could perform Residual Vector Quantization(RVQ)(?)(?) on the dataset. However for high dimensional data, the encoding capacity is low with RVQ and doesn’t exhibit locally aggregating. We introduce Dictionary Annealing to produce High Capacity Locally Aggregating Encodings.
Dictionary Annealing
Dictionary Annealing(DA) performs simulated annealing on an series of existing dictionaries, while it can also learn dictionaries from scratch. Figure 3 provides an intuitive illustration of DA. To optimize a single dictionary , let’s assume it’s already at the local lowest energy position, i.e, not improving on the previous optimization/learning. We first ”heat up” the dictionary, by putting the ”noisy” residue into to generate an intermediate dataset:
, Then we ”cool down” dictionary by incrementally fitting .
Why the intermediate dataset and why using residues? We have two reasons:

The intermediate dataset is the residue dropping currently optimizing dictionary. The quantization error is reduced if the intermediate dataset is better fitted. On the whole picture, this th dictionary does a better job and residues left for the next dictionary is lowered, lowering .

The residues are independent to other dictionary spaces, as they’re ”noises” to these dictionaries. Messing with residues won’t rise mutual information between dictionaries. So we can push the higher without worry.
Given a series of dictionaries, the algorithm is performed by multiple iterations. On each iteration, we optimize one dictionary, then reencode the dataset to obtain the new residue for the next iteration. To learn dictionaries from scratch, one can simply perform DA on ”allzeros” dictionaries, ^{1}^{1}1In this case DA is quite similar to Residual Vector Quantization: the intermediate dataset of an ”allzeros” dictionary is the same to the residues. To bring better performance, we propose the following two auxiliary algorithms:
Improved Kmeans for Highdimensional Residue
Clustering on high dimensional space is not easy, especially on highdimensional residues as the randomness is increased. To obtain a better clustering for high dimensional data, one approach is to cluster on lowerdimensional subspace(?), which is also done by PQ/OPQ to obtain high information entropy for each dictionary. (?) indicates that PCA dimension reduction is particularly beneficial for Kmeans clustering, as it finds best low rank L2 approximations. In addition, the dictionary learned previously can provide initial points good enough, which is important for kmeans clustering(?).
Our idea is to preserve the clustering information on lowerdimensional subspace for higherdimensional subspace clustering. To optimize dictionary for , we first designate a dimension adding sequence: , then:

Project and into PCA space of , obtaining rotated dictionary and rotated intermediate dataset .

Optimize by performing Kmeans on , initialized with , using only the top dimensional data, then on the top , next on dimensions.

Rotate back to finish the optimization:
The choice of have minor effect on the optimization of a dictionary. We choice in our experiments.
Multipath Encoding
To encode with DA dictionaries, we seek the code that minimizes the quantization error for an input vector :
(1) 
The above algorithm is a typical fully connected MRF problem. Though the optimization of can be solved approximately by various existing algorithms, they’re very time consuming(?).
Similar to the concept of Locally Aggregating Encoding, if given an oracle the correct first encodings, can we effectively tell the correct encoding on the th part? Denote the correct encoding of a input vector as , and the known correct encodings , we consider quantization error as a function of :
(2) 
where , and
We seek the best among to minimize . In Equation 2, terms 1/2/3/7 are constant and negligible, terms 4/5 can be computed. Only the 6th term cannot be computed because we don’t know . We want it to be small so it won’t seriously affects the final outcome.
Thus we rearrange the dictionaries in the descending order of dictionary’s elements variance. Note that DA learned from scratch naturally produces variance descending dictionaries. We further adopt beam search to encode a vector. That is, we maintain a list of best approximations of on the first dictionaries: . Then we encode with the next dictionary . We find combinations from by minimizing the following objective function:
(3) 
We enumerate combinations and select top candidates. For each combination in Equation 3:

The first term has been computed at the previous encoding step  one table lookup.

The second term is precomputed for each encoding vector taking time  one table lookup

The third term is a negligible constant.

The last term involves table lookups and addition, with the innerproduct of all dictionaries elements precomputed before the beam search procedure.
To sum up, the time complexity is O() for encoding with one single dictionary. Note for fresh start DA, we don’t need to encode the previously learned dictionaries excessively after we optimized a ”zero” dictionary(i.e. learned a new dictionary). We report the distortion curve in Figure 4(c), we found that a relatively low could already achieve satisfactory encoding quality. We use this configuration in the rest of the experiments.
Online Dictionary Learning
DA can be easily extended to an online learning mechanism to utilize even larger scale dataset, where clustering on all data could be prohibitive, or new data is not yet available currently. Online learning with DA can be done simply by optimizing the learned dictionaries to fit the new coming data. We report online learning result for SIFT1B(?) dataset in Figure 4(a).
Aggregating Tree
We are now able to adopt boundandbranch methodology to highdimensional data by Aggregating Tree(ATree). After obtaining HCLAEs with DA, ATree is constructed according to the encodings like a radix tree(each node that is the only child is merged with its parent), except that we only merge leaf nodes. ATree effectively presents the quantized dataset, with all encodings written directly on the tree. A demonstrative structure of ATree is shown in Figure 6.
To perform nonexhaustive search on ATree, the idea is to maintain a candidate list like in multipath encoding. First we determine a candidate list with size limit for each layer as: . Given a query vector , we start with an initial candidate list containing only the root node, and iteratively do the following for times (The procedure is illustrated in Figure 5):

Replace the nodes in candidate list with their children. If the node has no children, it stays in the candidate list.

If the size of the candidate list exceeds ( is the current iteration number), shrink it to , and discard the nodes distant to the query vector.
We have to record some extra information on each node to allow fast distance computation. Let denote the depth of a node , and is the path from the root to this node, we record:
for . When we compute the distance between and (reconstructed as ), we have known the distance between and ’s father , we have:
(4) 
Thus the distance computation between a node and the query can be done efficiently in . ^{2}^{2}2We can further reduce the number of additions and table look ups with a smart implementation, please refer to the supplementary material.
After the above steps we have obtained the list of approximate nearest neighbors. The configuration of has an influence on the final search quality, which will be discussed in the Experiments Section. Candidate listing with ATree is highly efficient: the overall time complexity is , where refers to the total number of nodes traversed. Note that ATree is a tree structure, the performance is heavily dependent on the implementation.
Experiments
In this section we present the experimental evaluation of Dictionary Annealing and ATree. All experiments are done on an Core i7 running at 3.5GHz with 16G memory, single threaded.
Datasets
We use the following datasets commonly used to validate the efficiency of ANN methods: SIFT1M(?), contains one million 128d SIFT (?) features and 10000 queries; GIST1M(?), contains one million 960d GIST (?) global descriptors and 1000 queries; SIFT1B(?) contains one billion 128d SIFT feature as base vectors, 10K queries.
Performance of Dictionary Annealing
We compare the following stateoftheart encodings: Optimized Product Quantization(OPQ), Composite Quantization(CQ)(?), Additive Quantization(AQ)(?), Tree Quantization(TQ) and it’s optimized version Optimized Tree Quantization(?). We reimplemented AQ and OPQ by ourselves, and reproduce the results from (?) and (?) to present the evaluation. We choose the commonly used configuration: as the dictionary size and for all methods.
Methods  8BSIFT1M  16BSIFT1M  8BGIST1M  16BGIST1M 

AQ  19196.26  9799.86  0.6785  0.5277 
OPQ  22239.78  10468.39  0.6973  0.5361 
PQ  23540.75  10534.82  0.7056  0.6976 
TQ  (about~20000)  (about~9000)     
offlineDA  18416.55  9444.11  0.6456  0.4847 
onlineDA  16573.20  5901.43  0.6201  0.4583 
We use SIFT1M and GIST1M for evaluation, and train all methods on the training set and encode the whole dataset. We also train online DA with all the data^{3}^{3}3We didn’t train other methods on the whole dataset because they require too much memory, and report the training time vs distortion graph to in Figure 4(b), DA runs almost as fast as RVQ and much faster than AQ. The quantization error is presented in Table 1, our AQ has a much lower quantization error than other stateoftheart. We perform exhaustive NNsearch and report the performance of different methods in Figure 8. It can be seen that DA consistently perform better than other stateoftheart methods. It’s online learning version further pushes the performance of the encodings higher, for example by 13.6% lower distortion and 23.07% higher recall@1 for NNSearch on 8Bytes SIFT1M encoding.
Searching with Aggregating Tree
Now we evaluate the performance of Aggregating Tree. We constructed an ATree for SIFT1B(DAonline with 10M vectors of the dataset, ). We design the ATree to be computation efficient^{4}^{4}4Implementation details are presented in supplementary materials. The outcome datastructure occupies 14.53GB (total 1,224,574,028 Nodes consisting of 988,853,094 leaf nodes and 235,720,934 internal nodes) memory for SIFT1B with 64bit encoding, including vectors ID.
System  Recall@1  Recall@100  Query Time 

IVFADC  (0.088)0.107  (0.733)0.729  (74ms) 65ms 
MultiDADC  (0.158)0.149  (0.706)0.717  (6ms) 3.4ms 
MultiADC  (~0.05)0.064  (~0.6)0.582  3.2ms 
LOPQ  (0.199)0.182  (0.909) 0.890  69ms 
ATree  0.137  0.7451  0.63ms 
The choice of is important for searching with ATree. The encodings by DA don’t always guarantee the local vectors have the exact same prefix. We let in our experiments. Figure 7(b) reports the number of nodes traversed, though grows exponentially, the total number of traversed nodes is limited. We also report the performance of an exhaustive ADC(7.2s per query) on the whole dataset. ATree delivers asymptotic performance to exhaustive ADC by magnitudes of acceleration as shown on Figure 7(a). One can use a longer encoding for preciser search result. We finally draw the performance curve of ATree in Figure 7(c). ATree achieves an amazing speed at 0.63ms with a high search quality of 74.51% Recall@100, at the elbow of the curve.
In Table 2 we compared ATree with our speed optimized implementations of IVFADC(?), Locally Optimized Product Quantization(?), MultiDADC and MultiADC(?). Atree achieves 9.5x acceleration over MultiDADC and over 117x accleration over IVFADC with comparable performance. We think this is mainly because:

ATree joins candidate listing and reranking procedures together to avoid excessive ”precomputation”. It also make ATree cache friendly. While other methods requires many times of recalculating the lookup table and cache unfriendly.

ATree is based on HCLAE so a shorter list of candidates could already achieve satisfying result. While for IVFADC, a typical length of candidates is 80M on SIFT1B dataset.

DA produces high quality encoded dataset, especially with online learning(Recall@100:0.834 on 64 bit, compared to Composite Quantization (?) :~0.7, OPQ: ~0.65, PQ: ~0.55)
Conclusion
In this paper, we introduced the concept of High Capacity Locally Aggregating Encodings(HCLAE) for ANN search. We proposed Dictionary Annealing to produce HCLAE, and Aggregating Tree to perform fast nonexhaustive search. Empirical results on datasets commonly used for evaluating ANN search methods demonstrated our proposed approach significantly outperforms existing methods.
References
 [Agrawal et al. 1998] Agrawal, R.; Gehrke, J.; Gunopulos, D.; and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications, volume 27. ACM.
 [Babenko and Lempitsky 2012] Babenko, A., and Lempitsky, V. 2012. The inverted multiindex. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 3069–3076. IEEE.
 [Babenko and Lempitsky 2014] Babenko, A., and Lempitsky, V. 2014. Additive quantization for extreme vector compression. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 931–938. IEEE.
 [Babenko and Lempitsky 2015] Babenko, A., and Lempitsky, V. 2015. Tree quantization for largescale similarity search and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4240–4248.
 [Bradley and Fayyad 1998] Bradley, P. S., and Fayyad, U. M. 1998. Refining initial points for kmeans clustering. In ICML, volume 98, 91–99. Citeseer.
 [Chen, Guan, and Wang 2010] Chen, Y.; Guan, T.; and Wang, C. 2010. Approximate nearest neighbor search by residual vector quantization. Sensors 10(12):11259–11273.
 [Ding and He 2004] Ding, C., and He, X. 2004. Kmeans clustering via principal component analysis. In Proceedings of the twentyfirst international conference on Machine learning, 29. ACM.
 [Friedman, Bentley, and Finkel 1977] Friedman, J. H.; Bentley, J. L.; and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) 3(3):209–226.
 [Ge et al. 2013] Ge, T.; He, K.; Ke, Q.; and Sun, J. 2013. Optimized product quantization for approximate nearest neighbor search. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2946–2953. IEEE.
 [Gray 1984] Gray, R. M. 1984. Vector quantization. ASSP Magazine, IEEE 1(2):4–29.
 [Jégou et al. 2011] Jégou, H.; Tavenard, R.; Douze, M.; and Amsaleg, L. 2011. Searching in one billion vectors: rerank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 861–864. IEEE.
 [Jegou, Douze, and Schmid 2011] Jegou, H.; Douze, M.; and Schmid, C. 2011. Product quantization for nearest neighbor search. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(1):117–128.
 [Juang and Gray Jr 1982] Juang, B.H., and Gray Jr, A. 1982. Multiple stage vector quantization for speech coding. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’82., volume 7, 597–600. IEEE.
 [Kalantidis and Avrithis 2014] Kalantidis, Y., and Avrithis, Y. 2014. Locally optimized product quantization for approximate nearest neighbor search. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 2329–2336. IEEE.
 [Lowe 2004] Lowe, D. G. 2004. Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60(2):91–110.
 [Mairal et al. 2009] Mairal, J.; Bach, F.; Ponce, J.; and Sapiro, G. 2009. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, 689–696. ACM.
 [Norouzi and Fleet 2013] Norouzi, M., and Fleet, D. J. 2013. Cartesian kmeans. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 3017–3024. IEEE.
 [Oliva and Torralba 2001] Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision 42(3):145–175.
 [Ting Zhang 2014] Ting Zhang, Chao Du, J. W. 2014. Composite quantization for approximate nearest neighbor search. Journal of Machine Learning Research: Workshop and Conference Proceedings 32(1):838–846.
 [Torralba, Fergus, and Freeman 2008] Torralba, A.; Fergus, R.; and Freeman, W. T. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(11):1958–1970.
 [Wang, Kumar, and Chang 2010] Wang, J.; Kumar, S.; and Chang, S.F. 2010. Semisupervised hashing for scalable image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 3424–3431. IEEE.
 [Weber, Schek, and Blott 1998] Weber, R.; Schek, H.J.; and Blott, S. 1998. A quantitative analysis and performance study for similaritysearch methods in highdimensional spaces. In VLDB, volume 98, 194–205.
 [Weiss, Torralba, and Fergus 2009] Weiss, Y.; Torralba, A.; and Fergus, R. 2009. Spectral hashing. In Advances in neural information processing systems, 1753–1760.
 [Xia et al. 2013] Xia, Y.; He, K.; Wen, F.; and Sun, J. 2013. Joint inverted indexing. In Computer Vision (ICCV), 2013 IEEE International Conference on, 3416–3423. IEEE.