On LargeScale Retrieval: Binary or ary Coding?
Abstract
The growing amount of data available in modernday datasets makes the need to efficiently search and retrieve information. To make largescale search feasible, Distance Estimation and Subset Indexing are the main approaches. Although binary coding has been popular for implementing both techniques, ary coding (known as Product Quantization) is also very effective for Distance Estimation. However, their relative performance has not been studied for Subset Indexing. We investigate whether binary or nary coding works better under different retrieval strategies. This leads to the design of a new nary coding method, ”Linear Subspace Quantization (LSQ)” which, unlike other ary encoders, can be used as a similaritypreserving embedding. Experiments on image retrieval show that when Distance Estimation is used, ary LSQ outperforms other methods. However, when Subset Indexing is applied, interestingly, binary codings are more effective and binary LSQ achieves the best accuracy.
1 Introduction
Largescale retrieval has attracted a growing attention in recent years due to the need for image search based on visual content and the availability of largescale datasets. This paper focuses on the problem of approximate nearest neighbor (ANN) search for largescale retrieval.
Approaches for solving this problem generally fall into two subcategories; Fast Distance Estimation [7, 12] and Fast Subset Indexing[14, 13, 2, 10, 1]. Fast Distance Estimation methods reduce computation cost by approximating the distance function. Distance computation is very expensive in high dimensional feature spaces. On the other hand, Fast Subset Indexing methods reduce the cost by constraining the search space for a query to a subset of the dataset instead of the whole dataset.
A general technique for ANN search (both Fast Distance Estimation and Fast Subset Indexing) is to discritize the feature space into regions. Different coding methods can be used for this purpose. One of the classic methods is onehot encoding using means. means is a classic quantization technique that quantizes data into regions (clusters). Data is coded using a bit binary code, in which only one bit is one (representing the appropriate cluster) . Although means works well for small values of , it becomes intractable for large .
An alternative method to onehot encoding using means is binary coding. One can code the clusters with ^{1}^{1}1In this paper, without loss of generality, we assume that is selected such that is a natural number. dimensional binary codes by relaxing the onehot encoding constraint and allowing multiple bits to be one. This is equivalent to partitioning the space into two regions times. Many binary coding methods have been designed to address this problem[2, 5, 11, 16]. While binary coding is more scalable, it has a high reconstruction error.
Binary coding can be relaxed by allowing each dimension to be ary instead of binary (i.e. take on integer values between 1 and ). In this case, clusters can be coded with dimensional ary codes. We introduce a new categorization for methods that generate ary codes. We explore two general approaches to generate dimensional ary codes: 1 Subspace Clustering: In this approach, the original feature space is divided into subspaces and each subspace is quantized into clusters. 2 Subspace Reduction: Here, the dimensionality of the original feature space is reduced to and each dimension is quantized into bins. Figure 1 illustrates these approaches. Multidimensional quantization methods (e.g. PQ, CKmeans)[4, 12, 7] adopt the first approach to perform ary coding. They solve the problem for any and (including which leads to binary coding based on the first approach). On the other hand, many binary coding methods(e.g. ITQ, LSH) [5, 16, 2, 11] are instances of the second approach. However they are limited to the case where .
Most recent papers on quantization [4, 12], compared their proposed methods with binary coding methods only with respect to Distance Estimation (i.e. typically employing exhaustive search over the data, where the approximated distance is mimicking the ordering of images based on Euclidean distance in the original feature space). This leaves the question of ”which binary or ary coding performs better for ANN search using Subset Indexing?” unanswered.
The contributions of this paper are twofold. First, a new general approach for multidimensional ary coding is introduced. Based on that, Linear Subspace Quantization (LSQ) is proposed as a new multidimensional ary encoder. Unlike previously proposed ary encoders in which the Euclidean distance between ary codes is not preserved, the distances in LSQ coded space correlate with the Euclidean distance in the original space. As a result, the codes can be used directly for learning tasks. Furthermore, LSQ does not make the restrictive assumption of dividing space into independent subspaces, which is common in ary encoders. Experiments show that LSQ outperforms such encoders. Second, it is shown that ary coding does not always outperform binary coding in retrieval. We show that binary coding works better when Subset Indexing is used and present an explanation based on the two approaches to coding. To the best of our knowledge this has not been identified previously. However, it is very important for largescale retrieval.
The rest of paper is organized as follows. In section 2, the general formulation for both Subspace Clustering and Subspace Reduction is presented. Additionally, the LSQ coding method is described and its relation to other methods and its properties are discussed. Section 3, describes the ways ary and binary coding methods can be exploited in combination with distance estimation and subset indexing to reduce search cost in retrieval. Experiments are reported in Section 4. Finally, Section 5 concludes the paper.
2 ary Coding
ANN search methods discretize the feature space into a set of disjoint regions. ary coding can be used for this purpose. An ary code of length is defined by an dimensional vector in . The goal is to transform data into dimensional ary codes that reconstruct the original data accurately.
First, consider constructing a onedimensional ary code. A common objective for quantization methods is to minimize the reconstruction error, referred to as quantization error. In other words, given a set of data points where each column is a data point , the quantization objective can be expressed as:
(1) 
where maps a vector (column in ) into one element of a finite set of vectors in referred to as a codebook. The index of the codebook vector assigned to a data point is its onedimensional ary code. means optimizes this objective when the size of the codebook is equal to .
Using the onehot encoding notation, the optimization in 1 can be written as follows:
(2) 
where is a binary matrix in which each column is a way selector  all of its elements but one are zero.
In order to generalize onedimensional ary codes to dimensional codes, we explore two approaches: Subspace Clustering and Subspace Reduction. Although the former has been explored in the literature, the latter has not. Without loss of generality, we assume that the data are mean centered (i.e. ) and scaled to by mapping the data to the unit hypersphere. In [15, 5], it is shown that it is very beneficial to normalize the data to the unit hypersphere.
2.1 Subspace Clustering
Here, to generate dimensional ary codes, the original feature space is divided into subspaces and each subspace is discretized into regions. To this end, in 2, the number of clusters can be set to and the selector can be allowed to include nonzero elements as follows:
(3) 
Here and are the codebook and its related onehot encoding in the ’th subspace. In general, the optimization of 3 is intractable. As a result, Product Quantization[7] and Cartesian Kmeans [12] solve a constrained version of this problem where the subspaces created by the s are orthogonal. In other words, }. Intuitively, the original space is divided into independent subspaces and each is clustered into regions. We next present another approach to ary coding in which no such constraint is imposed.
2.2 Subspace Reduction
Subspace Reduction maps the data into an dimensional space and discretizes each dimension into bins. The goal is to perform this discretization to minimize the reconstruction error in the original space. Formally, the optimization problem can be written as follows:
(4) 
where is the mapping function and is applied to each column of , is the reconstruction function which projects the data back to the dimensional space in which the reconstruction error is computed. In order to prevent overfitting, the reconstruction function must be regularized. is a regularizing function that limits the variations in , is a parameter controlling the amount of regularization, and is a uniform quantizer that is applied to each element of its input and is defined as:
(5) 
where generates uniformly distributed values in . In other words, is a general quantizer that maps any real value in [1,1] into one of uniformly distributed values in . For example, is the sign function and maps into one of the three values .
To summarize, optimizing 4 identifies a mapping and a reconstruction function such that the quantized data in the space generated by the mapping function can be reconstructed accurately by the reconstruction function. It should be noted that an dimensional ary code is generated by .
2.2.1 Linear Subspace Quantization (LSQ)
LSQ is a multidimensional ary coding method based on Subspace Reduction where linear functions are used as the mapping and the reconstruction functions in 4. Assume that and where and . Employing the Frobenius norm as the regularizing function, the optimization problem in 4 becomes:
(6) 
To solve this problem, we propose a two step iterative algorithm. Subsequently, the convergence of the proposed algorithm will be proven.
Learning LSQ:
The optimization for and in 6 can be solved by a two step iterative optimization algorithm (i.e. fixing one variable and updating the other). The steps are as follows:

Fix and update : For a fixed , define ; then we have a closed form solution for as
(7) 
Fix and update : In this step, is updated as:
(8) where is the Moore Penrose pseudoinverse of . In the following we prove that the pseudoinverse is an optimal solution for 6 when is fixed.
The algorithm iterates between step 1 and 2 until there is no progress in minimizing 6.
Convergence of LSQ:
In order to prove the convergence of the algorithm, we show that both steps reduce the objective value. The optimality of the first step can be easily shown by simple linear algebra. Here, we focus on proving that the second step reduces the objective value.
Given that , the solution of the optimization in 6 for fixed is equivalent to the solution of the following problem:
(9) 
Defining , the optimal solution for can be formulated as:
(10) 
It should be noted that the optimal solution is not unique. Therefore, is defined as the optimal solution set for . The goal is to prove that .
Let . We first prove that . Suppose, to the contrary, that . Consequently, there should be at least one and , such that . Since is defined in the optimal solution of the optimization 10, its corresponding objective value should be less than that of any other feasible point. This leads to the conclusion that (Note that even if more than one element differs between and , the inequality holds for at least one of them). However, this contradicts the definition of in 5 since should map into (It should be noted that is in the range of ). So for any and , . Therefore, . Considering the definition of and completes the proof that .
Finally, since both steps in our optimization reduce the objective value, LSQ converges to a local optimal value of optimization 6.
Relation to ITQ:
ITQ is a special case of LSQ when and , are rotation matrixes where =. Our experiments show that the binary codes generated by LSQ leads to higher accuracies than the binary codes generated by ITQ.
Geometrical Interpretation:
LSQ finds a linear transformation of a quantized hypercube that best fits the data. Figure 2 illustrates a simple 2D example in which the quantizer is fit to the data by a rotation.
2.2.2 LSQ as an ary Embedding
While binary encoding techniques try to minimize the reconstruction error, the resulting codes preserve similarities between samples. In other words, the Hamming distance in the binary space approximates the Euclidean distance in the original feature space. As a consequence of this property, these binary codes can be exploited as feature vectors for learning tasks in the embedded space. Many recent approaches based on this property have been proposed to make learning more efficient [16, 17].
In subspace clustering methods (e.g. CKmeans), the cluster indices generated by the quantizer can not be viewed as a similarity preserving embedding. This is due to the fact that there are no constraints on assigning these indices to clusters. In subspace reduction methods (e.g. LSQ), each dimension of an ary code has a finite(discrete) set of real values as its domain. For each dimension, the distances between these discrete values correlates with the distances between the data points in the original feature space in the direction of that dimension. Therefore, the Euclidean distance in the quantized data correlates with the Euclidean distance in the original feature space.
One could post process CKmeans to generate similarity(distance) preserving codes by assigning the appropriate indices to cluster centers after completion of the training stage. These indices can be obtained by finding a 1D subspace for each of the subspaces generated by CKmeans. A simple model could compute PCA over the cluster centers in each subspace to reduce the cluster centers into 1D real values. However, in 4.5, a classification experiment is performed in which the ary codes are used as features. The result shows that, as an embedding, LSQ outperforms CKMeans by a large margin even after refining the CKMeans index assignments to clusters.
3 NN Retrieval using Data Encoding
A large source of computational cost in nearest neighbor search is the distance computation between the query and all the samples in the dataset. In order to speed up NN search, one can either speed up the computation of the distance function and/or reduce the number of distance computations by limiting the search space for a given query. We refer to the first strategy as Distance Estimation and the second as Subset Indexing. In the following subsections, we show how Subspace Clustering and Subspace Reduction coding techniques can be used for each of these strategies.
3.1 Retrieval by Distance Estimation
Data coding can reduce the cost of distance computation since the Euclidean distance can be efficiently estimated in the coding space.
Distance estimation using Subspace Clustering ary codes:
Once data is coded, the Euclidean distance between two points can be estimated as the sum of distances between the assigned cluster centers to those data points in each subspace [7]. This is known as the symmetric distance. The distances between the cluster centers in each subspace can be precomputed in an table. Then, computing the symmetric distance can be implemented efficiently by lookups and additions of table elements, one for each subspace. More formally,
(11) 
where project into the subspace, is the cluster index to which belongs, and is the precomputed distance table for the subspace. If we consider as the query and as a data point from the database, can be precomputed. Therefore the complexity is for each query, where is the total number of points in the database.
Distance estimation using subspace reduction ary codes:
As mentioned earlier, the Euclidean distance between quantized data by subspace reduction approximates the Euclidean distance in the original feature space. Therefore we need only compute the distance between coded values. This has complexity , which is the same as the complexity of subspace clustering.
Distance estimation using binary codes:
For the binary codes, Hamming distance is used as the distance metric. Computing Hamming distances using bit binary codes has complexity for each query.
3.2 Retrieval by Subset Indexing
Another way to speed up nearest neighbor search is to limit the search space. Hashing techniques [2] and tree based methods[14] limit the search space by constraining search to a subset of samples in the database. This is accomplished by indexing the data into a data structure (e.g. hash tables or search tree) at training time. Multiple index hashing [6, 13] is one such data structure that can be used for binary and ary codes.
Multiple index hashing using ary codes:
In this approach, for dimensional ary coding, we create an index table for each dimension. Each table has tuples where corresponds to one of the values in a dimension of the code and is a list of those data points’ indices such that the value of the dimension in their code is . At query time, for each dimension of the code, a set of data indices is retrieved. Figure 3(a) illustrates this technique. For each index in the union of these sets, we assign a score between and which indicates that a particular index has been retrieved from dimensions. The samples with higher scores are more likely to be similar to the query sample. By sorting the indices based on their score, we can choose the top samples as the NN’s. If the total number of retrieved indices were less than , we change the value in one of dimensions in the query code that has minimum distance to the quantized query point in the original space. Then we retrieve a new set and repeat the process until the total size of the retrieval set is greater than or equal to .
Multiple index hashing with binary codes:
Similar to ary codes, binary codes can be used for multiple index hashing. However, in this case each set of consecutive bits are grouped together to create the indices for accessing the tables. Considering , this partitioned binary code can be seen as an ary code. As a result, the same technique can be applied for multiple index hashing as discussed previously. This case can be seen in Figure 3(b).
3.2.1 Subset Indexing: Binary or ary Coding?
As mentioned earlier, ary coding does not always outperforms binary coding for largescale retrieval. More precisely, when Subset Indexing is used to reduce the search cost, binary coding achieves better search accuracy. This is due to the fact that quantization does not necessarily preserve the similarities (or distances) between data. In other words, a good quantizer that minimizes the quantization error in 1, does not always preserve relative distances between data. Formally:
(12) 
This is important when retrieval is carried out by subset indexing. There, binary codes may retrieve the nearest neighbors better than ary codes. Figure 4 illustrate an example of dimensional ary codes and their corresponding binary codes, which have bits (). Each bit is generated by a line based on which side of the line the point lies. The green dots are the points in the database and the red diamond is a query point. In this figure the binary code for the query sample is . In the subspace clustering view, we cluster each dimension into 8 clusters. In this case, all points in the yellow region will be retrieved by multiple index hashing. As can be seen, none of the actual nearest neighbors can be retrieved. But, when we use the binary codes for multiple index hashing all the actual nearest neighbors are retrieved. This is the blue region (i.e. the union of the region created by the first three bits () and the second three bits () of the query code). Our experimental evaluation confirms that when subset indexing is used for retrieval, binary codes outperform ary codes. Although, ary codes are more accurate for quantization, they are not accurate for ANN with subset indexing.
4 Experiments
We report experiments on three wellknown datasets, namely GIST1M [7], CIFAR10 [9], and a subset of ImageNet [3] which is used for the ILSVRC2012 challenge. GIST1M contains 1M base feature vectors, 500K training samples and 1K query samples. For CIFAR10, we randomly selected 20K samples as our training set, 500 samples as query images and the remaining 39500 images as the base samples. Raw pixel values are used as features for this dataset. The ImageNet ILSVRC2012 dataset consists of 500K training samples, 250K base samples, and 1K query images. We used ConvNet as the stateoftheart feature extractor for this dataset. The ConvNet features are extracted by Caffe [8].
Following [12], we used recall as the performance measure for retrieval. The training set is used to train the coding model and the learned model is applied for coding the base and query set. For each point in the query set, we find its nearest neighbors and report the recall at . By varying we draw the recall curves.
As mentioned earlier, retrieval can be made faster using two approaches: distance estimation and subset indexing. The performance of different methods can vary with respect to which approach is used. Therefore, each method is examined with respect to both and an analysis is presented. The nearest neighbors in the original feature space are defined as ground truth for each query image. For making the comparison fair, in each experiment the number of bits which can be used by each coding method is limited to the same fixed budget. e.g. a dimensional ary code requires bits of memory. ( bits per code dimension).
4.1 Retrieval using Distance Estimation
Figure 5, shows the Recall@R curves on different datasets using a budget of 256 bits. In this figure, LSQ(N) and LSQ(B) refer to the nary and binary versions of the LSQ method respectively. The recall@R curves are shown for different number of bits per code dimension, which controls the number of quantization steps for nary encoders (e.g. for LSQ(N)5 or LSQ(B)5 the quantizer has levels or bits). As can be seen, the performance of ary codes is better than binary codes. Also, LSQ outperforms CKmeans on all three datasets.
Figure 6 explores the effect of the number of bits on the different methods. We fixed the number of bits per code dimension to 5 (e.g. the CKmeans algorithm would learn 32 clusters per segment) and report the area under the Recall@R curve. Again, LSQ performs better than CKmeans.
4.2 Retrieval using Subset Indexing
As discussed in section 3.2, either binary or ary coding can be used to speed up search with Subset Indexing. This approach limits the search to a small number of samples by indexing subsets of the database (subset indexing). Here, the performance of binary and nary coding is compared. We compare the retrieval results of the best ary encoding for this task(CKmeans) to the best binary coding(the binary version of LSQ).
Figure 7, shows the recall@R curves for this experiment with varying numbers of bits per code dimension for a fixed budget of 256 bits. In Naryk, bits are used for quantizing each dimension and additionally indexing in the multiindex hashing method(i.e. quantization steps for each dimension). Similarly, in Binaryk, consecutive bits are used for indexing in the hashing method. The effect of changing the budget of the encoder on the retrieval task can be seen in Figure 8. These figures illustrate that the binary encoding techniques outperform the nary encoders when the subset indexing technique is used, as discussed in Section 3.2.
Discussion: These experiments confirm that when retrieval is performed by distance estimation, it is better to use ary coding with based on subspace reduction (e.g. LSQ). On the other hand, when subset indexing is used for retrieval, binary coding outperforms nary coding.
4.3 Comparison of binary coding methods
Both CKmeans and LSQ can be viewed as generalizations of binary encoding where the number of quantization steps can be more than two. Here, the number of quantization steps is set to two and the binary versions of LSQ and CKmeans (namely LSQ(B) and OKmeans respectively) are compared with ITQ using subset indexing. Figure 9, shows the area under the recall precision curve for these three binary coding methods under varying bit budgets. As can be seen, the binary version of LSQ outperforms ITQ and the binary version of CKmeans.
4.4 Convergence of the algorithms
In Figure 10, the convergence of different binary coding methods are shown. For this experiment, GIST1M is used. As can be seen, LSQ converges much faster than OKMeans. Also note that the final reconstruction error of LSQ is much smaller than ITQ and OKmeans, reflecting the fact that LSQ reconstructs the data more accurately using the same memory budget.
4.5 ary Codes as Feature Vectors
The codes constructed by LSQ can be used as feature vectors to perform learning tasks. Figure 11 shows the performance evaluation of a classification task using different codings as features. As proposed in sec 2.2.2, for CKmeans, we refine the index assignments to clusters by mapping the cluster centers in each subspace into a one dimensional space using PCA and convert each dimension of the code to the corresponding value in this 1D space. It can be seen that our proposed quantization method outperforms CKmeans even after refining the CKMeans index assignments.
5 Conclusion
We focused on the problem of large scale retrieval using ANN. A new general approach for multidimensional ary coding Linear Subspace Quantization (LSQ) was introduced for ANN. LSQ achieves lower reconstruction error than other ary coding methods. Furthermore, it preserve the similarities in the original space, which is important when it is used directly for learning tasks. Experiments show that LSQ outperforms other binary and nary coding methods on large scale image retrieval. We also compared the performance of binary and nary coding methods for this task. We showed that ary coding outperforms binary coding when distance estimation is used to reduce the search computation cost. However, in combination with Subset Indexing, interestingly, binary coding works better for retrieval.
References
 [1] A. Andoni and P. Indyk. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 459–468. IEEE, 2006.
 [2] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG ’04, 2004.
 [3] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [4] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. 2014.
 [5] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, 2011.
 [6] D. Greene, M. Parnas, and F. Yao. Multiindex hashing for information retrieval. In Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on, pages 722–731. IEEE, 1994.
 [7] H. Jégou, M. Douze, C. Schmid, et al. Searching with quantization: approximate nearest neighbor search using short codes and distance estimators. 2009.
 [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [9] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
 [10] B. Kulis and K. Grauman. Kernelized localitysensitive hashing for scalable image search. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2130–2137. IEEE, 2009.
 [11] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, 2011.
 [12] M. Norouzi and D. J. Fleet. Cartesian kmeans. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3017–3024. IEEE, 2013.
 [13] M. Norouzi and A. Pournaji. Fast search in hamming space with multiindex hashing. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, 2012.
 [14] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1986.
 [15] M. Rastegari, S. Fakhraei, J. Choi, D. W. Jacobs, and L. S. Davis. Comparing apples to apples in the evaluation of binary coding methods. CoRR, 2014.
 [16] M. Rastegari, A. Farhadi, and D. A. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV (6), 2012.
 [17] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category recognition using classemes. In Computer Vision–ECCV 2010, pages 776–789. Springer, 2010.