Generalized Residual Vector Quantization for Large Scale Data
Abstract
Vector quantization is an essential tool for tasks involving large scale data, for example, large scale similarity search, which is crucial for contentbased information retrieval and analysis. In this paper, we propose a novel vector quantization framework that iteratively minimizes quantization error. First, we provide a detailed review on a relevant vector quantization method named residual vector quantization (RVQ). Next, we propose generalized residual vector quantization (GRVQ) to further improve over RVQ. Many vector quantization methods can be viewed as the special cases of our proposed framework. We evaluate GRVQ on several large scale benchmark datasets for large scale search, classification and object retrieval. We compared GRVQ with existing methods in detail. Extensive experiments demonstrate our GRVQ framework substantially outperforms existing methods in term of quantization accuracy and computation efficiency.
GENERALIZED RESIDUAL VECTOR QUANTIZATION FOR LARGE SCALE DATA
Shicong Liu, Junru Shao, Hongtao Lu*^{†}^{†}thanks: *Corresponding author ^{†}^{†}thanks: This paper is supported by NSFC (No. 61272247, 61533012, 61472075), the 863 National High Technology Research and Development Program of China ( SS2015AA020501) and the Major Basic Research Program of Shanghai Science and Technology Committee (15JC1400103). ^{†}^{†}thanks: 9781479970827/15/$ 31.00 ©2015 IEEE 
{artheru, yz_sjr, htlu}@sjtu.edu.cn 
Key Laboratory of Shanghai Education Commission for 
Intelligent Interaction and Cognitive Engineering, 
Department of Computer Science and Engineering, 
Shanghai Jiao Tong University, P.R.China 
Index Terms— Vector Quantization, Large Scale Data, Similarity Search, Nearest Neighbor Search
1 Introduction
With the rapid development of data collecting and mining techniques, there is an urgent need for powerful algorithms in data compression, storage and retrieval. Specifically, compressing a highdimensional vector and performing similarity search without decompression on large scale data have become crucial in many fields, e.g. object detection [1], image and video retrieval [2], and deep neural networks [3] etc.
Vector quantization (VQ) based methods, e.g. product quantization (PQ) [4], optimized product quantization (OPQ) [5], additive quantization (AQ), composite quantization (CQ) [6], are popular and successful methods for the tasks above. Vector quantization is essentially lossy compression of highdimensional vectors. It compresses a vector into a short encoding representation by multiple learned codebooks, and approximately reconstructs the vector by codewords corresponding to the encoding. Quantizationbased algorithms have three major advantages: (1) Memory consumption is significantly reduced to represent highdimensional vectors; (2) It allows efficient similarity computation, e.g, one can compute Euclidean distance or scalar products between an uncompressed vector and a large set of compressed vectors via asymmetric distance computation (ADC) [4] or its variants like optimized asymmetric distance [7] , hence approximate nearest neighbor search (ANN) can be greatly accelerated; (3) These encodings are simple enough to allow more sophisticated data structure and heuristic nonexhaustive search scheme like inverted file system with asymmetric distance computation (IVFADC) [4], inverted multi index [8] and locally optimized product quantization [9]. They are capable of storing one billion compressed vectors in memory and conducting a retrieval in a few milliseconds even on a modern laptop.
In this paper, we propose generalized residual vector quantization (GRVQ) to further improve over existing vector quantization methods. The main idea is to iteratively select a codebook and optimize it with the current residual vectors, then requantize the dataset to obtain the new residual vectors for the next iteration. GRVQ shares a similar motivation with the traditional residual vector quantization (RVQ) ( [10], [11]). RVQ uses additive model to quantize vectors, and adopts a multistage residual clustering scheme to learn codebooks. However RVQ fails to generate effective encodings for highdimensional data [12], which manifests as the information entropy of the encodings obtained on each adding stage drops quickly. Compared to RVQ, our GRVQ:
Overcomes the downsides of RVQ with transition clustering which substantially improves performance of kmeans on high intrinsic dimensional data. We also propose a multipath encoding scheme to further lower the quantization error.
Generalizes RVQ that RVQ can be viewed as a special case of GRVQ performing codebook optimization on an "allzero" codebook on every stage.
Compared to the existing vector quantization methods, GRVQ has the following merits:

Quantizing a vector with additive model is NPhard. Though many approaches have been proposed, e.g, iterated conditional modes [14] and AQencoding [13], they’re too slow for practical application. The codebooks obtained by GRVQ are variance descending, enabling a much more efficient and practical beam search encoding scheme.
The quantization accuracy and computation efficiency of our method is validated on three large scale dataset commonly used for evaluating vector quantization methods. We also demonstrate the superior performance of GRVQ on classification task.
2 Related Methods
2.1 Vector Quantization
Vector quantization (VQ) techniques are used to perform lossy compression on a large scale dataset. Denote a database as a set of dimensional vectors for VQ to compress, VQ learns a codebook , which is a list of codewords: , , , . Then VQ uses a mapping function ^{1}^{1}1 denotes to encode a vector: . Quantizer is defined as , meaning is approximated as for latter use. Vector quantization minimizes quantization error, which is defined as
(1) 
Minimizing Eqn.1 directly leads to classical kmeans clustering algorithm [15]. VQ essentially partitions the data space into many Voronoi cells, and quantizes vectors to the centroids of the cells. The kmeans model is simple and intuitive. However, the cost of training and storing the centers grows linearly with , limiting quantization accuracy.
2.2 Residual Vector Quantization
With a compositional model, one can represent cluster centers more efficiently. A number of compositional models are proposed, e.g, product quantization (PQ) [4], optimized product quantization (OPQ) [5], additive quantization (AQ) [13], composite quantization (CQ) [6]. Here we focus on residual vector quantization (RVQ) ( [10], [16]) for high dimensional data lossy compression. RVQ is a common technique to approximate original data with several low complexity quantizers, instead of a prohibitive high complexity quantizer. RVQ algorithm iteratively learns quantizers step by step. In the th step, RVQ obtains the current residuals with the previous learned quantizers . Next, it performs classical kmeans to learn the th quantizer for the following objective:
(2) 
The original vectors are quantized with the following additive model:
(3) 
The above additive model is also used in AQ, CQ, tree quantization [17], etc. Such model is beneficial for applications like highdimensional nearest neighbor retrieval, for example, asymmetric distance computation (ADC) [4] [13] allows exhaustive nearest neighbor search by efficiently computing Euclidean distance between an uncompressed query vector and the compressed dataset vectors with the following equation:
(4) 
To retrieve the nearest neighbors, we first compute and store for and in a lookup table. As for term , it can be computed during the dataset compression and stored along with the compressed dataset. Then, the approximate distance between and any dataset vector can be efficiently computed with floating point addition. CQ regularizes to a fixed value to further reduce the cost for storing the compressed dataset and the cost for computing the approximate distance, sparse composite quantization [18] proposes a method to accelerate the lookup table computation.
2.3 Disadvantages of Residual Vector Quantization
RVQ quantized vectors have relatively higher quantization errors compared to other vector quantization methods in high dimensional space, as observed in Fig. 2, the performance gain of adding an additional stage drops quickly for RVQ.
We examine the encodings obtained by vector quantization from the view of information entropy. For an effective encoding, the information entropy of encoding at any position should be high, and the mutual information of encoding at different positions should be low. Note the above objective is explicitly considered in Hashing methods like spectral hashing [19]. To formulate, denote the encoding at position as a discrete random variable with domain of , and as the information entropy of a random variable, we would like following equations to hold:
(5) 
RVQ doesn’t produce encodings with high enough, as observed in Fig. 1. This is mainly because the intrinsic dimensionality of the residual vectors becomes higher with increasing stages [12], hence traditional kmeans algorithm fails to work. AQ [13] has a slightly higher compared to RVQ, yet still much lower than other quantization methods like OPQ. Thus an improvement over this should be beneficial.
In addition, as an additive model, the quantization of a vector is actually a fully connected discrete MRF problem ( [6], [13]). However RVQ doesn’t consider this in the codebook learning. Thus on each stage, the codebook is not learned with the optimal input. This leads to an accumulating quantization error and impact the overall quantization accuracy.
3 Generalized Residual Vector Quantization
We propose generalized residual vector quantization (GRVQ) to learn effective encodings with additive model. We present the outline of GRVQ in Algorithm 1. GRVQ optimizes existing codebooks or codebooks of zero vectors to learn from scratch. Formally, denote the encoding of , the current residual of is:
(6) 
On each iteration, GRVQ randomly pick an th codebook to optimize. We first perform incremental clustering on an intermediate dataset , defined as:
(7) 
Then we optimize the codebook to fit this dataset better with the following objective function:
(8) 
Finally, we reencode the original dataset with the optimized codebooks , and obtain the residual vectors for the next iteration.
Traditional RVQ is a special case of GRVQ which is initialized on all zeros codebooks, and sequentially optimizes each codebook for once. Product quantization [4] and optimized product quantization can be viewed as GRVQ with constraints that each codebook only works on specific dimensions.
3.1 Transition Clustering
The increased randomness of highdimensional residual vectors lead to the failure of traditional kmeans algorithm [12]. In order to obtain a better clustering result, a typical approach is to cluster on lowerdimensional subspace [20], with the objective function:
(9) 
Note that kmeans algorithm on the entire dimensional space is a special case when . Transition clustering seek a transition from subspace clustering to the full dimensional clustering. we first use PCA dimension reduced subspace to initialize the clustering [21], then iteratively add more dimensions and warm start kmeans algorithm with the clustering information obtained from previous iteration, as they provide good starting position [22]. To optimize a codebook for Eqn.8, we perform the following:

Designate a dimension increasing sequence: ^{2}^{2}2We choose parameters and in our experiments. ;

Project and into PCA space of : , and ;

Perform warmstarted kmeans initialized with the first dimensions of , and update them with the resulting centroids. We do this iteratively for .

Rotate back to finish the optimization: .
3.2 Multipath Encoding
Encoding with additive model is a fullyconnected MRF problem ([13], [6]). Though it can be solved approximately by various existing algorithms, they are very time consuming [17]. In this section we propose an efficient beam search method for GRVQ optimized codebooks.
Denote as the optimal encodings for , which quantizes minimizing the quantization error . Suppose we know the first optimal encodings . To determine the th optimal encoding effectively, denote and , and consider quantization error as a function of :
(10) 
We seek the best in , in order to minimize . In Eqn. 10, term cannot be computed because is unknown to the encoding scheme, which leads to an error in estimating the best . Low variance of is required for neglecting . A simple way to achieve this goal is to rearrange the codebooks by the variance of codewords in descending order. In fact, GRVQ naturally produces codebooks descending order of variance of corresponding codewords.
Then, we can we adopt the idea behind the beam search algorithm to encode a vector . That is, we sequentially encode with each codebook and maintain a list of best encodings of . On each iteration, we enumerate all possible codewords on the next codebook, compute the distortion, and determine the new best encodings. This can be done efficiently with lookup tables. The time complexity of encoding with an th codebook is . To encode a vector with all codebooks, the time complexity is .
One should notice that when GRVQ has optimized an th codebook, there is no need to reencode the vectors with the first codebooks since our method is carried out sequentially and will obtain exactly the same first encodings. This is very different from the encoding scheme proposed in [13], in which the change in any codebook requires reencoding over all codebooks. Our encoding scheme is also much more efficient compared to [13], because we are only required to consider one codebook at a time.
With the encoding time vs quantization error curve presented in Fig. 2, we find that already could achieve good encoding quality and have relatively low encoding time. We use this configuration in the rest of the experiment.
3.3 Eliminating term for Efficient Euclidean Distance Computation
Elimination  Quantization  Don’t care  

Processing Time  Long  Short  No 
Quantization Error  High  Low  Low 
Extra Length  No  68 bit  4 Byte 
Computation  No  2 Flops  1 Flops 
Quantization methods using additive model like AQ and RVQ require an extra term fix to perform Euclidean ADC as mentioned in Sec.2.2. Composite quantization is a method similar to AQ, only that it eliminates this term fix by imposing regularization that for all vectors is a constant.
Similarly, we can modify the objective functions for transition clustering and multipath encoding to eliminate . We introduce regularization parameter indicating the penalty, and a target parameter . We modify Eqn.8 to the following:
(11) 
The above problem can be solved via a slight modification to kmeans algorithm employed in transition clustering: on each iteration of kmeans, we assign a vector to the centroid that minimize .
Next, on encoding a vector , we optimize the following equation to satisfy the regularization:
(12) 
The above problem only requires a trivia modification to multipath encoding simply by considering this penalty in the beam search.
We start the elimination with , then on each iteration of GRVQ we compute , and slightly increase to enforce the regularization. The regularization on put a slight loss on quantization accuracy as observed in Fig.2 for accelerated distance computation and lower memory consumption. Another option is to quantize into a few bit [13], we don’t need long code for quantizing as observed in Fig.2. We compare different ways of processing in Table 1.
3.4 Extensibility for online codebook learning
GRVQ is naturally an online learning mechanism that is able to deal with incrementally obtained training data. This can be done simply by optimizing the codebooks on the newcoming data. It is also capable of handling large scale dataset where classical clustering algorithm is prohibitive due to unacceptable time complexity and memory consumption. Online learning effect on SIFT1B [23] dataset containing one billion vectors is reported in Figure 2.
4 Experiments
4.1 Dataset and configurations
In this section we present the experimental evaluation of GRVQ. All experiments are done on a quadcore CPU running at 3.5GHz with 16G memory and one GTX980 GPU.
We use the following datasets commonly used for evaluating vector quantization methods: SIFT1M [4], contains one million 128d SIFT [24] features. GIST1M [4], contains one million 960d GIST [25] global descriptors. SIFT1B [23] contains one billion 128d SIFT feature as base vectors.
We compare GRVQ with the following stateoftheart VQ methods: PQ [4], OPQ [5], AQ [13]. We choose the commonly used configuration for codebooks learning: , and . We train all methods on the training set and encode the base dataset. We train online version of GRVQ with all the data. The training time on GIST1M of all methods is presented on Fig.2. Though PQ / OPQ train fast, the performance gets easily saturated. Performance of AQ appears unstable. The proposed GRVQ achieves a balanced tradeoff between performance and training speed. We draw the time for encoding vectors from GIST1M with GRVQ learned codebooks in Fig.2. Our proposed multipath encoding scheme utilizes the characteristic of GRVQ codebooks and encodes efficiently. Fig.2 shows our GRVQ outperforms existing methods by large margin on all code length ranging from in term of quantization accuracy.
4.2 Large Scale Search
We perform exhaustive Euclidean nearest neighbor search to compare different vector quantization methods. Fig.3 shows the results of large scale datasets SIFT1M and GIST1M. It can be seen that the gains obtained by our approaches are significant on both datasets. The online version of our GRVQ outperforms existing methods by large margin, for example, the performance of 64bit GRVQ encoding closely match the performance of 128bit PQ encoding. The term eliminated GRVQ codebooks also achieves large improvement over other methods.
Table 2 shows the performance for an even larger dataset SIFT1B. By utilizing online learning, our method achieves the best performance. The improvement on large dataset is more significant than that on smaller datasets.
GRVQ(online)  CQ  OPQ  PQ  
Recall@100  0.834  (0.701)  (0.646)  0.581 
4.3 Image Classification and Retrieval
Another important application of vector quantization is to compress image descriptors for image classification and retrieval, in which images are usually represented as the aggregation of local descriptors, result in vectors of thousand dimensions. We evaluate the image classification and retrieval performances over dimensional fisher vectors [26] on 64d PCA dimension reduced SIFT descriptors extracted form INRIA holiday dataset [27]. We quantize all fisher vectors and learn a linear classifier to perform the classification. With short codes one can accelerate the classification and retrieval thousandfold. In particular, our GRVQ has the minimal degradation of performance.
GRVQ  AQ  OPQ  RVQ  PQ  CQ  

32bit  57.1%  54.5%  53.7%  50.9%  50.3%  (55.0%) 
64bit  62.9%  62.1%  57.9%  53.8%  55.0%  (62.2%) 
5 Conclusion
In this paper, we proposed the generalized residue vector quantization (GRVQ) to perform vector quantization with higher quantization accuracy. We proposed improved clustering algorithm and multipath encoding for GRVQ codebook learning and encoding. We also propose free version of GRVQ for efficient Euclidean distance computation. Experiments against several stateoftheart quantization methods on well known datasets demonstrate the effectiveness of GRVQ on a number of applications.
References
 [1] Andrea Vedaldi and Andrew Zisserman, “Sparse kernel approximations for efficient classification and detection,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2320–2327.
 [2] Josef Sivic and Andrew Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003, pp. 1470–1477.
 [3] James A Rodger, “A fuzzy nearest neighbor neural network statistical model for predicting demand for natural gas and energy cost savings in public buildings,” Expert Systems with Applications, vol. 41, no. 4, pp. 1813–1829, 2014.
 [4] Herve Jegou, Matthijs Douze, and Cordelia Schmid, “Product quantization for nearest neighbor search,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 1, pp. 117–128, 2011.
 [5] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun, “Optimized product quantization for approximate nearest neighbor search,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 2946–2953.
 [6] Jingdong Wang Ting Zhang, Chao Du, “Composite quantization for approximate nearest neighbor search,” Journal of Machine Learning Research: Workshop and Conference Proceedings, vol. 32, no. 1, pp. 838–846, 2014.
 [7] Jianfeng Wang, Heng Tao Shen, Shuicheng Yan, Nenghai Yu, Shipeng Li, and Jingdong Wang, “Optimized distances for binary code ranking,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 517–526.
 [8] Artem Babenko and Victor Lempitsky, “The inverted multiindex,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3069–3076.
 [9] Yannis Kalantidis and Yannis Avrithis, “Locally optimized product quantization for approximate nearest neighbor search,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 2329–2336.
 [10] Yongjian Chen, Tao Guan, and Cheng Wang, “Approximate nearest neighbor search by residual vector quantization,” Sensors, vol. 10, no. 12, pp. 11259–11273, 2010.
 [11] Robert M Gray, “Vector quantization,” ASSP Magazine, IEEE, vol. 1, no. 2, pp. 4–29, 1984.
 [12] Benchang Wei, Tao Guan, and Junqing Yu, “Projected residual vector quantization for ann search,” MultiMedia, IEEE, vol. 21, no. 3, pp. 41–51, 2014.
 [13] Artem Babenko and Victor Lempitsky, “Additive quantization for extreme vector compression,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 931–938.
 [14] John Lafferty, Andrew McCallum, and Fernando CN Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
 [15] Stuart Lloyd, “Least squares quantization in pcm,” Information Theory, IEEE Transactions on, vol. 28, no. 2, pp. 129–137, 1982.
 [16] BiingHwang Juang and A Gray Jr, “Multiple stage vector quantization for speech coding,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’82. IEEE, 1982, vol. 7, pp. 597–600.
 [17] Artem Babenko and Victor Lempitsky, “Tree quantization for largescale similarity search and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4240–4248.
 [18] Ting Zhang, GuoJun Qi, Jinhui Tang, and Jingdong Wang, “Sparse composite quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4548–4556.
 [19] Yair Weiss, Antonio Torralba, and Rob Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
 [20] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, vol. 27, ACM, 1998.
 [21] Chris Ding and Xiaofeng He, “Kmeans clustering via principal component analysis,” in Proceedings of the twentyfirst international conference on Machine learning. ACM, 2004, p. 29.
 [22] Paul S Bradley and Usama M Fayyad, “Refining initial points for kmeans clustering.,” in ICML. Citeseer, 1998, vol. 98, pp. 91–99.
 [23] Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg, “Searching in one billion vectors: rerank with source coding,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 861–864.
 [24] David G Lowe, “Distinctive image features from scaleinvariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
 [25] Aude Oliva and Antonio Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
 [26] Florent Perronnin, Jorge Sánchez, and Thomas Mensink, “Improving the fisher kernel for largescale image classification,” in Computer Vision–ECCV 2010, pp. 143–156. Springer, 2010.
 [27] Herve Jegou, Matthijs Douze, and Cordelia Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in Computer Vision–ECCV 2008, pp. 304–317. Springer, 2008.