Simultaneous Compression and Quantization:
A Joint Approach for Efficient Unsupervised Hashing
Abstract
For unsupervised datadependent hashing, the two most important requirements are to preserve similarity in the lowdimensional feature space and to minimize the binary quantization loss. A wellestablished hashing approach is Iterative Quantization (ITQ), which addresses these two requirements in separate steps. In this paper, we revisit the ITQ approach and propose novel formulations and algorithms to the problem. Specifically, we propose a novel approach, named Simultaneous Compression and Quantization (SCQ), to jointly learn to compress (reduce dimensionality) and binarize input data in a single formulation under strict orthogonal constraint. With this approach, we introduce a loss function and its relaxed version, termed Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE) respectively, which involve challenging binary and orthogonal constraints. We propose to attack the optimization using novel algorithms based on recent advance in cyclic coordinate descent approach. Comprehensive experiments on unsupervised image retrieval demonstrate that our proposed methods consistently outperform other stateoftheart hashing methods. Notably, our proposed methods outperform recent deep neural networks and GAN based hashing in accuracy, while being very computationallyefficient.
keywords:
Msc:
41A05, 41A10, 65D05, 65D17 \KWDKeyword1, Keyword2, Keyword31 Introduction
For decades, image hashing has been an active research field in vision community (Andoni and Indyk (2008); Gong and Lazebnik (2011); Weiss et al. (2009); Zhang et al. (2010)) due to its advantages in storage and computation speed for similarity search/retrieval under specific conditions (Gong and Lazebnik (2011)). Firstly, the binary code should be short so as to the whole hash table can fit in the memory. Secondly, the binary code should preserve the similarity, i.e., (dis)similar images have (dis)similar hashing codes in the Hamming distance space. Finally, the algorithm to learn parameters should be fast and for unseen samples, the hashing method should produce the hash codes efficiently. It is very challenging to simultaneously satisfy all three requirements, especially, under the binary constraint which leads to an NPhard mixedinteger optimization problem. In this paper, we aim to tackle all these challenging conditions and constraints.
The proposed hashing methods in literature can be categorized into dataindependence (Gionis et al. (1999); Kulis and Grauman (2009); Raginsky and Lazebnik (2009)) and datadependence; in which, the latter recently receives more attention in both (semi)supervised (Do et al. (2016b); Kulis and Darrell (2009); Lin et al. (2014); Liu et al. (2012); Norouzi et al. (2012); Shen et al. (2015); Chen et al. (2018); Cao et al. (2018); Jain et al. (2017); Liu et al. (2016); Lin et al. (2015); Lai et al. (2015)) and unsupervised (CarreiraPerpiñán and Raziperchikolaei (2015); Do et al. (2016a, 2017, 2019); Gong and Lazebnik (2011); He et al. (2013); Heo et al. (2012); Shen et al. (2018); Hu et al. (2018); Huang and Lin (2018); y. Duan et al. (2018); Wang et al. (2018); Duan et al. (2017); En et al. (2017); Do et al. (2019)) manners. Supervised hashing have shown superior performance over unsupervised hashing. However, in practice, labeled datasets are limited and costly; hence, in this work, we focus only on the unsupervised setting. We refer readers to recent surveys (Grauman and Fergus (2013); Wang et al. (2015, 2014, 2017)) for more detailed reviews of dataindependent/dependent hashing methods.
1.1 Related works
The most relevant work to our proposal is Iterative Quantization (ITQ) (Gong and Lazebnik (2011)), which is a very fast and competitive hashing method. The fundamental of ITQ is two folds. Firstly, to achieve lowdimensional features, it uses the wellknown Principle Component Analysis (PCA) method. PCA maximizes the variance of projected data and keeps dimensions pairwise uncorrelated. Hence, the lowdimension data, projected using the top PCA component vectors, can preserve data similarity well. Secondly, minimizing the binary quantization loss using an orthogonal rotation matrix strictly maintains the data pairwise distance. As a result, ITQ learns binary codes that can highly preserve the local structure of the data. However, optimizing these two steps separately, especially when no binary constraint is enforced in the first step, i.e., PCA, leads to suboptimal solutions. In contrast, we propose to jointly optimize the projection variance and the quantization loss.
Other works that are highly relevant to our proposed method are Binary Autoencoder (BA) (CarreiraPerpiñán and Raziperchikolaei (2015)), UHBDNN (Do et al. (2016a)), DBDMQ (Duan et al. (2017)), and Stacked convolutional AutoEncoders (SAE) (En et al. (2017)). In these methods, the authors proposed to combine the data dimension reduction and binary quantization into a single step by using encoder of autoencoder, while the decoder encourages (dis)similar inputs map to (dis)similar binary codes. However, the reconstruction criterion is not a direct way for preserving the similarity (Do et al. (2016a)). Additionally, although achieving very competitive performances, UHBDNN and DBDMQ are based on the deep neural network (DNN); hence, it is difficult to produce the binary code computationallyefficiently. Particularly, given an extracted CNN feature, these methods require a forward propagation through multiple fullyconnected and activation layers to produce the binary code. While our proposed method only requires a single linear transformation, i.e., one BLAS operation (gemv or gemm), and a comparison operation.
Recently, many works (Liny et al. (2016); Duan et al. (2017); Song (2018)) leverage the powerful capability of Convolution Neuron Network (CNN) to jointly learn the image representations and binary codes. However, due to the nonsmooth property of the binary constraint causing the illgradient in backpropagation, these methods resort to relaxation or approximation. As a result, even thought achieving highdiscriminative image representations, these methods can only produce suboptimal binary codes. In the paper, we show that by directly considering the binary constraint, our methods can obtain much better binary codes. Hence, higher retrieval performances can be achieved. This emphasizes the necessity of having an effective method to preserve the discrimination power of highdimensional CNN features in very compact binary representations, i.e., effectively handling the challenging binary and orthogonal constraints.
Besides, several works has been proposed to handle the difficulty of training deep models with the binary constraint. Cao et al. (2017) proposed to handles the nonsmooth problem of the function by continuation, i.e., starting the training with a smoothed approximation and gradually reducing the smoothness as the training proceeds, i.e., . Chen et al. (2018) transformed the original binary optimization into differentiable optimization problem over hash functions through Taylor series expansion. Cao et al. (2018) introduced a pairwise crossentropy loss based on the Cauchy distribution, which penalizes significantly similar image pairs with Hamming distance larger than the given Hamming radius threshold, e.g., greater than 2. Nevertheless, these methods requires class labels for the training process (i.e., supervised hashing). This is not the focus of our methods which aim to learn optimal binary codes from given image representations in the unsupervised manner.
1.2 Contributions
In this work, to address the problem of learning to preserve data affinity in lowdimension binary codes, (i) we first propose a novel loss function to learn a single linear transformation under the column orthonormal constraint^{1}^{1}1Please refer to section 1.3 for our term definitions. in the unsupervised manner that compresses and binarizes the input data jointly. The approach is named as Simultaneous Compression and Quantization (SCQ). Noted that the idea of jointly compressing and binarizing data has been explored in CarreiraPerpiñán and Raziperchikolaei (2015); Do et al. (2016a). However, due to the difficulty of the nonconvex orthogonal constraint, these works try to relax the orthogonal constraint and resort to the reconstruction criterion as an indirect way to handle the similarity perserving concern. Our work is the first one to tackle the similarity concern by enforcing strict orthogonal constraints.
(ii) Under the strict orthogonal constraints, we conduct analysis and experiments to show that our formulation is able to retain a high amount of the variance, i.e., preserve data similarity, and achieve small quantization loss, which are important requirements in hashing for image retrieval (Gong and Lazebnik (2011); CarreiraPerpiñán and Raziperchikolaei (2015); Do et al. (2016a)). As a result, this leads to improved accuracy as demonstrated in our experiments.
(iii) We then propose to relax the column orthonormal constraint to column orthogonal constraint on the transformation matrix. The relaxation not only helps to gain extra retrieval performances but also significantly improves the training time.
(iv) Our proposed loss functions, with column orthonormal and orthogonal constraints, are confronted with two main challenges. The first is the binary constraint, which is the traditional and wellknown difficulty of hashing problem (Andoni and Indyk (2008); Gong and Lazebnik (2011); Weiss et al. (2009)). The second challenge is the nonconvex nature of the orthonormal/orthogonal constraint (Wen and Yin (2013)). To tackle the binary constraint, we propose to apply an alternating optimization with an auxiliary variable. Additionally, we resolve the orthonormal/orthogonal constraint by using the cyclic coordinate descent approach to learn one column of the projection matrix at a time while fixing the others. The proposed algorithms are named as Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE).
(v) Comprehensive experiments on common benchmark datasets show considerable improvements on retrieval performance of proposed methods over other stateoftheart hashing methods. Additionally, the computational complexity and training / onlineprocessing time are also discussed to show the computational efficiency of our methods.
1.3 Notations and Term definitions
We first introduce the notations. Given a zerocentered dataset which consists of images and each image is represented by a dimension feature descriptor, our proposed hashing methods aim to learn a column orthonormal/orthogonal matrix which simultaneously compresses input data to dimensional space, while retains a high amount of variance, and quantizes to binary codes .
It is important to note that, in this work, we abuse the terms: column orthonormal/orthogonal matrix. Specifically, the term column orthonormal matrix is used to indicate the matrix that , where is the identity matrix. While the term column orthogonal matrix indicates matrix that , where is an arbitrary diagonal matrix. Noted that the word “column” in these terms means that columns of the matrix are pairwise independent.
We define as the eigenvalues of the covariance matrix sorted in descending order. Finally, let be the th columns of respectively.
The remainder of the paper is organized as follow. Firstly, Section 2 presents in details our proposed hashing method, i.e., Orthonormal Encoder (OnE) and provide the analysis to show that our method can retain a high amount of variance and achieve small quantization loss. Section 3 presents a relax version of OnE, i.e., Orthogonal Encoder (OgE). Section 4 presents experiment results to validate the effectiveness of our proposed methods. We conclude the paper in Section 5.
2 Simultaneous Compression & Quantization: Orthonormal Encoder
2.1 Problem Formulation
In order to jointly learn data dimension reduction and binary quantization using a single linear transformation , we propose to solve the following constrained optimization:
(1) 
where denotes the Frobenius norm. Additionally, the orthonormal constrained on the column of is necessary to make sure no redundant information is captured in binary codes (Wang et al. (2012)) (i.e., the projected lowdimensional features are strictly pairwise uncorrelated) and the projection vectors do not scale up/down projected data.
It is noteworthy to highlight the differences between our loss function Eq. (1) and the binary quantization loss function of ITQ (Gong and Lazebnik (2011)). Firstly, different from ITQ, which works on the compressed lowdimensional feature space after using PCA, i.e., ; our approach, instead, works directly on the original highdimensional feature space . This leads to the second main difference that the nonsquare column orthonormal matrix simultaneously (i) compresses data to lowdimension and (ii) quantizes to binary codes. However, it is important to note that solving for a nonsquare projection matrix is challenging. To handle this difficulty, ITQ propose to solve the data compression and binary quantization problems in two separated optimizations. Specifically, it applys PCA to compress data to dimension, and then uses the Orthogonal Procrustes approach (Schönemann (1966)) to learn a square rotation matrix to optimize binary quantization loss. However, there is a limitation in ITQ approach as no consideration for the binary constraint in the data compression step, i.e., PCA. Consequently, the solution is suboptimal. In this paper, by adopting recent advance in cyclic coordinate descent approach (Shen et al. (2015); Do et al. (2016a); Gurbuzbalaban et al. (2017); Yuan and Ghanem (2017)), we propose a novel and efficient algorithm to resolve the ITQ limitation by simultaneously attacking both problems in a single optimization problem under the strict orthogonal constraint. Hence, our optimization can lead to a better optimal solution.
2.2 Optimization
In this section, we discuss the key details of the algorithm (Algorithm 1) for solving the optimization problem Eq. (1). In order to handle the binary constraint in Eq. (1), we propose to use alternating optimization over and .
2.2.1 Fix and update
When is fixed, the problem becomes exactly the same as when fixing rotation matrix in ITQ. To make the paper selfcontained, we repeat the explaination of Gong and Lazebnik (2011). By expanding the objective function in Eq. (1), we have
(2) 
where . Because is fixed, so is fixed, minimizing (2) is equivalent to maximizing
(3) 
where and denotes elements of and respectively. To maximize this expression with respect to , we need to have whenever and otherwise. Hence, the optimal value of can be simply achieved by
(4) 
2.2.2 Fix and update
When fixing , the optimization is no longer a mixinteger problem. However, the problem is still nonconvex and difficult to solve due to the orthonormal constraint (Wen and Yin (2013)). It is important to note that is not a square matrix. It means that the objective function is not the classic Orthogonal Procrustes problem (Schönemann (1966)). Hence, we cannot achieve the closedform solution for as proposed in Gong and Lazebnik (2011). To the best of our knowledge, there is no easy way for achieving the closedform solution of nonsquare . Hence, in order to overcome this challenge, inspired by PCA and recent methods in cyclic coordinate descent (Shen et al. (2015); Do et al. (2016a); Gurbuzbalaban et al. (2017); Yuan and Ghanem (2017)), we iteratively learn one vector, i.e., one column of , at a time. We now consider two cases for and .

st vector
(5) 
where is the norm.
Let be the Lagrange multiplier, we formulate the Lagrangian :
(6) 
By minimizing over , we can achieve:
(7) 
given that maximizes the dual function ^{2}^{2}2The dual function can be simply constructed by substituting from Eq. (7) into Eq. (6). of (Boyd and Vandenberghe (2004)). Equivalently, should satisfy the following conditions:
(8) 
where is the smallest eigenvalue of . The detail derivation is provided in Appendix section A.
In Eq. (8), the first condition is to ensure that is nonsingular and the second condition is achieved by setting the derivative of with regard to equal to .
The second equation in Eq. (8) can be recognized as a order polynomial equation of which has no explicit closedform solution for when . Fortunately, since is a concave function of , is monotonically decreasing. Hence, we can simply solve for using binary search with a small errortolerance . Note that:
(9) 
thus always has a solution.

th vector
For the second vector onward, besides the unitnorm constraint, we also need to ensure that the current vector is independent with its previous vectors.
(10) 
Let and be the Lagrange multipliers, we also formulate the Lagrangian :
(11) 
Minimizing over , similar to Eq. (7), we can achieve:
(12) 
given that satisfy the following conditions which make the corresponding dual function maximum:
(13) 
where
(14) 
in which . The detail derivation is provided in Appendix section B.
There is also no straightforward solution for . In order to resolve this difficulty, we propose to use alternative optimization to solve for and . In particular, (i) given a fixed (initialized as ), we find using binary search as discussed above. Additionally, similar to , there is always a solution for . Then, (ii) with fixed , we can get the closedform solution for as . Note that since the dual function is a concave function of , alternative optimizing between and still guarantees the solution to approach the global optimal one.
Additionally, we note that solving for requires a matrix inversion (for each ), which is very computationally expensive. However, by utilizing the Singular Value Decomposition (SVD), we can efficiently compute the inversion as follows:
(15) 
where is the matrix of eigenvectors corresponding to in columns ( is the eigenvalues of sorted in ascending order) and
(16) 
with “” is the operation to convert vectors to square diagonal matrices. Note that, given , and are fixed and can be computed in advance.
2.3 Retained variance and quantization loss
In the hashing problem for image retrieval, both retained variance and quantization loss are important. In this section, we provide analysis to show that, when solving Eq. (1), it is possible to retain a high amount of the variance and achieve small quantization loss. As will be discussed in more details, this can be accomplished by applying an appropriate scale S on the input dataset. Noticeably, by applying any positive scale ^{3}^{3}3For simplicity, we only discuss positive value . Negative value should have similar effects. on the dataset, the local structure of data is strictly preserved, i.e., the ranking nearest neighbor set of every data point is always the same. Therefore, in the hashing problem for retrieval task, it is equivalent to work on a scaled version of the dataset, i.e., . We can rewrite the loss function of Eq. (1) as following:
(17) 
where is the elementwise operation to find the absolute values and is the all1 matrix. In what follows, we discuss how can affect the retained variance and quantization loss.
2.3.1 Maximizing retained variance
We recognize that by scaling to the dataset by an appropriate scale , such that all projected data points are inside the hypercube of , i.e., , the maximizing retained variance problem (PCA) can achieve similar results to the minimizing quantization loss problem, i.e., . Intuitively, we can interpret the former problem, i.e., PCA, as to find the projection that maximizes the distances of projected data points from the coordinate origin. While the latter problem, i.e., minimizing binary quantization loss, tries to find the projection matrix that minimizes the distances of projected data points from or correspondingly. A simple 1D illustration to explain the relationship between two problems is given in Figure 2.
Since each vector of is constrained to have the unit norm, the condition actually can be satisfied by scaling the dataset by to have all data points in the original space inside the hyperball with unit radius, in which is equal to the largest distance between data points and the coordinate origin.
2.3.2 Minimizing quantization loss
Regarding the quantization loss (Eq. 17), which is a convex function of , by setting , we have the optimal solution for as following:
(18) 
where is the all0 matrix.
Considering , there are two important findings. Firstly, there is obviously no scaling value that can concurrently achieve and , except the case which is unreal in practice. Secondly, from Eq. (18), we can recognize that as gets larger, i.e., gets smaller, minimizing the loss will produce that focuses on lowervariance directions so as to achieve smaller as well as smaller . It means that gets closer to the global minimum of . Consequently, the quantization loss becomes smaller. In Figure 3, we show a toy example to illustrate that as increases, minimizing quantization loss diverts the projection vector from topPCA component (Figure 2(a)) to smaller variance directions (Figure 2(b) 2(c)), while the quantization loss (per bit) gets smaller (Figure 2(a) 2(c)). In summary, as gets smaller, the quantization loss is smaller and vice versa. However, note that keeping increasing when already focuses on leastvariance directions will make the quantization loss larger.
Note that the scale is a hyperparameter in our system. In the experiment section (Section 4.2), we will additionally conduct experiments to quantitatively analyse the effect of the scale hyperparameter and determine proper values using validation dataset.
3 Simultaneous Compression & Quantization: Orthogonal Encoder
3.1 Problem Reformulation: Orthonormal to Orthogonal
In Orthonormal Encoder (OnE), we work with the column orthonormal constraint on . However, we recognize that relaxing this constraint to column orthogonal constraint, i.e., relaxing the unit norm constraint on each column of , by converting it into a penalty term, provides three important advantages. We now achieve the new loss function as following:
(19) 
where is a fixed positive hyperparameter to penalize large norms of . It is important to note that, in Eq. (19), we still enforce the strict pairwise independent constraint of projection vectors to ensure no redundant information is captured.
Firstly, with an appropriately large , the optimization prefers to choose large variance components of since this helps to achieve the projection vectors that have smaller norms. In other words, without penalizing large norms of , the optimization has no incentive to focus on high variance components of since it can produce projection vectors with arbitrary large norms that can scale any components appropriately to achieve minimum binary quantization loss. Secondly, this provides more flexibility of having different scale values for different directions. Consequently, relaxing the unitnorm constraint of each column of helps to mitigate the difficulty of choosing the scale value . However, it is important to note that a too large , on the other hand, may distract the optimization from minimizing the binary quantization term. Finally, from OnE Optimization (Section 2.2), we observed that the unit norm constraint on each column of makes the OnE optimization difficult to be solved efficiently since there is no closedform solution for . By relaxing this unit norm constraint, we now can achieve the closedform solutions for ; hence, it is very computationally beneficial. We will discuss more about the computational aspect in section 3.3.
3.2 Optimization
Similar to the Algorithm 1 for solving Orthonormal Encoder, we apply alternative optimize and with the step is exactly the same as Eq. (4). For step, we also utilize the cyclic coordinate descent approach to iteratively solve , i.e., column by column. The loss functions are rewritten and their corresponding closedform solutions for can be efficiently achieved as following:

st vector
(20) 
We can see that Eq. (20) is the regularized least squares problem, whose closedform solution is given as:
(21) 

th vector
(22) 
Given the Lagrange multiplier , similar to Eq. (7) and Eq. (11), we can obtain as following:
(23) 
where , in which
(24) 
and .
Note that, given a fixed , is a constant matrix, the matrix contains matrix in the topleft corner. It means that only the th row and column of matrix are needed to be computed. Thus, can be solved even more effectively.
3.3 Complexity analysis
The complexity of the two algorithms, OnE and OgE, are shown in Table 1. In our empirical experiments, is usually around 50, is at most 10 iterations, and (for CNN fullyconnected features (Section 4.1)). Firstly, we can observe that OgE is very efficient as its complexity is only linearly depended on the number of training samples , feature dimension , and code length . In addition, OgE is also faster than OnE. Furthermore, as our methods aim to learn the projection matrices that preserve highvariance components, it is unnecessary to work on very high dimensional features. As there are many lowvariance/noisy components, which will be discarded eventually. More importantly, we observe no retrieval performance drop when applying PCA to compress features to a much lower dimension, e.g., 512D, in comparison with using the original 4096D features. While this helps to achieve significant speedup in training time for both algorithms, especially for the OnE, as its time complexity is depended on for large . In addition, we conduct experiments to measure the actual running time of the algorithms and compare with other methods in section 4.4.
Computational complexity  

OnE  
OgE 
4 Experiments
4.1 Datasets, Evaluation protocols, and Implementation notes
Dataset  CIFAR10  LabelMe1250k  SUN397  

L  8  16  24  32  8  16  24  32  8  16  24  32  
mAP 
SpH  17.09  18.77  20.19  20.96  11.68  13.24  14.39  14.97  9.13  13.53  16.63  19.07 
KMH  22.22  24.17  24.71  24.99  16.09  16.18  16.99  17.24  21.91  26.42  28.99  31.87  
BA  23.24  24.02  24.77  25.92  17.48  17.10  17.91  18.07  20.73  31.18  35.36  36.40  
ITQ  24.75  26.47  26.86  27.19  17.56  17.73  18.52  19.09  20.16  30.95  35.92  37.84  
SCQ  OnE  27.08  29.64  30.57  30.82  19.76  21.96  23.61  24.25  23.37  34.09  38.13  40.54  
SCQ  OgE  26.98  29.33  30.65  31.15  20.63  23.07  23.54  24.68  23.44  34.73  39.47  41.82  
prec@r2 
SpH  18.04  30.58  37.28  21.40  11.72  19.38  25.14  13.66  6.88  23.68  37.21  27.39 
KMH  21.97  36.64  42.33  27.46  15.20  26.17  32.09  18.62  9.50  36.14  51.27  39.29  
BA  23.67  38.05  42.95  23.49  16.22  25.75  31.35  13.14  10.50  37.75  50.38  41.11  
ITQ  24.38  38.41  42.96  28.63  15.86  25.46  31.43  17.66  9.78  35.15  49.85  46.34  
SCQ  OnE  24.48  36.49  41.53  43.90  16.69  27.30  34.63  33.04  8.68  30.12  43.54  50.41  
SCQ  OgE  24.35  38.30  43.01  44.01  16.57  27.80  34.77  34.64  8.76  29.31  45.03  51.88  
prec@1k 
SpH  22.93  26.99  29.50  31.98  14.07  16.78  18.52  19.27  10.79  15.36  18.21  20.07 
KMH  32.30  33.65  35.52  37.77  21.07  20.97  21.41  21.98  18.94  24.93  25.74  28.26  
BA  31.73  34.16  35.67  37.01  21.14  21.71  22.64  22.83  19.22  28.68  31.31  31.80  
ITQ  32.40  36.35  37.25  37.96  21.01  22.00  22.98  23.63  18.86  28.62  31.56  32.74  
SCQ  OnE  33.38  37.82  39.13  40.40  22.91  25.39  26.55  27.16  19.26  29.95  32.72  34.08  
SCQ  OgE  33.41  38.33  39.54  40.70  23.94  25.94  26.99  27.46  20.10  29.95  33.43  35.00 
The CIFAR10 dataset (Krizhevsky and Hinton (2009)) contains fullyannotated color images of from object classes ( images for each class). The provided test set ( images for each class) is used as the query set. The remaining 50,000 images are used as the training set and the database.
The LabelMe1250k dataset (Uetz and Behnke (2009)) consists of fully annotated color images of of object classes, which is a subset of LabelMe dataset (Russell et al. (2008)). In this dataset, for any image having multiple label values in the range of , the object class of the largest label value is chosen as the image label. We also use the provided test set as the query set and the remaining images as the training set and the database.
The SUN397 dataset (Xiao et al. (2016)) contains approximately fully annotated color images from scene categories. We select a subset of categories which contain more than 500 images per category to construct a dataset of approximately images in total. We then randomly sample 100 images per class to form the query set. The remaining images are used as the training set and the database.
For these above image datasets, each image is represented by a 4096D feature vector extracted from the fullyconnected layer 7 of pretrained VGG (Simonyan and Zisserman (2014)).
Evaluation protocols. As datasets are fully annotated, we use semantic labels to define the ground truths of image queries. We apply three standard evaluation metrics, which are widely used in literature (CarreiraPerpiñán and Raziperchikolaei (2015); Erin Liong et al. (2015); Gong and Lazebnik (2011)), to measure the retrieval performance of all methods: 1) mean Average Precision (mAP()); 2) precision at Hamming radius of 2 (prec@r2 ()) which measures precision on retrieved images having Hamming distance to query (we report zero precision for queries that return no image); 3) precision at top 1000 return images (prec@1k ()) which measures the precision on the top 1000 retrieved images.
Implementation notes. As discussed in section 3.3, for computational efficiency, we apply PCA to reduce the feature dimension to 512D for our proposed methods. The hyperparameter of OgE algorithm is empirically set as for all experiments. Finally, for both OnE and OgE, we set all errortolerance values, , as and the maximum number of iteration is set as . The implementation of our methods is available at https://github.com/hnanhtuan/SCQ.git.
For all compared methods, e.g., Spherical Hashing (SpH) (Heo et al. (2012)), Kmeans Hashing (KMH)^{4}^{4}4Due to very long training time at highdimension of KMH (He et al. (2013)), we apply PCA to reduce dimension from 4096D to 512D. Additionally, we execute experiments for KMH with and report the best results. (He et al. (2013)), Binary Autoencoder (BA) (CarreiraPerpiñán and Raziperchikolaei (2015)), and Iterative Quantization (ITQ) (Gong and Lazebnik (2011)); we use the implementation with suggested parameters provided by the authors. Besides, to improve the statistical stability of the results, we report the average values of 5 executions.
4.2 Effects of parameters
As discussed in section 2.3, when decreases, the projection matrix can be learned to retain a very high amount of variance, as much as PCA can. However, it causes undesirable large binary quantization loss and vice versa. In this section, we additionally provide quantitative analysis of the effects of the scale parameter on these two factors (i.e., the amount of retained variance and the quantization loss) and, moreover, on the retrieval performance.
In this experiment, for all datasets, e.g., CIFAR10, LabelMe1250k, and SUN397, we random select 20 images for each class in the training set (as discussed in section 4.1) for validation set. The remaining images are used for training. To obtain each data point, we solve the problem Eq. (1) at various scale values and use OnE algorithm (Algorithm 1  Section 2.2) to tackle the optimization.
Figure 4 presents (i) the quantization loss per bit, (ii) the percentage of total retained variance by the minimizing quantization loss projection matrix in comparison with the total retained variance of top PCA components as varies, and (iii) the retrieval performance (mAP) of the validation sets. Firstly, we can observe that there is no scale that can simultaneously maximizes the retained variance and minimizes the optimal quantization loss. On the one hand, as the scale value decreases, minimizing the loss function Eq. (17) produces a projection matrix that focuses on highvariance directions, i.e., retains more variance in comparison with PCA (red line). On the other hand, at smaller , the quantization loss is much larger (blue dash line). The empirical results are consistent with our discussion in section 2.3.
Secondly, regarding the retrieval performance, unsurprisingly, the performance drops as the scale gets too small, i.e., a high amount of variance is retained but the quantization loss is too large, or gets too large, i.e., the quantization loss is small but only low variance components are retained. Hence, it is necessary to balance these two factors. As data variance varies from dataset to dataset, the scale value should be determined from the dataset. In particular, we leverage the eigenvalues , which are the variances of PCA components, to determine this hyperparameter. From experimental results in Figure 4, we propose to formulate the scale parameter as:
(25) 
One advantage of this setting is that it can generally achieve the best performances across multiple datasets, feature types, and hash lengths, without resort to conducting multiple trainings and crossvalidations. The proposed working points of the scale are shown in Figure 4. We apply this scale parameter to the datasets for both OnE and OgE algorithms in all later experiments.
Note that the numerator of the fraction in Eq. 25, i.e., is the hash code length, which is also the total variance of binary codes . In addition, the denominator is the total variance of top th PCA components, i.e., the maximum amount of variance that can be retained in an dimension feature space. Hence, we can interpret the scale as a factor that make the amounts of variance, i.e., energy, of the input and output (i.e. binary codes ) are comparable. This property is important as when the variance of input is much larger than the variance of output, obviously there is some information loss. On the other hand, when the variance of output is larger than it of input, the output contains undesirable additional information.
Method  CIFAR10  LabelMe  SUN397  
% Retained  ITQ  100%  100%  100% 
variance  SCQOnE  59.6%  63.0%  69.4% 
Quantization  ITQ  0.75  0.71  0.65 
error  SCQOnE  0.29  0.29  0.24 
mAP  ITQ  27.01  18.24  37.79 
SCQOnE  30.68  23.74  41.12 
Additionally, in Table 3, we summarize the percentage of retained variance (%), quantization loss per bit, and retrieval performance (mAP) on validation sets for ITQ and our SCQOnE methods. Even though, the projection matrix, learned by our Algorithm 1, can retain less variance in comparison to the optimal PCA projection matrix (i.e., the ITQ first step), this helps to achieve a much smaller quantization error. Hence, balancing the variance loss and quantization error is desirable and can result in higher retrieval performance.
4.3 Comparison with stateoftheart
In this section, we evaluate our proposed hashing methods, SCQ  OnE and OgE, and compare to the stateoftheart unsupervised hashing methods including SpH, KMH, BA, and ITQ. The experimental results in mAP, prec@r2 and prec@1k are reported in Table 2. Our proposed methods clearly achieve significant improvement over all datasets at the majority of evaluation metrics. The improvement gaps are clearer at higher code lengths, i.e., . Additionally, OgE generally achieves slightly higher performance than OnE. Moreover, it is noticeable that, for prec@r2, all compared methods suffer performance downgrade at long hash code, e.g., . However, our proposed methods still achieve good prec@r2 at . This shows that binary codes producing by our methods highly preserve data similarity.
Methods  mAP  prec@r2  

16  32  16  32  
CIFAR10  DH  16.17  16.62  23.33  15.77 
UHBDNN  17.83  18.52  24.97  18.85  
SCQ  OnE  17.97  18.63  24.57  23.72  
SCQ  OgE  18.00  18.78  24.15  25.69 
Methods  CIFAR10  NUSWIDE  

12  24  32  48  12  24  32  48  
mAP  BGAN  40.1  51.2  53.1  55.8  67.5  69.0  71.4  72.8 
SCQ  OnE  53.59  55.77  57.62  58.14  69.82  70.53  72.78  73.25  
SCQ  OgE  53.83  55.65  57.74  58.44  70.17  71.31  72.49  72.95 
Comparison with Deep Hashing (DH) (Erin Liong et al. (2015)) and Unsupervised Hashing with Binary Deep Neural Network (UHBDNN) (Do et al. (2016a)). Recently, there are several methods (Erin Liong et al. (2015); Do et al. (2016a)) applying DNN to learn binary hash codes. These method can achieve very competitive performances. Hence, in order to have a complete evaluation, following the experiment settings of Erin Liong et al. (2015); Do et al. (2016a), we conduct experiments on the CIFAR10 dataset. In this experiment, 100 images are randomly sampled for each class as a query set; the remaining images are for training and database. Each image is presented by a GIST 512D descriptor (Oliva and Torralba (2001)). In addition, to avoid bias results due to test samples, we repeat the experiment 5 times with 5 different random training/query sets. The comparative results in term of mAP and prec@r2 are presented in Table 4. Our proposed methods are very competitive with DH and UHBDNN, specifically achieving higher mAP and prec@r2 at than DH and UHBDNN.
Comparison with Binary Generative Adversarial Networks for Image Retrieval (BGAN) (Song (2018)). Recently, BGAN applies a continuous approximation of sign function to learn the binary codes which can help to generate images plausibly similar to the original images. The method has been proven to achieve outstanding performances in unsupervised image hashing task. It is important to note that BGAN is different from our method and compared methods in the aspect that BGAN jointly learns image feature representations and binary codes, in which the binary codes are achieved by using an approximate smooth function of sign. While ours and compared methods learn the optimal binary codes given image representations. Hence, to further validate the effectiveness of our methods and to compare with BGAN, we apply our method on the FC7 features extracted from the feature extraction component in the pretrained BGAN model^{5}^{5}5The model is obtained after training BGAN method on CIFAR10 and NUSWIDE datasets accordingly. The same model is also used to obtain BGAN binary codes. on CIFAR10 and NUSWIDE (Chua et al. ) datasets. In this experiment, we aim to show that by applying our hashing methods on the pretrained features from feature extraction component of BGAN, our methods can produce better hash codes than the hash codes which are obtained from the jointly learning approach of BGAN.
Similar to the experiment setting in BGAN (Song (2018)), for both CIFAR10 and NUSWIDE, we randomly select 100 images per class as the test query set; the remaining images are used as database for retrieval. We then randomly sample from the database set 1,000 images per class as the training set. The Table 5 shows that by using the more discriminative features^{6}^{6}6In comparison with the image features which are obtained from the pretrained offtheshelf VGG network (Simonyan and Zisserman (2014)). from the pretrained feature extraction component of BGAN, our methods can outperform BGAN, i.e., our methods can produce better binary codes in comparison to the sign approximate function in BGAN, and achieve the stateoftheart performances in the unsupervised image hashing task. Hence, the experiment results emphasize the important of an effective method to preserve the discrimination power of highdimensional CNN features in very compact binary representations, i.e., effectively handling the challenging binary and orthogonal constraints.
4.4 Training time and Processing time
In this experiment, we empirically evaluate the training time and online processing time of our methods. The experiments are carried out on a workstation with a 4core i76700 CPU @ 3.40GHz. The experiments are conducted on the combination of CIFAR10, Labelme1250k, and SUN397 datasets. For OnE and OgE, the training time include time for applying zeromean, scaling, reducing dimension to . We use 50 iterations for all experiments. The Fig. 5 shows that our proposed methods, OnE and OgE, are very efficient. OgE is just slightly slower than ITQ. Even though OnE is slower than OgE and ITQ, it takes just over a minute for 100.000 training samples which is still very fast and practical, in comparison with several dozen minutes for KMH, BA, and UHBDNN^{7}^{7}7For training 50000 CIFAR10 samples using author’s release code and dataset (Do et al. (2016a))..
Compared with training cost, the time to produce new hash codes is more important since it is done in real time. Similar to SemiSupervised Hashing (SSH) (Wang et al. (2012)) and ITQ (Gong and Lazebnik (2011)), by using only a single linear transformation, our proposed methods require only one BLAS operation (gemv or gemm) and a comparison operation; hence, it takes negligible time to produce binary codes for new data points.
5 Conclusion
In this paper, we successfully addressed the problem of jointly learning to preserve data pairwise (dis)similarity in lowdimension space and to minimize the binary quantization loss with the strict diagonal constraint. Additionally, we show that as more variance is retained, the quantization loss is undesirably larger; and vice versa. Hence, by appropriately balancing these two factors using a scale, our methods can produce better binary codes. Extensive experiments on various datasets show that our proposed methods, Simultaneous Compression and Quantization (SCQ): Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE), outperform other stateoftheart hashing methods by clear margins under various standard evaluation metrics and benchmark datasets. Furthermore, OnE and OgE are very computationally efficient in both training and testing steps.
References
 Andoni and Indyk (2008) Andoni, A., Indyk, P., 2008. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51.
 Boyd and Vandenberghe (2004) Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press, New York, NY, USA.
 Cao et al. (2018) Cao, Y., Long, M., Liu, B., Wang, J., 2018. Deep cauchy hashing for hamming space retrieval, in: CVPR.
 Cao et al. (2017) Cao, Z., Long, M., Wang, J., Yu, P.S., 2017. Hashnet: Deep learning to hash by continuation, in: ICCV, pp. 5609–5618.
 CarreiraPerpiñán and Raziperchikolaei (2015) CarreiraPerpiñán, M.Á., Raziperchikolaei, R., 2015. Hashing with binary autoencoders, in: CVPR.
 Chen et al. (2018) Chen, Z., Yuan, X., Lu, J., Tian, Q., Zhou, J., 2018. Deep hashing via discrepancy minimization, in: CVPR.
 Chen et al. (2018) Chen, Z., Yuan, X., Lu, J., Tian, Q., Zhou, J., 2018. Deep hashing via discrepancy minimization, in: CVPR, pp. 6838–6847.
 (8) Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T., . Nuswide: A realworld web image database from national university of singapore, in: Proc. of ACM Conf. on Image and Video Retrieval.
 Do et al. (2019) Do, T., Hoang, T.N.A., Le, K., Doan, D., Cheung, N., 2019. Compact hash code learning with binary deep neural network. IEEE Transactions on Multimedia .
 Do et al. (2016a) Do, T.T., Doan, A.D., Cheung, N.M., 2016a. Learning to hash with binary deep neural network, in: ECCV.
 Do et al. (2016b) Do, T.T., Doan, A.D., Nguyen, D.T., Cheung, N.M., 2016b. Binary hashing with semidefinite relaxation and augmented lagrangian, in: ECCV.
 Do et al. (2019) Do, T.T., Le, K., Hoang, T., Le, H., Nguyen, T.V., Cheung, N.M., 2019. Simultaneous feature aggregating and hashing for compact binary code learning. IEEE TIP .
 Do et al. (2017) Do, T.T., Tan, D.K.L., Pham, T., Cheung, N.M., 2017. Simultaneous feature aggregating and hashing for largescale image search, in: CVPR.
 y. Duan et al. (2018) y. Duan, L., Wu, Y., Huang, Y., Wang, Z., Yuan, J., Gao, W., 2018. Minimizing reconstruction bias hashing via joint projection learning and quantization. IEEE TIP 27, 3127–3141.
 Duan et al. (2017) Duan, Y., Lu, J., Wang, Z., Feng, J., Zhou, J., 2017. Learning Deep Binary Descriptor with Multiquantization, in: CVPR, pp. 4857–4866.
 En et al. (2017) En, S., Crémilleux, B., Jurie, F., 2017. Unsupervised deep hashing with stacked convolutional autoencoders, in: ICIP, pp. 3420–3424.
 Erin Liong et al. (2015) Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J., 2015. Deep hashing for compact binary codes learning, in: CVPR.
 Gionis et al. (1999) Gionis, A., Indyk, P., Motwani, R., 1999. Similarity search in high dimensions via hashing, in: VLDB.
 Gong and Lazebnik (2011) Gong, Y., Lazebnik, S., 2011. Iterative quantization: A procrustean approach to learning binary codes, in: CVPR.
 Grauman and Fergus (2013) Grauman, K., Fergus, R., 2013. Learning binary hash codes for largescale image search, in: Studies in Computational Intelligence.
 Gurbuzbalaban et al. (2017) Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A., Vanli, N., 2017. When cyclic coordinate descent outperforms randomized coordinate descent, in: NIPS.
 He et al. (2013) He, K., Wen, F., Sun, J., 2013. Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes, in: CVPR.
 Heo et al. (2012) Heo, J.P., Lee, Y., He, J., Chang, S.F., Yoon, S.E., 2012. Spherical hashing, in: CVPR.
 Hu et al. (2018) Hu, M., Yang, Y., Shen, F., Xie, N., Shen, H.T., 2018. Hashing with angular reconstructive embeddings. IEEE TIP 27, 545–555.
 Huang and Lin (2018) Huang, Y., Lin, Z., 2018. Binary multidimensional scaling for hashing. IEEE TIP 27, 406–418.
 Jain et al. (2017) Jain, H., Zepeda, J., Pérez, P., Gribonval, R., 2017. SuBiC: A Supervised, Structured Binary Code for Image Search, in: ICCV, pp. 833–842.
 Krizhevsky and Hinton (2009) Krizhevsky, A., Hinton, G., 2009. Learning multiple layers of features from tiny images, in: Technical report, University of Toronto.
 Kulis and Darrell (2009) Kulis, B., Darrell, T., 2009. Learning to hash with binary reconstructive embeddings, in: NIPS.
 Kulis and Grauman (2009) Kulis, B., Grauman, K., 2009. Kernelized localitysensitive hashing for scalable image search, in: ICCV.
 Lai et al. (2015) Lai, H., Pan, Y., Ye Liu, Yan, S., 2015. Simultaneous feature learning and hash coding with deep neural networks, in: CVPR, pp. 3270–3278.
 Lin et al. (2014) Lin, G., Shen, C., Shi, Q., van den Hengel, A., Suter, D., 2014. Fast supervised hashing with decision trees for highdimensional data, in: CVPR.
 Lin et al. (2016) Lin, K., Lu, J., Chen, C.S., Zhou, J., 2016. Learning compact binary descriptors with unsupervised deep neural networks, in: CVPR.
 Lin et al. (2015) Lin, K., Yang, H., Hsiao, J., Chen, C., 2015. Deep learning of binary hash codes for fast image retrieval, in: CVPR Workshop, pp. 27–35.
 Liny et al. (2016) Liny, K., Luz, J., Cheny, C.S., Zhou, J., 2016. Learning compact binary descriptors with unsupervised deep neural networks, in: CVPR.
 Liu et al. (2016) Liu, H., Wang, R., Shan, S., Chen, X., 2016. Deep Supervised Hashing for Fast Image Retrieval, in: CVPR, pp. 2064–2072.
 Liu et al. (2012) Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F., 2012. Supervised hashing with kernels, in: CVPR.
 Norouzi et al. (2012) Norouzi, M., Fleet, D.J., Salakhutdinov, R., 2012. Hamming distance metric learning, in: NIPS.
 Oliva and Torralba (2001) Oliva, A., Torralba, A., 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV , 145–175.
 Raginsky and Lazebnik (2009) Raginsky, M., Lazebnik, S., 2009. Localitysensitive binary codes from shiftinvariant kernels, in: NIPS.
 Russell et al. (2008) Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T., 2008. Labelme: A database and webbased tool for image annotation. IJCV , 157–173.
 Schönemann (1966) Schönemann, P.H., 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika .
 Shen et al. (2015) Shen, F., Shen, C., Liu, W., Shen, H.T., 2015. Supervised discrete hashing, in: CVPR.
 Shen et al. (2018) Shen, F., Xu, Y., Liu, L., Yang, Y., Huang, Z., Shen, H.T., 2018. Unsupervised deep hashing with similarityadaptive and discrete optimization. IEEE TPAMI , 1–1.
 Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for largescale image recognition. CoRR .
 Song (2018) Song, J., 2018. Binary generative adversarial networks for image retrieval, in: AAAI.
 Uetz and Behnke (2009) Uetz, R., Behnke, S., 2009. Largescale object recognition with cudaaccelerated hierarchical neural networks, in: IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS).
 Wang et al. (2012) Wang, J., Kumar, S., Chang, S.F., 2012. Semisupervised hashing for largescale search. TPAMI .
 Wang et al. (2015) Wang, J., Liu, W., Kumar, S., Chang, S., 2015. Learning to hash for indexing big data  a survey, in: Proceedings of the IEEE.
 Wang et al. (2014) Wang, J., Tao Shen, H., Song, J., Ji, J., 2014. Hashing for similarity search: A survey .
 Wang et al. (2017) Wang, J., Zhang, T., j. song, Sebe, N., Shen, H.T., 2017. A survey on learning to hash. TPAMI .
 Wang et al. (2018) Wang, M., Zhou, W., Tian, Q., Li, H., 2018. A general framework for linear distance preserving hashing. IEEE TIP 27, 907–922.
 Weiss et al. (2009) Weiss, Y., Torralba, A., Fergus, R., 2009. Spectral hashing, in: NIPS.
 Wen and Yin (2013) Wen, Z., Yin, W., 2013. A feasible method for optimization with orthogonality constraints. Math. Program. .
 Xiao et al. (2016) Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A., 2016. Sun database: Exploring a large collection of scene categories. IJCV .
 Yuan and Ghanem (2017) Yuan, G., Ghanem, B., 2017. An exact penalty method for binary optimization based on mpec formulation, in: AAAI.
 Zhang et al. (2010) Zhang, D., Wang, J., Cai, D., Lu, J., 2010. Selftaught hashing for fast similarity search, in: ACM SIGIR.
Appendix A Derivation for Eq. (8)
Firstly, the dual function can be simply constructed by substituting from Eq. (7) into Eq. (6):
(26) 
Firstly, we note that:
(27) 
Hence,
(28) 
(29) 
Note that: .
A simpler way to achieve is to take the deravative of w.r.t first, then replace by Eq. (7) later.
Appendix B Derivation for Eq. (13)
Following the similar derivation in Appendix section A, we can obtain the second condition of Eq. (13). We now provide the detail derivation for the third condition. Considering the th value () of the Lagrange multiplier
(32) 
(33) 
where .
By setting the derivative of w.r.t equal to and some simple manipulations, we can obtain the third condition of Eq. (13) as follows:
(34) 
where
(35) 