Fast Supervised Discrete Hashing and its Analysis
Abstract
In this paper, we propose a learningbased supervised discrete hashing method. Binary hashing is widely used for largescale image retrieval as well as video and document searches because the compact representation of binary code is essential for data storage and reasonable for query searches using bitoperations. The recently proposed Supervised Discrete Hashing (SDH) efficiently solves mixedinteger programming problems by alternating optimization and the Discrete Cyclic Coordinate descent (DCC) method. We show that the SDH model can be simplified without performance degradation based on some preliminary experiments; we call the approximate model for this the “Fast SDH” (FSDH) model. We analyze the FSDH model and provide a mathematically exact solution for it. In contrast to SDH, our model does not require an alternating optimization algorithm and does not depend on initial values. FSDH is also easier to implement than Iterative Quantization (ITQ). Experimental results involving a largescale database showed that FSDH outperforms conventional SDH in terms of precision, recall, and computation time.
1 Introduction
Binary hashing is an important technique for computer vision, machine learning, and largescale image/video/document retrieval [6, 9, 17, 19, 24, 27, 28]. Through binary hashing, multidimensional feature vectors with integers or floatingpoint elements are transformed into short binary codes. This representation of binary code is an important technique since largescale databases occupy large amounts of storage. Furthermore, it is easy to compare a query in binary code with a binary code in a database because the Hamming distance between them can be computed efficiently by using bitwise operations that are part of the instruction set of any modern CPU [3, 7].
Many binary hashing methods have been proposed. Localitysensitive hashing (LSH) [6] is one of most popular methods. In LSH, binary codes are generated by using a random projection matrix and thresholding using the sign of the projected data. Iterative quantization (ITQ) [9] is another stateoftheart binary hashing method. In ITQ, a projection matrix of the hash function is optimized by iterating projection and thresholding procedures according to the given training samples.
Binary hashing can be roughly classified into two types: unsupervised hashing [17, 22, 21, 27, 11, 35] and supervised hashing. Supervised hashing uses learning label information if it exists. In general, supervised hashing yields better performance than unsupervised hashing, so in this study, we target supervised hashing. In addition, some unsupervised methods such as LSH and ITQ can be converted into supervised methods by imposing label information on feature vectors. For example, canonical correlation analysis (CCA) [12] can transform feature vectors to maximize interclass variation and minimize intraclass variation according to label information. Hereafter, we call these processes CCALSH and CCAITQ, respectively.
Not imposing label information on feature vectors, such as in CCA, but imposing it directly on hash functions has been proposed. Kernelbased supervised hashing (KSH) [23] uses spectral relaxation to optimize the cost function through a sign function. Feature vectors are transformed by kernels during preprocessing. KSH has also been improved to kernelbased supervised discrete hashing (KSDH) [30]. It relaxes the discrete hashing problem through linear relaxation. Supervised Discriminative Hashing [24] decomposes training samples into inter and intra samples. Column samplingbased discrete supervised hashing (COSDISH) [14] uses column sampling based on semantic similarity, and decomposes the problem into a subproblem to simplify solution.
The optimization of binary codes leads to a mixedinteger programming problem involving integer and noninteger variables, which is an NPhard problem in general [28]. Therefore, many methods discard the discrete constraints, or transform the problem into a relaxed problem, i.e., a linear programming problem [26]. This relaxation significantly simplifies the problem, but is known to affect classification performance [28].
Recent research has introduced a type of supervised discrete hashing (SDH) [28, 34] that directly learns binary codes without relaxation. SDH is a stateoftheart method because of its ease of implementation, reasonable computation time for learning, and better performance over other stateoftheart supervised hashing methods. To solve discrete problems, SDH uses a discrete cyclic coordinate descent (DCC) method, which is an approximate solver of 01 quadratic integer programming problems.
1.1 Contributions and advantages
In this study, we first analyze the SDH model and point out that it can be simplified without performance degradation based on some preliminary experiments. We call the approximate model the fast SDH (FSDH) model. We analyze the FSDH model and provide a mathematically exact solution to it. The model simplification is validated through experiments involving several largescale datasets.
The advantages of the proposed method are as follows:

Unlike SDH, it does not require alternating optimization or hyperparameters, and is not initial valuedependent.

It is easier to implement than ITQ and is efficient in terms of computation time. FSDH can be implemented in three lines on MATLAB.

High bit scalability: its learning time and performance do not depend on the code length.

It has better precision and recall than other stateoftheart supervised hashing methods.
1.2 Related work
As described subsequently, the SDH model poses a matrix factorization problem: . The popular form of this problem is singular value decomposition (SVD) [8], and when and are unconstrained, the Householder method is used for computation. When , nonnegative matrix factorization (NMF) is used [4].
In the case of the SDH model, is constrained to and is unconstrained. In a similar problem setting, Slawski et al. proposed matrix factorization with binary components [31] and showed an application to DNA analysis for cancer research. is constrained to , and indicates Unmethylated/Methylated DNA sequences. Furthermore, a similar model has been proposed in display electronics. Koutaki proposed binary continuous decomposition for multiview displays [15]. In this model, multiple images are decomposed into binary images and a weight matrix . An image projector projects binary 01 patterns through digital mirror devices (DMDs), and the weight matrix corresponds to the transmittance of the LCD shutter.
2 Supervised Discrete Hashing (SDH) Model
In this section, we introduce the supervised discrete hashing (SDH) model. Let be a feature vector, and introduce a set of training samples . Then, consider binary label information corresponding to , where is the number of categories to classify. Setting the th element to 1, , and the other elements to 0 indicates that the th vector belongs to class . By concatenating samples of horizontally, a label matrix is constructed.
2.1 Binary code assignment to each sample
For each sample , an bit binary code is assigned. By concatenating samples of horizontally, a binary matrix is constructed. The binary code is computed as
(1) 
where (therefore ) is a linear transformation matrix and is the sign function. The major aim of SDH is to determine the matrix from training samples . In practice, feature vectors are transformed by preprocessing. Therefore, we denote the original feature vectors and the transformed feature vectors .
2.2 Preprocessing: Kernel transformation
The original feature vectors of training samples are converted into the feature vectors using the following kernel transformation :

(2) 
where is an anchor vector obtained by randomly sampling the original feature vectors, . Then, the transformed feature vectors are bundled into the matrix form .
2.3 Classification model
Following binary coding by (1), we suppose that a good binary code classifies the class, and formulate the following simple linear classification model:
(3) 
where is a weight matrix and is an estimated label vector. As mentioned above, its maximum index, , indicates the assigned class of .
2.4 Optimization of SDH
The SDH problem is defined as the following minimization problem:

(4) 
where is the Frobenius norm, and and are balance parameters. The first term includes the classification model explained in Sec. 2.3. The second term is a regularizer for to avoid overfitting. The third term indicates the fitting errors due to binary coding.
In this optimization, it is sufficient to compute , i.e., if is obtained, can be obtained by (1), and can be obtained from the following simple least squares equation:
(5) 
However, due to the difficulty of optimization, the optimization problem of (4) is usually divided into three subproblems of the optimization of , and . Thus, the following alternating optimization is performed:
(i) Initialization: is initialized, usually randomly.
(ii) FStep: is computed by the following simple least squares method:
(6) 
(iii) WStep: is computed by (5).
(iv) BStep: After fixing and , equation (4) becomes:
(7)  
where
(8) 
Note that . The trace can be rewritten as
(9) 
where is the th column vector of . are actually independent of one another. Therefore, it reduces to the following 01 integer quadratic programming problem for each th sample:
(10) 
(v)
Iterate steps (ii)(iv) until convergence.
3 Discussion of the SDH Model
3.1 01 integer quadratic programming problem
DCC method
To solve (10), SDH uses a discrete cyclic coordinate descent (DCC) method. In this method, a onebit element of is optimized while fixing the other bits; the th bit is optimized as
(11) 
Then, all bits are optimized, and this procedure is repeated several times. In addition, the DCC method is prone to result in a local minimum because of its greediness. To improve it, Shen et al. proposed using a proximal operation of convex optimization [29].
Branchandbound method
In the case of a large number of bits , solving (10) exactly is difficult because this problem is NPhard. However, there exist a few efficient methods to solve the 01 integer quadratic programming problem. In [15], Koutaki used a branchandbound method to solve the problem. is expanded into a binary tree of depth , and the problem of (10) is divided into a subproblem by splitting . At each node, the lower bound is computed and compared with the given best solution; child nodes can be excluded from the search.
The computation of the lower bound depends on the structure of , and . To compute the lower bound in general, the linear relaxation method is a standard method, . In this case, the rough lower bound of the quadratic term in (10) can be provided by the minimum eigenvalues of . However, linear relaxation is useless in the SDH model because in general, so the matrix is rank deficient and, as a result, the minimum eigenvalue of becomes zero.
Even if we can obtain an efficient algorithm, such as branchandbound and good lower bound, in the application of binary hashing, we still suffer from computational difficulties because code lengths , or bits are still too long to optimize, and they are used frequently.
3.2 Alternating optimization and initial value dependence
Even if we optimize the binary optimization in (10), the resulting binary codes are not always optimal ones because they depend on the other fixed variables and . In addition, alternating optimization is prone to cause a serious problem: a solution depends on the initial values, and may fall in a local minimum during the iterations, even if each step of FStep, WStep and BStep provides the optimal solution.
Figure 1 shows an example of the optimization result for a simple version of the SDH model in (4) with a small number of bits . In this case, an exact solution is known and its minimum value is (green line in Fig. 1). DCC (red lines) provides results for 10 randomized initial conditions. The full search (blue lines) provides the results of an exact full search in BStep, where nodes are searched.
In spite of the small size of the problem, the cost function of conventional alternating solvers (DCC and full search) cannot find the exact value, and depends on initial values. Interestingly, the results of full search immediately fall into a local minimum, and are worse than those of DCC.
4 Proposed Fast SDH Model
We introduce a new hashing model by approximating the SDH model, which utilizes the following assumptions:
 A1:

The number of bits of the binary code is a power of 2: .
 A2:

The number of bits is greater than the number of classes: .
 A3:

Singlelabeling problem.
 A4:

in (8).
Note that assumptions A1A3 also become the limitations of the proposed model. In A4, SDH recommends that the parameter be set to a very small value, such as [28]. In practice, and in the CIFAR10 dataset. Furthermore, when , almost the same results can be obtained in all datasets as shown in the experimental results in Sec. 5. We call this approximation using the “fast SDH (FSDH) approximation.”
Using the FSDH approximation, we solve the following problem for each sample in BStep:
(12)  
where is a constant matrix and depends on label . By using the singlelabel assumption in A3, the number of kinds of is limited to :
(13) 
Thus, it is sufficient to solve only integer quadratic programming problems of (12) from . In general, the number of samples is larger than that of classes: , e.g., and . Thereby, the computational cost of BStep becomes times lower. In other words, the FSDH approximation proposes the following:
Proposition 4.1
The FSDH approximation defines the SDH model to assign a binary code to each class.
After obtaining the binary codes of each class , the binary codes of all samples can be constructed by lining up as
(14) 
After constructing , the projection matrix can be obtained by (6).
4.1 Analytical solutions of FSDH model
From Proposition 4.1, we found that it is sufficient to determine the binary code for each class. Furthermore, we can choose the optimal binary codes under the FSDH approximation as follows:
Lemma 4.2
If is convex, the solution of
(15) 
is given by the mean value .

See Appendix missingA.
Theorem 4.3
An analytical solution of FSDH is obtained as a Hadamard matrix.
I  W^⊤B^′∥^2 + λ∥W∥^2, where is an identity matrix. Using the solution of (4.1), i.e., , and the eigendecomposition of , we denote the eigenvalues as and then get as the trace of diagonal values. Then, equation (4.1) can be represented simply as
(17) 
By lemma 4.2, . This implies that is an orthogonal matrix with binary elements ; in other words, can be given by a submatrix of the Hadamard matrix .
Corollary 4.4
The following characteristics can be obtained easily:

is independent of regularization parameter (invariant).

The optimal weight matrix of FSDH is given by the version of the scaled binary matrix .

The minimum value of (4.1) is given by .
In short, we can eliminate the WStep, the alternating procedure, and the initial value dependence. An exact solution of the FSDH model can be obtained independent of the hyperparameters and .
4.2 Implementation of FSDH
Algorithm 1 and Figure 2, respectively, show the algorithm of FSDH and sample MATLAB code, which is simple and easy to implement. Figure 3 shows an example of and . A Hadamard matrix of size can be constructed recursively by Sylvester’s method [33] as
(18)  
Furthermore, Hadamard matrices of orders 12 and 20 were constructed by Hadamard transformation [10]. Fortunately, in applications of binary hashing, since , and bits are used frequently, Sylvester’s method suffices in most cases.
4.3 Analysis of bias term of FSDH
We have already shown that obtained from the Hadamard matrix minimizes two terms: . Furthermore, we pay attention to how affects the bias term . In this subsection, we continue to analyze its behavior. We suppose that samples are sorted by label . Let be the bias term:
(19)  
where is a projection matrix. Therefore, to reduce the bias term, it is better that has a large value. Then, using , we can rewrite it as
(20) 
where is a blockdiagonal matrix
(21) 
are matricies with all elements equal to 1, and is the number of samples with label . Using these values, in (20) can be expressed as
(22)  
where with the same label are summed up. Since the definition of is , can be regarded as the normalized correlation of and . Since samples with the same label must represent a similar feature vector, is assumed to be a large value.
Figure 4 shows visualizations of matrices and for SDH and FSDH. Highcorrelation areas of are partitioned by each class block. of SDH includes a “negative” block in the nondiagonal components, and reduces . On the other hand, the proposed FSDH shows clear blocks; the diagonal blocks take the value and the nondiagonal blocks 0.
5 Experiments
5.1 Datasets
We tested the proposed method on three largescale image datasets: CIFAR10 [16] ^{1}^{1}1https://www.cs.toronto.edu/ kriz/cifar.html, SUN397 [36] ^{2}^{2}2http://groups.csail.mit.edu/vision/SUN/, and MNIST [18] ^{3}^{3}3http://yann.lecun.com/exdb/mnist/. The feature vectors of all datasets were normalized. A multilabeled NUSWIDE dataset was not included due to the limitation that the proposed method can be applied only to singlelabel problems.
CIFAR10 includes labeled subsets of 60,000 images. In this test, we used 512dimensional GIST features [25] extracted from the images. training samples and 1,000 test samples were used for evaluation. The number of classes was , and included “airplane”, “automobile”, “bird”, , etc.
SUN397 is a largescale image dataset for scene recognition with 397 categories, and consists of 108,754 labeled images. We extracted 10 categories with and training samples. A total of 500 training samples per class and 1,000 test samples were used. We used 512dimensional GIST features extracted from the images. Since we used , we called the dataset “SUN10” in this study.
MNIST includes an image dataset of handwritten digits. The feature vectors we used were given by [pix] of data that were normalized. The number of classes was , i.e., digits. We used training samples and 1,000 test samples for evaluation.
5.2 Comparative methods and settings
The proposed method was compared with four stateoftheart supervised hashing methods: CCAITQ, CCALSH, SDH, and COSDISH [14]. Unsupervised or semisupervised methods were not assessed. All methods were implemented in MATLAB R2012b and tested on an Intel i74770@3.4 GHz CPU with DDR3 SDRAM@32 GB.
CCAITQ and LSH: ITQ and LSH are stateoftheart binary hashing methods. They can be converted into supervised binary hashing methods by preprocessing feature vectors using label information. Canonical correlation analysis (CCA) transformation was performed and feature vectors were normalized and set to zero mean. They generated the projection matrix , and binary codes were assigned by (1).
CIFAR10  SUN10  MNIST  

MAP  Pre.  MAP  Pre.  MAP  Pre.  
SDH()  0.47  0.34  0.47  0.78  0.47  0.83 
SDH()  0.47  0.36  0.47  0.79  0.47  0.82 
CIFAR10  31.53  0.0132 

SUN10  9.16  0.0045 
MNIST  22.96  0.0124 
16  32  64  96  128  

5.05  5.10  5.62  5.86  5.78  
45.28  45.18  45.09  45.29  45.22  
121.53  122.80  121.86  121.87  124.77  
37.42  55.04  112.20  185.79  285.53  
27.50  41.87  60.56  148.67  199.96  
344.55  343.38  378.19  474.44  607.49  
COSDISH  11.76  41.42  155.66  349.82  656.55 
CCAITQ  1.29  2.73  5.61  10.12  14.25 
CCALSH  0.00  0.00  0.00  0.00  0.01 
32  64  128  256  512  1024  

FSDH  learning time [s]  0.64  0.70  0.85  0.98  1.16  1.48 
Precision  0.50  0.47  0.47  0.47  0.47  0.47  
MAP  0.44  0.44  0.44  0.44  0.44  0.44  
SDH  learning time [s]  6.38  14.92  47.42  284.00  1189.49  5230.43 
Precision  0.49  0.40  0.20  0.11  0.03  0.01  
MAP  0.44  0.47  0.48  0.48  0.46  0.44 
COSDISH is a recently proposed supervised hashing method. COSDISH generates the projection matrix , as does ITQ. The feature vectors are transformed so they have zero mean and normalized through variance in preprocessing. We used opensource MATLAB code ^{4}^{4}4http://cs.nju.edu.cn/lwj/.
SDH is a stateoftheart supervised hashing method. We used and with the maximum number of iterations set to 5, anchor points () and (), and kernel parameter for all datasets. SDH generated the projection matrix , and binary codes were assigned by reprojection (1). Furthermore, to show the validity of the FSDH approximation, we evaluated the case where (). We used opensource MATLAB code ^{5}^{5}5https://github.com/bd622/DiscretHashing.
FSDH: The proposed method used the same parameters as SDH: anchor points () and (), and kernel parameter for all datasets. Moreover, we used (). FSDH generated the projection matrix and assigned binary codes through reprojection (1), as in SDH. Our code will be made available to the public ^{6}^{6}6https://github.com/goukoutaki/FSDH, and is shown in Fig. 2.
5.3 Results and discussion
Methods  MAP  Learning time[s] 
0.025  17519  
0.113  71883  
0.030  7  
0.264  721  
0.442  3542  
FSH [20]  0.142  29624 
LSVMb [2]  0.042   
TopRSBC+SGD [32]  0.344  4663 
Precision and recall were computed by calculating the Hamming distance between the training samples and the test samples with a Hamming radius of 2. Figure 6 shows the results, in terms of precision, recall, and the mean average of precision (MAP), of the Hamming ranking for all methods and the three datasets. Code lengths of , and were evaluated.
CIFAR10: COSDISH shows the best MAP. yielded the best precision and recall. Although COSDISH showed a satisfactory MAP, the precision was low. In SDH and FSDH, increasing the number of anchor points improved the performance. As the code length increases, SDH reduces precision. However, FSDH maintains high precision and recall. This is a significant advantage of the proposed method. In general, by increasing the code length, precision tends to decrease with such a narrow threshold of a Hamming radius of 2.
SUN10: In this dataset, the results for FSDH were significantly better. In particular, the recall rates of FSDH remained high in spite of long code lengths. When the SDH and FSDH had the same number of anchor points, FSDH was clearly superior. The MAP of COSDISH was comparable to that of SDH; however, the precision and recall of COSDISH were not as good as those of CIFAR10.
MNIST: FSDH yielded the best results in all datasets with the same trends. It retained high precision and recall even with large values of code length.
The graphs on the right of Fig. 6 show the precisionrecall ROC curves based on Hamming ranking. FSDH shows better performance than SDH with the same number of anchor points. In particular, the SUN dataset yielded distinct results compared with the other methods.
5.3.1 Validation of FSDH approximation
5.3.2 Computation time
Table 3 shows the computation time of each method for CIFAR10. The time for was almost identical to that of CCA=ITQ. As the number of anchors increased, the computational time increased for SDH and FSDH. The computational time for SDH and COSDISH increased with the code length. The number of iterations of the DCC method depended on the code length.
5.3.3 Bit scalability and larger classes
Table 4 shows the comparative results in terms of computational time and performance with a wide range of code lengths for the CIFAR10 dataset. training samples, 1,000 test samples, and 1,000 anchors were used. The computation time of FSDH was almost identical in terms of code length because the main computation in FSDH involved matrix multiplication and inversion of (6). In practice, the inverse matrix was not computed directly, and Cholesky decomposition was performed. On the contrary, the computation time for SDH exponentially increased and precision decreased significantly. This means that the DCC method fell into local minima in cases of large code length.
In general, large bits are useful for a large number of classes. Table 5 shows the results of larger classes of the SUN dataset. bits, classes and training samples are used. FSDH achieves high precision, high MAP and lower computational time compared with SDH. When is used, can be ontained by FSDH. Here, SDH was not able to finish after three days of computation in our computational environment. In the experiments, we found that a large number of anchor points can improve performance. However it requres more computation. Therefore FSDH can use a large number of anchor points in a realistic computation time compared with SDH. For reference, we refer to the results of fast supervised hashing (FSH), LSVMb and TopRSBC+SGD which are reprinted from [20, 2, 32]. Although FSDH outperforms those methods, note that those methods use different computational environments, feature vectors and code lengths.
6 Conclusion
In this paper, we simplified the SDH model to an FSDH model by approximating the bias term, and provided exact solutions for the proposed FDSH model. The FSDH approximation was validated by comparative experiments with the SDH model. FSDH is easy to implement and outperformed several stateoftheart supervised hashing methods. In particular, in the case of large code lengths, FSDH can maintain performance without losing precision. In future work, we intend to use this idea for other hashing models.
Appendix
Appendix A Proof of Lemma 4.2
The optimization problem in (15) is known as the resource allocation problem [1, 5, 13]. Here we present a simple proof for the solution.
The constraint can be regarded as a surface equation in an Ndimensional space . On the other hand, the gradient vector of the object function is defined as
(23) 
where is the differentiated version of and the th element (the gradient in the th direction) is given by , and becomes 1 if or 0 if .
Then, the gradient along the surface is obtained as the projection of onto the surface, and computed as the innerproduct of and a set of vectors perpendicular to the normal vector of the surface:
(24) 
and the projected gradient becomes 0 at the global extermum point on the surface. This also indicates and are completely parallel and their innerproduct becomes
(25) 
Substituting Eqs. (23) and (24) into (25), we get
(26) 
Additionally, when we express as and as , we get
(27) 
The shape of this equality actually corresponds to Jensen’s inequality: where , and the equality holds if and only if i.e. are all equal:
(28) 
Additionally, when is a convex function, becomes an injective function because is a monotonically increasing function. Also, if the sign of does not change within the valid range of (the case considered in this paper), becomes injective. Hence,
(29) 
Finally, substituting this (29) into the condition , we get
(30) 
Appendix B Complete data of the experimental results
b.1 Recall, precision and MAP
Tables 88 show recall, precision and MAP for all datasets when and . These were computed by calculating the Hamming distance between the training samples and the test samples with a Hamming radius of 2.
Precision  Recall  MAP  

16  32  64  96  128  16  32  64  96  128  16  32  64  96  128  
0.488  0.506  0.501  0.505  0.501  0.256  0.167  0.097  0.112  0.097  0.429  0.429  0.429  0.419  0.429  
0.559  0.573  0.566  0.564  0.566  0.328  0.230  0.148  0.161  0.148  0.512  0.512  0.512  0.494  0.512  
0.589  0.599  0.593  0.585  0.593  0.363  0.262  0.175  0.185  0.175  0.553  0.553  0.553  0.533  0.553  
0.456  0.517  0.427  0.339  0.276  0.306  0.147  0.095  0.077  0.068  0.409  0.436  0.457  0.464  0.470  
0.447  0.511  0.453  0.346  0.292  0.298  0.154  0.099  0.082  0.074  0.399  0.440  0.445  0.459  0.468  
0.511  0.584  0.483  0.403  0.346  0.354  0.195  0.147  0.121  0.106  0.471  0.520  0.529  0.542  0.548  
COSDISH  0.262  0.120  0.061  0.046  0.031  0.251  0.118  0.061  0.046  0.031  0.574  0.615  0.625  0.644  0.654 
CCAITQ  0.373  0.427  0.352  0.267  0.203  0.139  0.040  0.014  0.006  0.004  0.307  0.329  0.339  0.341  0.344 
CCALSH  0.332  0.159  0.102  0.146  0.150  0.056  0.143  0.487  0.213  0.204  0.240  0.141  0.101  0.125  0.127 
Precision  Recall  MAP  

16  32  64  96  128  16  32  64  96  128  16  32  64  96  128  
0.754  0.782  0.804  0.784  0.804  0.539  0.406  0.272  0.303  0.272  0.740  0.740  0.740  0.723  0.740  
0.865  0.857  0.819  0.837  0.819  0.781  0.724  0.653  0.666  0.653  0.878  0.878  0.878  0.866  0.878  
0.842  0.819  0.789  0.794  0.789  0.842  0.819  0.789  0.794  0.789  0.899  0.899  0.899  0.889  0.899  
0.682  0.788  0.778  0.774  0.770  0.527  0.370  0.263  0.234  0.209  0.674  0.722  0.739  0.748  0.760  
0.690  0.781  0.780  0.771  0.770  0.544  0.360  0.279  0.229  0.215  0.684  0.717  0.749  0.752  0.756  
0.829  0.831  0.807  0.791  0.785  0.733  0.624  0.540  0.495  0.480  0.837  0.840  0.857  0.859  0.866  
COSDISH  0.425  0.229  0.173  0.141  0.111  0.406  0.227  0.170  0.136  0.108  0.682  0.724  0.744  0.760  0.767 
CCAITQ  0.524  0.699  0.744  0.746  0.744  0.153  0.044  0.017  0.012  0.009  0.411  0.454  0.472  0.479  0.484 
CCALSH  0.486  0.148  0.161  0.175  0.171  0.087  0.271  0.161  0.061  0.043  0.345  0.133  0.133  0.162  0.144 
Precision  Recall  MAP  

16  32  64  96  128  16  32  64  96  128  16  32  64  96  128  
0.922  0.929  0.916  0.928  0.916  0.833  0.776  0.684  0.704  0.684  0.929  0.929  0.929  0.928  0.929  
0.951  0.953  0.933  0.945  0.933  0.900  0.857  0.788  0.789  0.788  0.958  0.958  0.958  0.958  0.958  
0.964  0.964  0.942  0.950  0.942  0.931  0.891  0.825  0.841  0.825  0.969  0.969  0.969  0.968  0.969  
0.896  0.916  0.862  0.825  0.802  0.809  0.736  0.677  0.645  0.627  0.898  0.921  0.926  0.926  0.930  
0.888  0.906  0.878  0.822  0.809  0.840  0.736  0.686  0.651  0.636  0.898  0.911  0.920  0.930  0.933  
0.909  0.930  0.889  0.853  0.843  0.839  0.798  0.745  0.714  0.705  0.911  0.937  0.947  0.947  0.948  
COSDISH  0.640  0.488  0.395  0.376  0.344  0.626  0.488  0.395  0.375  0.343  0.818  0.844  0.860  0.865  0.863 
CCAITQ  0.782  0.824  0.686  0.574  0.491  0.449  0.294  0.202  0.165  0.151  0.710  0.745  0.761  0.766  0.773 
CCALSH  0.723  0.142  0.190  0.197  0.197  0.205  0.349  0.146  0.089  0.068  0.562  0.124  0.148  0.155  0.152 
b.2 ROC curves
Figure 6 shows precisionrecall ROC curves based on Hamming distance ranking for all datasets when .
Appendix C Loss comparison
We define loss and loss of the SDH model in (4) as follows:
(31)  
Tables 9 11 show each loss of SDH and FSDH after optimization for the CIFAR10, SUN10 and MNIST datasets. As described in sec.4.1, FSDH can minimize loss exactly. Therefore, for all datasets, FSDH results in a lower value of loss than SDH. Furthermore, as described in sec.4.3, FSDH can also reduce loss. For CIFAR10, SDH results in a lower value of loss than FSDH. For SUN10 and MNIST, FSDH results in a lower value of loss than SDH.
SDH  FSDH  

loss  loss  loss  loss  
16  0.0044  729.0250  0.0026  716.5170 
32  0.0017  998.6710  0.0013  1013.3100 
64  0.0008  1425.4700  0.0006  1433.0300 
128  0.0006  1980.5200  0.0003  2026.6200 
SDH  FSDH  

loss  loss  loss  loss  
16  0.0124  160.7050  0.0088  151.6680 
32  0.0054  224.2900  0.0044  214.4910 
64  0.0028  316.3380  0.0022  303.3370 
128  0.0068  448.1170  0.0011  428.9830 
SDH  FSDH  

loss  loss  loss  loss  
16  0.0058  271.8340  0.0036  268.3930 
32  0.0037  386.6100  0.0018  379.5650 
64  0.0011  569.8690  0.0009  536.7860 
128  0.0008  784.7970  0.0005  759.1300 
References
 [1] R. E. Bellman and S. E. Dreyfus. Applied dynamic programming. Princeton University Press, 1962.
 [2] F. Cakir and S. Sclaroff. Supervised hashing with error correcting codes. In ACM MM, pages 785–788, 2014.
 [3] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: binary robust independent elementary features. In ECCV, pages 778–792, 2010.
 [4] I. S. Dhillon and S. Sra. Generalizednnonnegative matrix approximations with Bregman divergences. In NIPS, pages 283–290, 2005.
 [5] S. E. Dreyfus and A. M. Law. The art and theory of dynamic programming. Academic Press, 1977.
 [6] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Int. Conf. Very Large Data Bases (VLDB), pages 518–529, 1999.
 [7] S. Gog and R. Venturini. Fast and compact hamming distance index. In Int. ACM SIGIR Conf. Research & Devel. Info. Retriev., pages 285–294, 2016.
 [8] G. H. Golub and C. F. Van Loan. Matrix computations (3rd ed.). Johns Hopkins Univ. Press, 1996.
 [9] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: a procrustean approach to learning binary codes for largescale image retrieval. IEEE T. PAMI, 35(12):2916–2929, 2013.
 [10] J. Hadamard. Résolution d́une question relative aux déterminants. Bulletin Sci. Math., 17:240–246, 1893.
 [11] J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon. Spherical hashing: Binary code embedding with hyperspheres. IEEE T. PAMI, 37(11):2304–2316, 2015.
 [12] H. Hotelling. Relations between two sets of variables. Biometrika, pages 312–377, 1936.
 [13] T. Ibaraki and N. Katoh. Resource allocation problems: algorithms approaches. The MIT Press, 1988.
 [14] W. C. Kang, W. J. Li, and Z. H. Zhou. Column sampling based discrete supervised hashing. In AAAI, pages 1230–1236, 2016.
 [15] G. Koutaki. Binary continuous image decomposition for multiview display. ACM TOG, 35(4):69:1–69:12, 2016.
 [16] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Univ. Toronto, 2009.
 [17] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, pages 1042–1050, 2009.
 [18] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proc. IEEE, pages 2278–2324, 1998.
 [19] X. Li, G. Lin, C. Shen, A. van den Hengel, and A. Dick. Learning hash functions using column generation. In ICML, 2013.
 [20] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for highdimensional data. In CVPR, pages 1971–1978, 2014.
 [21] W. Liu, C. Mu, S. Kumar, and S.F. Chang. Discrete graph hashing. In NIPS, pages 3419–3427, 2014.
 [22] W. Liu, J. Wang, and S. fu Chang. Hashing with graphs. In ICML, 2011.
 [23] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
 [24] V. A. Nguyen, J. Lu, and M. N. Do. Supervised discriminative hashing for compact binary codes. In ACM MM, pages 989–992, 2014.
 [25] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.
 [26] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, Inc., 1986.
 [27] F. Shen, W. Liu, S. Zhang, Y. Yang, and H. Tao Shen. Learning binary codes for maximum inner product search. In ICCV, pages 4148–4156, 2015.
 [28] F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015.
 [29] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. A fast optimization method for general binary code learning. IEEE T. Image Process. (TIP), 25(12):5610–5621, 2016.
 [30] X. Shi, F. Xing, J. Cai, Z. Zhang, Y. Xie, and L. Yang. Kernelbased supervised discrete hashing for image retrieval. In ECCV, pages 419–433, 2016.
 [31] M. Slawski, M. Hein, and P. Lutsik. Matrix factorization with binary components. In NIPS, pages 3210–3218, 2013.
 [32] D. Song, W. Liu, R. Ji, D. A. Meyer, and J. R. Smith. Top rank supervised binary coding for visual search. In ICCV, pages 1922–1930, 2015.
 [33] J. Sylvester. Thoughts on inverse orthogonal matrices, simultaneous sign successions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tilework, and the theory of numbers. Philos. Magazine, 34:461–475, 1867.
 [34] X. Wang, T. Zhang, G.J. Qi, J. Tang, and J. Wang. Supervised quantization for similarity search. In CVPR, pages 2018–2026, 2016.
 [35] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2009.
 [36] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: largescale scene recognition from abbey to zoo. In CVPR, pages 3485–3492, 2010.