Fast Supervised Discrete Hashing and its Analysis
In this paper, we propose a learning-based supervised discrete hashing method. Binary hashing is widely used for large-scale image retrieval as well as video and document searches because the compact representation of binary code is essential for data storage and reasonable for query searches using bit-operations. The recently proposed Supervised Discrete Hashing (SDH) efficiently solves mixed-integer programming problems by alternating optimization and the Discrete Cyclic Coordinate descent (DCC) method. We show that the SDH model can be simplified without performance degradation based on some preliminary experiments; we call the approximate model for this the “Fast SDH” (FSDH) model. We analyze the FSDH model and provide a mathematically exact solution for it. In contrast to SDH, our model does not require an alternating optimization algorithm and does not depend on initial values. FSDH is also easier to implement than Iterative Quantization (ITQ). Experimental results involving a large-scale database showed that FSDH outperforms conventional SDH in terms of precision, recall, and computation time.
Binary hashing is an important technique for computer vision, machine learning, and large-scale image/video/document retrieval [6, 9, 17, 19, 24, 27, 28]. Through binary hashing, multi-dimensional feature vectors with integers or floating-point elements are transformed into short binary codes. This representation of binary code is an important technique since large-scale databases occupy large amounts of storage. Furthermore, it is easy to compare a query in binary code with a binary code in a database because the Hamming distance between them can be computed efficiently by using bitwise operations that are part of the instruction set of any modern CPU [3, 7].
Many binary hashing methods have been proposed. Locality-sensitive hashing (LSH)  is one of most popular methods. In LSH, binary codes are generated by using a random projection matrix and thresholding using the sign of the projected data. Iterative quantization (ITQ)  is another state-of-the-art binary hashing method. In ITQ, a projection matrix of the hash function is optimized by iterating projection and thresholding procedures according to the given training samples.
Binary hashing can be roughly classified into two types: unsupervised hashing [17, 22, 21, 27, 11, 35] and supervised hashing. Supervised hashing uses learning label information if it exists. In general, supervised hashing yields better performance than unsupervised hashing, so in this study, we target supervised hashing. In addition, some unsupervised methods such as LSH and ITQ can be converted into supervised methods by imposing label information on feature vectors. For example, canonical correlation analysis (CCA)  can transform feature vectors to maximize inter-class variation and minimize intra-class variation according to label information. Hereafter, we call these processes CCA-LSH and CCA-ITQ, respectively.
Not imposing label information on feature vectors, such as in CCA, but imposing it directly on hash functions has been proposed. Kernel-based supervised hashing (KSH)  uses spectral relaxation to optimize the cost function through a sign function. Feature vectors are transformed by kernels during preprocessing. KSH has also been improved to kernel-based supervised discrete hashing (KSDH) . It relaxes the discrete hashing problem through linear relaxation. Supervised Discriminative Hashing  decomposes training samples into inter and intra samples. Column sampling-based discrete supervised hashing (COSDISH)  uses column sampling based on semantic similarity, and decomposes the problem into a sub-problem to simplify solution.
The optimization of binary codes leads to a mixed-integer programming problem involving integer and non-integer variables, which is an NP-hard problem in general . Therefore, many methods discard the discrete constraints, or transform the problem into a relaxed problem, i.e., a linear programming problem . This relaxation significantly simplifies the problem, but is known to affect classification performance .
Recent research has introduced a type of supervised discrete hashing (SDH) [28, 34] that directly learns binary codes without relaxation. SDH is a state-of-the-art method because of its ease of implementation, reasonable computation time for learning, and better performance over other state-of-the-art supervised hashing methods. To solve discrete problems, SDH uses a discrete cyclic coordinate descent (DCC) method, which is an approximate solver of 0-1 quadratic integer programming problems.
1.1 Contributions and advantages
In this study, we first analyze the SDH model and point out that it can be simplified without performance degradation based on some preliminary experiments. We call the approximate model the fast SDH (FSDH) model. We analyze the FSDH model and provide a mathematically exact solution to it. The model simplification is validated through experiments involving several large-scale datasets.
The advantages of the proposed method are as follows:
Unlike SDH, it does not require alternating optimization or hyper-parameters, and is not initial value-dependent.
It is easier to implement than ITQ and is efficient in terms of computation time. FSDH can be implemented in three lines on MATLAB.
High bit scalability: its learning time and performance do not depend on the code length.
It has better precision and recall than other state-of-the-art supervised hashing methods.
1.2 Related work
As described subsequently, the SDH model poses a matrix factorization problem: . The popular form of this problem is singular value decomposition (SVD) , and when and are unconstrained, the Householder method is used for computation. When , non-negative matrix factorization (NMF) is used .
In the case of the SDH model, is constrained to and is unconstrained. In a similar problem setting, Slawski et al. proposed matrix factorization with binary components  and showed an application to DNA analysis for cancer research. is constrained to , and indicates Unmethylated/Methylated DNA sequences. Furthermore, a similar model has been proposed in display electronics. Koutaki proposed binary continuous decomposition for multi-view displays . In this model, multiple images are decomposed into binary images and a weight matrix . An image projector projects binary 0-1 patterns through digital mirror devices (DMDs), and the weight matrix corresponds to the transmittance of the LCD shutter.
2 Supervised Discrete Hashing (SDH) Model
In this section, we introduce the supervised discrete hashing (SDH) model. Let be a feature vector, and introduce a set of training samples . Then, consider binary label information corresponding to , where is the number of categories to classify. Setting the -th element to 1, , and the other elements to 0 indicates that the -th vector belongs to class . By concatenating samples of horizontally, a label matrix is constructed.
2.1 Binary code assignment to each sample
For each sample , an -bit binary code is assigned. By concatenating samples of horizontally, a binary matrix is constructed. The binary code is computed as
where (therefore ) is a linear transformation matrix and is the sign function. The major aim of SDH is to determine the matrix from training samples . In practice, feature vectors are transformed by preprocessing. Therefore, we denote the original feature vectors and the transformed feature vectors .
2.2 Preprocessing: Kernel transformation
The original feature vectors of training samples are converted into the feature vectors using the following kernel transformation :
where is an anchor vector obtained by randomly sampling the original feature vectors, . Then, the transformed feature vectors are bundled into the matrix form .
2.3 Classification model
Following binary coding by (1), we suppose that a good binary code classifies the class, and formulate the following simple linear classification model:
where is a weight matrix and is an estimated label vector. As mentioned above, its maximum index, , indicates the assigned class of .
2.4 Optimization of SDH
The SDH problem is defined as the following minimization problem:
where is the Frobenius norm, and and are balance parameters. The first term includes the classification model explained in Sec. 2.3. The second term is a regularizer for to avoid overfitting. The third term indicates the fitting errors due to binary coding.
In this optimization, it is sufficient to compute , i.e., if is obtained, can be obtained by (1), and can be obtained from the following simple least squares equation:
However, due to the difficulty of optimization, the optimization problem of (4) is usually divided into three sub-problems of the optimization of , and . Thus, the following alternating optimization is performed:
(i) Initialization: is initialized, usually randomly.
(ii) F-Step: is computed by the following simple least squares method:
(iii) W-Step: is computed by (5).
(iv) B-Step: After fixing and , equation (4) becomes:
Note that . The trace can be rewritten as
where is the -th column vector of . are actually independent of one another. Therefore, it reduces to the following 0-1 integer quadratic programming problem for each -th sample:
Iterate steps (ii)(iv) until convergence.
3 Discussion of the SDH Model
3.1 0-1 integer quadratic programming problem
To solve (10), SDH uses a discrete cyclic coordinate descent (DCC) method. In this method, a one-bit element of is optimized while fixing the other bits; the -th bit is optimized as
Then, all bits are optimized, and this procedure is repeated several times. In addition, the DCC method is prone to result in a local minimum because of its greediness. To improve it, Shen et al. proposed using a proximal operation of convex optimization .
In the case of a large number of bits , solving (10) exactly is difficult because this problem is NP-hard. However, there exist a few efficient methods to solve the 0-1 integer quadratic programming problem. In , Koutaki used a branch-and-bound method to solve the problem. is expanded into a binary tree of depth , and the problem of (10) is divided into a sub-problem by splitting . At each node, the lower bound is computed and compared with the given best solution; child nodes can be excluded from the search.
The computation of the lower bound depends on the structure of , and . To compute the lower bound in general, the linear relaxation method is a standard method, . In this case, the rough lower bound of the quadratic term in (10) can be provided by the minimum eigenvalues of . However, linear relaxation is useless in the SDH model because in general, so the matrix is rank deficient and, as a result, the minimum eigenvalue of becomes zero.
Even if we can obtain an efficient algorithm, such as branch-and-bound and good lower bound, in the application of binary hashing, we still suffer from computational difficulties because code lengths , or bits are still too long to optimize, and they are used frequently.
3.2 Alternating optimization and initial value dependence
Even if we optimize the binary optimization in (10), the resulting binary codes are not always optimal ones because they depend on the other fixed variables and . In addition, alternating optimization is prone to cause a serious problem: a solution depends on the initial values, and may fall in a local minimum during the iterations, even if each step of F-Step, W-Step and B-Step provides the optimal solution.
Figure 1 shows an example of the optimization result for a simple version of the SDH model in (4) with a small number of bits . In this case, an exact solution is known and its minimum value is (green line in Fig. 1). DCC (red lines) provides results for 10 randomized initial conditions. The full search (blue lines) provides the results of an exact full search in B-Step, where nodes are searched.
In spite of the small size of the problem, the cost function of conventional alternating solvers (DCC and full search) cannot find the exact value, and depends on initial values. Interestingly, the results of full search immediately fall into a local minimum, and are worse than those of DCC.
4 Proposed Fast SDH Model
We introduce a new hashing model by approximating the SDH model, which utilizes the following assumptions:
The number of bits of the binary code is a power of 2: .
The number of bits is greater than the number of classes: .
Note that assumptions A1A3 also become the limitations of the proposed model. In A4, SDH recommends that the parameter be set to a very small value, such as . In practice, and in the CIFAR-10 dataset. Furthermore, when , almost the same results can be obtained in all datasets as shown in the experimental results in Sec. 5. We call this approximation using the “fast SDH (FSDH) approximation.”
Using the FSDH approximation, we solve the following problem for each -sample in B-Step:
where is a constant matrix and depends on label . By using the single-label assumption in A3, the number of kinds of is limited to :
Thus, it is sufficient to solve only integer quadratic programming problems of (12) from . In general, the number of samples is larger than that of classes: , e.g., and . Thereby, the computational cost of B-Step becomes times lower. In other words, the FSDH approximation proposes the following:
The FSDH approximation defines the SDH model to assign a binary code to each class.
After obtaining the binary codes of each class , the binary codes of all samples can be constructed by lining up as
After constructing , the projection matrix can be obtained by (6).
4.1 Analytical solutions of FSDH model
From Proposition 4.1, we found that it is sufficient to determine the binary code for each class. Furthermore, we can choose the optimal binary codes under the FSDH approximation as follows:
If is convex, the solution of
is given by the mean value .
See Appendix missingA.
An analytical solution of FSDH is obtained as a Hadamard matrix.
I - W^⊤B^′∥^2 + λ∥W∥^2, where is an identity matrix. Using the solution of (4.1), i.e., , and the eigen-decomposition of , we denote the eigenvalues as and then get as the trace of diagonal values. Then, equation (4.1) can be represented simply as
By lemma 4.2, . This implies that is an orthogonal matrix with binary elements ; in other words, can be given by a submatrix of the Hadamard matrix .
The following characteristics can be obtained easily:
is independent of regularization parameter (-invariant).
The optimal weight matrix of FSDH is given by the version of the scaled binary matrix .
The minimum value of (4.1) is given by .
In short, we can eliminate the W-Step, the alternating procedure, and the initial value dependence. An exact solution of the FSDH model can be obtained independent of the hyper-parameters and .
4.2 Implementation of FSDH
Algorithm 1 and Figure 2, respectively, show the algorithm of FSDH and sample MATLAB code, which is simple and easy to implement. Figure 3 shows an example of and . A Hadamard matrix of size can be constructed recursively by Sylvester’s method  as
Furthermore, Hadamard matrices of orders 12 and 20 were constructed by Hadamard transformation . Fortunately, in applications of binary hashing, since , and bits are used frequently, Sylvester’s method suffices in most cases.
4.3 Analysis of bias term of FSDH
We have already shown that obtained from the Hadamard matrix minimizes two terms: . Furthermore, we pay attention to how affects the bias term . In this subsection, we continue to analyze its behavior. We suppose that samples are sorted by label . Let be the bias term:
where is a projection matrix. Therefore, to reduce the bias term, it is better that has a large value. Then, using , we can rewrite it as
where is a block-diagonal matrix
are matricies with all elements equal to 1, and is the number of samples with label . Using these values, in (20) can be expressed as
where with the same label are summed up. Since the definition of is , can be regarded as the normalized correlation of and . Since samples with the same label must represent a similar feature vector, is assumed to be a large value.
Figure 4 shows visualizations of matrices and for SDH and FSDH. High-correlation areas of are partitioned by each class block. of SDH includes a “negative” block in the non-diagonal components, and reduces . On the other hand, the proposed FSDH shows clear blocks; the diagonal blocks take the value and the non-diagonal blocks 0.
We tested the proposed method on three large-scale image datasets: CIFAR-10  111https://www.cs.toronto.edu/ kriz/cifar.html, SUN-397  222http://groups.csail.mit.edu/vision/SUN/, and MNIST  333http://yann.lecun.com/exdb/mnist/. The feature vectors of all datasets were normalized. A multi-labeled NUS-WIDE dataset was not included due to the limitation that the proposed method can be applied only to single-label problems.
CIFAR-10 includes labeled subsets of 60,000 images. In this test, we used 512-dimensional GIST features  extracted from the images. training samples and 1,000 test samples were used for evaluation. The number of classes was , and included “airplane”, “automobile”, “bird”, , etc.
SUN-397 is a large-scale image dataset for scene recognition with 397 categories, and consists of 108,754 labeled images. We extracted 10 categories with and training samples. A total of 500 training samples per class and 1,000 test samples were used. We used 512-dimensional GIST features extracted from the images. Since we used , we called the dataset “SUN-10” in this study.
MNIST includes an image dataset of handwritten digits. The feature vectors we used were given by [pix] of data that were normalized. The number of classes was , i.e., digits. We used training samples and 1,000 test samples for evaluation.
5.2 Comparative methods and settings
The proposed method was compared with four state-of-the-art supervised hashing methods: CCA-ITQ, CCA-LSH, SDH, and COSDISH . Unsupervised or semi-supervised methods were not assessed. All methods were implemented in MATLAB R2012b and tested on an Intel email@example.com GHz CPU with DDR3 SDRAM@32 GB.
CCA-ITQ and LSH: ITQ and LSH are state-of-the-art binary hashing methods. They can be converted into supervised binary hashing methods by pre-processing feature vectors using label information. Canonical correlation analysis (CCA) transformation was performed and feature vectors were normalized and set to zero mean. They generated the projection matrix , and binary codes were assigned by (1).
|FSDH||learning time [s]||0.64||0.70||0.85||0.98||1.16||1.48|
|SDH||learning time [s]||6.38||14.92||47.42||284.00||1189.49||5230.43|
COSDISH is a recently proposed supervised hashing method. COSDISH generates the projection matrix , as does ITQ. The feature vectors are transformed so they have zero mean and normalized through variance in pre-processing. We used open-source MATLAB code 444http://cs.nju.edu.cn/lwj/.
SDH is a state-of-the-art supervised hashing method. We used and with the maximum number of iterations set to 5, anchor points () and (), and kernel parameter for all datasets. SDH generated the projection matrix , and binary codes were assigned by re-projection (1). Furthermore, to show the validity of the FSDH approximation, we evaluated the case where (). We used open-source MATLAB code 555https://github.com/bd622/DiscretHashing.
FSDH: The proposed method used the same parameters as SDH: anchor points () and (), and kernel parameter for all datasets. Moreover, we used (). FSDH generated the projection matrix and assigned binary codes through re-projection (1), as in SDH. Our code will be made available to the public 666https://github.com/goukoutaki/FSDH, and is shown in Fig. 2.
5.3 Results and discussion
Precision and recall were computed by calculating the Hamming distance between the training samples and the test samples with a Hamming radius of 2. Figure 6 shows the results, in terms of precision, recall, and the mean average of precision (MAP), of the Hamming ranking for all methods and the three datasets. Code lengths of , and were evaluated.
CIFAR-10: COSDISH shows the best MAP. yielded the best precision and recall. Although COSDISH showed a satisfactory MAP, the precision was low. In SDH and FSDH, increasing the number of anchor points improved the performance. As the code length increases, SDH reduces precision. However, FSDH maintains high precision and recall. This is a significant advantage of the proposed method. In general, by increasing the code length, precision tends to decrease with such a narrow threshold of a Hamming radius of 2.
SUN-10: In this dataset, the results for FSDH were significantly better. In particular, the recall rates of FSDH remained high in spite of long code lengths. When the SDH and FSDH had the same number of anchor points, FSDH was clearly superior. The MAP of COSDISH was comparable to that of SDH; however, the precision and recall of COSDISH were not as good as those of CIFAR-10.
MNIST: FSDH yielded the best results in all datasets with the same trends. It retained high precision and recall even with large values of code length.
The graphs on the right of Fig. 6 show the precision-recall ROC curves based on Hamming ranking. FSDH shows better performance than SDH with the same number of anchor points. In particular, the SUN dataset yielded distinct results compared with the other methods.
5.3.1 Validation of FSDH approximation
5.3.2 Computation time
Table 3 shows the computation time of each method for CIFAR-10. The time for was almost identical to that of CCA=ITQ. As the number of anchors increased, the computational time increased for SDH and FSDH. The computational time for SDH and COSDISH increased with the code length. The number of iterations of the DCC method depended on the code length.
5.3.3 Bit scalability and larger classes
Table 4 shows the comparative results in terms of computational time and performance with a wide range of code lengths for the CIFAR-10 dataset. training samples, 1,000 test samples, and 1,000 anchors were used. The computation time of FSDH was almost identical in terms of code length because the main computation in FSDH involved matrix multiplication and inversion of (6). In practice, the inverse matrix was not computed directly, and Cholesky decomposition was performed. On the contrary, the computation time for SDH exponentially increased and precision decreased significantly. This means that the DCC method fell into local minima in cases of large code length.
In general, large bits are useful for a large number of classes. Table 5 shows the results of larger classes of the SUN dataset. -bits, classes and training samples are used. FSDH achieves high precision, high MAP and lower computational time compared with SDH. When is used, can be ontained by FSDH. Here, SDH was not able to finish after three days of computation in our computational environment. In the experiments, we found that a large number of anchor points can improve performance. However it requres more computation. Therefore FSDH can use a large number of anchor points in a realistic computation time compared with SDH. For reference, we refer to the results of fast supervised hashing (FSH), LSVM-b and Top-RSBC+SGD which are reprinted from [20, 2, 32]. Although FSDH outperforms those methods, note that those methods use different computational environments, feature vectors and code lengths.
In this paper, we simplified the SDH model to an FSDH model by approximating the bias term, and provided exact solutions for the proposed FDSH model. The FSDH approximation was validated by comparative experiments with the SDH model. FSDH is easy to implement and outperformed several state-of-the-art supervised hashing methods. In particular, in the case of large code lengths, FSDH can maintain performance without losing precision. In future work, we intend to use this idea for other hashing models.
Appendix A Proof of Lemma 4.2
The constraint can be regarded as a surface equation in an N-dimensional space . On the other hand, the gradient vector of the object function is defined as
where is the differentiated version of and the -th element (the gradient in the -th direction) is given by , and becomes 1 if or 0 if .
Then, the gradient along the surface is obtained as the projection of onto the surface, and computed as the inner-product of and a set of vectors perpendicular to the normal vector of the surface:
and the projected gradient becomes 0 at the global extermum point on the surface. This also indicates and are completely parallel and their inner-product becomes
Additionally, when we express as and as , we get
The shape of this equality actually corresponds to Jensen’s inequality: where , and the equality holds if and only if i.e. are all equal:
Additionally, when is a convex function, becomes an injective function because is a monotonically increasing function. Also, if the sign of does not change within the valid range of (the case considered in this paper), becomes injective. Hence,
Finally, substituting this (29) into the condition , we get
Appendix B Complete data of the experimental results
b.1 Recall, precision and MAP
Tables 88 show recall, precision and MAP for all datasets when and . These were computed by calculating the Hamming distance between the training samples and the test samples with a Hamming radius of 2.
b.2 ROC curves
Figure 6 shows precision-recall ROC curves based on Hamming distance ranking for all datasets when .
Appendix C Loss comparison
We define -loss and -loss of the SDH model in (4) as follows:
Tables 9 11 show each loss of SDH and FSDH after optimization for the CIFAR-10, SUN-10 and MNIST datasets. As described in sec.4.1, FSDH can minimize -loss exactly. Therefore, for all datasets, FSDH results in a lower value of -loss than SDH. Furthermore, as described in sec.4.3, FSDH can also reduce -loss. For CIFAR-10, SDH results in a lower value of -loss than FSDH. For SUN-10 and MNIST, FSDH results in a lower value of -loss than SDH.
-  R. E. Bellman and S. E. Dreyfus. Applied dynamic programming. Princeton University Press, 1962.
-  F. Cakir and S. Sclaroff. Supervised hashing with error correcting codes. In ACM MM, pages 785–788, 2014.
-  M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: binary robust independent elementary features. In ECCV, pages 778–792, 2010.
-  I. S. Dhillon and S. Sra. Generalizednnonnegative matrix approximations with Bregman divergences. In NIPS, pages 283–290, 2005.
-  S. E. Dreyfus and A. M. Law. The art and theory of dynamic programming. Academic Press, 1977.
-  A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Int. Conf. Very Large Data Bases (VLDB), pages 518–529, 1999.
-  S. Gog and R. Venturini. Fast and compact hamming distance index. In Int. ACM SIGIR Conf. Research & Devel. Info. Retriev., pages 285–294, 2016.
-  G. H. Golub and C. F. Van Loan. Matrix computations (3rd ed.). Johns Hopkins Univ. Press, 1996.
-  Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE T. PAMI, 35(12):2916–2929, 2013.
-  J. Hadamard. Résolution d́une question relative aux déterminants. Bulletin Sci. Math., 17:240–246, 1893.
-  J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon. Spherical hashing: Binary code embedding with hyperspheres. IEEE T. PAMI, 37(11):2304–2316, 2015.
-  H. Hotelling. Relations between two sets of variables. Biometrika, pages 312–377, 1936.
-  T. Ibaraki and N. Katoh. Resource allocation problems: algorithms approaches. The MIT Press, 1988.
-  W. C. Kang, W. J. Li, and Z. H. Zhou. Column sampling based discrete supervised hashing. In AAAI, pages 1230–1236, 2016.
-  G. Koutaki. Binary continuous image decomposition for multi-view display. ACM TOG, 35(4):69:1–69:12, 2016.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Univ. Toronto, 2009.
-  B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, pages 1042–1050, 2009.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proc. IEEE, pages 2278–2324, 1998.
-  X. Li, G. Lin, C. Shen, A. van den Hengel, and A. Dick. Learning hash functions using column generation. In ICML, 2013.
-  G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for high-dimensional data. In CVPR, pages 1971–1978, 2014.
-  W. Liu, C. Mu, S. Kumar, and S.-F. Chang. Discrete graph hashing. In NIPS, pages 3419–3427, 2014.
-  W. Liu, J. Wang, and S. fu Chang. Hashing with graphs. In ICML, 2011.
-  W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
-  V. A. Nguyen, J. Lu, and M. N. Do. Supervised discriminative hashing for compact binary codes. In ACM MM, pages 989–992, 2014.
-  A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.
-  A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, Inc., 1986.
-  F. Shen, W. Liu, S. Zhang, Y. Yang, and H. Tao Shen. Learning binary codes for maximum inner product search. In ICCV, pages 4148–4156, 2015.
-  F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015.
-  F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. A fast optimization method for general binary code learning. IEEE T. Image Process. (TIP), 25(12):5610–5621, 2016.
-  X. Shi, F. Xing, J. Cai, Z. Zhang, Y. Xie, and L. Yang. Kernel-based supervised discrete hashing for image retrieval. In ECCV, pages 419–433, 2016.
-  M. Slawski, M. Hein, and P. Lutsik. Matrix factorization with binary components. In NIPS, pages 3210–3218, 2013.
-  D. Song, W. Liu, R. Ji, D. A. Meyer, and J. R. Smith. Top rank supervised binary coding for visual search. In ICCV, pages 1922–1930, 2015.
-  J. Sylvester. Thoughts on inverse orthogonal matrices, simultaneous sign successions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Philos. Magazine, 34:461–475, 1867.
-  X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang. Supervised quantization for similarity search. In CVPR, pages 2018–2026, 2016.
-  Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2009.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492, 2010.