Supervised Online Hashing via Similarity Distribution Learning
Online hashing has attracted extensive research attention when facing streaming data. Most online hashing methods, learning binary codes based on pairwise similarities of training instances, fail to capture the semantic relationship, and suffer from a poor generalization in large-scale applications due to large variations. In this paper, we propose to model the similarity distributions between the input data and the hashing codes, upon which a novel supervised online hashing method, dubbed as Similarity Distribution based Online Hashing (SDOH), is proposed, to keep the intrinsic semantic relationship in the produced Hamming space. Specifically, we first transform the discrete similarity matrix into a probability matrix via a Gaussian-based normalization to address the extremely imbalanced distribution issue. And then, we introduce a scaling Student -distribution to solve the challenging initialization problem, and efficiently bridge the gap between the known and unknown distributions. Lastly, we align the two distributions via minimizing the Kullback-Leibler divergence (KL-diverence) with stochastic gradient descent (SGD), by which an intuitive similarity constraint is imposed to update hashing model on the new streaming data with a powerful generalizing ability to the past data. Extensive experiments on three widely-used benchmarks validate the superiority of the proposed SDOH over the state-of-the-art methods in the online retrieval task.
Hashing based visual search has attracted extensive research attention in recent years due to the rapid growth of visual data on the Internet [7, 33, 8, 26, 12, 13, 30, 32, 25, 35, 27]. In various scenarios, online hashing has become a hot topic due to the emergence of handling the streaming data, which aims to resolve an online retrieval task by updating the hash functions from sequentially arriving data instances. On one hand, online hashing takes advantages of traditional offline hashing methods, i.e., low storage cost and efficiency of pairwise distance computation in the Hamming space. On the other hand, it also merits in training efficiency and scalability for large-scale applications, since the hash functions are updated instantly and solely based on the current streaming data, which is superior to traditional hashing methods based on a hashing model entirely trained from scratch.
Several recent endeavors have been made for robust and efficient online hashing, i.e., OKH , SketchHash , AdaptHash , OSH , FROSH , MIHash , HCOH [21, 19] and BSODH . Unsupervised online hashing methods, e.g., SketchHash  and FROSH , consider a sketch of the whole streaming data, which is efficient but lacks in accuracy, as the label information is ignored.
Recent advances have advocated more on supervised online hashing, which yields better results in practice. As shown in Fig. 1(a), early works such as OKH , AdaptHash , MIHash  and BSODH  utilize label information to define the pairwise similarities between different training instances to guide the learning of hash functions. However, these methods suffer from a poor generalization. To explain, as demonstrated in a previous work , only pairwise relationships of sequential data at current stage are considered, which ignores the data variations in different stages. As a result, the property of the past data becomes less conspicuous as the dataset grows. In terms of OSH  and HCOH [21, 19], the label information is used to assign “codeword” from a pre-defined ECOC codebook, as shown in Fig. 1(b). And the hash functions map the to-be-learned hashing codes to the assigned “codeword”, which however heavily depends on the quality of ECOC codebook. Though a recent work in  considers the Hadamard matrix  as the ECOC codebook, it restricts the length of hashing bits to be consistent with the size of the Hadamard matrix.
Despite the extensive progress made, supervised online hashing remains unsolved due to the defect in modeling the supervised cues. Existing methods only preserve the information from the current data, and their update does not take the distributions of previous data into account. We argue that, these defects can be compensated by aligning the distributions between the input data and the hashing space when updating, which has been demonstrated informatively beyond online hashing as revealed in [22, 9, 23, 10, 34]. Inspired from it, we aim to impose an intuitive constraint on similarity preservation in the Hamming space to capture not only the pairwise similarity at the current stage, but also the semantic relationship among different stages. By doing so, the learning can take both the information from the current streaming data, but also the past data into account.
In this paper, we propose a novel online hashing method, called Similarity Distribution based Online Hashing (SDOH), which exploits the distribution over different pairwise similarities towards optimal supervised online hashing, as shown in Fig. 1(c). To this end, we first transfer the discrete similarity matrix into a probability matrix via a Gaussian-based normalization. Noticeably, Lin et al.  adopted a similar idea which simply obtains a probability matrix via normalization. However, such a normalization may generate an extremely imbalanced distribution (as illustrated in Fig. 2(a)) when facing a highly sparse pairwise discrete similarity matrix. And the optimization takes a risk of losing the information of dissimilar pairs (see Fig. 2(c)) . Therefore, we introduce a Gaussian distribution to smooth the imbalanced distribution before normalization, which bridges the gap between similar and dissimilar probabilities (as in Fig. 2(b)). Second, we develop a scaling Student -distribution to transform pairwise distances in the Hamming space into a probability (see Fig. 2(f)). Different from traditional Student -distribution that suffers from poor performance due to the instability of parameter initialization (see Fig. 2(e)), the proposed scaling Student -distribution not only improves the performance but also accelerates the training speed (see Fig. 2(d)). Lastly, to better approximate the probability, we adopt KL-divergence minimization between the two introduced distributions to preserve relationships among different pairwise similarities.
Our main contributions include:
We investigate the online hashing problem by modeling the similarity distribution, instead of only exploiting the pairwise similarities that suffer from a poor generalization problem. The Gaussian normalization is introduced to smooth the extremely imbalanced distribution, while a scaling -Student distribution is proposed to solve the initialization problem, and bridge the gap between the known and unknown distributions.
We propose to align the distributions via KL-divergence between the input data and the binary space, which imposes an intuitive similarity constraint to update hash functions on the new streaming data with a powerful generalizing ability to the past data.
2 Related Work
According to the learning type, online hashing can be categorized into the SGD-based methods and matrix sketch-based methods.
SGD-based online hashing employs SGD to update the learned parameters. To our best knowledge, Online Kernel Hashing (OKH)  is the first of this kind, which requires pairs of points to update the hash functions via an online passive-aggressive strategy . Adaptive Hashing (AdaptHash)  defines a hinge-like loss, which is approximated by a differentiable Sigmoid function adopted to update the hash functions with SGD. In , a more general two-step hashing was introduced, in which binary Error Correcting Output Codes (ECOC) are first assigned to labeled data, and then the hash functions are learned to fit the binary ECOC using Online Boosting. Cakir et al.  developed an Online Hashing with Mutual Information (MIHash), which targets at optimizing the mutual information between the neighbors and non-neighbors, given a query. Lin et al. [21, 19] proposed a Hadamard Codebook based Online Hashing (HCOH), where a more discriminative Hadamard matrix is used as the ECOC codebook to guide the learning of hash functions.
The inspiration of matrix sketch-based online hashing methods comes from the idea of “data sketch” , where a small size of sketch data is leveraged to preserve the main property of a large-scale dataset. To this end, Leng et al.  proposed an Online Sketching Hashing (SketchHash), which employs an efficient variant of SVD decomposition to learn hash functions, with a PCA-based batch learning on the sketch to learn hashing weights. A faster version of Online Sketch Hashing (FROSH) was developed in , where the independent Subsampled Randomized Hadamard Transform (SRHT) is employed on different data chunks to make the sketch more compact and accurate, and to accelerate the sketching process.
However, existing sketch-based online hashing methods are unsupervised, which suffer from a low performance due to the lack of supervised labels. SGD-based methods [12, 6, 4, 5, 21, 20] aim to make full use of labels, which still face practical problems as discussed in Sec. 1. For OKH , AdaptHash , MIHash  and BSODH , less generalization ability exists since only pairwise relationships of current sequential data are considered. As for OSH  and HCOH [21, 19], a well-defined ECOC codebook has to be given in advance, which still fails when the size of codebook is inconsistent with the length of hashing bits.
3 Proposed Method
3.1 Problem Definition
Suppose there is a dataset with its corresponding labels , where is the -th instance with its class label . Assume there are hash functions to be learned, which map each into a -bit binary code , and the -th binary bit of is computed as follows:
where is the -th hash function, and is a projection of . The function returns , if , and otherwise.
Let be the projection matrix. Then, the binary codes of can be computed as:
Online hashing aims to resolve an online retrieval task by updating hash functions from a sequence of data instances one at a time. Therefore, is not available once for all. Without loss of generality, we denote as the input streaming data at the -stage, as the learned binary codes for and as the corresponding label set, where is the size of data at the -stage. Further, we denote as the pairwise similarity matrix, where , if and share the same label, otherwise . In an online setting, the parameter matrix should be updated based on the current data only.
3.2 Proposed Framework
The framework of the proposed method can be seen in Fig. 2. Specifically, suppose that at the -th round, we have a known similarity distribution matrix and an unknown Hamming distance distribution matrix . The goal of the proposed SDOH is to align with , such that the similarity can be well preserved in the Hamming space. It is achieved by minimizing the KL-divergence as follows:
where and are the -th elements in the -th row of and , respectively. In the following, we elaborate on the details of and .
3.2.1 Gaussian-based Normalization
One common approach to obtain is to normalize the similarity matrix with each element as:
However, such a probability matrix may suffer from an extremely imbalanced distribution, as shown in Fig. 2(a). For instance, when is a highly sparse matrix that is common in an online setting , is with a higher probability if and grows quickly. Similarly, if , decreases to 0 quickly.
Therefore, the learning of Eq.( 4) heavily relies on similar pairs and thus loses the information of dissimilar pairs, as shown in Fig. 2(c). To address this issue, one key novelty in our proposed SDOH is to modify Eq.( 4) as:
where is introduced to smooth the imbalanced distribution as shown in Fig. 2(b). We assume that follows a Gaussian distribution widely used in practice, i.e., , where and are the mean and variance of the pairwise similarity distribution, respectively. Therefore, we derive exp, where exp is the exponential function. Different values of the pair have different impacts on . To sum up, decides the position of the highest value of . The larger the is, the smoother the function is. Therefore, Eq.( 5) can well alleviate the imbalanced distribution problem caused by Eq.( 4) (see Fig. 2(d)).
3.2.2 Scaling Student -distribution
For the distribution , we define it as the probability of Hamming distance. The similarity between and can be measured by the Hamming distance as:
We propose a scaling Student -distribution based on a new with one degree of freedom to transform Hamming distances into probabilities. We start from the works in [28, 23], and each element of the original is defined as:
However, such an assigned distribution causes an unsatisfactory initialization of , which may lead to the performance degradation. Ideally, if , we need a higher value of . However, the value of in Eq.( 7) depends on the initialization of . If does not initialize well, is likely to be very small for . In such a case, in Eq.( 3) grows quickly. Similarly, when , in Eq.( 3) may be very small. Therefore, Eq.( 7) may result in an extremely poor initialization (see Fig. 2(e)).
To avoid such a poor initialization, another key novelty in our approach is to revise Eq.( 7) as follows:
where the scaling parameter if , otherwise . To analyze, scaling up will increase the value of , thus further decrease the value of when . Similar analyses can be applied to the case of (see Fig. 2(f)). Therefore, Eq.( 8) can well reduce the influence of initialization problem (see Fig. 2(d)).
The final objective function can be derived:
We note that the proposed SDOH clearly differs from SePH  as: Our SDOH is based on a new and well defined and which enable to avoid the imbalanced distribution and poor initialization problems. We are the first to employ -Student distribution for online uni-modal retrieval, while SePH aims at solving offline cross-modal retrieval. Our SDOH is implemented in an end-to-end manner, where the learning of hash functions is integrated into KL-divergence. While SePH is based on a two-step framework, where KL-divergence is used to guide the learning of binary codes first, and then hash functions are learned to approximate the learned binary codes.
3.3 The Optimization
After obtaining the distributions and , we aim to optimize the KL-divergence in Eq.( 9) to preserve the similarities in the Hamming space. Due to the discrete sign function in Eq.( 2), the above objective function is highly non-convex and difficult (usually NP-hard) to optimize. To solve it, we follow the work in  to replace the sign function with the tanh function as follows:
To solve the optimization problem in Eq.( 9), we adopt the SGD algorithm to update the hash functions at the -stage as below:
where is a positive learning rate.
We now elaborate the partial derivative of w.r.t. , i.e., . The gradient w.r.t. can be computed as follows:
Further, we denote , where is the matrix with being . Let , where stands for the Hadamard product. Let , where is the diagonal matrix, and represents a vector with all elements being . Therefore, we obtain the gradient w.r.t. as:
Besides, it is easy to obtain the derivative of w.r.t. based on Eq.( 10) as follows:
The optimization process for the proposed SDOH is summarized in the supplement material with more details.
We conduct experiments on three benchmark datasets, i.e., CIFAR- , Places , and MNIST . Our proposed SDOH will be compared with several state-of-the-art online hashing methods [12, 6, 17, 4, 5, 21, 20] to demonstrate its performance.
4.1 Experimental Settings
Datasets. The CIFAR- dataset consists of K instances from categories. Each instance is represented by a -dim CNN feature vector, extracted from VGG- . Similar to , we randomly split the dataset into a retrieval set with K samples, and a test set with K samples. Besides, K training images from the retrieval set are sampled to learn the hash functions.
The Places dataset is a large-scale and challenging dataset contains more than million images with scenes. We extract CNN features from the layer of AlexNet , which are reduced into -dim features by PCA. Following , images from each scene are used to construct a test set, and the remaining is used as the retrieval set. A random subset of K images is used to update the hash functions.
The MNIST dataset consists of K handwritten digit images with classes. Following , each image is represented by -dim normalized pixel values. The test set is constructed by sampling instances from each class, and the others are used to form the retrieval set. Besides, K images from the retrieval set are sampled to learn the hash functions.
Compared Methods. The proposed SDOH is compared with seven state-of-the-art online hashing methods, including the Online Kernel Hashing () , Online Sketch Hashing () , Adaptive Hashing () , Online Supervised Hashing () , Online hashing with Mutual Information () , Hadamard Codebook based Online Hashing ()  and Balanced Similarity for Online Discrete Hashing (BSODH) . Our model is implemented with MATLAB. The training is done on a standard workstation with a GHz Intel Core I CPU and G RAM. The source codes of these methods are publicly available. We carefully follow the original parameter settings for each method and report their best results.
Evaluation Protocols. We use five widely-adopted evaluation metrics for performance comparisons, including mean Average Precision (denoted as mAP), precision within a Hamming ball of radius centered on each query (denoted as Precision@H), mean precision of the top R retrieved neighbors (denoted as Precision@R), mAP vs. different sizes of training instances, as well as its corresponding area under the mAP curve (denoted as AUC). Noticeably, due to the large scale of Places dataset, it is very time-consuming to compute mAP. Following [5, 21], we only compute mAP on the top retrieved items (denoted as mAP@). And for SketchHash , the batch size has to be larger than the size of hashing bits. Thus, we only report its performance for hashing bits of and .
4.2 Results and Discussions
First, we report the mAP (mAP@) and Precision@H performance in Tabs. 1, 2 and 3. The highest values are shown in boldface, and the second best are with underlines. It can be seen that the proposed SDOH is the best in almost all cases. Interestingly, as the number of hashing bit increases, the proposed SDOH outperforms others by large margins.
Second, we analyze the Precision@R performance with R ranging from to on the three benchmarks in Figs. 3, 6 and 8. SDOH achieves super results on all three benchmarks in all hashing bits, which demonstrates the excellent performance of the proposed method.
Third, we report the mAP (mAP@) measure w.r.t. different training sizes in Figs. 4, 7 and 9. As the size of training data increases, SDOH has consistently higher mAP (mAP@) on all three benchmarks. To quantitatively evaluate the performance of all methods, we take a deeper analysis in terms of their AUC metrics in Fig. 5. For CIFAR- and MNIST, the AUC performance of Fig. 4 and Fig. 9 is charted in Fig. 5(a) and Fig. 5(c), respectively. Obviously, in all hashing bits, SDOH outperforms other methods by a large margin.
On Places, the AUC performance in Fig. 7 is plotted in Fig. 5(b). When the hashing bit is , the proposed method also transcends all other methods by a certain margin. Though, similar results can be observed for MIHash and HCOH in other hashing bits, the performance of SDOH still ranks the best. Quantitatively, compared with the second best method, i.e., HCOH, SDOH achieves an average improvement of , and on CIFAR-, Places, and MNIST, respectively.
Note that Figs. 4, 7 and 9 also validate the generalization ability of the proposed SDOH. Taking the case of hashing bit on CIFAR- for instance (Fig. 4), when the size of training data is , SDOH already achieves a satisfying result of mAP compared with other methods, e.g., mAP for MIHash and mAP for HCOH. For a more in-depth analysis, MIHash only considers the pairwise similarities while HCOH restricts the length of hashing bit to be consistent with the size of ECOC codebook applied. It shows that the proposed method not only captures the pairwise similarities of the current data batch, but also the relationships among data batches at different stages, which demonstrates the usefulness of exploiting the distribution of pairwise similarities.
Furthermore, we find the advantage of KL-based solution in comparison with others, such as inner product. As shown in the above figures and tables, the inner-product-based BSODH shows advantages mainly in Precision@H. Nevertheless, the proposed KL-based SDOH still outperforms BSODH by a clear margin, which demonstrates the effectiveness and capability of the KL-based solution.
4.3 Retrieval of Unseen Classes
We further test the performance on unseen classes as in . To do that, of the categories are treated as seen classes to form the training set. The remaining categories are regarded as unseen classes, which are divided into a retrieval set and test set to evaluate the hashing model. For each query, we retrieve the nearest neighbors among the retrieval set and then compute the Precision@R. The experiments are done when the hashing bit is . The results are shown in Fig. 10. Clearly, the proposed SDOH achieves the best performance among all methods, which further demonstrates the generalization capability of the proposed framework for online hashing.
We have presented a novel online hashing method, named SDOH, which aims to align the similarity distributions between the original data space and the hashing space to preserve the semantic relationship well in the Hamming space. To achieve the goal, we first transform the discrete similarity matrix into a specified probability matrix via a Gaussian-based normalization to solve the imbalanced distribution problem. Second, to deal with the poor initialization, we have developed a scaling Student -distribution to transform pairwise Hamming distance computation into a probability estimation problem. Finally, we approximate these two distributions via the KL-divergence to impose an intuitive similarity constraint to update hash functions with a powerful generalization ability. Experimental results have shown that the proposed SDOH achieves better results than the state-of-the-art methods in comparison.
This work is supported by the National Key R&D Program (No. 2017YFC0113000, and No. 2016YFB1001503), Nature Science Foundation of China (No. U1705262, No. 61772443, and No. 61572410), Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scientific Research Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).
-  M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In Proceedings of the ICLR, 2017.
-  X. Chen, I. King, and M. R. Lyu. Frosh: Faster online sketching hashing. In Proceedings of the UAI, 2017.
-  K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. JMLR, 2006.
-  C. Fatih, S. A. Bargal, and S. Sclaroff. Online supervised hashing. CVIU, 2017.
-  C. Fatih, K. He, S. A. Bargal, and S. Sclaroff. Mihash: Online hashing with mutual information. In Proceedings of the ICCV, 2017.
-  C. Fatih and S. Sclaroff. Adaptive hashing for fast similarity search. In Proceedings of the ICCV, 2015.
-  A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the VLDB, 1999.
-  Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of the CVPR, 2011.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the NeurIPS, 2014.
-  Y. Hao, T. Mu, J. Y. Goulermas, J. Jiang, R. Hong, and M. Wang. Unsupervised t-distributed video hashing and its deep hashing extension. IEEE TIP, 2017.
-  K. J. Horadam. Hadamard matrices and their applications. Princeton university press, 2012.
-  L. Huang, Q. Yang, and W. Zheng. Online hashing. In Proceedings of the IJCAI, 2013.
-  Q. Jiang and W. Li. Scalable graph hashing with feature transformation. In Proceedings of the IJCAI, 2015.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the NeurIPS, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
-  C. Leng, J. Wu, J. Cheng, X. Bai, and H. Lu. Online sketching hashing. In Proceedings of the CVPR, 2015.
-  E. Liberty. Simple and deterministic matrix sketching. In Proceedings of the ACM SIGKDD, 2013.
-  M. Lin, R. Ji, H. Liu, X. Sun, S. Chen, and Q. Tian. Hadamard matrix guided online hashing. arXiv preprint arXiv:1905.04454, 2019.
-  M. Lin, R. Ji, H. Liu, X. Sun, Y. Wu, and Y. Wu. Towards optimal discrete online hashing with balanced similarity. In Proceedings of the AAAI, 2019.
-  M. Lin, R. Ji, H. Liu, and Y. Wu. Supervised online hashing via hadamard codebook learning. In Proceedings of the ACM MM, 2018.
-  R.-S. Lin, D. A. Ross, and J. Yagnik. Spec hashing: Similarity preserving algorithm for entropy-based coding. In Proceedings of the CVPR, 2010.
-  Z. Lin, G. Ding, M. Hu, and J. Wang. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the CVPR, 2015.
-  H. Liu, R. Ji, Y. Wu, and F. Huang. Ordinal constrained binary code learning for nearest neighbor search. In Proceedings of the AAAI, 2017.
-  H. Liu, M. Lin, S. Zhang, Y. Wu, F. Huang, and R. Ji. Dense auto-encoder hashing for robust cross-modality retrieval. In ACM MM, 2018.
-  W. Liu, J. Wang, R. Ji, Y. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proceedings of the CVPR, 2012.
-  X. Liu, X. Nie, W. Zeng, C. Cui, L. Zhu, and Y. Yin. Fast discrete cross-modal hashing with regressing from semantic labels. In In Proceedings of the ACM MM, 2018.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 2008.
-  A. Sablayrolles, M. Douze, N. Usunier, and H. Jégou. How should we evaluate supervised hashing? In Proceedings of the ICASSP, 2017.
-  F. Shen, W. Liu, S. Zhang, Y. Yang, and H. Tao Shen. Learning binary codes for maximum inner product search. In Proceedings of the ICCV, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR, 2015.
-  J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data â a survey. Proceedings of the IEEE, 2016.
-  Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of the NeurIPS, 2009.
-  L. Wu, H. Ling, P. Li, J. Chen, Y. Fang, and F. Zou. Deep supervised hashing based on stable distribution. IEEE Access, 2019.
-  C. C. H. S. Y. Y. Xingbo Liu, Xiushan Nie. Modality-specific structure preserving hashing for cross-modal retrieval. In In Proceedings of the IEEE ICASSP, 2018.
-  B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Proceedings of the NeurIPS, 2014.