Rankingbased Deep Crossmodal Hashing
Abstract
Crossmodal hashing has been receiving increasing interests for its low storage cost and fast query speed in multimodal data retrievals. However, most existing hashing methods are based on handcrafted or raw level features of objects, which may not be optimally compatible with the coding process. Besides, these hashing methods are mainly designed to handle simple pairwise similarity. The complex multilevel ranking semantic structure of instances associated with multiple labels has not been well explored yet. In this paper, we propose a rankingbased deep crossmodal hashing approach (RDCMH). RDCMH firstly uses the feature and label information of data to derive a semisupervised semantic ranking list. Next, to expand the semantic representation power of handcrafted features, RDCMH integrates the semantic ranking information into deep crossmodal hashing and jointly optimizes the compatible parameters of deep feature representations and of hashing functions. Experiments on real multimodal datasets show that RDCMH outperforms other competitive baselines and achieves the stateoftheart performance in crossmodal retrieval applications.
Rankingbased Deep Crossmodal Hashing
Xuanwu Liu, Guoxian Yu^{†}^{†}thanks: Corresponding author: gxyu@swu.edu.cn(Guoxian Yu), Carlotta Domeniconi, Jun Wang, Yazhou Ren, Maozu Guo College of Computer and Information Sciences, Southwest University, Chongqing, China Hubei Key Laboratory of Intelligent GeoInformation Processing, China University of Geosciences, Wuhan, China Department of Computer Science, George Mason University, Fairfax, USA SMILE Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China Email: {alxw1007,gxyu,kingjun}@swu.edu.cn, carlotta@cs.gmu.edu, yazhou.ren@uestc.edu.cn, guomaozu@bucea.edu.cn
Copyright © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
With the explosive growth of data, how to efficiently and accurately retrieve the required information from massive data becomes a hot research topic and has various applications. For example, in information retrieval, approximate nearest neighbor (ANN) search (?) plays a fundamental role. Hashing has received increasing attention due to its low storage cost and fast retrieval speed for ANN search (?). The main idea of hashing is to convert the highdimensional data in the ambient space into binary codes in the lowdimensional Hamming space, while the proximity between data in the original space is preserved in the Hamming space(?; ?; ?). By using binary hash codes to represent the original data, the storage cost can be dramatically reduced. In addition, we can use hash codes to construct an index and achieve a constant or sublinear time complexity for ANN search. Hence, hashing has become more and more popular for ANN search on large scale datasets.
In many applications, the data can have multimodalities. For example, a web page can include not only a textual description but also images and videos to illustrate its contents. These different types (views) of data are called multimodal data. With the rapid growth of multimodal data in various applications, multimodal hashing has recently been widely studied. Existing multimodal hashing methods can be divided into two main categories: mutlisource hashing (MSH) and crossmodal hashing (CMH)(?). The goal of MSH is to learn hash codes by utilizing all the information from multiple modalities. Hence, MSH requires all the modalities observed for all data points, including query points and those in the database. In practice, it is often difficult or even infeasible to acquire all data points across all the modalities, as such the application of MSH is limited. On the contrary, the application scenarios of CMH are more flexible and practical. In CMH, the modality of a query point can be different from the modality of the points in the database. In addition, the query point typically has only one modality and the points in the database can have one or more modalities. For example, we can use text queries to retrieve images from the database, and we can also use image queries to retrieve texts from the database. Due to its wide application, CMH has attracted increasing attention (?; ?).
Many CMH methods have been proposed recently, existing CMH methods can be roughly divided into two categories: supervised and unsupervised. Unsupervised approaches seek hash coding functions by taking into account underlying data structures, distributions, or topological information. To name a few, Canonical correlation analysis (?) maps two modalities, such as visual and textual, into a common space by maximizing the correlation between the projections of the two modalities. Intermedia hashing (?) maps viewspecific features onto a common Hamming space by learning linear hash functions with intramodal and intermodal consistencies. Supervised approaches try to leverage supervised information (i.e., semantic labels) to improve the performance. Crossmodal similarity sensitive hashing (CMSSH) (?) regards the hash codes learning as binary classification problems, and efficiently learns the hash functions using a boosting method. Coregularized hashing (?) learns a group of hash functions for each bit of binary codes in every modal. Semantic correlation maximization (SCM) (?) optimizes the hashing functions by maximizing the correlation between two modalities with respect to the semantic labels. Semantics Preserving Hashing (SePH)(?) generates one unified hash code for all observed views of any instance by considering the semantic consistency between views.
Most supervised hashing are pairwise supervised methods, which leverage labels of instances and pairwise labels of instancepairs to train the coding functions, such that the label information can be preserved in the Hamming space (?). Their objectives, however, may be suboptimal for ANN search, because they do not fully explore the highorder ranking information (?). For example, a triplet rank contains a query image, a positive image, and a negative image, where the positive image is more similar to the query image than the negative image (?). Highorder ranking information carries relative similarity ordering in the triplets and provides richer supervision, it often can be more easily obtained than pairwise ranking. Some hashing methods consider the highorder ranking information for hashing learning. For example, deep semantic ranking based hashing (?) learns deep hash functions based on CNN (Convolutional neural network)(?), which preserves the semantic structure of multilabel images. Simultaneous feature learning and hash coding (?) generates bitwise hash codes for images via a carefully designed deep architecture and uses a triplet ranking loss to preserve relative similarities.
However, these semantic ranking methods just consider one modality, and cannot apply to crossmodal retrieval. Besides, the ranking lists are just simply computed by the number of shared labels, which could not preserve the integral ranking information of labels. Furthermore, the ranking lists adopted by these methods ask for sufficient labeled training data, and cannot make use of abundant unlabeled data, whose multimodal feature information can boost the crossmodal hashing performance. Semisupervised hashing methods were introduced to leverage both labeled and unlabeled samples (?), but these methods cannot directly be applied on multimodal data. Almost all these CMH methods are based on handcrafted (or rawlevel) features. One drawback of these handcrafted feature based methods is that the feature extraction procedure is isolated from the hashcode learning procedure, or the original rawlevel features can not reflect the semantic similarity between objects very well. The handcrafted features might not be optimally compatible with the hashcode learning procedure(?). As a result, these CMH methods can not achieve satisfactory performance in real applications.
Recently, deep learning has also been utilized to perform feature learning from scratches with promising performance. Deep crossmodal hashing(DCMH)(?) combines the deep feature learning with crossmodal retrieval and guides deep learning procedure with multilabels of multimodal objects. Correlation autoencoder hashing (?) adopts deep learning for unimodal hashing. Their studies show that the endtoend deep learning architecture is more compatible for hashing learning. However, they still ask for sufficient label information of training data, and treat the parameters of the hash quantization layer and those of deep feature learning layers as the same, which may reduce the discriminative power of the quantification process.
In this paper, we propose a rankingbased deep crossmodal hashing (RDCMH), for crossmodal retrieval applications. RDCMH firstly uses the feature and label information of data to derive a semisupervised semantic ranking list. Next, it integrates the semantic ranking information into deep crossmodal hashing and jointly optimizes the ranking loss and hashing codes functions to seek optimal parameters of deep feature representations and those of hashing functions. The main contributions of RDCMH are outlined as follows:

A novel crossmodal hash function learning framework (RDCMH) is proposed to integrate deep feature learning with semantic ranking to address the problem of preserving semantic similarity between multilabel objects for crossmodal hashing; and a label and feature information induced semisupervised semantic ranking metric is also introduced to leverage labeled and unlabeled data.

RDCMH jointly optimizes the deep feature extraction process and the hash quantization process to make feature learning procedure being more compatible with the hashcode learning procedure, and this joint optimization indeed significantly improves the performance.

Experiments on benchmark multimodal datasets show that RDCMH outperforms other baselines (?; ?; ?; ?; ?) and achieves the stateoftheart performance in crossmodal retrieval tasks.
The Proposed Approach
Suppose and are two data modalities, is the number of instances (data points), () is the dimensionality of the instances in the respective modality. For example, in the Wikiimage search application, is the image features of the entity , and is the tag features of this entity. stores the label information of instances in and with respect to distinct labels. , indicates that is labeled with the th label; otherwise. Without loss of generality, suppose the first samples have known labels, whereas other samples lack label information. To enable crossmodal hashing, we need to learn two hashing functions, : and : , where is the length of binary hash codes. These two hashing functions are expected to map the feature vectors in the respective modality onto a common Hamming space and to preserve the proximity of the original data.
RDCMH mainly involves with two steps. It firstly measures the semantic ranking between instances based on the label and feature information. Next, it defines an objective function to simultaneously account for semantic ranking, deep feature learning and hashing coding functions learning; and further introduces an alternative optimization procedure to jointly optimize these learning objectives. The overall workflow of RDCMH is shown in Fig. 1.
Semisupervised Semantic Ranking
To preserve the semantic structure, we can force the ranking order of neighbors computed by the Hamming distance being consistent with that derived from semantic labels in terms of ranking evaluation measures. Suppose is a query point, the semantic similarity level of a database point with respect to can be calculated based on ranking order of label information. Then we can obtain a groundtruth ranking list for by sorting the database points in decreasing order of their similarity levels (?; ?). However, this similarity level is just simply derived from the number of shared labels and these semantic rankingbased methods ignore that the labels of training data are not always readily available. Furthermore, these methods work on one modality and can not directly apply on multimodal data.
To alleviate the issue of insufficient labeled data, we introduce a semisupervised semantic measure that takes into account both the label and feature information of training data. The labels of an instance depend on the features of this instance, and the semantic similarity is positively correlated with the feature similarity of respective instances (?; ?). The semisupervised semantic measure is defined as follows:
(1) 
where is the feature similarity of and , while is the label similarity, both of them are computed by the cosine similarity. Note, is always in the interval [0,1] and other similarity metrics can also be used. Eq. (1) can account for both the labeled and unlabeled training data. Specifically, for two unlabeled data, the similarity between and is directly computed from the feature information of the respective data. For labeled data, we consider that the label similarity is a supplement to . The larger the is, the larger the is. In this way, we leverage the label and feature information of training data to account for insufficient labels.
Extending the ranking order to the crossmodal case, we should keep the sematic structure both in the intermodality and intramodality. Based on , we can obtain a ranking list for by sorting the database points in decreasing order of . Similarly, we can define the semisupervised semantic similarity for the data modality . To balance the inconsistence of ranking list between two modalities, the semisupervised semantic similarity is averaged as: . Finally, we can obtain three different ranking lists: , , for each query point.
Unified Objective Function
Deep Feature Representation
Most existing hashing methods first extract handcrafted visual features (like GIST and SIFT) from images and then learn ‘shallow’ (usually linear) hash functions upon these features (?; ?; ?). However, these handcrafted features have limited representation power and may lose key semantic information, which is important for similarity search. Here we consider designing deep hash functions using CNN (?) to jointly learn feature representations and their mappings to hash codes. This nonlinear hierarchical hash function has more powerful learning capability than the shallow one based on features crafted in advance, and thus is able to learn more suitable feature representations for multilevel semantic similarity search. Other representation learning models (i.e., AlexNet) can also be used to learn deep features of images and text for RDCMH. The feature learning part contains two deep neural networks, one for image modality and the other for text modality.
The adopted deep neural network for image modality is a CNN, which includes eight layers. The first six layers are the same as those in CNNF(?). The seventh and eighth layer is a fullyconnected layer with the outputs being the learned image features. As to the text modality, we first represent each text as a vector with bagofwords (BOW) representation. Next, the bagofwords vectors are used as the inputs for a neural network with two fullyconnected layers, denoted as “full1  full2”. The “full1” layer has 4096 neurons, and the second layer “full2” has (hashing codes) neurons, The activation function for the first layer is ReLU, and that for the second layer is the identity function.
For presentation, we represent the learnt deep feature representations of and as and . The nonlinear mapping parameters of these two representations will be discussed later.
Triplet Ranking Loss and Quantitative Loss
Directly optimizing the ranking criteria for crossmodal hashing is very hard. Because it is very difficult to compare the ranking lists and stringently comply with the lists. To circumvent this problem, we use a triplet ranking loss as the surrogate loss. Given a query and a ranking list for , we can define a ranking loss on a set of triplets of hash codes as follows:
(2) 
where is the length of the ranking list, and are the similarity between query and and , respectively. represents the learnt hash codes of , , , is the Hamming distance. This triplet ranking loss is a convex upper bound on the pairwise disagreement, it counts the number of incorrectly ranked triplets.
Eq. (2) equally treats all triplets, but two samples ( and ) of a triplet may have different similarity levels to the query . So we introduce the weighted ranking triplet loss based on the ranking list as follows:
(3) 
The larger the relevance between and than that between and is, the larger the ranking loss results in, if is ranked behind for .
As to the crossmodal case, we should balance the inconsistence of ranking lists between two modalities. To this end, we give the unified objective function that simultaneously account for triplet ranking loss and quantitative loss as follows:
(4) 
where
(5) 
(6) 
(7) 
is the set of query points, and are the deep features of images and texts, and are the coefficient matrices of two modalities, respectively. is the scalar parameter to balance the triplet ranking loss and quantitative loss. and are the binary hash codes for image and text modality, respectively. In the training process, since different modality data of the same sample share the same label set, and they actually represent the same sample from different viewpoints, we fix the binary codes of same training points from two modalities as the same, namely .
Eq. (4) simultaneously accounts for the triplet ranking loss and the quantitative loss. The first term enforces the consistency of crossmodal ranking list by minimizing the number of incorrectly ranked triplets, and the second term (weighted by ) measures the quantitative loss of hashing. and can preserve the crossmodal similarity in , and , as a result, binary hash codes and can also preserve these crossmodal similarities. This exactly coincides with the goal of crossmodal hashing.
Optimization
We can solve Eq. (4) via the Alternating Direction Method of Multipliers (ADMM) (?), which alternatively optimizes one of , , and B, while keeping the other two fixed.
Optimize with and fixed: We observe that the loss function in Eq. (4) is actually a summation of weighted triplet losses and the quantitative loss. Like most existing deep learning methods, we utilize stochastic gradient descent (SGD) to learn with the backpropagation (BP) algorithm. In order to facilitate the gradient computation, we rewrite the Hamming distance as the form of inner product: , where is the number of hash bits.
More specifically, in each iteration we sample a minibatch of points from the training set and then carry out our learning algorithm based on the triplet data. For any triplet (), the derivative of Eq. (4) with respect to coefficient matrix in the data modality is given by:
(8) 
(9) 
(10) 
We can compute with , and using the chain rule. These derivative values are used to update the coefficient matrix , which is then fed into six layers CNN to update the parameters of in each layer via the BP algorithm.
Similar to the optimization of , we optimize on the data modality with and fixed. The derivative values are similarly used to update the coefficient matrix , which is then fed into the adopted twolayer network to update the parameters of in each layer via the BP algorithm.
Optimize with and fixed: When and are optimized and fixed, and are also determined, then the minimization problem in Eq. (4) is equal to a maximization as follows:
(11) 
where . It is easy to observe that the binary code should keep the same sign as . Therefore, we have:
(12) 
Experiment
Datasets
We use three benchmark datasets: Nuswide, Pascal VOC, and Mirflicker to evaluate the performance of RDCMH. Each dataset include two modalities (image and text), but RDCMH can also be applied to other data modalities. For modalities, we just need to compute the ranking lists for each modality and optimize it by minimizing the inconsistency of each ranking list between any pairwise modality.
Nuswide^{1}^{1}1http://lms.comp.nus.edu.sg/research/NUSWIDE.htm contains 260,648 web images, and some images are associated with textual tags. It is a multilabel dataset where each point is annotated with one or several labels from 81 concept labels. The text for each point is represented as a 1000dimensional bagofwords vector. The handcrafted feature for each image is a 500dimensional bagofvisual words (BOVW) vector.
Wiki^{2}^{2}2https://www.wikidata.org/wiki/Wikidata is generated from a group of 2866 Wikipedia documents. Each document is an imagetext pair labeled with 10 semantic classes. The images are represented by 128dimensional SIFT feature vectors. The text articles are represented as probability distributions over 10 topics, which are derived from a Latent Dirichlet Allocation (LDA) model.
Mirflickr^{3}^{3}3http://press.liacs.nl/mirflickr/mirdownload.html originally contains 25,000 instances collected from Flicker. Each instance consists of an image and its associated textual tags, and is manually annotated with one or more labels, from a total of 24 semantic labels. The text for each point is represented as a 1386dimensional bagofwords vector. For the handcrafted feature based method, each image is represented by a 512dimensional GIST feature vector.
Evaluation metric and Comparing Methods
We use the widely used Mean Average Precision (MAP) to measure the retrieval performance of all crossview hashing methods. A larger MAP value corresponds to a better retrieval performance.
Seven stateoftheart and related crossmodal hashing methods are used as baselines for comparison, including Crossmodal Similarity Sensitive Hashing (CMSSH) (?), Semantic Correlation Maximization (SCMseq and SCMorth) (?), Semantics Preserving Hashing (SePH) (?), Deep Crossmodal Hashing (DCMH) (?), Correlation Hashing Network (CHN) (?) and Collective Deep Quantization(CDQ)(?). Source codes of these baselines are kindly provided by the authors and the input parameters of these baselines are specified according to the suggestion of the papers. As to RDCMH, we set the minibatch size for gradient descent to 128, and set dropout rate as 0.5 on the fully connected layers to avoid overfitting. The regularization parameter in Eq. (4) is set to 1, and the number of iterations for optimizing Eq. (4) is fixed to 500. We empirically found RDCMH generally converges in no more than 500 iterations on all these datasets. The length of the semisupervised sematic ranking list used for training is set to 5. Namely, we divide the ranking list (i.e., into 5 bins and randomly pick three points from three different bins to form a triplet for training. By doing so, we can not only capture different levels of semantic similarity, but also avoid optimizing too much triplets, whose maximum number is cubic to the number of samples. Our preliminary study shows that DRCMH holds relatively stable performance when the number of bins .
Results and Analysis
Search Accuracies
Mirflickr  Nuswide  Wiki  
Methods  16bits  32bits  64bits  128bits  16bits  32bits  64bits  128bits  16bits  32bits  64bits  128bits  

CMSSH  
SCMseq  
SCMorth  
SePH  
DCMH  
CHN  
CDQ  
RDCMH  

CMSSH  
SCMseq  
SCMorth  
SePH  
DCMH  
CHN  
CDQ  
RDCMH  
Mirflickr  Nuswide  Wiki  
Methods  16bits  32bits  64bits  128bits  16bits  32bits  64bits  128bits  16bits  32bits  64bits  128bits  

CMSSH  
SCMseq  
SCMorth  
SePH  
DCMH  
CHN  
CDQ  
RDCMH  

CMSSH  
SCMseq  
SCMorth  
SePH  
DCMH  
CHN  
CDQ  
RDCMH  
The MAP results for RDCMH and other baselines with handcrafted features on MIRFLICKR, NUSWIDE and Wiki datasets are reported in Table 1. Here, ‘Image vs. Text’ denotes the setting where the query is an image and the database is text, and ‘Text vs. Image’ denotes the setting where the query is a text and the database is image.
From Table 1, we have the following observations.
(1) RDCMH outperforms almost all the other baselines, which demonstrate the superiority of our method in crossmodal retrieval. This superiority is because RDCMH integrates the semantic ranking information into deep crossmodal hashing to preserve better semantic structure information, and jointly optimizes the triplet ranking loss and quantitative loss to obtain more compatible parameters of deep feature representations and of hashing functions. SePH achieves better results for text to image retrieval on Wiki. That is possible because its adaptability of probabilitybased strategy on small datasets.
(2) An unexpected observation is that the performance of CMSSH and SCMOrth decreases as the length of hash codes increase. This may be caused by the imbalance between bits in the hash codes learnt by singular value decomposition or eigenvalue decomposition, these two decompositions are adopted these two approaches.
(3) Deep hashing methods (DCMH, CHN, CDQ and DRCMH) have an improved performance than the others. This proves that deep feature learned from raw data is more compatible for hashing learning than handcrafted features in crossmodal retrieval. DRCMH still outperforms DCMH, CDQ and CHN. This observation corroborates the superiority of rankingbased loss and the necessity of jointly learning deep feature presentations and hashing functions.
To further verify the effectiveness of RDCMH in semisupervised situation, we randomly mask all the labels of 70% training samples. All the comparing methods then use the remaining labels to learn hash functions. Table 2 reports the results under different hash bits on three datasets. All these methods manifest sharply reduced MAP values. RDCMH have higher MAP values than all the other baselines, and also outperforms SePH on the Wiki dataset. RDCMH is less affected by the insufficient labels than other methods. For example, the average MAP value of the second best performer CHN is reduced by 81.9%, and that of RDCMH is 70.2%. This is because label integrity has a significant impact on the effectiveness of supervised hashing methods. In practice, the pairwise semantic similarity between labeled data is reduced to 9% () in this setting. As a result, RDCMH also has a sharply reduced performance. All these comparing methods ask for sufficient label information to guide the hashing code learning. Unfortunately, these comparing methods disregard unlabeled data, which contribute to more faithfully explore the structure of data and to reliable crossmodal hashing codes. This observation proves the effectiveness of the introduced semisupervised semantic measure in leveraging unlabeled data to boost the hashing code learning.
We conducted additional experiments on multilabel datasets with 30% missing labels by randomly masking the labels of training data. The recorded results show that RDCMH again outperforms the comparing methods. Specifically, the average MAP value of the second best performer (CDQ) is 4% less than that of RDCMH. Due to space limitation, the results are not reported here. Overall, we can conclude that RDCMH is effective in weaklysupervised scenarios.
Sensitivity to Parameters
We further explore the sensitivity of the scalar parameter in Eq. (4), and report the results on Mirflickr and Wiki in Fig. 2, where the code length fixed as 16 bits. We can see that RDCMH is slightly sensitive to with , and achieves the best performance when . Overweighting or underweighting the quantitative loss have a negative impact to the performance, but not so significant. In summary, an effective can be easily selected for RDCMH.
Further Analysis
To investigate the contribution components of RDCMH, we introduce four variants of RDCMH, namely RDCMHNW, RDCMHND, RDCMHNS and RDCMHNJ. RDCMHNW disregards the weight and equally treats all the triplets; RDCMHND denotes the variant without deep feature learning, it directly uses the handcrafted features to learn hashing functions during training. RDCMHNS simply obtains the ranking list by the number of shared labels, as done by (?; ?). RDCMHNJ isolates deep feature learning and hashing functions learning, it first learns deep features and then generates hash codes based on the learnt features. Fig. 3 shows the results of these variants on the Mirfilcker dataset. The results on other datasets provide similar observations and conclusions, and are omitted here for space limit.
We can see RDCMH outperforms RDCMHNW. This means the triplet ranking loss with adaptive weights can improve the crossmodal retrieval quality, since it assigns larger weights to more relevant points and smaller weights to the less relevant ones. RDCMH also outperforms RDCMHNS, which indicates that dividing the ranking lists into different levels based on the semisupervised semantic similarity is better than simply dividing by the number of shared labels, which was adopted by (?; ?). Moreover, we can find that RDCMH achieves a higher accuracy than RDCMHND and RDCMHNJ, which shows not only the superiority of deep features than handcrafted features in crossmodal retrieval, but also the advantage of simultaneous hashcode learning and deep feature learning.
Conclusion
In this paper, we proposed a novel crossmodal hash function learning formwork (RDCMH) to seamlessly integrate deep feature learning with semantic ranking based hashing. RDCMH can preserve multilevel semantic similarity between multilabel objects for crossmodal hashing, and it also introduces a label and feature information induced semisupervised semantic measure to leverage labeled and unlabeled data. Extensive experiments demonstrate that RDCMH outperforms other stateoftheart hashing methods in crossmodal retrieval. The code of RDCMH is available at mlda.swu.edu.cn/codes.php?name=RDCMH.
Acknowledgments
This work is supported by NSFC (61872300, 61741217, 61873214, and 61871020), NSF of CQ CSTC (cstc2018jcyjAX0228, cstc2016jcyjA0351, and CSTC2016SHMSZX0824), the Open Research Project of Hubei Key Laboratory of Intelligent GeoInformation Processing (KLIGIP2017A05), and the National Science and Technology Support Program (2015BAK41B04).
References
 [Andoni and Indyk 2006] Andoni, A., and Indyk, P. 2006. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, 459–468.
 [Boyd et al. 2011] Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1):1–122.
 [Bronstein et al. 2010] Bronstein, M. M.; Bronstein, A. M.; Michel, F.; and Paragios, N. 2010. Data fusion through crossmodality metric learning using similaritysensitive hashing. In CVPR, 3594–3601.
 [Cao et al. 2016] Cao, Y.; Long, M.; Wang, J.; and Zhu, H. 2016. Correlation autoencoder hashing for supervised crossmodal search. In ACM ICMR, 197–204.
 [Cao et al. 2017] Cao, Y.; Long, M.; Wang, J.; and Liu, S. 2017. Collective deep quantization for efficient crossmodal retrieval. In AAAI, 3974–3980.
 [Chang et al. 2012] Chang, S. F.; Jiang, Y. G.; Ji, R.; Wang, J.; and Liu, W. 2012. Supervised hashing with kernels. In CVPR, 2074–2081.
 [Chatfield et al. 2014] Chatfield, K.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Return of the devil in the details: Delving deep into convolutional nets. BMVC 5(4):14.
 [Jiang and Li 2017] Jiang, Q. Y., and Li, W. J. 2017. Deep crossmodal hashing. In CVPR, 3270–3278.
 [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
 [Kulis and Grauman 2010] Kulis, B., and Grauman, K. 2010. Kernelized localitysensitive hashing for scalable image search. In ICCV, 2130–2137.
 [Kumar and Udupa 2011] Kumar, S., and Udupa, R. 2011. Learning hash functions for crossview similarity search. In IJCAI, 1360–1365.
 [Lai et al. 2015] Lai, H.; Pan, Y.; Liu, Y.; and Yan, S. 2015. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, 3270–3278.
 [Lin et al. 2017] Lin, Z.; Ding, G.; Han, J.; and Wang, J. 2017. Crossview retrieval via probabilitybased semanticspreserving hashing. IEEE Transactions on Cybernetics 47(12):4342–4355.
 [Rasiwasia et al. 2010] Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G. R.; Levy, R.; and Vasconcelos, N. 2010. A new approach to crossmodal multimedia retrieval. In ACM MM, 251–260.
 [Shao et al. 2016] Shao, W.; He, L.; Lu, C.T.; Wei, X.; and Philip, S. Y. 2016. Online unsupervised multiview feature selection. In ICDM, 1203–1208.
 [Song et al. 2013] Song, J.; Yang, Y.; Yang, Y.; Huang, Z.; and Shen, H. T. 2013. Intermedia hashing for largescale retrieval from heterogeneous data sources. In SIGMOD, 785–796.
 [Song et al. 2015] Song, D.; Liu, W.; Ji, R.; Meyer, D. A.; and Smith, J. R. 2015. Top rank supervised binary coding for visual search. In ICCV, 1922–1930.
 [Wang et al. 2009] Wang, C.; Yan, S.; Zhang, L.; and Zhang, H.J. 2009. Multilabel sparse coding for automatic image annotation. In CVPR, 1643–1650.
 [Wang et al. 2016] Wang, J.; Liu, W.; Kumar, S.; and Chang, S.F. 2016. Learning to hash for indexing big data – a survey. Proceedings of the IEEE 104(1):34–57.
 [Wang et al. 2018] Wang, J.; Zhang, T.; Sebe, N.; and Shen, H. T. 2018. A survey on learning to hash. TPAMI 40(4):769–790.
 [Wang, Kumar, and Chang 2012] Wang, J.; Kumar, S.; and Chang, S. F. 2012. Semisupervised hashing for largescale search. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(12):2393–2406.
 [Yi and Yeung 2012] Yi, Z., and Yeung, D. Y. 2012. Coregularized hashing for multimodal data. In NIPS, 1376–1384.
 [Zhang and Li 2014] Zhang, D., and Li, W. J. 2014. Largescale supervised multimodal hashing with semantic correlation maximization. In AAAI, 2177–2183.
 [Zhang and Zhou 2010] Zhang, Y., and Zhou, Z.H. 2010. Multilabel dimensionality reduction via dependence maximization. TKDD 4(3):14.
 [Zhao et al. 2015] Zhao, F.; Huang, Y.; Wang, L.; and Tan, T. 2015. Deep semantic ranking based hashing for multilabel image retrieval. In CVPR, 1556–1564.
 [Zhu et al. 2013] Zhu, X.; Huang, Z.; Shen, H. T.; and Zhao, X. 2013. Linear crossmodal hashing for efficient multimedia search. In ACM MM, 143–152.