Crossmodal Deep Metric Learning with Multitask Regularization
Abstract
DNNbased crossmodal retrieval has become a research hotspot, by which users can search results across various modalities like image and text. However, existing methods mainly focus on the pairwise correlation and reconstruction error of labeled data. They ignore the semantically similar and dissimilar constraints between different modalities, and cannot take advantage of unlabeled data. This paper proposes Crossmodal Deep Metric Learning with Multitask Regularization (CDMLMR), which integrates quadruplet ranking loss and semisupervised contrastive loss for modeling crossmodal semantic similarity in a unified multitask learning architecture. The quadruplet ranking loss can model the semantically similar and dissimilar constraints to preserve crossmodal relative similarity ranking information. The semisupervised contrastive loss is able to maximize the semantic similarity on both labeled and unlabeled data. Compared to the existing methods, CDMLMR exploits not only the similarity ranking information but also unlabeled crossmodal data, and thus boosts crossmodal retrieval accuracy.
Xin Huang and Yuxin Peng*
Beijing 100871, China
{huangxin_14, pengyuxin}@pku.edu.cn
Crossmodal retrieval, metric learning, multitask regularization
1 Introduction
Nowadays, multimedia retrieval is increasingly important for data management and utilization, and has been a research hotspot for a long time. However, most of the existing methods are for singlemodal retrieval, and can only measure the similarity between data of the same single modality. Different modalities are different views of semantics. An image of flying bird and a text description of bird have the same semantic of “bird”. So they describe the same semantics through two different views, and are similar to each other in the semantic level. Modeling the similarities among different modalities is important for better understanding the multimedia data, and also for multimedia applications on the Internet.
Crossmodal similarity learning focuses on exploiting semantic correlation among multiple modalities like image and text. The existing methods mainly project data of different modalities into one common space and then get shared representations for them. So the crossmodal similarity can be directly measured by distance computing. For example, canonical correlation analysis (CCA) [1] can learn a common space maximizing correlation between data with two modalities. There are many CCAbased methods as [2, 3]. Crossmodal Factor Analysis (CFA) approach [4] aims to minimize Frobenius norm between pairwise data in the common space. Besides, Zhai et al. propose Joint Representation Learning (JRL) [5], which can model pairwise correlation and semantic information jointly in a unified graphbased framework. Kang et al. propose Local Group based Consistent Feature Learning (LGCFL) [6], which adopts a local groupbased priori to learn basis matrices of different modalities. Hua et al. [7] propose to first build semantic hierarchy with content and ontology similarities, and then learn a set of local linear projections and probabilistic membership functions for local expert aggregation. However, the above methods mostly perform shared representation projection by linear functions, which is insufficient for the complex crossmodal correlation.
Inspired by the improvement of deep neural network (DNN) in singlemodal retrieval, researchers have also attempted to apply DNN to crossmodal similarity measure. For instance, the input of different modalities can be converted to shared representations through a shared code layer as [8, 9, 10]. Also, there are some crossmodal deep architectures consisting of two linked deep encodings such as Deep CCA [11, 12] and CorrAE [13]. The above methods mainly focus on the pairwise correlation and reconstruction error of labeled multimodal data. However, they ignore the semantically similar and dissimilar constraints between different modalities, which can provide similarity ranking information for better semantically discriminative ability. Unlabeled data should also be taken into consideration, which can increase the diversity of training data and boost the accuracy of shared representation learning.
For addressing the above problems, this paper proposes Crossmodal Deep Metric Learning with Multitask Regularization (CDMLMR), which integrates quadruplet ranking loss and semisupervised contrastive loss for modeling crossmodal semantic similarity in a unified metric learning architecture. On the one hand, triplet network has been proposed for singlemodal metric learning [14, 15], which can model the relative similarity of images. CDMLMR extends the singlemodal triplet network to crossmodal quadruplet network, which can preserve relative similarity ranking information between different modalities. On the other hand, the semisupervised contrastive loss can preserve the similarity information on both labeled and unlabeled data by maximizing the semantic correlation with an online graph construction strategy. The two loss functions can be integrated into one unified multitask network and optimized simultaneously inspired by [16]. By doing so, CDMLMR can capture fully the useful yet intrinsic hints for crossmodal similarity measure and improve the retrieval accuracy. Experiment results show that CDMLMR achieves better performance comparing with 8 stateoftheart methods on 2 datasets: Wikipedia [2] and NUSWIDE10k [17].
2 Crossmodal Deep Metric Learning with Multitask Regularization
To perform crossmodal deep metric learning, a base network with Deep Belief Network (DBN) and Bimodal Autoencoders (Bimodal AE) [8] will first be used for the feature of different modalities, from which we can get the shallow crossmodal shared representations. Then as shown in Figure 1, CDMLMR model integrates two loss functions for modeling the crossmodal semantic similarity in a unified optimization deep network, which are semisupervised contrastive loss and quadruplet ranking loss. The network is trained by simultaneous optimization based on these loss functions to get the final semantically discriminative shared representations.
Formally, the multimodal dataset contains both the labeled data and unlabeled data. denotes the labeled image data, here the th image data is denoted as , is the number of the labeled image data, the dimensional number of the image feature is , and is the corresponding label of . The unlabeled image data is denoted as , where is the total number of the image data. And the text data is represented as and which is similar to the image data. It should be noted that th image and text, i.e., and , have pairwise correspondence.
2.1 The Base Network
The base network can convert data from different modalities to representations of the same dimensional number, and these shallow shared representations will be used as input of the deep metric learning network. We first employ a separate twolayer DBN to model each modality. To model the distribution over the image feature , the Gaussian Restricted Boltzmann Machine (RBM) is used, which is an undirected graphical model having visible units connected to the hidden units . And for text feature , Replicated Softmax is used to model the distribution over them. The probability that each DBN model assigns to the input feature is defined as follows:
(1)  
(2) 
The outputs of the two DBN are denoted as and . Then we use Bimodal AE to get the shallow shared representations with the same dimension. Bimodal AE has the ability to reconstruct both two modalities by minimizing the reconstruction error between the input and the reconstruction representations at the reconstruction layers. The shallow shared representations obtained from the middle layer of Bimodal AE are denoted as and .
2.2 Multitask Regularization
As shown in Figure 1, our CDMLMR model has two pathways of three fullyconnected layers for each modality separately, taking the shallow shared representations and obtained from the base network as input. At the top of the two pathways network, multiple loss branches are embedded with a fullyconnected layer using sigmoid nonlinearity, which integrates the semisupervised and quadruplet ranking regularization in a unified optimization deep network. For image, a batch data for each iteration consist of the labeled images and the unlabeled images , where is the total number of the image data in a minibatch, and of them are labeled images. Similarly, we have and . The outputs from the two pathways are separately denoted as and , where denotes the image mapping and denotes the text mapping.
Semisupervised Contrastive Loss: In CDMLMR, the semisupervised contrastive loss is proposed to preserve the similarity information of both labeled and unlabeled data. The basic idea is similar image/text pairs should have similar shared representations, and vice versa. Here a pair of data are “similar” if they are close to each other in the shallow shared representation space, or from the same semantic class. For modeling such semisupervised information, a neighborhood graph is constructed, where the vertices represent both image and text data, and the edges represent the crossmodal similarity matrix between image and text data, which is denoted as for the labeled image/text pairs and for the unlabeled image/text pairs. For the labeled image and labeled text , the similarity matrix is constructed based on labels as follows:
(3) 
As for unlabeled image/text pairs which mean that at least one of the image and text data in pair is unlabeled, we analyze the knearestneighbors and for each image and text . Instead of constructing the graph offline on all the data which is much timeconsuming, an online graph construction strategy is proposed to generate the crossmodal similarity matrix for the unlabeled image/text pairs within a minibatch. So the similarity matrix for the unlabeled image/text pairs is defined as follows:
(4) 
For maximizing the semantic correlation, we expect the similar image/text pairs to have smaller distance and the dissimilar image/text pairs to have larger distance. Thus, a contrastive loss for the labeled image/text pairs to capture the similarity information is designed as follows:
(5) 
and for capturing the adjacency neighbors information, the contrastive loss between unlabeled image/text pairs is defined as follows:
(6) 
where and are for input image and text data obtained from the base network respectively, and is the margin parameter. Combining the above two loss functions, finally we get the semisupervised contrastive loss function as follows:
(7) 
For balancing the number of similar and dissimilar pairs, we randomly select a similar pair and a dissimilar pair for each image or text for training. By minimizing the above loss function, we can preserve the similarity information on both the labeled and unlabeled data.
Quadruplet Ranking Loss: Inspired by the idea of preserving the relative similarity in the triplet network as [14], the crossmodal quadruplet ranking loss is designed for further modeling the crossmodal relative similarity with a sample layer to generate the quadruplet samples from the output of the separate twopathway network. The quadruplet samples are organized into the form of () and satisfy the following two relative similarity constraints: (1) The text sample is more similar to the image sample than to the image sample . (2) The image sample is more similar to the text sample than to the text sample . The similarity is according to data labels, so the quadruplet samples are generated only from the labeled data and in a minibatch. Based on this, the quadruplet ranking loss function is defined as follows:
where is the margin parameter. By capturing both the betweenclass and withinclass differences between different modalities, the quadruplet network can effectively preserve the crossmodal relative similarity and improve the semantic discriminative ability of shared representations to boost retrieval accuracy.
After network training, we get the mapping function for the image pathway and for the text pathway. For both modalities and , we can calculate and (denoted as and ) as the final semantically discriminative shared representations. They can further be used for retrieval by distance measure.
2.3 Network Training
CDMLMR involves two loss functions: the quadruplet ranking loss and the semisupervised contrastive loss. First, we calculate the derivative of the two loss functions separately. For the semisupervised contrastive loss, the derivative of the loss function in (5) is calculated for each image and text as follows:
(9) 
(10) 
where and is in opposite. is the margin parameter. Moreover, we can easily calculate the derivative of the in (6) similar as and further calculate the derivative of in (7). Thus, the backpropagation can be applied to update the parameters through the network.
For the quadruplet ranking loss in (LABEL:equ:loss), we calculate the gradients of as follows:
(11)  
(12)  
(13)  
(14) 
where is denoted as . And the parameter is 1 if
(15) 
otherwise . Thus this loss function in (LABEL:equ:loss) could be applied to back propagation in the neural networks.
After calculating the derivative of the above two loss functions, the gradients of each modality from the fullyconnected layers of each loss branch are summed together at the top of the 3 fullyconnected layers in the proposed two pathways network for parameter updating.
Datasets  Task  CCA  CFA 




CorrAE  JRL  CMDN  CDMLMR  
Wikipedia Dataset  ImageText  0.124  0.236  0.200  0.245  0.236  0.149  0.280  0.344  0.393  0.412  
TextImage  0.120  0.211  0.185  0.219  0.208  0.150  0.242  0.277  0.325  0.341  
Average  0.122  0.224  0.193  0.232  0.222  0.150  0.261  0.311  0.359  0.377  
NUSWDIE 10k Dataset  ImageText  0.120  0.211  0.150  0.232  0.159  0.158  0.223  0.324  0.391  0.405  
TextImage  0.120  0.188  0.149  0.213  0.172  0.130  0.227  0.263  0.357  0.379  
Average  0.120  0.200  0.150  0.223  0.166  0.144  0.225  0.294  0.374  0.392 
Datasets  Task  CCA  CFA 




CorrAE  JRL  CMDN  CDMLMR  
Wikipedia Dataset  ImageText  0.186  0.315  0.245  0.275  0.282  0.189  0.335  0.310  0.360  0.388  
TextImage  0.167  0.328  0.277  0.341  0.327  0.222  0.368  0.386  0.487  0.517  
Average  0.177  0.322  0.261  0.308  0.305  0.206  0.352  0.348  0.424  0.453  
NUSWDIE 10k Dataset  ImageText  0.205  0.324  0.254  0.301  0.250  0.173  0.331  0.348  0.432  0.487  
TextImage  0.210  0.332  0.250  0.360  0.297  0.203  0.379  0.458  0.497  0.553  
Average  0.208  0.328  0.252  0.331  0.274  0.188  0.355  0.403  0.465  0.520 
2.4 Details of the Network
The network parameters need to be adjusted according to the input dimensions. Here we will present the layer parameters designed for Wikipedia dataset which will be introduced in the experiment section. In the base network, the twolayer DBN for image input has 2048 hidden units on the first layer, and on the second layer, there are 1024 hidden units. For the text input, the twolayer DBN has 1024 hidden units on both the two layers. On the top of DBN, a threelayer feedforward neural network with a Softmax layer is adopted for further optimization, which has the dimensional number of 1024 on each layer. In the Bimodal AE, the input layer and the reconstruction layer have the same number of dimension, and the dimensional number of the middle layer is half of the input. There is also a Softmax layer connected to the middle layer for further optimization. As for the twopathway network in Figure 1, all the three fullyconnected layers have the dimensional number of 256. And the dimension of the fullyconnected layers with sigmoid nonlinearity on each loss branch is also 256. For generality, the output dimensions are 256 for all the 3 datasets according to the retrieval accuracy on validation set of Wikipedia dataset. The networks are trained with a base learning rate 0.001 by stochastic gradient descent with 0.9 momentum, and the weight decay parameter is 0.004. The network is easy to train and converges in less than 5k steps in our experiment.
3 Experiment
3.1 Experiment Datasets
We will introduce the 2 datasets briefly as follows. For fair comparison, the dataset partition and feature extraction is strictly the same with [13] and [10]. It should be noted that in our experiment, unlabeled data is from test set, so we set the ratio of labeled/unlabeled data according to the ratio of training/test data.
Wikipedia dataset [2]. Based on Wikipedia’s “feature articles”, Wikipedia dataset is the most widely used dataset for crossmodal retrieval, which consists of 2,866 documents with 10 categories, and each document has an image/text pair. In our experiment, following [13] and [10], the dataset is split into 3 parts: 2,173 documents as training set, 462 documents as testing set, and 231 documents as validation set. The image feature is 2,296d catenation of 1,000d PHOW descriptor, 512d GIST descriptor, and 784d MPEG7 descriptor. The text feature is the representation of 3,000d BoW vector.
NUSWIDE10k dataset [17]. NUSWIDE dataset consists of about 270,000 images and the tags of them, and they are categorized into 81 classes. NUSWIDE10k is a subset of NUSWIDE dataset with 10,000 image/text pairs from the 10 largest classes (1,000 image/pairs from each class). The dataset is also randomly split into 3 parts: 8,000 documents as training set, 1,000 documents as testing set, and 1,000 documents as validation set. The same as [13], we take 1,134d catenation image feature of 64d color histogram, 144d color correlogram, 73d edge direction histogram, 128d wavelet texture, 225d blockwise color moments and 500d SIFTbased BoVW features. The texts are represented by 1,000d BoW vector.
3.2 Compared Methods and Evaluation Metric
Two retrieval tasks are conducted: retrieving text by image query (ImageText) and retrieving image by text query (TextImage), where each image in test set is used to retrieve all the text in the test set and vice versa. 8 stateoftheart crossmodal retrieval methods are used for comparison: CCA [1], CFA [4], KCCA [18] (with Poly and RBF kernel functions), Bimodal AE [8], Multimodal DBN [19], CorrAE [13], JRL [5] and CMDN [10]. After obtaining crossmodal shared representations by CDMLMR and the compared methods, we get the ranking list with cosine distance and adopt mean average precision (MAP) score as evaluation metric for both all and top 50 results.
3.3 Experimental Results
Table 1 and 2 show the MAP scores on the 2 datasets for all and top 50 results. We can see on both Wikipedia and NUSWIDE datasets, CDMLMR achieves inspiring improvement compared with the stateoftheart methods in ImageText and TextImage tasks. In general, KCCA shows clear advantage than CCA because of its nonlinearity, and JRL achieves the hight accuracy in methods without DNN. As for DNNbased methods, CMDN has the best performance in the four DNNbased compared methods (Bimodal AE, Multimodal DBN, CorrAE and CMDN) because it models the intermodal and intramodal information simultaneously. As shown from the above results, our CDMLMR method can measure the crossmodal similarity more effectively. Compared to the existing methods, CDMLMR can fully capture the useful yet intrinsic hints for crossmodal similarity metric by exploiting the similarity ranking information of crossmodal quadruplets, and make full use of unlabeled data to increase the diversity of training data. Thus we can learn semantically discriminative shared representations and boost the crossmodal retrieval accuracy.
Table 3 shows the experiments of our CDMLMR method and the base network baselines. We compared three baselines: Base means to directly perform retrieval with the output of base network; Semi means to only use the semisupervised contrastive loss; Quad means to only use the quadruplet ranking loss. For the page limitation, we only show the MAP score for all results here. It can be seen that the CDMLMR clearly improves the crossmodal retrieval accuracy, which shows that the similarity ranking information of crossmodal quadruplets and the rich crossmodal unlabeled instances provide useful hints for crossmodal similarity learning, and can be effectively modeled by our CDMLMR model in a unified framework.
Dataset  Task  Base  Semi  Quad  CDMLMR 

Wikipedia dataset 
ImageText  0.292  0.364  0.344  0.412 

TextImage  0.240  0.308  0.280  0.341 

Average  0.266  0.336  0.312  0.377 
NUSWIDE 10k dataset 
ImageText  0.264  0.351  0.326  0.405 

TextImage  0.290  0.342  0.312  0.379 

Average  0.277  0.347  0.319  0.392 

4 Conclusion
This paper has proposed CDMLMR modal which integrates quadruplet ranking loss and semisupervised contrastive loss for modeling crossmodal semantic similarity in a unified multitask learning architecture. Compared to the existing methods, CDMLMR can not only exploit the similarity ranking information of crossmodal quadruplets to learn the semantically discriminative shared representations, but also make full use of unlabeled data to increase the diversity of training data, and can improve the retrieval accuracy. In the future, we still focus on crossmodal deep metric learning and aim at modeling more than two modalities simultaneously.
5 Acknowledgments
This work was supported by National Natural Science Foundation of China under Grants 61371128 and 61532005.
Footnotes
 thanks: *Corresponding author.
References
 Harold Hotelling, “Relations between two sets of variates,” Biometrika, pp. 321–377, 1936.
 Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in ACM International Conference on Multimedia (ACMMM), 2010, pp. 251–260.
 Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik, “A multiview embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision (IJCV), vol. 106, no. 2, pp. 210–233, 2014.
 Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K Sethi, “Multimedia content processing through crossmodal association,” in ACM International Conference on Multimedia (ACMMM), 2003, pp. 604–611.
 Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao, “Learning crossmedia joint representation with sparse and semisupervised regularization,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 24, pp. 965–978, 2014.
 Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan, “Learning consistent feature representation for crossmodal multimedia retrieval,” IEEE Transactions on Multimedia (TMM), vol. 17, no. 3, pp. 370–381, 2015.
 Yan Hua, Shuhui Wang, Siyuan Liu, Anni Cai, and Qingming Huang, “Crossmodal correlation learning by adaptive hierarchical semantic aggregation,” IEEE Transactions on Multimedia (TMM), vol. 18, no. 6, pp. 1201–1216, 2016.
 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, “Multimodal deep learning,” in International Conference on Machine Learning (ICML), 2011, pp. 689–696.
 Nitish Srivastava and Ruslan Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Conference on Neural Information Processing Systems (NIPS), 2012, pp. 2222–2230.
 Yuxin Peng, Xin Huang, and Jinwei Qi, “Crossmedia shared representation by hierarchical learning with multiple deep networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 3846–3853.
 Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning (ICML), 2013, pp. 1247–1255.
 Fei Yan and Krystian Mikolajczyk, “Deep correlation for matching images and text,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3441–3450.
 Fangxiang Feng, Xiaojie Wang, and Ruifan Li, “Crossmodal retrieval with correspondence autoencoder,” in ACM International Conference on Multimedia (ACMMM), 2014, pp. 7–16.
 Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu, “Learning finegrained image similarity with deep ranking,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1386–1393.
 Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
 Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun, “Faster RCNN: towards realtime object detection with region proposal networks,” in Conference on Neural Information Processing Systems (NIPS), 2015, pp. 91–99.
 TatSeng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng, “Nuswide: a realworld web image database from national university of singapore,” in ACM International Conference on Image and Video Retrieval (CIVR), 2009.
 David R. Hardoon, Sándor Szedmák, and John ShaweTaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
 Nitish Srivastava and Ruslan Salakhutdinov, “Learning representations for multimodal data with deep belief nets,” in International Conference on Machine Learning (ICML) Workshop, 2012.