SoDeep: a Sorting Deep net to learn ranking loss surrogates
Abstract
ï»¿Several tasks in machine learning are evaluated using nondifferentiable metrics such as mean average precision or Spearman correlation. However, their nondifferentiability prevents from using them as objective functions in a learning framework. Surrogate and relaxation methods exist but tend to be specific to a given metric.
In the present work, we introduce a new method to learn approximations of such nondifferentiable objective functions. Our approach is based on a deep architecture that approximates the sorting of arbitrary sets of scores. It is trained virtually for free using synthetic data. This sorting deep (SoDeep) net can then be combined in a plugandplay manner with existing deep architectures. We demonstrate the interest of our approach in three different tasks that require ranking: Crossmodal textimage retrieval, multilabel image classification and visual memorability ranking. Our approach yields very competitive results on these three tasks, which validates the merit and the flexibility of SoDeep as a proxy for sorting operation in rankingbased losses.
1 Introduction
ï»¿ï»¿ï»¿ Deep learning approaches have gained enormous research interest for many Computer Vision tasks in the recent years. Deep convolutional networks are now commonly used to learn stateoftheart models for visual recognition, including image classification [26, 18, 35] and visual semantic embedding [25, 22, 37]. One of the strengths of these deep approaches is the ability to train them in an endtoend manner removing the need for handcrafted features [29]. In such a paradigm, the network starts with the raw inputs, and handles feature extraction (low level and highlevel features) and prediction internally. The main requirement is to define a trainable scheme. For deep architectures, stochastic gradient descent with backpropagation is usually performed to minimize an objective function. This loss function depends on the target task but has to be at least differentiable.
Machine learning tasks are often evaluated and compared using metrics which differ from the objective function used during training. The choice of an evaluation metric is intimately related to the definition of the task at hand, even sometimes to the benchmark itself. For example, accuracy seems to be the natural choice to evaluate classification methods, whereas the choice of the objective function is also influenced by the mathematical properties that allow a proper optimization of the model. For classification, one would typically choose the cross entropy loss – a differentiable function – over the nondifferentiable accuracy. Ideally, the objective function used during training would be identical to the evaluation metric. However, standard evaluation metrics are often not suitable as training objectives for lack of differentiability to start with. This results in the use of surrogate loss functions that are better behaved (smooth, possibly convex). Unfortunately, coming up with good surrogate functions is not an easy task.
In this paper, we focus on the nondifferentiability of the evaluation metrics used in rankingbased tasks such as recall, mean average precision and Spearman correlation. Departing from prior art on building surrogates losses for such tasks, we adopt a simple, yet effective, learning approach: Our main idea is to approximate the nondifferentiable part of such rankingbased metrics by an allpurpose learnable deep neural network. In effect, this architecture is designed and trained to mimic sorting operations. We call it SoDeep. SoDeep can be added in a plugandplay manner on top of any deep network trained for tasks whose final evaluation metric is rankbased, hence not differentiable. The resulting combined architecture is endtoend learnable with a loss that relates closely to the final metric.
Our contributions are as follows:

We propose a deep neural net that acts as a differentiable proxy for ranking, allowing one to rewrite different evaluation metrics as functions of this sorter, hence making them differentiable and suitable as training loss.

We explore two types of architectures for this trainable sorting function: convolutional and recurrent.

We combine the proposed differentiable sorting module with standard deep CNNs, train them endtoend on three challenging tasks, and demonstrate the merit of this novel approach through extensive evaluations of the resulting models.
The rest of the paper is organized as follows. We discuss in Section 2 the related works on direct and indirect optimization of rankingbased metrics, and position our work accordingly. Section 3 is dedicated to the presentation of our approach. We show in particular how a “universal” sorting proxy suffices to tackle standard rankbased metrics, and present different architectures to this end. More details on the system and its training are reported in Section 4, along with various experiments. We first establish new stateoftheart performance on crossmodal retrieval, then we show the benefits of our learned loss function compared to standard methods on memorability prediction and multilabel image classification.
2 Related works
Many data processing systems rely on sorting operations at some stage of their pipeline. It is the case also in machine learning, where handling such nondifferentiable, nonlocal operations can be a real challenge [32]. For example, retrieval systems require to rank a set of database items according to their relevance to a query. For sake of training, simple loss functions that are decomposable over each training sample have been proposed as for instance in [19] for the area under the ROC curve. Recently, some more complex nondecomposable losses (such as the Average Precision (AP), Spearman coefficient, and normalized discounted cumulative gain (nDCG) [3]) that present hard computational challenges have been proposed [31].
Mean average precision optimization
Our work shares the high level goal of using ranking metrics as training objective function with many works before us. Several works studied the problem of optimizing average precision with support vector machines [21, 40] and other works extended these approaches to neural networks [1, 31, 8]. To learn rank, the seminal work [21] relies on a structured hinge upper bound to the loss. Further works reduce the computational complexity [31] or rely on asymptotic methods [36]. The focus of these works is mainly on the relaxation of the mean average precision, while our focus is on learning a surrogate for the ranking operation itself such that it can be combined with multiple ranking metrics. In contrast to most rankingbased techniques, which have to face the high computational complexity of the loss augmented inference [21, 36, 31], we propose a fast, generic, deep sorting architecture that can be used in gradientbased training for rankbased tasks.
Application of ranking based metrics
Ranking is commonly used in evaluation metrics. On retrieval tasks such as crossmodal retrieval [25, 22, 15, 12, 30], recall is the standard evaluation. Image classification [11, 9] and object recognition are evaluated with mean average precision in the multilabel case. Ordinal regression [5] is evaluated using Spearman correlation.
Existing surrogate functions
Multiple surrogates for ranking exist. Using metric learning to do retrieval is one of them. This popular approach avoids the use of the ranking function altogether. Instead, pairwise [39], tripletwise [38, 4] and listwise [13, 2] losses are used to optimize distances in a latent space. The crossentropy loss is typically used for multilabel and multiclass classification tasks.
3 SoDeep approach
Rankbased metrics such as recall, Spearman correlation and mean average precision can be expressed as a function of the rank of the output scores. The computation of the rank being the only nondifferentiable part of these metrics, we propose to learn a surrogate network that approximates directly this sorting operation.
3.1 Learning a sorting proxy
Let be a vector of real values and the ranking function so that is the vector containing the rank for each variable in , i.e. is the rank of among the ’s. We want to design a deep architecture that is able to mimic this sorting operator. The training procedure of this DNN is summarized in Fig. 2. The aim is to learn its parameters, , so that the output of the network is as close as possible to the output of the exact sorting.
Before discussing possible architectures, let’s consider the training of this network, independent of its future use. We first generate a training set by randomly sampling input vectors and we compute through exact sorting the associated groundtruth rank vectors . We then classically learn the DNN by minimizing a loss between the predicted ranking vector and the groundtruth rank over the training set:
(1) 
We explore in the following different network architectures and we explain how the training data is generated.
3.1.1 Sorter architectures
We investigate two types of architectures for our differentiable sorter . One is a recurrent network and the other one a convolutional network, each capturing interesting aspects of standard sorting algorithms:

The recurrent architecture in Fig. 2(a) consists of a bidirectional LSTM [34] followed by a linear projection. The bidirectional recurrent network creates a connection between the output of the network and every input, which is critical for ranking computation. Knowledge about the whole sequence is needed to compute the true rank of any element.

The convolutional architecture in Fig. 2(b) consists of 8 convolutional blocks, each of these blocks being a onedimensional convolution followed by a batch normalization layer [20] and a ReLU activation function. The sizes of the convolutional filters are chosen such that the output of the network contains as many channels as the length of the input sequence. Convolutions are used for their local property: indeed, sorting algorithms such as bubble sort [14] only rely on a sequence of local operations. The intuition is that a deep enough convolutional network, with its cascaded local operations, should be able to mimic recursive sorting algorithms and thus to provide an efficient approximation of ranks.
We will further discuss the interest of both types of SoDeep block architectures in the experiments.
3.1.2 Training data
SoDeep module can be easily (pre)trained with supervision on synthetic data. Indeed, while being nondifferentiable, the ranking function can be computed with classic sorting algorithms. The training data consists of vectors of randomly generated scalars, associated with their groundtruth rank vectors. In our experiments, the numbers are sampled from different types of distributions:

Uniform distribution over ;

Normal distribution with and ;

Sequence of evenly spaced numbers in a uniformly drawn random subrange of ;

Random mixtures of the previous distributions.
While the differentiable sorter can be trained ahead of time on a variety of input distributions, as explained above, there might be a shift with the actual score distribution that the main network will output for the task at hand. This shift can reduce naturally during training, or an alignment can be explicitly enforced. For example, can be designed to output data in the interval used to learn the sorter, with the help of bounded functions such as cosine similarity.
3.2 Using SoDeep for training with rankbased loss
Rankbased metrics are used for evaluating and comparing learned models in a number of tasks. Recall is a standard metric for image and information retrieval, mean Average Prediction (mAP) for classification and recognition, and Spearman correlation for ordinal prediction. This type of rankbased metrics are nondifferentiable because they require to transition from the continuous domain (score) toward the discrete domain (rank).
As presented in Fig 1, we propose to insert a pretrained SoDeep proxy block between the deep scoring function and the chosen rankbased loss. We show in the following how mAP, Spearman correlation and recall can be expressed as functions of the rank and combined with SoDeep accordingly.
In the following we assume a training set of annotated pairs for the task at hand. A group of training examples among them yields a prediction vector and an associated groundtruth score vector (Fig. 1).
3.2.1 Spearman correlation
For two vectors and of size , corresponding to two sets of observations, the Spearman correlation [7] is defined as:
(2) 
Maximizing w.r.t. parameters the sum of Spearman correlations (2) between ground truth and predicted score vectors over subsets of training examples amounts to solving the minimization problem:
(3) 
with the loss not being differentiable.
Using now our differentiable proxy instead of the rank function, we can define the new Spearman loss for a group :
(4) 
Training will typically minimize it over a large set of groups. Note that here the optimization is done over , knowing that SoDepp block has been trained independently on specific synthetic training data. Optionally, the block can be finetuned along the way, hence minimizing w.r.t. as well.
3.2.2 Mean Average Precision (mAP)
Multilabel image classification is often evaluated using mAP, a metric from information retrieval. To define it, each of the classes is considered as a query over the elements of the datasets. For class , denoting the dimensional groundtruth binary vector and the vector of scores for this class, the average precision (AP) for the class is defined as [40] :
(5) 
where is the number of positive items for class and precision for element is defined as:
(6) 
with the set of indices of the elements of larger than .
Minimizing for all from class (i.e., those verifying ) will be used as a surrogate of the maximization of the AP over predictor’s parameters .
The mAP is obtained by averaging AP over the classes. Replacing the rank function by its differentiable proxy, the proposed mAPbased loss reads:
(7) 
3.2.3 Recall at
Recall at rank is often used to evaluate retrieval tasks. In the following we assume a training set for the task at hand. A group of training examples among them yields a prediction matrix representing the scores of all pairwise combinations of training examples in . In other words, the th column of this matrix, , provides the relevance of other vectors in the group w.r.t. to query .
This matrix being given, recall at is defined as:
(8) 
with the index of the unique positive entry in , a single relevant item being assumed for query .
Once again, our sorter enables a differentiable implementation of this measure. However, we could not obtain conclusive results yet, possibly due to the batch size limiting the range of the summation. We found, however, an alternative way to leverage our sorting network. It is based on the use of the “triplet loss”, a popular surrogate for recall. We propose to apply this loss on ranks instead of similarity scores, making it only dependent on the ordering of the retrieved elements. The triplet loss on the rank can be expressed as follows:
(9) 
where is defined as above (the positive example in the triplet, given anchor query ) and is the index of a negative (irrelevant) example for this query. The goal is to minimize the rank of the positive pair with score such that its rank is lower than the rank of the negative pair with score by a margin of .
The complete loss is then expressed over all the elements of in its hard negative version as:
(10) 
4 Experiments
We present in this section several experiments to evaluate our approach. We first detail the way we train our differentiable sorter deep block using only synthetic data. We also present a comparison between the different models based on CNNs and on LSTM recurrent nets and with our baseline inspired from pairwise comparisons. We then evaluate the SoDeep combined with deep scoring functions . The loss functions expressed in (4), (7) and (10) are applied to three different tasks: memorability prediction, crossmodal retrieval, and object recognition.
4.1 SoDeep Training and Analysis
4.1.1 Training
The proposed SoDeep models based on BILSTM and CNNs are trained on synthetic pairs of scores and ranks generated on the fly according to the distributions defined in Section 3.1.2.
For convenience we call an epoch as going through 100 000 pairs. The training is done using the Adam optimizer [24] with a learning rate of which is halved every 100 epochs. Minibatches of size 512 are used. The model is trained until the loss values stop decreasing and are stable.
4.1.2 A handcrafted sorting baseline
We add to our trainable SoDeep blocks a baseline that does not require any training.
Inspired by the representation of the ranking problem as a matrix of pairwise ordering in [40], we build a handcrafted differentiable sorter using pairwise comparisons.
A sigmoid function parametrized with scalar is used as a binary comparison function between two scalars and as:
(11) 
Indeed, if and are separated by a sufficient margin, will be either or . The parameter is used to control the precision of the comparator.
This function may be used to approximate the relative rank of two components and in a vector : will be close to if is (significantly) smaller than , 0 otherwise. By summing up the result of the comparison between and all the other elements of the vector , we form our ranking function . More precisely, the rank for the est element of is expressed as follow:
(12) 
The overall precision of the handcrafted sorter can be controlled by the hyper parameter . The value of lambda is a trade off between the precision of the predicted rank and the efficiency when backpropagating through the sorter. Further experiments will use .
4.1.3 Results
Table 1 contains the loss values of the two different trained sorters and the handcrafted one on a generated test set of 10 000 samples. The LSTM based sorter is the most efficient, outperforming the CNN and the handcrafted sorters.
\rowcolorgray!40 Sorter model  L1 loss 

Handcrafted sorter  0.0350 
CNN sorter  0.0120 
LSTM sorter loss  0.0033 

The performance of the CNN sorter slightly below the LSTMbased one can be explained by local behaviour of the CNNs, requiring a more complex structure to be able to rank elements.
In Figure 4 we compare CNN sorters with respect to their number of layers. From these results, we choose to use 8 layers in our CNN sorter since the performance seems to saturate once this depth has been reached. A possible explanation of this saturation is that the relation between the depth of the network and the input dimension ( here) is logarithmic.
4.1.4 Further analysis
The ranking function being noncontinuous is nondifferentiable, the rank value is jumping from one discrete value to another. We design an experiment to visualize how the different types of sorter behave at these discontinuities. Starting from a uniformly sampled vector of raw scores in the range , we compute the ground truth rank and the predicted rank of the first element while varying this element from 1 to 1 in increments of 0.001. The plot of the predicted ranks can be found in Fig. 5. The blue curve corresponds to the groundtruth rank where noncontinuous steps are visible, whereas the curves for the learned sorters (orange and green) are a smooth approximation of the groundtruth curve.
In Fig. 6 we compare our SoDeep against previous approaches optimizing structured hinge upper bound to the mAP loss. We followed the protocol described in [36] for their synthetic data experiments. Our sorters using the loss defined in (7) are compared to a reimplementation of the HingeAP loss proposed in [21]. The results in Fig. 6 show that our approach with the LSTM sorter (blue curve) gets mAP scores similar to [21] (purple curve) while being generic and less complex.
From the learned sorters, the LSTM architecture is the one performing best on synthetic data (Tab. 1). In addition, its simple design and small number of hyperparameters make it straightforward to train. The CNN architecture while not being as efficient, uses a smaller number of weights and is 1.7 time faster. Further experiments will use the LSTM sorter unless specified otherwise.
4.2 Differentiable Sorter based loss functions
Our method is benchmarked on three tasks. Each one of these tasks focuses on a different rank based loss function. Crossmodal retrieval will be used to test recall evaluation metrics, memorability prediction will be used for Spearman correlation and image classification will be used for mean average precision.
As explained in Section 3.1.2, a shift in distribution might appear when using sorterbased loss. To prevent this, a parallel loss can be used to help domain alignment. This loss can be used only to stabilize the initialization or kept for the whole training.
4.2.1 Spearman Correlation: Predicting Media Memorability
The media memorability prediction task [5] is used to test the differentiable sorter with respect to the Spearman correlation metrics. Examples of elements of the dataset can be found in Fig. 7. Given a 7 seconds video the task consists in predicting the short term memorability score. The memorability score reflects the probability of a video being remembered.
The task is originally on video memorability. However the model used here are pretrained on images, therefore 7 frames are extracted from each video and are associated with the memorability score of the source video. The training is done on pairs of frame and memorability score. During testing the predicted score of the 7 frames of a video are averaged to obtain the score per video. The dataset contains 8000 videos (56000 frames) for training and 2000 videos for testing. This training set is completed using LaMem dataset [23] adding 60 000 (image, memorability) pairs to the training data.
\rowcolorgray!40 Single model  Spear. cor. test 

Baseline [6]  46.0 
Image only [17]  48.8 
R34 + MSE loss  44.2 
R34 + SoDeep loss  46.6 
SemEmb + MSE loss  48.6 
SemEmb + SoDeep loss  49.4 
Architectures and training
The regression model consists of a feature extractor combined with a two layers MLP [33] regressing features to a single memorability score. We use two pretrained nets to extract features: the Resnet34 [18] and the semantic embedding model of [10] (as in the next section).
We use the loss defined in (4) to learn the memorability model. The training is done in two steps. First, for 15 epochs only the MLP layers are trained while the weights of the feature extractor are kept frozen. Second, the whole model is finetuned. The Adam optimizer [24] is used with a learning rate of which is halved every 3 epochs. To help with domain adaptation, our loss is combined with an L1 loss for the first epoch.
Results
In Tab. 2, we compare the impact of the learned loss over two architectures. For both models we defined a baseline using a L2 loss. On both architectures the proposed loss function achieves higher Spearman correlation by 2.4 points on the Resnet model and 0.8 points on the semantic embedding model. These are state of the arts result on the task with an absolute gain of 0.6 pt. The model is almost on par (0.3 pt) with an ensemble method proposed by [17] that is using additional textual data.
Sorter comparison
The memorability prediction is also used to compare the different types of sorters presented so far. Fixing the model and the hyper parameters, 4 models are trained with 4 different types of loss. The losses based on the LSTM sorter, the CNN sorter and the handcrafted sorter obtained respectively a Spearman correlation of 49.4, 46.6, 45.7, and the L1 loss gives a correlation of 46.2. These results are consistent with the result on synthetic data, with the LSTM sorter performing the best, followed by the CNN and handcrafted ones.
\rowcolorgray!40  caption retrieval  image retrieval  

\rowcolorgray!40 model  R@1  R@5  R@10  Med. r  R@1  R@5  R@10  Med. r 
Emb. network [37]  54.9  84.0  92.2    43.3  76.4  87.5   
DSVELoc [10]  69.8  91.9  96.6  1  55.9  86.9  94.0  1 
GXN (i2t+t2i) [16]  68.5    97.9  1  56.6    94.5  1 
DSVELoc + SoDeep loss  71.5  92.8  97.1  1  56.2  87.0  94.3  1 
4.2.2 Mean Average precision: Image classification
The VOC 2007 [11] object recognition challenge is used to evaluate our sorter on a task using the mean average precision metric. We use an offtheshelf model [9]. This model is a fully convolutional network, combining a Resnet101 [18] with advanced spatial aggregation mechanisms.
To evaluate the loss defined in (7) two versions of the model are trained: A baseline using only multilabel soft margin loss, and another model trained using the multilabel soft margin loss combined with .
Rows 3 and 4 of Tab. 4 show the results obtained by the two previously described models. Both models are below the stateoftheart, however the use of the rank loss is beneficial and improves the mAP by 0.8 pt compared to the model using only the soft margin loss.
\rowcolorgray!40 Loss  mAP 

VGG 16 [35]  89.3 
WILDCAT [9]  95.0 
WILDCAT*  93.2 
WILDCAT* + SoDeep loss  94.0 
4.2.3 Recall@K: Crossmodal Retrieval
The last benchmark used to evaluate the differentiable sorter is the crossmodal retrieval. Starting from images annotated with text, we train a model producing rich features for both image and text that live in the same embedding space. Similarity in the embedding space is then used to evaluate the quality of the model on the crossmodal retrieval task.
Our approach is evaluated on the MSCOCO dataset [28] using the rVal split proposed in [22]. The dataset contains 110k images for training, 5k for validation and 5k for testing. Each image is annotated with 5 captions.
Given a query image (resp. a caption), the aim is to retrieve the corresponding captions (resp. image). Since MSCOCO contains 5 captions per image, recall at (“R@”) for caption retrieval is computed based on whether at least one of the correct captions is among the first retrieved ones. The task is performed 5 times on 1000image subsets of the test set and the results are averaged.
We use an offtheshelf model [10]. It is a twopaths multimodal embedding approach that leverages the latest neural network architecture. The visual pipeline is based on a Resnet152 and is fully convolutional. The textual pipeline is trained from scratch and uses a Simple Recurrent Unit (SRU) [27] to encode sentences. The model is trained using the loss defined in (10) instead of the triplet based loss.
Crossmodal retrieval results can be found in Tab. 3. The model trained using the proposed loss function (DSVELoc + SoDeep loss) outperforms the similar architecture DSVELoc trained with the triplet margin based loss by (1.7%,0.9%,0.5%) on (R@1,R@5,R@10) in absolute for caption retrieval, and by (0.3%,0.1%,0.3%) for image retrieval. It obtains stateoftheart performance on caption retrieval and is very competitive on image retrieval being almost on par with the GXN [16] model, which has a much more complex architecture. It is important to note that the loss function proposed could be beneficial for any type of architecture.
5 Conclusion
We have presented SoDeep, a novel method that leverages the expressivity of recent architectures to learn differentiable surrogate functions. Based on a direct deep network modeling of the sorting operation, such a surrogate allows us to train, in an endtoend manner, models on a diversity of tasks that are traditionally evaluated with rankbased metrics. Remarkably, this deep proxy to estimate the rank comes at virtually no cost since it is easily trained on purely synthetic data.
Our experiments show that the proposed approach achieves very good performance on crossmodal retrieval tasks as well as on media memorability prediction and multilabel image classification. These experiments demonstrate the potential and the versatility of SoDeep. This approach allows the design of training losses that are closer than before to metrics of interest, which opens up a wide range of other applications in the future.
References
 [1] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
 [2] Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li. Learning to rank: From pairwise approach to listwise approach. In ICML, 2007.
 [3] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. Structured learning for nonsmooth ranking losses. In ACM SIGKDD, 2008.
 [4] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. J. Machine Learning Research, 11:1109–1135, 2010.
 [5] Romain Cohendet, ClaireHélène Demarty, Ngoc Duong, Mats Sjöberg, Bogdan Ionescu, and ThanhToan Do. Mediaeval 2018: Predicting media memorability task. arXiv preprint arXiv:1807.01052, 2018.
 [6] Romain Cohendet, ClaireHélène Demarty, and Ngoc Q. K. Duong. Transfer learning for video memorability prediction. In MediaEval Workshop, 2018.
 [7] Yadolah Dodge. The concise encyclopedia of statistics. Springer Science & Business Media, 2008.
 [8] Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A Hidalgo. Deep learning the city: Quantifying urban perception at a global scale. In ECCV, 2016.
 [9] Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In CVPR, 2017.
 [10] Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. Finding beans in burgers: Deep semanticvisual embedding with localization. In CVPR, 2018.
 [11] Mark Everingham and J Winn. The PASCAL visual object classes challenge 2007 development kit. Technical report, 2007.
 [12] Fartash Faghri, David Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: Improved visualsemantic embeddings. arXiv preprint arXiv:1707.05612, 2017.
 [13] Basura Fernando, Efstratios Gavves, Damien Muselet, and Tinne Tuytelaars. Learning to rank based on subsequences. In ICCV, 2015.
 [14] Edward H Friend. Sorting on electronic computer systems. JACM, 1956.
 [15] Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. DeViSE: A deep visualsemantic embedding model. In NIPS, 2013.
 [16] Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving textualvisual crossmodal retrieval with generative models. In CVPR, 2018.
 [17] Rohit Gupta and Kush Motwani. Linear models for video memorability prediction using visual and semantic features. In MediaEval Workshop, 2018.
 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [19] Alan Herschtal and Bhavani Raskutti. Optimising area under the roc curve using gradient descent. In ICML, 2004.
 [20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [21] Thorsten Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD, 2002.
 [22] Andrej Karpathy and Li FeiFei. Deep visualsemantic alignments for generating image descriptions. In CVPR, 2015.
 [23] Aditya Khosla, Akhil S. Raju, Antonio Torralba, and Aude Oliva. Understanding and predicting image memorability at a large scale. In ICCV, 2015.
 [24] D Kinga and J Ba Adam. A method for stochastic optimization. In ICLR, 2015.
 [25] Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. Unifying visualsemantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
 [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [27] Tao Lei and Yu Zhang. Training RNNs as fast as CNNs. arXiv preprint arXiv:1709.02755, 2017.
 [28] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [29] David Lowe. Object recognition from local scaleinvariant features. In ICCV, 1999.
 [30] Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. Multimodal convolutional neural networks for matching image and sentence. In ICCV, 2015.
 [31] Pritish Mohapatra, Michal Rolinek, CV Jawahar, Vladimir Kolmogorov, and M Kumar. Efficient optimization for rankbased loss functions. In CVPR, 2018.
 [32] Mehryrar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
 [33] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 1958.
 [34] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans. Signal Processing, 1997.
 [35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [36] Yang Song, Alexander Schwing, and Raquel Urtasun. Training deep neural networks via direct loss minimization. In ICML, 2016.
 [37] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Learning twobranch neural networks for imagetext matching tasks. IEEE Trans. Pattern Recognition and Machine Intell., 41(2):394–407, 2018.
 [38] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. J. Machine Learning Research, 2009.
 [39] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with application to clustering with sideinformation. In NIPS, 2003.
 [40] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In ACM SIGIR, 2007.