Metric learning: crossentropy vs. pairwise losses
Abstract
Recently, substantial research efforts in Deep Metric Learning (DML) focused on designing complex pairwisedistance losses and convoluted samplemining and implementation strategies to ease optimization.
The standard crossentropy loss for classification has been largely overlooked in DML.
On the surface, the crossentropy may seem unrelated and irrelevant to metric learning as it does not explicitly involve pairwise distances.
However, we provide a theoretical analysis that links the crossentropy to several wellknown and recent pairwise losses.
Our connections are drawn from two different perspectives: one based on an explicit optimization insight; the other on discriminative and generative views of the mutual information between the labels and the learned features.
First, we explicitly demonstrate that the crossentropy is an upper bound on a new pairwise loss, which has a structure similar to various pairwise losses: it minimizes intraclass distances while maximizing interclass distances.
As a result, minimizing the crossentropy can be seen as an approximate boundoptimization (or MajorizeMinimize) algorithm for minimizing this pairwise loss.
Second, we show that, more generally, minimizing the crossentropy is actually equivalent to maximizing the mutual information, to which we connect several wellknown pairwise losses.
These findings indicate that the crossentropy represents a proxy for maximizing the mutual information – as pairwise losses do – without the need for complex samplemining and optimization schemes.
Furthermore, we show that various standard pairwise losses can be explicitly related to one another via bound relationships.
Our experiments
Keywords:
Metric Learning, Deep Learning, Information Theory1 Introduction
The core task of metric learning consists in learning a metric from highdimensional data, such that the distance between two points, as measured by this metric, reflects their semantic similarity. Such a task can be of crucial importance in several applications including image retrieval, zeroshot learning or person reidentification, among other tasks. The first approaches to tackle this problem attempted to learn metrics directly on the input space [14]. Later, the idea of learning suitable embedding was introduced, and quite some efforts were put into learning Mahalanobis distances [37, 22, 4, 34, 1]. This corresponds to learning the best linear projection of the input space onto a lowerdimensional manifold, followed by using the Euclidean distance as a metric. Building on the embeddinglearning ideas, several papers proposed to learn more complex mappings, either by kernelization of already existing linear algorithms [1], or simply by using more complex hypothesis such as linear combinations of gradient boosted regressions trees [9].
The recent success of deep neural networks at learning complex, highly nonlinear mappings of highdimensional data completely aligns with the problematic of learning a suitable embedding. Similarly to works carried out in the context of Mahalanobis distance learning, most Deep Metric Learning (DML) approaches are based on pairwise distances. More specifically, the current paradigm is to learn a deep encoder such that it maps pairs of points with high semantic similarity close together in the embedded space, as measured by wellestablished distances (e.g. Euclidean or cosine). This paradigm concretely translates into pairwise losses that encourage small distances for pairs of samples from the same class and penalize small distances for pairs of samples from different classes. While the formulations seem rather intuitive, the practical implementation and optimization of such pairwise losses can become cumbersome. Randomly assembling pairs of samples results in either slow convergence or degenerate solutions [7]. Hence, most research efforts put into DML focused on finding efficient ways to reformulate, generalize and/or improve sample mining and/or improve sample weighting strategies over the existing pairwise losses. Popular pairwise losses include triplet loss and its derivatives [7, 23, 24, 43, 3], contrastive loss and its derivatives [5, 33], Neighborhood Component Analysis and its derivatives [4, 15, 36], among others. However, such modifications are often heuristicbased, and come at the price of an increased complexity and additional hyperparameters, impeding the potential applicability of these methods to realworld problems.
Admittedly, the objective of learning a useful embedding of data points intuitively aligns with the idea of directly acting on the distances between pairs of points in the embedded space, via minimizing intraclass distances and maximizing interclass distances. Therefore, the standard crossentropy loss, widely used in classification tasks, has been largely overlooked by the DML community, most likely due to its apparent irrelevance for metric Learning [35]. As a matter of fact, why would anyone use a pointwise prediction loss to enforce pairwisedistance properties on the feature space? Even though the crossentropy was shown to be competitive for face recognition applications [12, 30, 29], to the best of our knowledge, only one paper observed empirically competitive results of a normalized, temperatureweighted version of the crossentropy in the more general context of deep metric learning [42]. However, the authors did not provide any theoretical insights for these results. Moreover, as we will show later, the feature, weight normalization and bias removal in [42] simplifies directly the altered crossentropy in [42] into a loss very similar to the wellknown center loss [35].
In fact, even though, on the surface, the standard crossentropy loss may seem completely unrelated to the pairwise losses used in DML, we provide theoretical justifications that connects directly the crossentropy to several wellknown and recent pairwise losses. Our connections are drawn from two different perspectives, one based on an explicit optimization insight and the other on mutualinformation arguments. We show that four of the most prominent pairwise metriclearning losses, as well as the standard crossentropy, are essentially maximizing a common underlying objective: the Mutual Information (MI) between the learned embeddings and the corresponding samples’ labels. As sketched in Section 2, this connection can be intuitively understood by writing this MI in two different, but equivalent ways. Specifically, we find that tight links between pairwise losses and the generative view of this MI can be established. We study the particular case of contrastive loss [5], explicitly showing its relation to this MI. We further demonstrate that other DML losses have tight relations to contrastive loss, such that the reasoning applied on this specific example generalizes to the other DML losses. In fact, various standard pairwise losses can be explicitly related to one another via bound relationships. As for the crossentropy, we first demonstrate that the crossentropy is an upper bound on a new, underlying pairwise loss, on which the previous reasoning could be applied which has a structure similar to various existing pairwise losses. As a result, minimizing the crossentropy can be seen as an approximate boundoptimization (or MajorizeMinimize) algorithm for minimizing this pairwise loss, thereby implicitly minimizing intraclass distances and maximizing interclass distances. Second, we show that, more generally, minimizing the crossentropy is actually equivalent to maximizing the discriminative view of the mutual information. Our findings indicate that the crossentropy represents a proxy for maximizing the mutual information, as pairwise losses do, without the need for complex samplemining and optimization schemes. Our comprehensive experiments over four standard DML benchmarks (CUB200, Cars196, Stanford Online Product and InShop) strongly support our findings. We consistently obtained stateoftheart results, outperforming many recent and complex DML methods.
Summary of contributions

Establishing relations between several wellknown pairwise DML losses and a generative view of the mutual between the learned features and labels.

Proving explicitly that optimizing the standard crossentropy corresponds to an approximate boundoptimizer of an underlying pairwise loss;

More generally, showing that minimizing the standard crossentropy loss is equivalent to maximizing a discriminative view of the mutual information between the features and labels.

Demonstrating stateoftheart results with crossentropy on several DML benchmark datasets.
2 On the two views of the mutual information
General  

Labeled dataset  
Input feature space  
Embedded feature space  
Label/Prediction space  
Euclidean distance  
Cosine distance 
Model  

Encoder  
Softclassifier  
Random variables (RVs)  
Data  , 
Embedding  
Prediction 
Information measures  

Entropy of  
Conditional entropy of given  
Cross entropy (CE) between and  
Conditional CE given  
Mutual information between and 
The Mutual Information (MI) is a well known measure designed to quantify the amount of information shared by two random variables. Its formal definition is presented in Table 1. Throughout this work, we will be particularly interested in which represents the MI between learned features and labels . Due to its symmetry property, the MI can be written in two ways, which we will refer to as the discriminative view and generative view of MI:
(1) 
While being analytically equivalent, these two views present two different, complementary interpretations. In order to maximize , the discriminative view conveys that the labels should be balanced (out of our control) and that the labels should be easily identified from the features. On the other hand, the generative view conveys that the features learned should overall spread as much as possible in the feature space, but keep samples sharing the same class close together. Hence the discriminative view is more focused around label identification, while the generative view focuses on more explicitly shaping the distribution of features learned by the model. Therefore, the MI can allow us to draw links between classification losses (e.g. crossentropy) and featureshaping losses (including all the wellknown pairwise metric learning losses).
3 Pairwise losses and the generative view of the MI
In this section, we study four pairwise losses widely used in the DML community: center loss [35], contrastive loss [5], Scalable Neighbor Component Analysis (SNCA) loss [36] and MultiSimilarity (MS) loss [33]. We show that these losses can be interpreted as proxies for maximizing the generative view of mutual information . We begin by analyzing the specific example of contrastive loss, establishing its tight link to the MI, and further generalize our analysis to all the other pairwise losses (see Table 2). Furthermore, we show that all these pairwise metriclearning losses can be explicitly related to one another via bound relationships.
3.1 The example of contrastive loss
We start by analyzing the representative example of contrastive loss [5]. For a given margin , this loss is formulated as:
(2) 
This loss naturally breaks down into two terms: a tightness part and a contrastive part . The tightness part encourages samples from the same class to be close to each other and form tight clusters. As for the contrastive part, it forces samples from different classes to stand far apart from one another in the embedded feature space. Let us analyze these two terms in great details from a mutualinformation perspective.
As shown in the next subsection, the tightness part of contrastive loss is equivalent to the tightness part of the center loss [35]: , where denotes the mean of feature points from class in embedding space and symbol denotes equality up to a multiplicative and/or additive constant. Written in this way, we can interpret as a conditional cross entropy between and another random variable , whose conditional distribution given is a standard Gaussian centered around : :
(3) 
As such, is an upper bound on the conditional entropy that appears in the mutual information:
(4) 
This bound is tight when . Hence, minimizing can be seen as minimizing , which exactly encourages the encoder to produce lowentropy (=compact) clusters in the feature space for each given class. Notice that using this term only will inevitably lead to a trivial encoder that maps all data points in to a single point in the embedded space , hence achieving a global optima.
To prevent such a trivial solution, a second term needs to be added. This second term – that we refer to as the contrastive term – is designed to push each point away from points that have a different label. Assuming a large margin such that , we can linearize the contrastive term:
(5) 
While the second term in Eq. 5 is redundant with the tightness objective, the first term is close to the differential entropy estimator proposed in [32]:
(6) 
Both terms measure the spread of , even though they present different dynamics. For instance, the presence of the in Eq. 6 could cause high gradients close to 0, but yield more robustness to outliers. All in all, minimizing the whole contrastive loss – assuming a large margin – can be seen as a proxy for maximizing the MI between the labels and the embedded features :
(7) 
3.2 Generalizing to other pairwise losses
Loss  Tightness part  Contrastive part  

Center [35]  
Contrast [5]  
SNCA [36]  
MS [33]  


A similar analysis can be carried out on other, more recent metric learning losses. More specifically, they can also be broken down into two parts: a tightness part that minimizes intraclass distances to form compact clusters, which is related to the conditional entropy , and a second contrastive part that prevents trivial solutions by maximizing interclass distances, which is related to the entropy of features . Note that, in some pairwise losses, there might be some redundancy between the two terms, i.e. the tightness term also contains some contrastive subterm, and viceversa. For instance, the crossentropy loss is used as the contrastive part of the centerloss but, as we will show later in subsection 4.2, the crossentropy loss, used alone, already contains both tightness (conditional entropy) and contrastive (entropy) parts. Table 2 presents the split for four representative DML losses. The rest of the section is devoted to exhibiting the close relationships between several wellknown pairwise losses and the tightness (conditional entropy) and contrastive (entropy) terms (i.e. and ).
Links between losses
In this section, we show that the tightness and contrastive parts of the pairwise losses in Table 2, even though different at first sight, can actually be related to one another.
Lemma 1.
Let denote the tightness part of the loss from method A. Assuming that features are L2normalized, and that classes are balanced, the following relations between Center [35], Contrastive [5], SNCA [36] and MS [33] losses hold:
(8) 
Where stands for lower than, up to a multiplicative and an additive constant, and stands for equal to, up to a multiplicative and an additive constant.
The detailed proof of 1 is deferred to the supplemental material. As for the contrastive parts, we show in the supplemental material that both and are lower bounded by a common contrastive term that is directly related to . We do not mention the contrastive term of centerloss, as it represents the crossentropy loss, which is exhaustively studied in Section 4.
4 Crossentropy does it all
We now completely change gear to focus on the widely used classification loss: crossentropy. On the surface, the crossentropy may seem unrelated to metriclearning losses as it does not involve pairwise distances. We show that a close relationship exists between these pairwise losses widely used in deep metric learning and the crossentropy classification loss. This link can be drawn from two different perspectives, one is based on an explicit optimization insight and the other is based on a discriminative view of the mutual information. First, we explicitly demonstrate that the crossentropy is an upper bound on a new pairwise loss, which has a structure similar to all the metriclearning losses listed in Table 2, i.e., it contains a tightness term and a contrastive term. Hence, minimizing the crossentropy can be seen as an approximate boundoptimization (or MajorizeMinimize) algorithm for minimizing this pairwise loss. Second, we show that, more generally, minimization of the crossentropy is actually equivalent to maximization of the mutual information, to which we connected various DML losses. These findings indicate that the crossentropy represents a proxy for maximizing , just like pairwise losses, without the need for dealing with the complex sample mining and optimization schemes associated to the latter.
4.1 The pairwise loss behind crossentropy
Bound optimization
Given a function that is either intractable or hard to optimize, bound optimizers are iterative algorithms that instead optimize auxiliary functions (upper bounds on ). These auxiliary functions are usually more tractable than the original function . Let be the current iteration index, then an is an auxiliary function if:
A bound optimizer follows a twostep procedure: first an auxiliary function is computed, then is minimized, such that:
This iterative procedure is guaranteed to decrease the original function :
(9) 
Note that bound optimizers are widely used in machine learning. Examples of wellknown bound optimizers include the concaveconvex procedure (CCCP) [41], expectation maximization (EM) algorithms or submodularsupermodular procedures (SSP) [16]. Such optimizers are particularly used in clustering [27], and more generally in problems involving latent variables optimization.
Pairwise CrossEntropy
We now prove that iterative minimization of crossentropy can be interpreted as an approximate bound optimization of a more complex pairwise loss.
Proposition 1.
Alternately minimizing the crossentropy loss with respect to the encoder’s parameters and the classifier’s weights can be viewed as an approximate boundoptimization of a Pairwise CrossEntropy (PCE) loss, which we define as follows:
(10) 
Where represents the softmean of class , represents the softmax probability of point belonging to class k, and depends on the encoder .
The full proof of Proposition 1 is provided in the supplemental material, but we hereby provide a sketch of it.
Proof.
Considering the usual softmax parametrization for our model’s predictions , the idea is to break the crossentropy loss in two terms, and artificially add and remove the regularization term :
(11) 
By properly choosing in Eq. (11), both and become convex functions of . For any class , we then show that the optimal values of for and are, respectively proportional to, the hard mean and the soft mean of class . By pluggingin those optimal values, we can lower bound and individually in Eq. 11 and get the result. ∎
Proposition 1 casts a new light on the crossentropy loss by explicitly relating it to a new pairwise loss (PCE), following the intuition that the optimal weights of the final layer, i.e., the linear classifier, are related to the centroids of each class in the embedded feature space . Specifically, finding the optimal classifier’s weight for crossentropy can be interpreted as building an auxiliary function on . Subsequently minimizing crossentropy w.r.t the encoder’s weights can be interpreted as the second step of bound optimization on . We exhibit empirical evidence on this boundtype relationship in Section 5.
Similarly to other metric learning losses, PCE contains a tightness part that encourages samples from the same classes to align with one another. In echo to 1, this tightness term, noted , is equivalent, up to multiplicative and additive constants, to and , when the features are assumed to be normalized:
Pairwise CrossEntropy also contains a contrastive part, divided into two terms. The first pushes all samples away from one another, while the second term forces soft means far from the origin. Hence, minimizing the crossentropy can be interpreted as implicitly minimizing a pairwise loss that has a structure similar to the wellestablished metriclearning losses in Table 2.
Remark
In [42], competitive results were achieved using a normalized version of the crossentropy, which directly optimizes the cosine distances and uses an additional temperature parameter . Using feature and weight normalization, the loss in [42] simplifies directly to a pairwise loss that contains the centerloss tightness term [35]:
(12) 
where and . This direct link comes from the fact that, on the unit hypersphere, and cosine distances are equivalent, up to an additive constant.
4.2 A discriminative view of mutual information
Lemma 2.
Minimizing the conditional crossentropy loss, denoted by , is equivalent to maximizing the mutual information .
Proof.
Using the discriminative view of MI, we can write:
(13) 
The entropy of labels is a constant, and can therefore be ignored. From this view of MI, maximization of can only be achieved through a minimization of , which depends on our embeddings .
We can relate this term to our crossentropy loss using the following relation:
(14) 
Therefore, while minimizing crossentropy, we are implicitly both minimizing as well as . In fact, following Eq. 14, optimization could naturally be decoupled in 2 steps, in a MaximizeMinimize fashion. One step would consist in fixing the encoder’s weights and only minimizing Eq. 14 w.r.t to the classifier’s weights . At this step, would be fixed while would be adjusted to minimize . Ideally, the KL term would vanish at the end of this step. In the following step, we would minimize Eq. 14 w.r.t to the encoder’s weights , while keeping the classifier fixed. ∎
Result from 2 is very compelling. Using the discriminative view of mutual information allows to show that minimizing crossentropy loss is equivalent to maximizing the mutual information . This information theoretic argument reinforces our conclusion from Proposition 1 that crossentropy and the previously described metric learning losses are essentially doing the same job.
4.3 Then why would crossentropy work better?
In previous sections, we showed that crossentropy essentially optimizes for the same underlying mutual information than other DML losses proposed. This fact alone is not enough to explain why the crossentropy is able to consistently achieve better results than DML losses proposed so far as shown in Section 6. We argue that the difference must lie in the optimization process. Pairwise losses require careful sample mining and sample weighting strategies to obtain most informative pairs of sample, especially when considering minibatches, in order to achieve convergence in a reasonable amount of time, using a reasonable amount of memory. On the other hand, optimizing for crossentropy is substantially easier as it only implies minimization of unary terms. Essentially, crossentropy does it all without the pain of having to deal with pairwise terms. Not only does it make optimization easier, it also makes the implementation simpler, thus increasing its potential applicability in realworld problems.
5 Empirical support of the link between CE and PCE
Throughout this section, we provide simple empirical evidence to support the claimed relation between crossentropy (CE) minimization and pairwise crossentropy (PCE) minimization (see subsection 4.1). The main goal is to show that minimizing the crossentropy (CE) is implicitly equivalent to minimizing an underlying pairwise loss. Intuitively, the pairwise terms appears through the observation that the optimal weights of the classifier used in crossentropy are directly related to the hardmeans of each class in the feature space. This enabled us to link CE to PCE in Proposition 1.
5.1 Simplified PCE
PCE as presented in Proposition 1 requires the computation of the parameter at every iteration, which in turn requires the computation of the eigenvalues of a matrix at every iteration (cf subsection A.2). Recall Eq. 21:
Therefore, we can remove dependence upon by plugging in the same for both and in Eq. 21. We choose to use . This yields a simplified version of PCE, that we call SPCE. SPCE and PCE are very similar (the only difference is essentially that PCE was derived after plugging in the soft means instead of hard means in ), and both pairwise losses derived from CE. Note that SPCE is nothing more than the crossentropy, where we replaced each classifier’s weight by the current hard means :
5.2 MNIST experiment
In Fig. 1, we track the evolution of both objectives and when optimizing both the encoder and classifier with CE only on MNIST dataset. We use a small CNN composed of four convolutional layers. Optimizer used is SGD. Batch size is set to 64, learning rate is set to with cosine annealing, momentum to 0.9 and feature dimension is set to .
Fig. 1 strongly supports Proposition 1. Minimizing crossentropy indeed results in minimizing an underlying pairwise loss. Notice that even if SPCE is not theoretically guaranteed to be a lower bound on CE (as opposed to PCE), it practically always remains lower than CE, and that both losses tend to be very close towards convergence. Given that both CE and SPCE are essentially the same crossentropy, applied with different classifier’s weight , this indicates that at every iteration , using the hard mean is a better option than the found with SGD, and that SGD eventually converges to this solution.
6 Experiments
6.1 Metric
The most common metric used in DML is the recall. Most methods, and especially recent ones, use cosine distance to compute the recall for the evaluation. They include normalization of the features in the model [17, 15, 31, 18, 3, 40, 39, 42, 33, 20, 38] which makes cosine and Euclidean distances equivalent. Computing cosine similarity is also more memory efficient and typically leads to better results [21]. For these reasons, the Euclidean distance on non normalized features has rarely been used for both training and evaluation. In our experiments, normalization of the features during training actually hindered the final performance, which might be explained by the fact that we add a classification layer on top of the feature extractor. Thus, we did not normalize the features during training and reported the recall with both Euclidean and cosine distances.
6.2 Datasets
Name  Objects  Categories  Images 

CaltechUCSD Birds2002011 (CUB)[28]  Birds  200  11 788 
Cars Dataset [11]  Cars  196  16 185 
Stanford Online Products (SOP) [24]  House furniture  22 634  120 053 
Inshop Clothes Retrieval [13]  Clothes  7 982  52 712 
Four datasets are commonly used in metric learning to evaluate the performance of a given method. These datasets are summarized in Table 3. CUB [28], Cars [11] and SOP [24] datasets are divided into train and evaluation splits. For the evaluation, the recall is computed between each sample of the evaluation set and the rest of the set. InShop [13] is divided between a query and a gallery set. The recall is computed between each sample of the query set and the whole gallery set.
6.3 Training specifics
Model architecture and pretraining In the metric learning literature, several model architectures have been used, which historically correspond to the stateoftheart image classification architectures on the ImageNet dataset [2] with an additional constraint on model size (i.e. being able to train on one or two GPUs in reasonable time). These architectures are GoogLeNet [25] as in [10], BatchNormInception [26] as in [33] and ResNet50 [6] as in [39]. They have large differences in performance on the ImageNet dataset but the impact on the final performance on DML benchmarks has rarely been studied in controlled experiments. As it is not the focus of our paper, we use the ResNet50 architecture for our experiments. We acknowledge that some papers may obtain better performance by modifying the architecture (e.g. reducing model stride, performing multilevel fusion of features) but limit our comparison to standard architectures. We implement our experiments using the PyTorch [19] library and initialize the ResNet50 model with pretrained weights on ImageNet.
Sampling To the best of our knowledge, all DML papers – including [42] – use a form of pairwise sampling to ensure that, during training, each minibatch contains a fixed number of classes and samples per class (e.g. minibatch size of 75 with 3 classes and 25 samples per class in [42]). Deviating from that, we use the common random sampling among all samples (as in most classification training schemes) and set the minibatch size to 128 in all experiments (contrary to [33] in which the authors use a minibatch size of 80 for CUB, 1 000 for SOP and did not report for Cars and InShop).
Data Augmentation As in many training procedure for deep learning models, data augmentation proves to be of paramount importance to the final performance of the method. In several works, the augmentation strategies used are the same independently of the dataset. However, in practice, we found that they heavily impacted the results and needed to be tuned on a per dataset basis. For CUB, the images are first resized so that their smallest side has a length of 256 (i.e. keeping the aspect ratio) while for Cars, SOP and InShop, the images are resized to . Then a patch is extracted at a random location and random size and then resized to . For CUB and Cars random jittering of the brightness, contrast and saturation further improved the results. All of the implementation details can be found in the publicly available code.
Crossentropy The focus of our experiments is to show that, with careful tuning, it is possible to obtain similar or better performance than most recent DML methods while only using the crossentropy loss. To train with the crossentropy loss, we add a linear classification layer (with bias) on top of the feature extraction – similar to many classification models – which produces logits for all the classes present in the training set. Both the weights and biases of this classification layer are initialized to . We also add dropout with a probability of before this classification layer. To further reduce overfitting, we use label smoothing for the target probabilities of the crossentropy. We set the probability of the true class to and the probabilities of the other classes to with in all our experiments.
Optimizer In most DML papers, the hyperparameters of the optimizer are the same for Cars, SOP and InShop and typically use a smaller learning rate for CUB. In our experiments, we found that the best results were obtained by tuning the learning rate on a per dataset basis. In all experiments, the models are trained with SGD with Nesterov acceleration and a weight decay of which is applied to convolution and fullyconnected layers’ weights (but not to biases) as in [8]. For CUB and Cars, the learning rate is set to 0.02 and 0.05 respectively, with 0 momentum. For both SOP and InShop, the learning rate is set to 0.003 with a momentum of 0.99.
Batch normalization Following [33], we freeze all the batch normalization layers in the feature extractor. For Cars, SOP and InShop, we found that adding batch normalization – without scaling and bias – on top of the feature extractor improves our final performance and reduces the gap between and cosine distances when computing the recall. On CUB however, we obtained the best recall without this batch normalization.
Method  Architecture  Recall at  
CaltechUCSD Birds2002011 


Lifted Structure [24]  GoogLeNet  
ProxyNCA [15]  cos  BNInception  
HTL [3]  cos  GoogLeNet  
ABE [10]  cos  GoogLeNet  
HDC [40]  cos  GoogLeNet  
DREML [38]  cos  ResNet18  
EPSHN [39]  cos  ResNet50  
NormSoftmax [42]  cos  ResNet50  
MultiSimilarity [33]  cos  BNInception  
D&C [20]  cos  ResNet50  
CrossEntropy  ResNet50  
cos  
Stanford Cars 


Lifted Structure [24]  GoogLeNet  
ProxyNCA [15]  cos  BNInception  
HTL [40]  cos  GoogLeNet  
EPSHN [39]  cos  ResNet50  
HDC [40]  cos  GoogLeNet  
MultiSimilarity [33]  cos  BNInception  
D&C [20]  cos  ResNet50  
ABE [10]  cos  GoogLeNet  
DREML [38]  cos  ResNet18  
NormSoftmax [42]  cos  ResNet50  
CrossEntropy  ResNet50  
cos  
Stanford Online Product 


Lifted Structure [24]  GoogLeNet  
HDC [40]  cos  GoogLeNet  
HTL [3]  cos  GoogLeNet  
D&C [20]  cos  ResNet50  
ABE [10]  cos  GoogLeNet  
MultiSimilarity [33]  cos  BNInception  
EPSHN [39]  cos  ResNet50  
NormSoftmax [42]  cos  ResNet50  
CrossEntropy  ResNet50  
cos  
InShop Clothes Retrieval 


HDC [40]  cos  GoogLeNet  
DREML [38]  cos  ResNet18  
HTL [3]  cos  GoogLeNet  
D&C [20]  cos  ResNet50  
ABE [10]  cos  GoogLeNet  
EPSHN [39]  cos  ResNet50  
NormSoftmax [42]  cos  ResNet50  
MultiSimilarity [33]  cos  BNInception  
CrossEntropy  ResNet50  
cos 
6.4 Results
Results for the experiments are reported in Table 4. We also report the architecture used in the experiments as well as the distance used in the evaluation to compute the recall. refers to the Euclidean distance on non normalized features while cos refers to either the cosine distance or the Euclidean distance on normalized features, both of which are equivalent.
On all datasets, we report stateoftheart results except on Cars where the only method achieving similar recall uses crossentropy for training (see subsubsection 4.1.3). We also notice that, contrary to common beliefs, using Euclidean distance can actually be competitive as it also achieves near stateoftheart results on all four datasets. These results clearly highlight the potential of crossentropy for metric learning and confirm that this loss can achieve the same objective as pairwise losses.
7 Conclusion
Throughout this paper, we revealed nonobvious relations between the crossentropy loss, widely adopted in classification tasks, and pairwise losses commonly used in DML. These relations were drawn under two different perspectives. First, crossentropy minimization was shown equivalent to an approximate boundoptimization of a pairwise loss, introduced as Pairwise CrossEntropy (PCE), which appears similar in structure to already existing DML losses. Second, adopting a more general information theoretic view of DML, we showed that both pairwise losses and crossentropy were, in essence, maximizing a common mutual information between the embedded features and the labels. This connection becomes particularly apparent when writing mutual information in both its generative and discriminative views. Hence, we argue that most of the differences in performance observed in previous works come from the optimization process during training. Crossentropy only contains unary terms, while traditional DML losses are based on pairwise terms optimization, which requires substantially more tuning (e.g. minibatch size, sampling strategy, pair weighting). And while we acknowledge that some losses have better properties than others regarding optimization, we empirically showed that the crossentropy loss was also able to achieve stateoftheart results when fairly tuned, highlighting the fact that most improvements have come from enhanced training schemes (e.g. data augmentation, learning rate policies, batch normalization freeze) rather than intrinsic properties of pairwise losses. We strongly advocate that crossentropy should be carefully tuned to be compared against as a baseline in future works.
Appendix A Proofs
a.1 1
Proof.
Throughout the following proofs, we will use the fact that classes are assumed to be balanced in order to consider , for any class , as a constant . We will also use the feature normalization assumption to connect cosine and Euclidean distances. On the unithypersphere, we will use that: .
Tightness terms:
Let us start by linking center loss to contrastive loss. For any specific class , let us note the hard mean , we can write:
Summing over all classes , we get the desired equivalence. Note that in the context of Kmeans clustering, where the setting is very different the optimization is performed on assignment variables, as opposed to DML where assignments are already known, and the embedding is optimized for, a similar result was already established [27].
Now we link contrastive loss to SNCA loss. For any class , we can write:
where we used the convexity of for the first inequality. The proof can be finished by summing over all classes .
Finally, we link MS loss applied with used in experiments in [33] to contrastive loss:
where we used the concavity of for the first inequality.
Contrastive terms:
In this part, we first show that the contrastive terms and represent upper bounds on :
The link between SNCA and contrastive can be established quite similarly:
(15)  
(16)  