Metric learning: cross-entropy vs. pairwise losses

Metric learning: cross-entropy vs. pairwise losses

Abstract

Recently, substantial research efforts in Deep Metric Learning (DML) focused on designing complex pairwise-distance losses and convoluted sample-mining and implementation strategies to ease optimization. The standard cross-entropy loss for classification has been largely overlooked in DML. On the surface, the cross-entropy may seem unrelated and irrelevant to metric learning as it does not explicitly involve pairwise distances. However, we provide a theoretical analysis that links the cross-entropy to several well-known and recent pairwise losses. Our connections are drawn from two different perspectives: one based on an explicit optimization insight; the other on discriminative and generative views of the mutual information between the labels and the learned features. First, we explicitly demonstrate that the cross-entropy is an upper bound on a new pairwise loss, which has a structure similar to various pairwise losses: it minimizes intra-class distances while maximizing inter-class distances. As a result, minimizing the cross-entropy can be seen as an approximate bound-optimization (or Majorize-Minimize) algorithm for minimizing this pairwise loss. Second, we show that, more generally, minimizing the cross-entropy is actually equivalent to maximizing the mutual information, to which we connect several well-known pairwise losses. These findings indicate that the cross-entropy represents a proxy for maximizing the mutual information – as pairwise losses do – without the need for complex sample-mining and optimization schemes. Furthermore, we show that various standard pairwise losses can be explicitly related to one another via bound relationships. Our experiments1 over four standard DML benchmarks (CUB200, Cars-196, Stanford Online Product and In-Shop) strongly support our findings. We consistently obtained state-of-the-art results, outperforming many recent and complex DML methods.

Keywords:
Metric Learning, Deep Learning, Information Theory

1 Introduction

The core task of metric learning consists in learning a metric from high-dimensional data, such that the distance between two points, as measured by this metric, reflects their semantic similarity. Such a task can be of crucial importance in several applications including image retrieval, zero-shot learning or person re-identification, among other tasks. The first approaches to tackle this problem attempted to learn metrics directly on the input space [14]. Later, the idea of learning suitable embedding was introduced, and quite some efforts were put into learning Mahalanobis distances [37, 22, 4, 34, 1]. This corresponds to learning the best linear projection of the input space onto a lower-dimensional manifold, followed by using the Euclidean distance as a metric. Building on the embedding-learning ideas, several papers proposed to learn more complex mappings, either by kernelization of already existing linear algorithms [1], or simply by using more complex hypothesis such as linear combinations of gradient boosted regressions trees [9].

The recent success of deep neural networks at learning complex, highly nonlinear mappings of high-dimensional data completely aligns with the problematic of learning a suitable embedding. Similarly to works carried out in the context of Mahalanobis distance learning, most Deep Metric Learning (DML) approaches are based on pairwise distances. More specifically, the current paradigm is to learn a deep encoder such that it maps pairs of points with high semantic similarity close together in the embedded space, as measured by well-established distances (e.g. Euclidean or cosine). This paradigm concretely translates into pairwise losses that encourage small distances for pairs of samples from the same class and penalize small distances for pairs of samples from different classes. While the formulations seem rather intuitive, the practical implementation and optimization of such pairwise losses can become cumbersome. Randomly assembling pairs of samples results in either slow convergence or degenerate solutions [7]. Hence, most research efforts put into DML focused on finding efficient ways to reformulate, generalize and/or improve sample mining and/or improve sample weighting strategies over the existing pairwise losses. Popular pairwise losses include triplet loss and its derivatives [7, 23, 24, 43, 3], contrastive loss and its derivatives [5, 33], Neighborhood Component Analysis and its derivatives [4, 15, 36], among others. However, such modifications are often heuristic-based, and come at the price of an increased complexity and additional hyper-parameters, impeding the potential applicability of these methods to real-world problems.

Admittedly, the objective of learning a useful embedding of data points intuitively aligns with the idea of directly acting on the distances between pairs of points in the embedded space, via minimizing intra-class distances and maximizing inter-class distances. Therefore, the standard cross-entropy loss, widely used in classification tasks, has been largely overlooked by the DML community, most likely due to its apparent irrelevance for metric Learning [35]. As a matter of fact, why would anyone use a point-wise prediction loss to enforce pairwise-distance properties on the feature space? Even though the cross-entropy was shown to be competitive for face recognition applications [12, 30, 29], to the best of our knowledge, only one paper observed empirically competitive results of a normalized, temperature-weighted version of the cross-entropy in the more general context of deep metric learning [42]. However, the authors did not provide any theoretical insights for these results. Moreover, as we will show later, the feature, weight normalization and bias removal in [42] simplifies directly the altered cross-entropy in [42] into a loss very similar to the well-known center loss [35].

In fact, even though, on the surface, the standard cross-entropy loss may seem completely unrelated to the pairwise losses used in DML, we provide theoretical justifications that connects directly the cross-entropy to several well-known and recent pairwise losses. Our connections are drawn from two different perspectives, one based on an explicit optimization insight and the other on mutual-information arguments. We show that four of the most prominent pairwise metric-learning losses, as well as the standard cross-entropy, are essentially maximizing a common underlying objective: the Mutual Information (MI) between the learned embeddings and the corresponding samples’ labels. As sketched in Section 2, this connection can be intuitively understood by writing this MI in two different, but equivalent ways. Specifically, we find that tight links between pairwise losses and the generative view of this MI can be established. We study the particular case of contrastive loss [5], explicitly showing its relation to this MI. We further demonstrate that other DML losses have tight relations to contrastive loss, such that the reasoning applied on this specific example generalizes to the other DML losses. In fact, various standard pairwise losses can be explicitly related to one another via bound relationships. As for the cross-entropy, we first demonstrate that the cross-entropy is an upper bound on a new, underlying pairwise loss, on which the previous reasoning could be applied which has a structure similar to various existing pairwise losses. As a result, minimizing the cross-entropy can be seen as an approximate bound-optimization (or Majorize-Minimize) algorithm for minimizing this pairwise loss, thereby implicitly minimizing intra-class distances and maximizing inter-class distances. Second, we show that, more generally, minimizing the cross-entropy is actually equivalent to maximizing the discriminative view of the mutual information. Our findings indicate that the cross-entropy represents a proxy for maximizing the mutual information, as pairwise losses do, without the need for complex sample-mining and optimization schemes. Our comprehensive experiments over four standard DML benchmarks (CUB200, Cars-196, Stanford Online Product and In-Shop) strongly support our findings. We consistently obtained state-of-the-art results, outperforming many recent and complex DML methods.

Summary of contributions

1. Establishing relations between several well-known pairwise DML losses and a generative view of the mutual between the learned features and labels.

2. Proving explicitly that optimizing the standard cross-entropy corresponds to an approximate bound-optimizer of an underlying pairwise loss;

3. More generally, showing that minimizing the standard cross-entropy loss is equivalent to maximizing a discriminative view of the mutual information between the features and labels.

4. Demonstrating state-of-the-art results with cross-entropy on several DML benchmark datasets.

2 On the two views of the mutual information

The Mutual Information (MI) is a well known measure designed to quantify the amount of information shared by two random variables. Its formal definition is presented in Table 1. Throughout this work, we will be particularly interested in which represents the MI between learned features and labels . Due to its symmetry property, the MI can be written in two ways, which we will refer to as the discriminative view and generative view of MI:

 (1)

While being analytically equivalent, these two views present two different, complementary interpretations. In order to maximize , the discriminative view conveys that the labels should be balanced (out of our control) and that the labels should be easily identified from the features. On the other hand, the generative view conveys that the features learned should overall spread as much as possible in the feature space, but keep samples sharing the same class close together. Hence the discriminative view is more focused around label identification, while the generative view focuses on more explicitly shaping the distribution of features learned by the model. Therefore, the MI can allow us to draw links between classification losses (e.g. cross-entropy) and feature-shaping losses (including all the well-known pairwise metric learning losses).

3 Pairwise losses and the generative view of the MI

In this section, we study four pairwise losses widely used in the DML community: center loss [35], contrastive loss [5], Scalable Neighbor Component Analysis (SNCA) loss [36] and Multi-Similarity (MS) loss [33]. We show that these losses can be interpreted as proxies for maximizing the generative view of mutual information . We begin by analyzing the specific example of contrastive loss, establishing its tight link to the MI, and further generalize our analysis to all the other pairwise losses (see Table 2). Furthermore, we show that all these pairwise metric-learning losses can be explicitly related to one another via bound relationships.

3.1 The example of contrastive loss

We start by analyzing the representative example of contrastive loss [5]. For a given margin , this loss is formulated as:

 Lcontrast=1nn∑i=1\smashoperator[r]∑j:yj=yiD2ijTcontrast+1nn∑i=1\smashoperator[r]∑j:yj≠yi[(m−Dij)+]2Econtrast (2)

This loss naturally breaks down into two terms: a tightness part and a contrastive part . The tightness part encourages samples from the same class to be close to each other and form tight clusters. As for the contrastive part, it forces samples from different classes to stand far apart from one another in the embedded feature space. Let us analyze these two terms in great details from a mutual-information perspective.

As shown in the next subsection, the tightness part of contrastive loss is equivalent to the tightness part of the center loss [35]: , where denotes the mean of feature points from class in embedding space and symbol denotes equality up to a multiplicative and/or additive constant. Written in this way, we can interpret as a conditional cross entropy between and another random variable , whose conditional distribution given is a standard Gaussian centered around : :

 Tcontrast\lx@stackrel\mathclapc=H(ˆZ;¯Z|Y)=H(ˆZ|Y)+DKL(ˆZ||¯Z|Y) (3)

As such, is an upper bound on the conditional entropy that appears in the mutual information:

 Tcontrast≥H(ˆZ|Y) (4)

This bound is tight when . Hence, minimizing can be seen as minimizing , which exactly encourages the encoder to produce low-entropy (=compact) clusters in the feature space for each given class. Notice that using this term only will inevitably lead to a trivial encoder that maps all data points in to a single point in the embedded space , hence achieving a global optima.

To prevent such a trivial solution, a second term needs to be added. This second term – that we refer to as the contrastive term – is designed to push each point away from points that have a different label. Assuming a large margin such that , we can linearize the contrastive term:

 Econtrast\lx@stackrel\mathclapc≈−2mnn∑i=1\smashoperator[r]∑j:yj≠yiDij=−2mnn∑i=1n∑j=1Dij+2mnn∑i=1\smashoperator[r]∑j:yj=yiDij (5)

While the second term in Eq. 5 is redundant with the tightness objective, the first term is close to the differential entropy estimator proposed in [32]:

 ˆH(ˆZ)=dn(n−1)n∑i=1n∑j=1logD2ij \lx@stackrel\mathclapc= n∑i=1n∑j=1logDij (6)

Both terms measure the spread of , even though they present different dynamics. For instance, the presence of the in Eq. 6 could cause high gradients close to 0, but yield more robustness to outliers. All in all, minimizing the whole contrastive loss – assuming a large margin – can be seen as a proxy for maximizing the MI between the labels and the embedded features :

 Lcontrast=1nn∑i=1\smashoperator[r]∑j:yj=yiD2ij+2mDij∝H(ˆZ|Y)−2mn∑i,jDij∝H(ˆZ)∝−I(ˆZ;Y) (7)

3.2 Generalizing to other pairwise losses

A similar analysis can be carried out on other, more recent metric learning losses. More specifically, they can also be broken down into two parts: a tightness part that minimizes intra-class distances to form compact clusters, which is related to the conditional entropy , and a second contrastive part that prevents trivial solutions by maximizing inter-class distances, which is related to the entropy of features . Note that, in some pairwise losses, there might be some redundancy between the two terms, i.e. the tightness term also contains some contrastive subterm, and vice-versa. For instance, the cross-entropy loss is used as the contrastive part of the center-loss but, as we will show later in subsection 4.2, the cross-entropy loss, used alone, already contains both tightness (conditional entropy) and contrastive (entropy) parts. Table 2 presents the split for four representative DML losses. The rest of the section is devoted to exhibiting the close relationships between several well-known pairwise losses and the tightness (conditional entropy) and contrastive (entropy) terms (i.e. and ).

In this section, we show that the tightness and contrastive parts of the pairwise losses in Table 2, even though different at first sight, can actually be related to one another.

Lemma 1.

Let denote the tightness part of the loss from method A. Assuming that features are L2-normalized, and that classes are balanced, the following relations between Center [35], Contrastive [5], SNCA [36] and MS [33] losses hold:

 TSNCA\lx@stackrel\mathclapc≤TCenter\lx@stackrel\mathclapc=TContrastive\lx@stackrel\mathclapc≤TMS (8)

Where stands for lower than, up to a multiplicative and an additive constant, and stands for equal to, up to a multiplicative and an additive constant.

The detailed proof of 1 is deferred to the supplemental material. As for the contrastive parts, we show in the supplemental material that both and are lower bounded by a common contrastive term that is directly related to . We do not mention the contrastive term of center-loss, as it represents the cross-entropy loss, which is exhaustively studied in Section 4.

4 Cross-entropy does it all

We now completely change gear to focus on the widely used classification loss: cross-entropy. On the surface, the cross-entropy may seem unrelated to metric-learning losses as it does not involve pairwise distances. We show that a close relationship exists between these pairwise losses widely used in deep metric learning and the cross-entropy classification loss. This link can be drawn from two different perspectives, one is based on an explicit optimization insight and the other is based on a discriminative view of the mutual information. First, we explicitly demonstrate that the cross-entropy is an upper bound on a new pairwise loss, which has a structure similar to all the metric-learning losses listed in Table 2, i.e., it contains a tightness term and a contrastive term. Hence, minimizing the cross-entropy can be seen as an approximate bound-optimization (or Majorize-Minimize) algorithm for minimizing this pairwise loss. Second, we show that, more generally, minimization of the cross-entropy is actually equivalent to maximization of the mutual information, to which we connected various DML losses. These findings indicate that the cross-entropy represents a proxy for maximizing , just like pairwise losses, without the need for dealing with the complex sample mining and optimization schemes associated to the latter.

4.1 The pairwise loss behind cross-entropy

Bound optimization

Given a function that is either intractable or hard to optimize, bound optimizers are iterative algorithms that instead optimize auxiliary functions (upper bounds on ). These auxiliary functions are usually more tractable than the original function . Let be the current iteration index, then an is an auxiliary function if:

 f(W) ≤at(W),∀W f(Wt) =at(Wt)

A bound optimizer follows a two-step procedure: first an auxiliary function is computed, then is minimized, such that:

 Wt+1=argminWat(W)

This iterative procedure is guaranteed to decrease the original function :

 f(Wt+1)≤at(Wt+1)≤at(Wt)=f(Wt) (9)

Note that bound optimizers are widely used in machine learning. Examples of well-known bound optimizers include the concave-convex procedure (CCCP) [41], expectation maximization (EM) algorithms or submodular-supermodular procedures (SSP) [16]. Such optimizers are particularly used in clustering [27], and more generally in problems involving latent variables optimization.

Pairwise Cross-Entropy

We now prove that iterative minimization of cross-entropy can be interpreted as an approximate bound optimization of a more complex pairwise loss.

Proposition 1.

Alternately minimizing the cross-entropy loss with respect to the encoder’s parameters and the classifier’s weights can be viewed as an approximate bound-optimization of a Pairwise Cross-Entropy (PCE) loss, which we define as follows:

 LPCE=−12λn2n∑i=1\smashoperator[r]∑j:yj=yizTizj\textsctightnesspart+1nn∑i=1logK∑k=1e1λnn∑j=1pjkzTizj−12λK∑k=1∥∥csk∥∥2\textsccontrastivepart (10)

Where represents the soft-mean of class , represents the softmax probability of point belonging to class k, and depends on the encoder .

The full proof of Proposition 1 is provided in the supplemental material, but we hereby provide a sketch of it.

Proof.

Considering the usual softmax parametrization for our model’s predictions , the idea is to break the cross-entropy loss in two terms, and artificially add and remove the regularization term :

 LCE=−1nn∑i=1θTyizi+λ2∑kθTkθkf1(θ)+1nn∑i=1logK∑k=1eθTkzi−λ2K∑k=1θTkθkf2(θ) (11)

By properly choosing in Eq. (11), both and become convex functions of . For any class , we then show that the optimal values of for and are, respectively proportional to, the hard mean and the soft mean of class . By plugging-in those optimal values, we can lower bound and individually in Eq. 11 and get the result. ∎

Proposition 1 casts a new light on the cross-entropy loss by explicitly relating it to a new pairwise loss (PCE), following the intuition that the optimal weights of the final layer, i.e., the linear classifier, are related to the centroids of each class in the embedded feature space . Specifically, finding the optimal classifier’s weight for cross-entropy can be interpreted as building an auxiliary function on . Subsequently minimizing cross-entropy w.r.t the encoder’s weights can be interpreted as the second step of bound optimization on . We exhibit empirical evidence on this bound-type relationship in Section 5.

Similarly to other metric learning losses, PCE contains a tightness part that encourages samples from the same classes to align with one another. In echo to 1, this tightness term, noted , is equivalent, up to multiplicative and additive constants, to and , when the features are assumed to be normalized:

 TPCE\lx@stackrel\mathclapc=Tcenter\lx@stackrel\mathclapc=Tcontrast

Pairwise Cross-Entropy also contains a contrastive part, divided into two terms. The first pushes all samples away from one another, while the second term forces soft means far from the origin. Hence, minimizing the cross-entropy can be interpreted as implicitly minimizing a pairwise loss that has a structure similar to the well-established metric-learning losses in Table 2.

Remark

In [42], competitive results were achieved using a normalized version of the cross-entropy, which directly optimizes the cosine distances and uses an additional temperature parameter . Using feature and weight normalization, the loss in [42] simplifies directly to a pairwise loss that contains the center-loss tightness term [35]:

 Lnorm−CE\lx@stackrel\mathclapc=τ2Nn∑i=1∥∥^θyi−^zi∥∥2+logK∑k=1exp(−τ2∥^θk−^zi∥2) (12)

where and . This direct link comes from the fact that, on the unit hypersphere, and cosine distances are equivalent, up to an additive constant.

4.2 A discriminative view of mutual information

Lemma 2.

Minimizing the conditional cross-entropy loss, denoted by , is equivalent to maximizing the mutual information .

Proof.

Using the discriminative view of MI, we can write:

 I(ˆZ;Y)=H(Y)−H(Y|ˆZ) (13)

The entropy of labels is a constant, and can therefore be ignored. From this view of MI, maximization of can only be achieved through a minimization of , which depends on our embeddings .

We can relate this term to our cross-entropy loss using the following relation:

 H(Y;ˆY|ˆZ)=H(Y|ˆZ)+DKL(Y∥ˆY|ˆZ) (14)

Therefore, while minimizing cross-entropy, we are implicitly both minimizing as well as . In fact, following Eq. 14, optimization could naturally be decoupled in 2 steps, in a Maximize-Minimize fashion. One step would consist in fixing the encoder’s weights and only minimizing Eq. 14 w.r.t to the classifier’s weights . At this step, would be fixed while would be adjusted to minimize . Ideally, the KL term would vanish at the end of this step. In the following step, we would minimize Eq. 14 w.r.t to the encoder’s weights , while keeping the classifier fixed. ∎

Result from 2 is very compelling. Using the discriminative view of mutual information allows to show that minimizing cross-entropy loss is equivalent to maximizing the mutual information . This information theoretic argument reinforces our conclusion from Proposition 1 that cross-entropy and the previously described metric learning losses are essentially doing the same job.

4.3 Then why would cross-entropy work better?

In previous sections, we showed that cross-entropy essentially optimizes for the same underlying mutual information than other DML losses proposed. This fact alone is not enough to explain why the cross-entropy is able to consistently achieve better results than DML losses proposed so far as shown in Section 6. We argue that the difference must lie in the optimization process. Pairwise losses require careful sample mining and sample weighting strategies to obtain most informative pairs of sample, especially when considering mini-batches, in order to achieve convergence in a reasonable amount of time, using a reasonable amount of memory. On the other hand, optimizing for cross-entropy is substantially easier as it only implies minimization of unary terms. Essentially, cross-entropy does it all without the pain of having to deal with pairwise terms. Not only does it make optimization easier, it also makes the implementation simpler, thus increasing its potential applicability in real-world problems.

5 Empirical support of the link between CE and PCE

Throughout this section, we provide simple empirical evidence to support the claimed relation between cross-entropy (CE) minimization and pairwise cross-entropy (PCE) minimization (see subsection 4.1). The main goal is to show that minimizing the cross-entropy (CE) is implicitly equivalent to minimizing an underlying pairwise loss. Intuitively, the pairwise terms appears through the observation that the optimal weights of the classifier used in cross-entropy are directly related to the hard-means of each class in the feature space. This enabled us to link CE to PCE in Proposition 1.

5.1 Simplified PCE

PCE as presented in Proposition 1 requires the computation of the parameter at every iteration, which in turn requires the computation of the eigenvalues of a matrix at every iteration (cf subsection A.2). Recall Eq. 21:

 LCE=−1nn∑i=1θTyizi+λ2K∑k=1θTkθkf1(θ)+1nn∑i=1logK∑j=1eθTjzi−λ2K∑k=1θTkθkf2(θ)

Therefore, we can remove dependence upon by plugging in the same for both and in Eq. 21. We choose to use . This yields a simplified version of PCE, that we call SPCE. SPCE and PCE are very similar (the only difference is essentially that PCE was derived after plugging in the soft means instead of hard means in ), and both pairwise losses derived from CE. Note that SPCE is nothing more than the cross-entropy, where we replaced each classifier’s weight by the current hard means :

 LSPCE=−1n2n∑i=1∑j:yj=yizTizj\textsctightness+1nn∑i=1logK∑k=1exp(1n∑j:yj=kzTizj)\textsccontrastive

5.2 MNIST experiment

In Fig. 1, we track the evolution of both objectives and when optimizing both the encoder and classifier with CE only on MNIST dataset. We use a small CNN composed of four convolutional layers. Optimizer used is SGD. Batch size is set to 64, learning rate is set to with cosine annealing, momentum to 0.9 and feature dimension is set to .
Fig. 1 strongly supports Proposition 1. Minimizing cross-entropy indeed results in minimizing an underlying pairwise loss. Notice that even if SPCE is not theoretically guaranteed to be a lower bound on CE (as opposed to PCE), it practically always remains lower than CE, and that both losses tend to be very close towards convergence. Given that both CE and SPCE are essentially the same cross-entropy, applied with different classifier’s weight , this indicates that at every iteration , using the hard mean is a better option than the found with SGD, and that SGD eventually converges to this solution.

6 Experiments

6.1 Metric

The most common metric used in DML is the recall. Most methods, and especially recent ones, use cosine distance to compute the recall for the evaluation. They include normalization of the features in the model [17, 15, 31, 18, 3, 40, 39, 42, 33, 20, 38] which makes cosine and Euclidean distances equivalent. Computing cosine similarity is also more memory efficient and typically leads to better results [21]. For these reasons, the Euclidean distance on non normalized features has rarely been used for both training and evaluation. In our experiments, -normalization of the features during training actually hindered the final performance, which might be explained by the fact that we add a classification layer on top of the feature extractor. Thus, we did not -normalize the features during training and reported the recall with both Euclidean and cosine distances.

6.2 Datasets

Four datasets are commonly used in metric learning to evaluate the performance of a given method. These datasets are summarized in Table 3. CUB [28], Cars [11] and SOP [24] datasets are divided into train and evaluation splits. For the evaluation, the recall is computed between each sample of the evaluation set and the rest of the set. In-Shop [13] is divided between a query and a gallery set. The recall is computed between each sample of the query set and the whole gallery set.

6.3 Training specifics

Model architecture and pre-training In the metric learning literature, several model architectures have been used, which historically correspond to the state-of-the-art image classification architectures on the ImageNet dataset [2] with an additional constraint on model size (i.e. being able to train on one or two GPUs in reasonable time). These architectures are GoogLeNet [25] as in [10], BatchNorm-Inception [26] as in [33] and ResNet-50 [6] as in [39]. They have large differences in performance on the ImageNet dataset but the impact on the final performance on DML benchmarks has rarely been studied in controlled experiments. As it is not the focus of our paper, we use the ResNet-50 architecture for our experiments. We acknowledge that some papers may obtain better performance by modifying the architecture (e.g. reducing model stride, performing multi-level fusion of features) but limit our comparison to standard architectures. We implement our experiments using the PyTorch [19] library and initialize the ResNet-50 model with pre-trained weights on ImageNet.

Sampling To the best of our knowledge, all DML papers – including [42] – use a form of pairwise sampling to ensure that, during training, each mini-batch contains a fixed number of classes and samples per class (e.g. mini-batch size of 75 with 3 classes and 25 samples per class in [42]). Deviating from that, we use the common random sampling among all samples (as in most classification training schemes) and set the mini-batch size to 128 in all experiments (contrary to [33] in which the authors use a mini-batch size of 80 for CUB, 1 000 for SOP and did not report for Cars and In-Shop).

Data Augmentation As in many training procedure for deep learning models, data augmentation proves to be of paramount importance to the final performance of the method. In several works, the augmentation strategies used are the same independently of the dataset. However, in practice, we found that they heavily impacted the results and needed to be tuned on a per dataset basis. For CUB, the images are first resized so that their smallest side has a length of 256 (i.e. keeping the aspect ratio) while for Cars, SOP and In-Shop, the images are resized to . Then a patch is extracted at a random location and random size and then resized to . For CUB and Cars random jittering of the brightness, contrast and saturation further improved the results. All of the implementation details can be found in the publicly available code.

Cross-entropy The focus of our experiments is to show that, with careful tuning, it is possible to obtain similar or better performance than most recent DML methods while only using the cross-entropy loss. To train with the cross-entropy loss, we add a linear classification layer (with bias) on top of the feature extraction – similar to many classification models – which produces logits for all the classes present in the training set. Both the weights and biases of this classification layer are initialized to . We also add dropout with a probability of before this classification layer. To further reduce overfitting, we use label smoothing for the target probabilities of the cross-entropy. We set the probability of the true class to and the probabilities of the other classes to with in all our experiments.

Optimizer In most DML papers, the hyper-parameters of the optimizer are the same for Cars, SOP and In-Shop and typically use a smaller learning rate for CUB. In our experiments, we found that the best results were obtained by tuning the learning rate on a per dataset basis. In all experiments, the models are trained with SGD with Nesterov acceleration and a weight decay of which is applied to convolution and fully-connected layers’ weights (but not to biases) as in [8]. For CUB and Cars, the learning rate is set to 0.02 and 0.05 respectively, with 0 momentum. For both SOP and In-Shop, the learning rate is set to 0.003 with a momentum of 0.99.

Batch normalization Following [33], we freeze all the batch normalization layers in the feature extractor. For Cars, SOP and In-Shop, we found that adding batch normalization – without scaling and bias – on top of the feature extractor improves our final performance and reduces the gap between and cosine distances when computing the recall. On CUB however, we obtained the best recall without this batch normalization.

6.4 Results

Results for the experiments are reported in Table 4. We also report the architecture used in the experiments as well as the distance used in the evaluation to compute the recall. refers to the Euclidean distance on non normalized features while cos refers to either the cosine distance or the Euclidean distance on -normalized features, both of which are equivalent.

On all datasets, we report state-of-the-art results except on Cars where the only method achieving similar recall uses cross-entropy for training (see subsubsection 4.1.3). We also notice that, contrary to common beliefs, using Euclidean distance can actually be competitive as it also achieves near state-of-the-art results on all four datasets. These results clearly highlight the potential of cross-entropy for metric learning and confirm that this loss can achieve the same objective as pairwise losses.

7 Conclusion

Throughout this paper, we revealed non-obvious relations between the cross-entropy loss, widely adopted in classification tasks, and pairwise losses commonly used in DML. These relations were drawn under two different perspectives. First, cross-entropy minimization was shown equivalent to an approximate bound-optimization of a pairwise loss, introduced as Pairwise Cross-Entropy (PCE), which appears similar in structure to already existing DML losses. Second, adopting a more general information theoretic view of DML, we showed that both pairwise losses and cross-entropy were, in essence, maximizing a common mutual information between the embedded features and the labels. This connection becomes particularly apparent when writing mutual information in both its generative and discriminative views. Hence, we argue that most of the differences in performance observed in previous works come from the optimization process during training. Cross-entropy only contains unary terms, while traditional DML losses are based on pairwise terms optimization, which requires substantially more tuning (e.g. mini-batch size, sampling strategy, pair weighting). And while we acknowledge that some losses have better properties than others regarding optimization, we empirically showed that the cross-entropy loss was also able to achieve state-of-the-art results when fairly tuned, highlighting the fact that most improvements have come from enhanced training schemes (e.g. data augmentation, learning rate policies, batch normalization freeze) rather than intrinsic properties of pairwise losses. We strongly advocate that cross-entropy should be carefully tuned to be compared against as a baseline in future works.

Appendix A Proofs

a.1 1

Proof.

Throughout the following proofs, we will use the fact that classes are assumed to be balanced in order to consider , for any class , as a constant . We will also use the feature normalization assumption to connect cosine and Euclidean distances. On the unit-hypersphere, we will use that: .

Tightness terms:

Let us start by linking center loss to contrastive loss. For any specific class , let us note the hard mean , we can write:

 ∑zi∈Zk∥zi−ck∥2 =∑zi∈Zk∥zi∥2−21|Zk|∑zi∈Zk∑zj∈ZkzTizj+1|Zk|∑zi∈Zk∑zj∈ZkzTizj =∑zi∈Zk∥zi∥2−1|Zk|∑zi∈Zk∑zj∈ZkzTizj =12[∑zi∈Zk∥zi∥2+∑zj∈Zk∥∥zj∥∥2]−1|Zk|∑zi∈Zk∑zj∈ZkzTizj =12|Zk|[∑zi∈Zk∑zj∈Zk∥zi∥2+∑zi∈Zk∑zj∈Zk∥∥zj∥∥2] −12|Zk|∑zi∈Zk∑zj∈Zk2zTizj =12|Zk|∑zi,zj∈Zk∥zi∥2−2zTizj+∥∥zj∥∥2 =12|Zk|∑zi,zj∈Zk∥∥zi−zj∥∥2 \lx@stackrel\mathclapc=∑zi,zj∈Zk∥∥zi−zj∥∥2

Summing over all classes , we get the desired equivalence. Note that in the context of K-means clustering, where the setting is very different -the optimization is performed on assignment variables, as opposed to DML where assignments are already known, and the embedding is optimized for-, a similar result was already established [27].

Now we link contrastive loss to SNCA loss. For any class , we can write:

 −∑zi∈Zklog∑zj∈Zk∖{i}eDcosi,jσ \lx@stackrel\mathclapc=−∑zi∈Zklog⎛⎜⎝1|Zk|−1∑zj∈Zk∖{i}eDcosi,jσ⎞⎟⎠ ≤−∑zi∈Zk∑zj∈Zk∖{i}Dcosi,j(|Zk|−1)σ =∑zi∈Zk∑zj∈Zk∖{i}∥∥zi−zj∥∥22σ(|Zk|−1)

where we used the convexity of for the first inequality. The proof can be finished by summing over all classes .

Finally, we link MS loss applied with used in experiments in [33]- to contrastive loss:

 \smashoperator[]∑zi∈Zk1αlog⎛⎜⎝1+\smashoperator[]∑zj∈Zk∖{i}e−α(Dcosi,j−1)⎞⎟⎠ =\smashoperator[r]∑zi∈Zk1αlog\smashoperator[]∑zj∈Zke−α(Dcosi,j−1) \lx@stackrel\mathclapc=\smashoperator[r]∑zi∈Zk1αlog(1|Zk|\smashoperator[r]∑zj∈Zke−α(Dcosi,j−1)) ≥1|Zk|\smashoperator[r]∑zi,zj∈Zk−(Dcosi,j−1) \lx@stackrel\mathclapc=\smashoperator[r]∑zi,zj∈Zk∥∥zi−zj∥∥2,

where we used the concavity of for the first inequality.

Contrastive terms:

In this part, we first show that the contrastive terms and represent upper bounds on :

 EMS=1βnn∑i=1log(1+∑j:yj≠yieβ(Dcosij−1)) ≥1βnn∑i=1log(∑j:yj≠yieβ(Dcosij−1)) \lx@stackrel\mathclapc≥1βnn∑i=1∑j:yj≠yiβ(Dcosij−1) \lx@stackrel\mathclapc=−1nn∑i=1∑j:yj≠yiD2ij =E

The link between SNCA and contrastive can be established quite similarly:

 ESNCA=1nn∑i=1log(∑j≠ieDcosijσ) =1nn∑i=1log(∑j≠i:yi=yjeDcosijσ+∑j:yj≠yieDcosijσ) (15) ≥1nn∑i=1log(∑j:yj≠yieDcosijσ) (16) \lx@stackrel\mathclapc≥