Metric Learning with
Adaptive Density Discrimination
Abstract
Distance metric learning (DML) approaches learn a transformation to a representation space where distance is in correspondence with a predefined notion of similarity. While such models offer a number of compelling benefits, it has been difficult for these to compete with modern classification algorithms in performance and even in feature extraction.
In this work, we propose a novel approach explicitly designed to address a number of subtle yet important issues which have stymied earlier DML algorithms. It maintains an explicit model of the distributions of the different classes in representation space. It then employs this knowledge to adaptively assess similarity, and achieve local discrimination by penalizing class distribution overlap.
We demonstrate the effectiveness of this idea on several tasks. Our approach achieves stateoftheart classification results on a number of finegrained visual recognition datasets, surpassing the standard softmax classifier and outperforming triplet loss by a relative margin of 3040%. In terms of computational performance, it alleviates training inefficiencies in the traditional triplet loss, reaching the same error in 530 times fewer iterations. Beyond classification, we further validate the saliency of the learnt representations via their attribute concentration and hierarchy recovery properties, achieving 1025% relative gains on the softmax classifier and 2550% on triplet loss in these tasks.
1 Introduction
The problem of classification is a mainstay task in machine learning, as it provides us with a coherent metric to gauge progress and juxtapose new ideas against existing approaches. To tackle various other tasks beyond categorization, we often require alternative representations of our inputs which provide succinct summaries of relevant characteristics. Here, classification algorithms often serve as convenient feature extractors: a very popular approach involves training a network for classification on a large dataset, and retaining the outputs of the last layer as inputs transferred to other tasks (Donahue et al., 2013; Sharif Razavian et al., 2014; Qian et al., 2015; Snoek et al., 2015).
However, this paradigm exhibits an intrinsic discrepancy: we have no guarantee that our extracted features are suitable for any task but the particular classification problem from which they were derived. On the contrary: in our classification procedure, we propagate highdimensional inputs through a complex pipeline, and map each to a single, scalar prediction. That is, we explicitly demand our algorithm to, ultimately, dispose of all information but class label. In the process, we destroy intra and interclass variation that would in fact be desirable to maintain in our features.
In principle, we have no reason to compromise: we should be able to construct a representation which is amenable to classification, while still maintaining more finegrained information. This philosophy motivates the class of distance metric learning (DML) approaches, which learn a transformation to a representation space where distance is in correspondence with a notion of similarity. Metric learning offers a number of benefits: for example, it enables zeroshot learning (Mensink et al., 2013; Chopra et al., 2005), visualization of highdimensional data (van der Maaten & Hinton, 2008), learning invariant maps (Hadsell et al., 2006), and graceful scaling to instances with millions of classes (Schroff et al., 2015). In spite of this, it has been difficult for DMLbased approaches to compete with modern classification algorithms in performance and even in feature extraction.
Admittedly, however, these are two sides of the same coin: a more salient representation should, in theory, enable improved classification performance and features for task transfer. In this work, we strive to reconcile this gap. We introduce Magnet Loss, a novel approach explicitly designed to address subtle yet important issues which have hindered the quality of learnt representations and the training efficiency of a class of DML approaches. In essence, instead of penalizing individual examples or triplets, it maintains an explicit model of the distributions of the different classes in representation space. It then employs this knowledge to adaptively assess similarity, and achieve discrimination by reducing local distribution overlap. It utilizes clustering techniques to simultaneously tackle a number of components in model design, from capturing the distributions of the different classes to hard negative mining. For a particular set of assumptions in its configuration, it reduces to the familiar triplet loss (Weinberger & Saul, 2009).
We demonstrate the effectiveness of this idea on several tasks. Using a soft nearestcluster metric for evaluation, this approach achieves stateoftheart classification results on a number of finegrained visual recognition datasets, surpassing the standard softmax classifier and outperforming triplet loss by a relative margin of 3040%. In terms of computational performance, it alleviates several training inefficiencies in traditional tripletbased approaches, reaching the same error in 530 times fewer iterations. Beyond classification, we further validate the saliency of the learnt representations via their attribute concentration and hierarchy recovery properties, achieving 1025% relative gains on the softmax classifier and 2550% on triplet loss in these tasks.
2 Motivation: Challenges in Distance Metric Learning
We start by providing an overview of challenges which we believe have been impeding the success of existing distance metric learning approaches. These will motivate our work to follow.
Issue #1: predefined target neighbourhood structure
All metric learning approaches must define a relationship between similarity and distance, which prescribes neighbourhood structure. The corresponding training algorithm, then, learns a transformation to a representation space where this property is obeyed. In existing approaches, similarity has been canonically defined apriori by integrating available supervised knowledge. The most common is semantic, informed by class labels. Finer assignment of neighbourhood structure is enabled with access to additional prior information, such as similarity ranking (Wang et al., 2014) and hierarchical class taxonomy (Verma et al., 2012).
In practice, however, the only available supervision is often in the form of class labels. In this case, a ubiquitous solution is to enforce semantic similarity: examples of each class are demanded to be tightly clustered together, far from examples of other classes (for example, Schroff et al. (2015); Norouzi et al. (2012); Globerson & Roweis (2006); Chopra et al. (2005)). However, this collapses intraclass variation and does not embrace shared structure between different classes. Hence, this imposes too strong of a requirement, as each class is assumed to be captured by a single mode.
This issue is wellknown, and has motivated the notion of local similarity: each example is designated only a small number of target neighbours of the same class (Weinberger & Saul, 2009; Qian et al., 2015; Hadsell et al., 2006). In existing work, these target neighbours are determined prior to training: they are retrieved based on distances in the original input space, and after which are never updated again. Ironically, this is in contradiction with our fundamental assumption which motivated us to pursue a DML approach in the first place. Namely, we want to learn a metric because we cannot trust distances in our original input space — but on the other hand define target similarity using this exact metric that cannot be trusted! Thus, although this approach has the good intentions of encoding similarity into our representation, it harms intraclass variation and interclass similarity by enforcing unreasonable proximity relationships. Apart from its information preservation ramifications, achieving predefined separation requires significant effort, which results in inefficiencies during training time.
Instead, what we ought to do is rather define similarity as function of distances of our representations — which lie in precisely the space sculpted for metric saliency. Since representations are adjusted continuously during training, it then follows that similarity must be defined adaptively. To that end, we must alternate between updating our representations, and refreshing our model which designates similarity as function of these. Visualizations of representations of different DML approaches can be found in a toy example in Figure 2.
Issue #2: objective formulation
Two very popular classes of DML approaches have stemmed from Triplet Loss (Weinberger & Saul, 2009) and Contrastive Loss (Hadsell et al., 2006). The outlined issues apply to both, but for simplicity of exposition we use triplet loss as an example. During its training, triplets consisting of a seed example, a “positive” example similar to the seed and a “negative” dissimilar example are sampled. Let us denote their representations as and for . Triplet loss then demands that the difference of distances of the representation of the seed to the negative and to the positive be larger than some preassigned margin constant :
(1) 
where is the hinge function and the parameters of the map to representation space. The representations are often normalized to achieve scale invariance, and negative examples are mined in order to find margin violators (for example, Schroff et al. (2015); Norouzi et al. (2012)).
Objectives formulated in this spirit exhibit a shortsightedness. Namely, penalizing individual pairs or triplets of examples does not employ sufficient contextual insight of neighbourhood structure, and as such different triplet terms are not necessarily consistent. This hinders both the convergence rate as well as performance of these approaches. Moreover, the cubic growth of the number of triplets renders operation on these computationally inefficient.
In contrast to this, it is desirable to instead inform the algorithm of the distributions of the different classes in representation space and their overlaps, and rather manipulate these in a way that is globally consistent. We elaborate on this in the section below.
3 Magnet Loss for Distance Metric Learning
We proceed to design a model to mitigate the identified difficulties. Let us for a moment neglect practical considerations, and envision our ideal DML approach. To start, as concluded at the start of Section 2, we are interested to characterize similarity adaptively as function of current representation structure. We would then utilize this knowledge to pursue local separation as opposed to global: we seek to separate between distributions of different classes in representation space, but do not mind if they are interleaved. As such, let us assume that we have knowledge of the representation distribution of each class at any time during training. Our DML algorithm, then, would discover regions of local overlap between different classes, and penalize these to achieve discrimination.
Such an approach would liberate us from the unimodality assumption and unreasonable prior target neighbourhood assignments — resulting in a more expressive representation which maintains significantly more information. Moreover, employing a loss informed of distributions rather than individual examples would allow for a more coherent training procedure, where the distance metric is adjusted in a way that is globally consistent.
To that end, a natural approach would be to employ clustering techniques to capture these distributions in representation space. Namely, for each class, we will maintain an index of clusters, which we will update continuously throughout training. Our objective, then, would jointly manipulate entire clusters — as opposed to individual examples — in the pursuit of local discrimination. This intuition of cluster attraction and repulsion motivates us to name it Magnet Loss. A caricature illustrating the intuition behind this approach can be found in Figure 3.
In addition to its advantages from a modeling perspective, a clusteringbased approach also facilitates computation by enabling efficient hard negative mining. That is, we may perform approximate nearest neighbour retrieval in a twostep process, where we first retrieve nearest clusters, after which we retrieve examples from these clusters.
Finally, as discussed, throughout training we are interested in a more complete characterization of neighbourhood structure. At each iteration, we sample entire local neighbourhoods rather than collections of independent examples (or triplets) as per usual, which significantly improves training efficiency. We elaborate on this in Section 3.2.
3.1 Model formulation
We proceed to quantify the modeling objectives outlined above. Let us assume we have a training set consisting of inputlabel pairs belonging to classes. We consider a parametrized map which hashes our inputs to representation space, and denote their representations as . In this work, we select this transformation as GoogLeNet (Szegedy et al., 2015; Ioffe & Szegedy, 2015), which has been demonstrated to be a powerful CNN architecture; in Section 4 we elaborate on this choice.
We assume that, for each class , we have cluster assignments obtained via an application of the Kmeans algorithm. Note that may vary across classes, but for simplicity of exposition we fix it as uniform. In Section 3.2, we discuss how to maintain this index. To that end, we assume that these assignments have been chosen to minimize intracluster distances. Namely, for each class , we have
(2)  
(3) 
We further define as the class of representation , and as its assigned cluster center.
We proceed to define our objective as follows:
(4) 
where is the hinge function, is a scalar, and is the variance of all examples away from their respective centers. We note that cluster centers sufficiently far from a particular example vanish from its term in the objective. This allows accurately approximating each term with a small number of nearest clusters.
A feature of this objective not usually available in standard distance metric learning approach is variance standardization. This renders the objective invariant to the characteristic lengthscale of the problem, and allows the model to gauge its confidence of prediction by comparison of intra and intercluster distances. With this in mind, is then the desired cluster separation gap, measured in units of variance. In our formulation, we may thus interpret as a modulator of the probability assigned to an example of a particular class under the distribution of another.
We remark that during model design, an alternative objective we considered is the clusterbased analogue of NCA (see Section 3.4): this objective seems to be a natural approach with a clear probabilistic interpretation. However, we found empirically that this objective does not generalize as well, since it only vanishes in the limit of extreme discrimination margins.
3.2 Training procedure
Component #1: neighbourhood sampling
At each iteration, we sample entire local neighbourhoods rather than a collection of independent examples. Namely, we construct our minibatch in the following way:

Sample a seed cluster

Retrieve nearest impostor clusters of

For each cluster , sample examples
The choices of and allow us to adapt to the current distributions of examples in representation space. Namely, in our training, these allow us to specifically target and reprimand contested neighbourhoods with large cluster overlap. During training, we cache the losses of individual examples, from which we compute the mean loss of each cluster . We then choose , and as a uniform distribution. We remark that these choices work well in practice, but have been made arbitrarily and perhaps can be improved.
Given our samples, we may proceed to construct a stochastic approximation of our objective:
(5) 
where we approximate the cluster means as and variance as . During training, we backpropagate through this objective, and the full CNN which gave rise to the representations.
Component #2: cluster index
As mentioned above, we maintain for each class a Kmeans index which captures its distribution in representation space during training. We initialize each index with Kmeans++ (Arthur & Vassilvitskii, 2007), and refresh it periodically. To attain the representations to be indexed, we pause training and compute the forward passes of all inputs in the training set. The computational cost of refreshing the cluster index is significantly smaller than the cost of training the CNN itself: it is not done frequently, it only requires forward passes of the inputs, and the relative cost of Kmeans clustering is negligible.
It may seem that freezing the training is unnecessarily computationally expensive. Note that we also explored the alternative strategy of caching the representations of each minibatch onthefly during training. However, we found that it is critical to maintain the true neighbourhood structure where the representations are all computed in the same stage of learning. We empirically observed that since the representation space is changing continuously during training, indexing examples whose representations were computed in different times resulted in incorrect inference of neighbourhood structure, which in turn irreparably damaged nearest impostor assessment.
Improvement of training efficiency
The proposed approach offers a number of benefits which compound to considerably enhance training efficiency, as can be seen empirically in Section 4.1. First, one of the main criticisms of tripletbased approaches is the cubic growth of the number of triplets. Manipulating entire clusters of examples, on the other hand, significantly improves this complexity, as this requires far fewer pairwise distance evaluations. Second, operating on entire cluster neighbourhoods also permits information recycling: we may jointly separate all clusters from one another at once, whereas an approach based on independent sampling would require far more repetitions of the same examples. Finally, penalizing clusters of points away from one another leads to a more coherent adjustment of each point, whereas different triplet terms may not necessarily be consistent with one another.
3.3 Evaluation procedure
The evaluation procedure is consistent with the objective formulation: we assign the label of each example as function of its representation’s softmax similarities to its closest clusters, say . More precisely, we choose label as
(6) 
where is a running average of stochastic estimates computed during training.
This can be thought of as “nearestcluster” (kNC), a variant of a soft kNN classifier. This has the added benefit of reducing the complexity of nearest neighbour evaluation from being a function of the number of examples to the number of clusters. Here, the lengthscale autonomously characterizes local neighbourhood radius, and as such implies how to sensibly choose . In general, we found that performance improves monotonically with , as the soft classification is able to make use of additional neighbourhood information. At some point, however, retrieving additional nearest neighbours is clearly of no further utility, since these are much farther away than the lengthscale defined by . In practice we use for all experiments in this work.
3.4 Relation to existing models
Triplet Loss
Our objective proposed in Equation 4 has the nice property that it reduces to the familiar triplet loss under a particular set of assumptions. Specifically, let us assume that we approximate each neighbourhood with a single impostor cluster, i.e, . Let us further assume that we approximate the seed cluster with merely samples, and the impostor cluster with one. We further simplify by ignoring the variance normalization. Our objective then exactly reduces to triplet loss for a pair of triplets “symmetrized” for the two positive examples:
(7) 
Neighbourhood Components Analysis
Neighbourhood Components Analysis (NCA) and its extensions (Goldberger et al., 2004; Salakhutdinov & Hinton, 2007; Min et al., 2010) have been designed in a similar spirit to Magnet Loss. The NCA objective is given by
(8) 
However, this formulation does not address a number of concerns both in modeling and implementation. As an example, it does not touch on minibatch sampling in large datasets. Even if we maintain a nearest neighbour index, if we naïvely retrieve the nearest neighbours for each example, they are all going to be of different classes with high probability.
Nearest Class Mean
Nearest Class Mean (Mensink et al., 2013) is cleverly designed for scalable DML. In this approach, the mean vectors of the examples in their raw input form are computed and fixed for each class. A linear transformation is then learned to maximize the softmax distance of each example to the cluster center of its class:
(9) 
The authors further generalize this to Nearest Class Multiple Centroids (NCMC), where for each class, centroids are computed with Kmeans. Magnet shares many ideas with NCMC, but these approaches differ in a number of important ways. For NCMC, the centroids are computed on the raw inputs and are fixed prior to training, rather than updated continuously on a learnt representation. It is also not clear how to extend this to more expressive transformations (such as CNNs) to representation space, but this step is required in order to enjoy the success of deep learning approaches in a DML setting.




4 Experiments
We run all experiments on a cluster of Tesla K40M GPU’s. All parametrized maps to representation space are chosen as GoogLeNet with batch normalization (Ioffe & Szegedy, 2015). We add an additional fullyconnected layer to map to a representation space of dimension 1024.
We find that it is useful to warmstart any DML optimization with weights of a partlytrained a standard softmax classifier. It is important to not use weights of a net trained to completion, as this would result in information dissipation and as such defeat the purpose of pursuing DML in the first place. Hence, we initialize all models with the weights of a net trained on ImageNet (Russakovsky et al., 2015) for 3 epochs only. We augment all experiments with random input rescaling of up to 30%, followed by jittering back to the original input size of . At testtime we evaluate an input by averaging the outputs of 16 random samples drawn from this augmentation distribution.
4.1 Finegrained classification
We validate the classification efficacy of the learnt representations on a number of popular finegrained visual categorization tasks, including Stanford Dogs (Khosla et al., 2011), OxfordIIIT Pet (Parkhi et al., 2012) and Oxford 102 Flowers (Nilsback & Zisserman, 2008) datasets. We also include results on ImageNet attributes, a dataset described in Section 4.2.
We seek to compare optimal performances of the different model spaces, and so perform hyperparameter search on validation error generated by 3 classes of objectives: a standard softmax classifier, triplet loss, and Magnet Loss. The hyperparameter search setup, including optimal configurations for each experiment, is specified in full detail in Appendix B. In general, for Magnet Loss we observed empirically that it is beneficial to increase the number of clusters per minibatch to around in the cost of reducing the number of retrieved examples per cluster to . The optimal gap has in general been , and the value of varied as function of dataset cardinality.
The classification results can be found in Table 4. We use soft kNN to evaluate triplet loss error and kNC (see Section 3.3) for Magnet Loss. However, for completeness of comparison, in Figure 4(e) we present evaluations of all learnt representations under both kNN and kNC.
It can be observed that Magnet Loss outperforms the traditional triplet loss by a considerable margin. It is also able to surpass the standard softmax classifier in most cases: while the margin is not significant, note that the true win here is in terms of learning representations much more suitable for task transfer, as validated in the following subsections.
In Figure 5, it can be seen that Magnet Loss reaches the triplet loss asymptotic error rate 530 times faster. The prohibitively slow convergence of triplet loss has been wellknown in the community. Magnet Loss achieves this speedup as it mitigates some of the trainingtime inefficiencies featured by triplet loss presented throughout Section 2 and the end of Section 3.2. For fairness of comparison, we remark that softmax converges faster than Magnet; however, this comes at the cost of a less informative representation.
4.2 Attribute distribution
We expect Magnet to sculpt a more expressive representation, which enables similar examples of different classes to be close together, and dissimilar examples of the same class to be far apart; this can be seen qualitatively in Figure 2. In order to explore this hypothesis quantitatively, after training is complete we examine the attributes of neighbouring examples as a proxy for assessment of similarity. We indeed find the distributions of these attributes to be more concentrated for Magnet.
We attain attribute labels from the Object Attributes dataset (Russakovsky & FeiFei, 2010). This provides 25 attribute annotations for 90 classes of an updated version of ImageNet, with about 25 annotated examples per class. Attributes include visual properties such as “striped”, “brown”, “vegetation” and so on; examples of these can be found in Figure 6(a). Annotations are assigned individually for each input, which allows capturing intraclass variation and interclass invariance.
We train softmax, triplet and Magnet Loss objectives on a curated dataset we refer to as ImageNet Attributes. This dataset contains 116,236 examples, and comprises all examples of each of the 90 ImageNet classes for which any attribute annotations are available: in Appendix C we describe it in detail. We emphasize we do not employ any attribute information during training. At convergence, we measure attribute concentration by computing mean attribute precision as function of neighbourhood size. Specifically, for each example and attribute, we compute over different neighbourhood cardinalities the fraction of neighbours also featuring this attribute.
This result can be found in Figure 6(d). Magnet Loss outperforms both softmax and triplet losses by a reasonable margin in terms of attribute concentration, with consistent gains of 2550% over triplet and 1025% over softmax across neighbourhood sizes. It may seem surprising that softmax surpasses triplet — an approach specifically crafted for distance metric learning. However, note that while the softmax classifier requires high relative projection onto the hyperplane associated with each class, it leaves some flexibility for information retainment in its highdimensional nullspace. Triplet loss, on the other hand, demands separation based on an imprecise assessment of similarity, resulting in poor proximity of similar examples of different classes.
Magnet’s attribute concentration can also be observed visually in Figures 6(b) and 6(c), presenting the tSNE projections from Figure 2 overlaid with attribute distribution. It can be seen qualitatively that the Magnet attributes are concentrated in particular areas of space, irrespective of class.
4.3 Hierarchy recovery
In this experiment, we are interested to see whether each algorithm is able to recover a latent class hierarchy, provided only coarse superclasses. To test this, we randomly pair all classes of ImageNet Attributes, and collapse each pair under a single label. We then train on the corrupted labels, and check whether the finergrained class labels may be recovered from the learnt representations.
The results can be found in Table 4(f). Magnet is able to identify intraclass representation variation, an essential property for success in this task. Softmax also achieves surprisingly competitive results, suggesting that meaningful variation is nevertheless captured within the nullspace of its last layer. For triplet loss, on the other hand, target neighbourhoods are designated prior to training, and as such it is not able to adaptively discriminate finer structure within superclasses.
5 Discussion and Future Work
In this work, we highlighted a number of difficulties in a class of DML algorithms, and sought to address them. We validated the effectiveness of our approach under a variety of metrics, ranging from classification performance to convergence rate to attribute concentration.
In this paper, we anchored in place a number of parameters: we chose the number of clusters per class as uniform across classes, and refreshed our representation index at a fixed rate. We believe that adaptively varying these during training can enhance performance and facilitate computation.
Another interesting line of work would be to replace the density estimation and indexing component with an approach more sophisticated than Kmeans. One natural candidate would be a treebased algorithm. This would enable more efficient and more accurate neighbourhood retrieval.
Acknowledgements
We are grateful to Laurens van der Maaten, Florent Perronnin, Rob Fergus, Yaniv Taigman and others at Facebook AI Research for meaningful discussions and input. We thank Soumith Chintala, Alexey Spiridonov, Kevin Lee and everyone else at the FAIR Engineering team for unbreaking things at light speed.
Appendix A tSNE image maps for typical Magnet and triplet representation spaces
Appendix B Hyperparameter Tuning Specifications and Optimal Configurations
Here we describe in detail the hyperparameter search setups for the different experiments, and the optimal configuration for each.
For all models, we tune optimization hyperparameters consisting of learning rate and its annealing factor which we apply every epoch. We fix the momentum as 0.9 for all experiments. For the smaller datasets, we refresh our index every epoch, and for ImageNet Attributes every 1000 iterations.
For Magnet Loss, we additionally tune the separation margin , the number of nearest clusters per minibatch , the number of examples per cluster , and the number of clusters per class which we take to be the same for all classes (the examples per minibatch is upperbounded by 48 due to memory constraints). Note that we permit the choices , which, as discussed in 3.4, reverts this back to triplet loss: hence, we expect this choice to be discovered if triplet loss is in fact the optimal choice of distance metric learning loss of this class. For triplet loss, we tune the separation margin , the fraction of nearest impostors retrieved in each minibatch and neighbourhood size retrieved for kNN evaluation.
We now specify the optimal hyperparameter configurations for the different datasets and model spaces, as found empirically via random search. The learning rate annealing factor is marked as “N/A” for smaller datasets, where we do not anneal the learning rate at all.
Appendix C Specifications for ImageNet Attributes Dataset
To curate this dataset, we first matched the annotated examples in the Object Attributes dataset (Russakovsky & FeiFei, 2010) to examples in the training set of ImageNet. The ImageNet Attributes training and validation sets then comprise all examples of all classes for which annotated examples exist.
Below we list these classes.
References
 Angelova & Long (2014) Angelova, Anelia and Long, Philip M. Benchmarking largescale finegrained categorization. In IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, March 2426, 2014, pp. 532–539, 2014. doi: 10.1109/WACV.2014.6836056.
 Angelova & Zhu (2013) Angelova, Anelia and Zhu, Shenghuo. Efficient object detection and segmentation for finegrained recognition. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2328, 2013, pp. 811–818, 2013. doi: 10.1109/CVPR.2013.110.
 Arthur & Vassilvitskii (2007) Arthur, David and Vassilvitskii, Sergei. Kmeans++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’07, pp. 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245.
 Chopra et al. (2005) Chopra, Sumit, Hadsell, Raia, and LeCun, Yann. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)  Volume 1  Volume 01, CVPR ’05, pp. 539–546, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0769523722. doi: 10.1109/CVPR.2005.202.
 Donahue et al. (2013) Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. Decaf: A deep convolutional activation feature for generic visual recognition. 2013.
 Gavves et al. (2013) Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., and Tuytelaars, T. Finegrained categorization by alignments. In IEEE International Conference on Computer Vision, 2013.
 Gavves et al. (2015) Gavves, Efstratios, Fernando, Basura, Snoek, CeesG.M., Smeulders, ArnoldW.M., and Tuytelaars, Tinne. Local alignments for finegrained categorization. International Journal of Computer Vision, 111(2):191–212, 2015. ISSN 09205691. doi: 10.1007/s1126301407415.
 Globerson & Roweis (2006) Globerson, Amir and Roweis, Sam T. Metric learning by collapsing classes. In Weiss, Y., Schölkopf, B., and Platt, J.C. (eds.), Advances in Neural Information Processing Systems 18, pp. 451–458. MIT Press, 2006. URL http://papers.nips.cc/paper/2947metriclearningbycollapsingclasses.pdf.
 Goldberger et al. (2004) Goldberger, Jacob, Roweis, Sam, Hinton, Geoff, and Salakhutdinov, Ruslan. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, pp. 513–520. MIT Press, 2004.
 Hadsell et al. (2006) Hadsell, Raia, Chopra, Sumit, and Lecun, Yann. Dimensionality reduction by learning an invariant mapping. In In Proc. Computer Vision and Pattern Recognition Conference (CVPRâ06. IEEE Press, 2006.
 Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 448–456, 2015.
 Khosla et al. (2011) Khosla, Aditya, Jayadevaprakash, Nityananda, Yao, Bangpeng, and FeiFei, Li. Novel dataset for finegrained image categorization. In First Workshop on FineGrained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
 Mensink et al. (2013) Mensink, Thomas, Verbeek, Jakob, Perronnin, Florent, and Csurka, Gabriela. Distancebased image classification: Generalizing to new classes at near zero cost. Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013.
 Min et al. (2010) Min, Martin Renqiang, van der Maaten, Laurens, Yuan, Zineng, Bonner, Anthony J., and Zhang, Zhaolei. Deep supervised tdistributed embedding. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pp. 791–798, 2010.
 Murray & Perronnin (2014) Murray, Naila and Perronnin, Florent. Generalized max pooling. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 2328, 2014, pp. 2473–2480, 2014. doi: 10.1109/CVPR.2014.317.
 Nilsback & Zisserman (2008) Nilsback, ME. and Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
 Norouzi et al. (2012) Norouzi, Mohammad, Fleet, David, and Salakhutdinov, Ruslan R. Hamming distance metric learning. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1061–1069. Curran Associates, Inc., 2012.
 Parkhi et al. (2012) Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. V. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 Qian et al. (2015) Qian, Qi, Jin, Rong, Zhu, Shenghuo, and Lin, Yuanqing. Finegrained visual categorization via multistage metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 Russakovsky & FeiFei (2010) Russakovsky, Olga and FeiFei, Li. Attribute learning in largescale datasets. In European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, 2010.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pp. 1–42, April 2015. doi: 10.1007/s112630150816y.
 Salakhutdinov & Hinton (2007) Salakhutdinov, Ruslan and Hinton, Geoffrey E. Learning a nonlinear embedding by preserving class neighbourhood structure. In Meila, Marina and Shen, Xiaotong (eds.), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS07), volume 2, pp. 412–419. Journal of Machine Learning Research  Proceedings Track, 2007.
 Schroff et al. (2015) Schroff, Florian, Kalenichenko, Dmitry, and Philbin, James. Facenet: A unified embedding for face recognition and clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 Sharif Razavian et al. (2014) Sharif Razavian, Ali, Azizpour, Hossein, Sullivan, Josephine, and Carlsson, Stefan. Cnn features offtheshelf: An astounding baseline for recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2014.
 Snoek et al. (2015) Snoek, Jasper, Rippel, Oren, Swersky, Kevin, Kiros, Ryan, Satish, Nadathur, Sundaram, Narayanan, Patwary, Md. Mostofa Ali, Prabhat, and Adams, Ryan P. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2015.
 Szegedy et al. (2015) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In CVPR 2015, 2015.
 van der Maaten & Hinton (2008) van der Maaten, L.J.P. and Hinton, G.E. Visualizing highdimensional data using tsne. 2008.
 Verma et al. (2012) Verma, Nakul, Mahajan, Dhruv, Sellamanickam, Sundararajan, and Nair, Vinod. Learning hierarchical similarity metrics. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2280–2287. IEEE, 2012.
 Wang et al. (2014) Wang, Jiang, Song, Yang, Leung, Thomas, Rosenberg, Chuck, Wang, Jingbin, Philbin, James, Chen, Bo, and Wu, Ying. Learning finegrained image similarity with deep ranking. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 1386–1393. IEEE, 2014.
 Weinberger & Saul (2009) Weinberger, Kilian Q. and Saul, Lawrence K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009. ISSN 15324435.
 Xie et al. (2015) Xie, Saining, Yang, Tianbao, Wang, Xiaoyu, and Lin, Yuanqing. Hyperclass augmented and regularized deep learning for finegrained image classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, pp. 2645–2654, 2015. doi: 10.1109/CVPR.2015.7298880.