Making Classification Competitive for Deep Metric Learning
Deep metric learning aims to learn a function mapping image pixels to embedding feature vectors that model the similarity between images. The majority of current approaches are non-parametric, learning the metric space directly through the supervision of similar (pairs) or relatively similar (triplets) sets of images. A difficult challenge for training these approaches is mining informative samples of images as the metric space is learned with only the local context present within a single mini-batch. Alternative approaches use parametric metric learning to eliminate the need for sampling through supervision of images to proxies. Although this simplifies optimization, such proxy-based approaches have lagged behind in performance. In this work, we demonstrate that a standard classification network can be transformed into a variant of proxy-based metric learning that is competitive against non-parametric approaches across a wide variety of image retrieval tasks. We address key challenges in proxy-based metric learning such as performance under extreme classification and describe techniques to stabilize and learn higher dimensional embeddings. We evaluate our approach on the CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering. Finally, we show that our softmax classification approach can learn high-dimensional binary embeddings that achieve new state-of-the-art performance on all datasets evaluated with a memory footprint that is the same or smaller than competing approaches.
Learning image representations, also known as image embeddings, is a core problem of a variety of applications including face recognition , fine-grained retrieval   , clustering , and visual search    . Standard deep neural network metric learning approaches learn image representations through the local relationships between images in the form of pairs   or triplets  . Similarity-style supervision are used to train the representation such that similar images are close in embedding space and dissimilar images apart.
A core challenge with metric learning is sampling informative samples for training. As described in   , negatives that are too hard can destabilize training, while negatives that are too easy result in triplets that have near zero loss leading to little contribution to network learning. The widely adopted Semi-hard sampling technique provides a good balanced sampling strategy. Recent methods such as     focus on addressing this sampling problem, many of which utilize the relationships of all images within the batch to form informative samples. These methods typically require a large batch size so that informative samples can be selected within the batch. In practice, batch size is constrained by the hardware memory size. Therefore, as the dataset size grows, one still faces the challenge of the diminishing probability of a randomly sampled batch containing any informative samples.
Another challenge with metric learning is difficulties in optimization. Since each optimization step only depends on local context within each minibatch, the training often takes a long time to converge  or converges to local a optima . In contrast to triplet approaches, the proxy-based approach  approximates each semantic class using a proxy, and uses all proxies to provide global context for each training iteration. Such a training setup is similar to the standard classification task, and eliminates the challenge of sampling informative samples. Though proxy-based metric learning are attractive due to simplify training by removing sampling, they have scalability limitations similar to extreme classification  and have been outperformed by recent advancements in metric learning   .
The goal of this paper is to investigate the effectiveness of softmax classification for metric learning. Our major contributions are as follows: 1) We empirically demonstrate that embeddings trained using a classification-based approach can achieve state of the art performance on public benchmarks while avoiding many of the issues that make training embeddings using metric-based approaches difficult. 2) We examine the connection between standard classification using softmax cross-entropy loss and proxy-based Neighborhood Component Analysis (NCA) loss. A theoretical justification is provided to show that by removing the bias term in the final linear layer and L2 normalizing the inputs and weights, the classification task is a variant of metric learning. 3) We demonstrate that adding temperature scaling of the distances and Layer Normalization  improves the numerical stability of optimization, and that such simple modifications provide a strong baseline for metric learning. 4) We describe how to learn high-dimensional binary embeddings trained using our approach to achieve state-of-the-art retrieval performance with the same memory footprint as 64 dimensional float embeddings. 5) We address the extreme classification scalability limitations of proxy-based approaches through experiments demonstrating the effects of subsampling target classes for each training iteration. To the best of our knowledge, we are the first to show the effect of subsampling on proxy-based approaches empirically.
2 Related Works
Metric Learning Losses Metric learning approaches aim to learn a good embedding space such that the similarity between samples are preserved as distance between embedding vectors of the samples. The metric learning losses, such as contrastive loss and triplet loss, are formulated to minimize intra-class distances and maximize inter-class distances. Recent approaches in metric learning design the loss function to consider the relationships of all the samples within the training batch, and achieve state-of-the-art performance in image retrieval datasets.
Training Sampling Sampling informative training samples plays an important role in metric learning as also suggested in . Semi-hard sampling in conjunction with triplet-loss  has been widely adopted for many tasks. Distanced-weighted sampling  suggests that with a balanced mix of difficulties during training time, the image retrieval performance can be further improved. Hierarchical Triplet Loss  proposed that by merging similar classes dynamically during training time into a hierarchical tree, more informative samples can be drawn from such a structure and the loss also provides global context for training.
Ensembling Ensembling embeddings has been the most recent focus to further improve image retrieval performance. The ensembled embeddings are trained via boosting  or attending diverse spatial locations . However, such ensembled embeddings trade off image retrieval performance with higher dimensions
Global Context via Active Memory Proxy loss  uses a small number of proxies in proxy bank to represent all positives and negatives samples in training data. Scalable NCA loss  further extends the idea to store all image embeddings in memory and perform approximate update for the stored embeddings.
3 Classification as Metric Learning
In this section, we will prove that removing bias and L2-normalizing inputs and weights before the final softmax cross-entropy layer make classification a variant of proxy-based metric learning. We first examine the Neighborhood Component Analysis (NCA) Loss used as the surrogate loss for ranking in . With the set of dissimilar images denoted as and distance between embeddings of images as :
We argue that by excluding the positive term , the proxy-NCA loss does not have the probabilistic meaning as proposed in original NCA paper. The softmax cross-entropy loss (softmax loss) used in standard classification task, on the other hand, which is more similar to NCA loss in spirit, also tries to make closer to than any .
With the static proxy assignment and the choice of distance function being the cosine distance:
The proxy-classification loss can then be expressed as standard softmax loss when input and weights are L2-normed:
By using the network’s penultimate layer’s normalized output as the embedding, and the normalized weights of the last fully-connected layer as the proxies, we show that standard classifier training without bias in the last layer is just a variant of proxy-based metric learning. We refer this modified softmax loss as normalized softmax loss. This explains the strong baseline observed by simply using the activations of the network pre-trained on a classification task such as ImageNet. We can also leverage the vast body of literature on the tricks for optimizing classification task training and performance.
Based on the derivation in Section 3, it is very simple to convert any classification network to perform proxy-based metric learning. One only needs to remove the bias term in the last linear layer and add an L2 normalization module to the inputs and weights before softmax loss. To achieve state-of-the-art metric learning performing using softmax, we introduce the following components without adding additional parameters:
4.1 Temperature Scaling
As shown in , in high-dimensional spaces, the distance between two random points on the unit-sphere approaches normal distribution where is the dimensionality. Such phenomena is also known as the contrast-loss in high dimensional space. Therefore, to make the network focus on smaller differences in distance as the dimensionality increases, we add the temperature term to scale cosine distance before softmax loss:
4.2 Layer Normalization
The layer normalization without affine parameters is added immediately after the final pooling layer of the feature model (e.g. GoogleNet ’s pool5 layer) to normalize the feature dimension of our embeddings to have a distribution of values centered at zero. This allows us to easily binarize embeddings via thresholding at zero. We also show empirically through ablation experiments in Section 6.4.2 that layer normalization helps the network better initialize new parameters and reach better optima.
Fitting all proxies in memory during training may not be possible with high-dimensional embeddings and a large number of classes. Furthermore, even if we could fit a large number of classes in memory, having softmax training scale linearly in complexity with the number of classes is an undesirable property for efficient training.
To address this issue, we propose a subsampling module. For each training batch, only the correct classes and a randomly sampled subset of the wrong classes are used for optimization. We show empirically that subsampling slightly degrades the performance, but we can achieve competitive performance with only a 10% subsampling ratio of the classes on the Stanford Product dataset in Section 6.5.
Although subsampling is not necessary for the existing datasets, the simple subsampling module alleviates the constraint of the number of classes when training the classifier. It also offers the flexibility to trade-off the dimension of the embedding and the ratio of subsampling.
We evaluate our method on commonly used image retrieval tasks with the standard train/test split protocols. Specifically, we use the following datasets:
CARS-196 . 16,185 images distributed over 196 classes of cars. The first 98 classes with 8,054 images are used for training and the other 98 classes with 8,131 images are used for testing. The test set is both the query and index set.
CUB-200-2011 . 11,788 images distributed over 200 species of birds. The first 100 classes with 5,864 images are used for training and the other 100 classes with 5,924 images are used for testing. The test set is both the query and index set.
Stanford Online Products (SOP) . 120,053 images distributed over 22,634 classes of products. 11,318 classes with 59,551 images are used for training and the other 11,316 classes with 60,502 images are used for testing. The test set is both the query and index set.
In-shop Clothes Retrieval . 54,642 images distributed over 11,735 classes of clothing items. Following standard protocol, only 7,982 classes with 52,712 images are used for training and testing. 3,997 classes with 25,882 images are used for training and 3,985 classes with 28,760 images are used for testing. The test set is split into a 14,218 image query set and a 12,612 image index set, both containing examples from each of the 3,985 classes.
Following the same evaluation protocol as in , we compare our method using Recall@K to measure retrieval quality and NMI to measure clustering quality. To compute Recall@K, during testing we first compute the embeddings for all test images. For each test image, we then retrieve the top K images from the test set, excluding the query image itself, using cosine similarity. A recall score of 1 is assigned to every query image that has at least one image that matches the query’s label from the K retrieved images, 0 otherwise. Recall@K is then the average of the recall scores for all queries.
We first investigate how embeddings trained with softmax classification compare against embeddings trained with existing metric learning losses using the same featurizer and embedding dimension. Following our loss comparisons, we conduct ablation studies on different design choices of our approach on CUB-200-2011  (Section 6.4). One facet of interest we study in detail is the dimensionality of our embeddings (Section 6.3) and its relation to performance. Next, we investigate how softmax embeddings are affected by class subsampling (Section 6.5), addressing the key scalability concern of softmax loss where training complexity is linear with the number of classes. Finally in Section 6.6, we show that our method outperforms state-of-the-art methods on several retrieval datasets.
All experiments were executed using PyTorch and a Tesla V100 graphic card. We compare our softmax experiments against common architectures used in metric learning including GoogleNet , GoogleNet with Batch Normalization , and ResNet50 . We initialize our networks with pre-trained ImageNet ILSVRC-2012  weights, with GoogleNet weights copied from the Caffe  trained model as in   . We add a randomly initialized fully connected layer to the pool5 features of each architecture to learn embeddings of varying dimensionality. To simplify the sensitivity to the initialization of the fully connected layer, we add a Layer Normalization  without additional parameters between the pool5 and fully connected layer (See Section 6.4.2 for the ablation study). We L2 normalize both the embedding and class weights before Softmax and use a temperature of 0.05 for all experiments (See Section 6.4.3 for the temperature ablation study).
Unless otherwise stated, we first train for 1 epoch, updating only new parameters for better initialization. We then optimize all parameters for 30 epochs with a batch size of 75 with the same learning rate. We construct our batch by sampling 25 examples per class for all datasets except the Stanford Online Products where we sample 5 due to few examples per class in the dataset (See Section 6.4.1 for the ablation study). We alternate between base learning rates of 1e-2 and 1e-3 depending on both dataset and model, unsurprising given the optimization variations of our different architectures (e.g. ResNet50 has Batch Normalization  which allows higher learning rates to be used unlike GoogleNet). We use SGD with momentum of 0.9, weight decay of 1e-4, and gamma of 0.1 which we apply to reduce the learning rate at epoch 15. During training, we apply horizontal mirroring and random crops from 256x256 images; during testing we only center crop from the 256x256 image. Following  , we crop to 227x227 for GoogleNet and 224x224 for the rest.
6.2 Loss Function Comparisons
We compare our normalized softmax loss against existing metric learning losses. To focus on contributions from the loss functions, we leave comparisons against methods that ensemble models , modify the feature extractor architecture , or propose complex activation paths between the featurizer and final embedding  for Section 6.6.
We present Recall@K and NMI results on three standard retrieval datasets in Table 3, Table 3, and Table 3, comparing against reported performance of methods trained with model architectures of GoogleNet, GoogleNet with Batch Normalization (BNInception), and ResNet50 respectively. For GoogleNet with Stanford Online Products only, we saw around a 1% Recall@1 improvement by training all parameters from start with 10x learning rate on new parameters when compared with models trained with our standard finetuning procedure.
As shown, our approach compares very strongly against existing baselines, achieving the best Recall@K and NMI performance for many of the tasks. When fixing dimensionality to 512, we see that the performance improvements of our softmax embeddings across architectures mirror classification performance on ImageNet ILSVRC-2012. We hope our results help disentangle performance improvements of existing metric learning methods due to advancements in methodology versus changes of base feature models.
|Lift. Struc128 ||-||-||-||-||-||49.0||60.3||72.1||81.5||55.0||47.2||58.9||70.2||80.2||55.6|
|Lift. Struc512 ||62.1||79.8||91.3||97.4||-||-||-||-||-||-||-||-||-||-||-|
|Hist Loss512 ||63.9||81.7||92.2||97.7||-||-||-||-||-||-||50.3||61.9||72.6||82.4||-|
|Bin. Dev512 ||65.5||82.3||92.3||97.6||-||-||-||-||-||-||52.8||64.4||74.7||83.9||-|
6.3 Embedding Dimensionality
To study the effects of dimensionality on our softmax embeddings, we keep all optimization parameters fixed with the exception of the dimensionality of the added fully connected layer. We have consistently observed that dimensionality is directly related to retrieval performance for our Softmax trained embeddings. Two examples of this across different datasets (CARS-196 and CUB-200-2011) and model architectures (ResNet50 and GoogleNet) are shown in Figure 2. Interestingly, this is in contrast to reported behaviors for previous non-parametric metric learning methods   , showing that dimensionality does not significantly affect retrieval performance. This difference is seen clearly When comparing R@1 across dimensionality for CUB-200-2011 with GoogleNet in Figure 2 with the same dataset and model combination from .
Higher dimensional embeddings lead to an increase in retrieval performance. Lower dimensional embeddings however are preferred for scalability to reduce storage and distance computation costs especially in large scale applications such as visual search . We observe however that as we increase dimensionality of our softmax embeddings, the optimizer does not fully utilize the higher dimensional metric space. Instead, we see that feature dimensions start relying less on the magnitude of each feature dimension and instead rely on the sign value. In Figure 2, we see for both datasets that the Recall@1 performance of binary features (thresholding the feature value at zero) converges with the performance of the float embeddings. This is a consistent result we see across datasets and model architectures. We show that training high dimensional embeddings and binarizing leads to the best trade-off of performance and scalability as described in Section 6.6.
6.4 Ablation Studies
In this set of experiments we report the effect on Recall@1 with different design choices in our approach on CUB-200-2011. We train the ResNet50 with embedding dimension of 512 variant as in Table 3. We fix all hyperparameters besides the one of interest.
6.4.1 Class Balancing
As seen in Table 4, class balancing seems beneficial over random sequential iteration over the dataset. Given the fine-grained nature of the CUB-200-2011 dataset, we hypothesize that by having enough examples of a particular class in one mini-batch, the gradient signal learned can better separate a particular class against other fine-grained negative class proxies. When too few distinct classes exist in the minibatch however, the bias of separating few distinct classes may introduce noise to the optimization, resulting in lowered performance.
6.4.2 Layer Normalization
We utilize Layer Normalization  without parameters to standardize activation values after pooling to help the initialization of our training. With 100 classes in CUB-200-2011, we expect that a random classifier would have a loss value of roughly . As shown in Table 3, this occurs when training with Layer Normalization, but not without. We have found incorporating Layer Normalization in our training allows us to be robust against poor weight initialization of new parameters across model architectures, alleviating the need for careful weight initialization.
We see in Table 5 the effects of varying temperature. As described in Section 4.1, due to distances becoming less distinguishable with higher dimensionality, temperature is necessary to improve stability of training. This can be seen as temperature values close to one have sharp drops in performance. Similarly temperature values too low also result in low performance. We hypothesize the scaling from low temperatures result in augmenting too much noise as the variance of seen distances increases. The robustness of our temperature setting of 0.05 however is validated as we fix this setting across all experiments we run, regardless of dimensionality of embedding learned.
6.5 Subsampling for Classification Scalability
We apply the subsampling methodology described in Section 4.3 to the Stanford Online Products dataset, the retrieval dataset with the largest number of classes (Section 5), using ResNet50 to train a 2048 dimensional embedding. We present our findings in Figure 6.5, showing that with only 10% of the classes available during the forward pass of training, we can reach a R@1 performance comparable to using all classes (1% drop in performance). When using 1% of classes, we reach a R@1 of 75.7 (a better performance than most methods in Table 6). When using 0.1% of classes, we reach a R@1 of 72.0. As we can see, subsampling classes during training is an effective method of scaling softmax embedding training with little drop in performance. We also note that our class subsampling strategy is simply random selection. An interesting line of work in the future may be able to leverage the improvements in negative sampling   for softmax class selection to further improve scalability and even performance.
|Lifted Struct512 ||GoogleNet||62.1||79.8||91.3||97.4||-||-||-||-||-||-|
|Histogram Loss512 ||GoogleNet||63.9||81.7||92.2||97.7||-||-||-||-||-||-|
|Binomial Dev512 ||GoogleNet||65.5||82.3||92.3||97.6||-||-||-||-||-||-|
|Angular Loss512 ||GoogleNet||70.9||85.0||93.5||98.0||-||-||-||-||-||-|
|Lifted Struct128 ||GoogleNet||49.0||60.3||72.1||81.5||47.2||58.9||70.2||80.2|
|Histogram Loss512 ||GoogleNet||-||-||-||-||50.3||61.9||72.6||82.4|
|Binomial Dev512 ||GoogleNet||-||-||-||-||52.8||64.4||74.7||83.9|
|Angular Loss512 ||GoogleNet||71.4||81.4||87.5||92.1||54.7||66.3||76.0||83.9|
|Proxy NCA64 ||BNInception||73.2||82.4||86.4||88.7||49.2||61.9||67.9||72.4|
6.6 Comparison against State of the Art
Finally we compare our best performing softmax embedding model against state-of-the-art metric learning approaches on Stanford Online Products, In-Shop, CARS-196, and CUB-200-2011. We train two variants of networks, BNInception with 1024 embedding dimensions and ResNet50 with 2048 dimensions. Because the dimensionality of the embedding is equal to the pooling feature of the original network, we skip the fully connected layer and directly use the pooled feature for efficiency. We have verified this difference results in negligible performance changes (Stanford Online Products Recall@1 of 79.2 versus 79.5). Full results are presented in Table 6 and Table 7.
We see impressive retrieval performance from methods using only GoogleNet in Table 6 and Table 7. Looking at these approaches, we see that HDC  relies on three ensembles of GoogleNets with the final embedding size of 384 being the concatenation of three 128 dimensional embeddings. ABE-8  modifies the internal GoogleNet architecture with attention modules to improve performance (not unexpected given precedence in classification architectures such as SeNet ). Finally A-BIER  where instead of a single fully connected layer on top of GoogleNet for the embedding, uses boosting through a cascade of smaller dimensional embeddings to learn decorrelated embeddings that when concatenated together, better leverages the higher dimensional space. We mark these methods as GoogleNet.
As shown in Table 6 and Table 7, our 2048 dimensional ResNet50 embedding significantly outperforms previous approaches. Considering the higher dimensionality of our embeddings, we also show that our 2048 binary embedding, sharing the same memory footprint of a 64 dimensional float embedding, similarly significantly outperforms state-of-the-art baselines. These binary features were obtained by thresholding the float embedding features at zero as in Figure 2. Visualizations of the retrieval results for both the binary and float embeddings on Stanford Online Products is shown in Figure 1. Considering the scalability and performance of our binary softmax features along the simplicity of our approach, we believe softmax embeddings should be a strong baseline for future metric learning work.
In this paper, we have proposed a simple modification to the standard classification network which makes classification an effective metric learning approach. In the metric learning community, a diverse set of base networks for training embedding of different sizes are compared to one another. In our work, we conducted fair comparisons through extensive experimentation, and establish that softmax used in standard classification is a strong baseline in a wide variety of settings. Our source code will be released to help make bench-marking easier for future research.
We further propose that by adding temperature scaling and Layer Normalization, our approach can learn high-dimensional binary embeddings that beat state-of-the-art performance on image retrieval tasks with same or less memory footprint. The subsampling module we introduced makes our approach viable even for tasks with a very large number of lasses. Our approach not only address the concern of extreme classification, but most importantly establishes the validity of using classification as a state-of-the-art metric learning approach.
-  S. Bell and K. Bala. Learning visual similarity for product design with convolutional neural networks. ACM Trans. on Graphics (SIGGRAPH), 34(4), 2015.
-  K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearest neighbor” meaningful? In Database Theory - ICDT ’99, 7th International Conference, Jerusalem, Israel, January 10-12, 1999, Proceedings., pages 217–235, 1999.
-  S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  J. Goldberger, S. T. Roweis, G. E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pages 513–520, 2004.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  A. Hermans, L. Beyer, and B. Leibe. In Defense of the Triplet Loss for Person Re-Identification. Technical report.
-  E. Hoffer and N. Ailon. Deep metric learning using triplet network. CoRR, abs/1412.6622, 2014.
-  H. Hu, Y. Wang, L. Yang, P. Komlev, L. Huang, X. S. Chen, J. Huang, Y. Wu, M. Merchant, and A. Sacheti. Web-scale responsive visual search at bing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 359–367, 2018.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. 2018.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  S. W. X. L. Y. L. Jian Wang, Feng Zhou. Deep metric learning with angular loss. In International Conference on Computer Vision, 2017.
-  G. E. H. Jimmy Lei Ba, Jamie Ryan Kiros. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  Y. Jing, D. Liu, D. Kislyuk, A. Zhai, J. Xu, and J. Donahue. Visual search at pinterest. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD).
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
-  Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh. No fuss distance metric learning using proxies. CoRR, abs/1703.07464, 2017.
-  M. Opitz, G. Waltner, H. Possegger, and H. Bischof. Deep Metric Learning with BIER: Boosting Independent Embeddings Robustly. arXiv:cs/1801.04815, 2018.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1857–1865. Curran Associates, Inc., 2016.
-  H. O. Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
-  E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Neural Information Processing Systems, 2016.
-  D. D. Weifeng Ge, Weilin Huang and M. R. Scott. Deep metric learning with hierarchical triplet loss. In ECCV, 2018.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
-  K. C. J. L. K. K. Wonsik Kim, Bhavya Goyal. Attention-based ensemble for deep metric learning. In ECCV, 2018.
-  C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning. CoRR, abs/1706.07567, 2017.
-  Z. Wu, A. A. Efros, and S. X. Yu. Improving generalization via scalable neighborhood component analysis. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pages 712–728, 2018.
-  Y. Yuan, K. Yang, and C. Zhang. Hard-aware deeply cascaded embedding. arXiv preprint arXiv:1611.05720, 2016.
-  J. W. D. T. C. R. Yushi Jing, Henry Rowley and M. Covell. Google image swirl: a large-scale content-based image visualization system. In Proceedings of the 21st International Conference on World Wide Web, 2012.
-  A. Zhai, D. Kislyuk, Y. Jing, M. Feng, E. Tzeng, J. Donahue, Y. L. Du, and T. Darrell. Visual discovery at pinterest. arXiv preprint arXiv:1702.04680, 2017.
-  Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, and R. Jin. Visual search at alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 993–1001, 2018.
-  S. Q. X. W. Ziwei Liu, Ping Luo and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.