Deep Metric Learning with Density Adaptivity

Deep Metric Learning with Density Adaptivity

Yehao Li, Ting Yao,  Yingwei Pan, Hongyang Chao, 
and Tao Mei, 
T. Yao is the corresponding author.This work is partially supported by NSF of China under Grant 61672548, U1611461, 61173081, and the Guangzhou Science and Technology Program, China, under Grant 201510010165.Y. Li and H. Chao are with Sun Yat-Sen University, Guangzhou, China (e-mail: yehaoli.sysu@gmail.com; isschhy@mail.sysu.edu.cn).T. Yao, Y. Pan and T. Mei are with JD AI Research, Beijing, China (e-mail: tingyao.ustc@gmail.com; panyw.ustc@gmail.com; tmei@jd.com).
Abstract

The problem of distance metric learning is mostly considered from the perspective of learning an embedding space, where the distances between pairs of examples are in correspondence with a similarity metric. With the rise and success of Convolutional Neural Networks (CNN), deep metric learning (DML) involves training a network to learn a nonlinear transformation to the embedding space. Existing DML approaches often express the supervision through maximizing inter-class distance and minimizing intra-class variation. However, the results can suffer from overfitting problem, especially when the training examples of each class are embedded together tightly and the density of each class is very high. In this paper, we integrate density, i.e., the measure of data concentration in the representation, into the optimization of DML frameworks to adaptively balance inter-class similarity and intra-class variation by training the architecture in an end-to-end manner. Technically, the knowledge of density is employed as a regularizer, which is pluggable to any DML architecture with different objective functions such as contrastive loss, N-pair loss and triplet loss. Extensive experiments on three public datasets consistently demonstrate clear improvements by amending three types of embedding with the density adaptivity. More remarkably, our proposal increases Recall@1 from 67.95% to 77.62%, from 52.01% to 55.64% and from 68.20% to 70.56% on Cars196, CUB-200-2011 and Stanford Online Products dataset, respectively.

Deep Metric Learning, Density Adaptation, Image Retrieval.

I Introduction

Learning to assess the distance between the pairs of examples or learning a good metric is crucial in machine learning and real-world multimedia applications. One typical direction to define and learn metrics that reflect succinct characteristics of the data is from the viewpoint of classification, where a clear supervised objective, i.e., classification error, is available and could be optimized for. However, there is no guarantee that classification approaches could learn good and general metrics for any tasks, particularly when the data distribution at test time is quite different not to mention that some test examples are even from previously unseen classes. More importantly, the extreme case with enormous number of classes and only a few labeled examples per class practically stymies the direct classification. Distance metric learning, in contrast, aims at learning a transformation to an embedding space, which is regarded as a full metric over the input space by exploring not only semantic information of each example in the training set but also their intra-class and inter-class structures. As such, the learnt metric generalizes more easily.

(a) Contrastive Loss
(b) N-pair Loss
(c) Triplet Loss
(d) Recall@1 Comparison
Fig. 1: (a)—(c): Image representation embedding visualizations of ten randomly selected training classes from Cars196 dataset by using t-SNE [38]. Each image is visualized as one point and colors denote different classes. The embedding space is learnt by a standard DML architecture with contrastive loss, N-pair loss and triplet loss, respectively. (d): Recall@1 performance on training & testing set by optimizing different losses and by regularizing contrastive embedding with our density adaptivity (DML-DA).

The recent attempts on metric learning are inspired by the advances of using deep learning and learn an embedding representation of the data through neural networks. Deep metric learning (DML) has demonstrated high capability in a wide range of multimedia tasks, e.g., visual product search [11, 34, 37], image retrieval [18, 29, 33, 17, 45], clustering [9], zero-shot image classification [6, 51], highlight detection [13, 46], face recognition [31, 35] and person re-identification [2, 20]. The basic objective of the learning process is to preserve similar examples close in proximity and make dissimilar examples far apart from each other in the embedding space. To achieve this objective, a broad variety of losses, e.g., contrastive loss [3, 8], N-pair loss [33] and triplet loss [31, 43], are devised to explore the relationship between pairs or triplets of examples. Nonetheless, there is no clear picture of how to control the generalization error, i.e., difference between “training error” and “test error,” when capitalizing on these losses. Take Cars196 dataset [15] as an example, a standard DML architecture with N-pair loss fits the training set nicely and achieves Recall@1 performance of 99.2% but generalizes poorly on the testing set and only reaches 56.5% Recall@1 as shown in Figure 1(d). Similarly, the generalization error is also observed when employing contrastive loss and triplet loss. Among the three losses, utilizing contrastive loss expresses the smallest generalization error and exhibits the highest performance on the testing set. More interestingly, the embedding representations of images from each class in the training set are more concentrated by using N-pair loss and triplet loss than contrastive loss as visualized in Figure 1(a)1(c). In other words, optimizing contrastive loss leads to low density of example concentration. Here density refers to the measure of data concentration in the representation. This observation motivates us to explore the fuzzy relationship between density of examples in the embedding space and generalization capability of DML.

By consolidating the idea of exploring density to mitigate overfitting, we integrate density adaptivity into metric learning as a regularizer, following the theory that some form of regularization is needed to ensure small generalization error [48]. The regularizer of density could be easily plugged into any existing DML framework by training the whole architecture in an end-to-end fashion. We formulate the density regularizer such that it enlarges intra-class variation while the loss in DML penalizes representation distribution overlap across different classes in the embedding space. As such, the embedding representations could be sufficiently spread out to fully utilize the expressive power of the embedding space. Moreover, considering that the inherent structure of each class should be preserved before and after representation embedding, relative relationship with respect to density between different classes is further taken into account to optimize the whole architecture. Technically, the target density of each class can be viewed as an intermediate variable in our designed regularizer. It is natural to simultaneously learn the target density of each class and the neural networks by optimizing the whole architecture through the DML loss plus density regularizer. As illustrated in Figure 1(d), contrastive embedding with our density adaptivity further decreases the generalization error and boosts up Recall@1 performance to 77.6% on Cars196 testing set.

The main contribution of this work is the proposal of density adaptivity for addressing the issue of model generalization in the context of distance metric learning. This also leads to the elegant view of what role the density should act as in DML framework, which is a problem not yet fully understood in the literature. Through an extensive set of experiments, we demonstrate that our density adaptivity is amenable to three types of embedding with clear improvements on three different benchmarks. The remaining sections are organized as follows. Section II describes the related works. Section III presents our approach of deep metric learning with density adaptivity, while Section IV presents the experimental results for image retrieval. Finally, Section V concludes this paper.

Ii Related Work

The research on deep metric learning has mainly proceeded along two basic types of embedding, i.e., contrastive embedding and triplet embedding. The spirit of contrastive embedding is to make each positive pair from the same class in close proximity and meanwhile push the two samples in each negative pair to become far apart from each other. That is to pursue a discriminative embedding space with pairwise supervision. [35] is one of the early works to capitalize on contrastive embedding for deep metric learning in face verification task. The method learns embedding space through two identical sub-networks with the input pairs of samples. Next, an amount of subsequent works are presented to leverage contrastive embedding in several practical applications, e.g., person re-identification [1, 16] and image retrieval [19, 44]. As an extension of contrastive embedding, triplet embedding [10, 41, 26, 28, 27] is another dimension of DML approaches by learning embedding function with triplet/ranking supervision over the set of ranking triplets. For each input triplet consisting of one query sample, one positive sample from the same class and another negative sample from different classes, the training procedure can be interpreted as the preservation of relative similarity relations like “for the query sample, it should be more similar to positive sample than to negative sample.”

Despite the promising success of both contrastive embedding and triplet embedding in aforementioned tasks, the two embeddings rely on huge amounts of pairs or triplets for training, resulting in slow convergence and even local optimization. This is partially due to the fact that existing methods often construct each mini-batch with randomly sampled pairs or triplets and the loss functions are measured independently over individual pairs or triplets without any interaction among them. To alleviate the problem, a practical trick, i.e., hard sample mining [7, 32, 42, 25], is commonly leveraged to accelerate convergence with the hard pairs or triplets selected from each mini-batch. In particular, [42] devises an effective hard triplet sampling strategy by selecting more positive images with higher relevance scores and hard in-class negative images with less relevance scores. In another work [32], the idea of hard mining is incorporated into contrastive embedding by gradually searching hard negative samples for training.

Recently, a variety of works design new loss functions for training, pursuing more effective DML. For example, [3, 49] present a simple yet effective method by combining deep metric learning with classification constraint in a multi-task learning framework. [33] develops N-pair embedding which improves triplet embedding by pushing away multiple negative samples simultaneously within a mini-batch. Such design of N-pair embedding constructs each batch with N pairs of samples, leading to more efficient convergence in training stage. Song et al. define a structured prediction objective for DML by lifting the examples within a batch into a dense pairwise matrix in [34]. Later in [24], another structured prediction-based method is designed to directly optimize the deep neural network with a clustering quality metric. Ustinova et al. propose a new Histogram loss [37] to train the deep embeddings through making the distribution of similarities of positive and negative pairs less overlapped. Huang et al. introduce a Position-Dependent Deep Metric (PDDM) unit [11] which is capable of learning a similarity metric adaptive to local feature structure. Most recently, in [47], a Hard-Aware Deeply Cascaded embedding (HDC) is devised to handle samples of different hard level with sub-networks of different depths in a cascaded manner. [50] presents a global orthogonal regularizer to improve DML with pairwise and triplet losses by making two randomly sampled non-matching embedding representations close to orthogonal.

In the literature, there have been few works, being proposed for exploiting the adaptation of density in deep metric learning. [29] is by arbitrarily splitting the distributions of classes in representation space to pursue local discrimination. Technically, the method maintains a number of clusters for each class and adaptively embraces intra-class variation and inter-class similarity by minimizing intra-cluster distances. As such, high density of data concentration is encouraged in each cluster. Instead, our work adapts data concentration through maximizing the feature spread or seeking to low density of feature distribution for each class, while guaranteeing all the classes separable. As a result, the expressive capability of the representation space could be fully endowed to enhance model generalization, making our model potentially more effective and robust. Moreover, relative relationship with respect to density between different classes is further taken into account to optimize DML architecture in our framework.

Fig. 2: The intuition behind existing DML models (e.g., Contrastive Embedding [3], Triplet Embedding [43], N-pair Embedding [33]) and our proposed DML with Density Adaptivity. The three DML models are all optimized by maximizing inter-class distance and minimizing intra-class variation, often resulting in overfitting problem as the examples of each class are enforced to be concentrated tightly, i.e., the density of each class is very high. In contrast, for DML with our proposed density regularizer, at each iteration the density of each class is estimated and adapted towards a target of low density which encourages to enlarge intra-class variation while guaranteeing all the classes seperable. Meanwhile, the objective in DML penalizes representation distribution overlap across different classes. Such balance between inter-class similarity and intra-class variation leads to better generalization capability of DML model.

Iii Deep Metric Learning with Density Adaptivity

Our proposed Deep Metric Learning with Density Adaptivity (DML-DA) approach is to build an embedding space in which the feature representations of images could be encoded with semantic supervision over pairs or triplets of examples, under the umbrella of density adaptivity for each class. The training of DML-DA is performed by simultaneously maximizing inter-class distance and minimizing intra-class variation, and dynamically adapting the density of each class to further regularize intra-class variation, targeting for better model generalization. Therefore, the objective function of DML-DA consists of two components, i.e., the standard DML loss over pairs or triplets of examples and the proposed density regularizer. In the following, we will first recall basic methods of DML, followed by presenting how to estimate and adapt the density of each class as a regularizer. Then, we formulate the joint objective function of DML with density adaptivity and present the optimization strategy in one deep learning framework. Specifically, a DML loss layer with density regularizer is elaborately devised to optimize the whole architecture.

Iii-a Deep Metric Learning

Suppose we have a training set with examples of image-label pairs belonging to classes, where is the class label of image . With the standard setting of deep metric learning, the target is to learn an embedding function for transforming each input image into a -dimensional embedding space through a deep architecture, where represents the learnable parameters of the deep neural networks. Note that length-normalization is performed on the top of the deep architecture, making all the embedded representations -normalized. Given two images and , the most natural way to measure the relations between them is to calculate the Euclidean distance in the embedding space as

(1)

After taking such Euclidean distance as a similarity metric, the concrete task for DML is to learn the discriminative embedding representation by preserving the semantic relationship underlying in pairs [4, 5], or triplets [31, 43] or even more critical relationships (e.g., N-pair loss [33]).

Contrastive Embedding. Contrastive embedding is the most popular DML method which aims to generate embedding representations to satisfy the pairwise supervision, i.e., making the distance between a positive pair of examples from the same class minimized while maximized on a negative pair from different classes. Concretely, the corresponding contrastive loss function is defined as

(2)

where is the functional margin in the hinge function. and denotes the set of positive pairs and negative pairs, respectively.

Triplet Embedding. Different from pairwise embedding which only considers the absolute values of distances between positive and negative pairs, triplet embedding focuses more on the relative distance ordering among triplets , where denotes positive pair and is negative pair. The assumption is that the distance between negative pair should be larger than that of positive pair . Hence, the triplet loss function is measured by

(3)

where is the enforced margin in the hinge function and is the triplet set generated on .

N-pair Embedding. N-pair embedding is one recent DML model which generalizes triplet loss by encouraging the joint distance comparison among more than one negative pair. Specifically, given a -tuplet of training samples where is the anchor point, is a positive sample sharing the same label with and are the negative samples from the rest different classes, the N-pair loss function is then formulated as

(4)

where is the -tuplet set constructed over . Through minimizing this N-pair loss, the similarity between positive pair is enforced to be larger than all the rest negative pairs, which further enhances triplet loss in triplet embedding with more semantic supervision.

Iii-B Density Regularizer

One of the key attributes which the recent DML methods aforementioned in Section III-A have in common is that their objectives are predominantly designed for maximizing inter-class distance and minimizing intra-class variation. Although such optimization matches the intention of encoding semantic supervision into the learnt embedding representations, it may stymie the intrinsic intra-class variation by enforcing the examples of each class to be concentrated together tightly, which often results in overfitting problem. To overcome this issue, we devise a novel regularizer for DML that encourages low density of data concentration in the learnt embedding space to achieve a better balance between inter-class distance and intra-class variation. A caricature illustrating the intuition behind the devised density regularizer is shown in Figure 2.

1:  Given a tradeoff parameter .
2:  Forward Pass:
3:   Fetch input batch with sampled image-label pairs .
4:   Generate the positive pairs set and negative pairs set .
5:   Compute the contrastive loss over and via Eq. (2) and the density regularizer via Eq.(8).
6:   Compute overall loss output with tradeoff parameter .
7:  Backward Pass:
8:   Compute the overall gradient with respect to the target density of each class and update the corresponding target density value.
9:   Compute the overall gradient with respect to input embedding representations and backward it to lower layers for updating the parameters of the embedding function .
Algorithm 1 The training of DML with density regularizer

Density Adaptivity. In our context, density is a measure of data concentration in the representation space. We assume that, for the image examples belonging to the same class, high density is equivalent to the fact that all the examples are close in proximity to the corresponding class centroid. Accordingly, for class , to estimate its density, one natural way is to measure the average intra-class distance between examples and the class centroid in the embedding space, which is written as

(5)

where denotes the set of samples from the same class and is the corresponding class centroid. Here we directly obtain the class centroid by performing mean pooling over all the samples in for simplicity. The higher the density of one class, the smaller the average intra-class distance between examples belonging to this class and the class centroid of this class in the embedding space.

Based on the observations of the fuzzy relationship between density and generalization capability of DML, we propose a density regularizer to dynamically adapt the density of data concentration in the learnt embedding space for enhancing the generalization capability. The objective function of density regularizer is defined as

(6)

where represents the density measurement of class in the embedding space as defined in Eq.(5). is a newly incorporated intermediate variable which can be interpreted as the target density of class corresponding to an appropriate intra-class variation. Note that is an intermediate which can be interpreted as the target density of class . Similar to the density estimation of class in Eq.(5), corresponds to an appropriate target intra-class variation of class . The larger the value of , the lower the density of data concentration for class . By minimizing this regularizer, the density of each class is enforced to be adapted towards the target density via the first term. Meanwhile, when minimizing the second term, each is enlarged (i.e., the target intra-class variation of each class is maximized), pursuing the lower density of data concentration in each class to enhance model generalization. The rationale of our devised density regularizer is to encourage the spread-out property in a way that the regularizer adapts data concentration and maximizes the feature spread in the embedding space, while guaranteeing all the classes separable. As such, the expressive capability of the embedding space could be fully endowed. Please also note that the devised density regularizer should be jointly utilized with a basic DML model in practice, as the objective in DML is required to simultaneously prevent the intra-class variation from increasing endlessly by penalizing representation distribution overlap across different classes.

Inter-class Density Correlations Preservation Constraint. Inspired by the idea of structure preservation or manifold regularization in [22], the inter-class density correlation here is integrated into the density regularizer as a constraint to further explore the inherent density relationships between different classes. The spirit behind this constraint is that the target densities of two classes with similar inherent structures should still be similar on the embedding space. The intrinsic structure of the data in each class can be appropriately measured by the original density measurement before embedding. Specifically, our density regularizer with the constraint of inter-class density correlations is defined as

(7)

where denotes the original average intra-class distance of class corresponding to the original density and it is calculated based on the image representations before embedding, i.e., the output of 1,024-way layer of GoogleNet [36] in our experiments. is utilized to control the impact of the original density and reflects what degree of the inherent density relationship between different classes is considered for measuring the density.

To make the optimization of our density regularizer easy to be solved, we relax the constraint of inter-class density correlations by appending the converted soft penalty term to the objective function and then Eq.(7) is rewritten as

(8)

By minimizing the converted soft penalty term in Eq.(8), the inherent inter-class density correlations can be preserved in the learnt embedding space.

Iii-C Training Procedure

Without loss of generality, we adopt the widely used contrastive embedding as the basic DML model and present how to additionally incorporate the density regularizer into it. It is also worth noting that our density regularizer is pluggable to any neural networks for deep metric learning and could be trained in an end-to-end fashion. In particular, the overall objective function of DML-DA integrates the contrastive loss in Eq.(2) and the proposed density regularizer in Eq.(8). Hence, we obtain the following optimization problem as

(9)

where is the tradeoff parameter. With this overall loss objective, the crucial goal of its optimization is to learn the embedding function with its parameters and the target density of each class .

Inspired by the success of CNNs in recent DML models, we employ a deep architecture, i.e., GoogleNet [36], followed by an additional fully-connected layer (an embedding layer) to learn the embedding representations for images. In the training stage, to solve the optimization according to overall loss objective in Eq.(9), we design a DML loss layer with density regularizer on the top of the embedding layer. The loss layer only contains parameters of target density. During learning, it evaluates the model’s violation of both the basic DML supervision over pairs and density regularizer, and back-propagates the gradients with respect to target density of each class and input embedding representations to update the parameters of loss layer and lower layers, respectively. The training process of our DML-DA is given in Algorithm 1.

Cars196
  Method Tri [43] LS [34] NP [33] Clu [24] Con [3] HDC [47] DML-DA DML-DA DML-DA
  NMI 47.23 56.88 57.29 59.04 59.09 62.17 56.59 62.07 65.17
  R@1 42.54 52.98 56.52 58.11 67.95 71.42 62.51 71.34 77.62
  R@2 53.94 65.70 68.42 70.64 78.05 81.85 73.58 81.29 86.25
  R@4 65.74 76.01 78.01 80.27 85.78 88.54 82.24 87.92 91.71
  R@8 75.06 84.27 85.70 87.81 91.60 93.40 88.56 92.74 95.35
  R@16 82.40 - 91.19 - 95.34 96.59 93.17 95.89 97.54
  R@32 88.70 - 94.81 - 97.58 98.16 95.89 97.70 98.89
  R@64 93.17 - 97.38 - 98.78 99.21 97.86 98.82 99.37
  R@128 96.42 - 98.83 - 99.51 99.67 98.98 99.53 99.73
CUB-200-2011
  Method Tri [43] LS [34] NP [33] Clu [24] Con [3] HDC [47] DML-DA DML-DA DML-DA
  NMI 50.99 56.50 57.41 59.23 60.07 60.78 55.53 59.67 62.32
  R@1 39.57 43.57 47.30 48.18 52.01 52.50 45.90 51.45 55.64
  R@2 51.74 56.55 59.57 61.44 65.16 65.25 57.97 63.01 66.96
  R@4 63.35 68.59 70.75 71.83 75.71 76.01 69.53 74.38 77.92
  R@8 74.14 79.63 80.98 81.92 84.25 85.03 80.23 83.78 86.23
  R@16 82.98 - 88.28 - 90.82 91.10 88.15 90.58 92.10
  R@32 89.53 - 93.50 - 95.17 95.34 94.01 95.16 95.95
  R@64 94.80 - 96.79 - 97.70 97.67 97.00 97.64 98.11
  R@128 97.65 - 98.43 - 98.99 99.09 98.63 98.89 99.21
Stanford Online Products
  Method Tri [43] LS [34] NP [33] Clu [24] Con [3] HDC [47] DML-DA DML-DA DML-DA
  NMI 86.20 88.65 88.77 89.48 88.57 88.75 87.25 88.93 89.50
  R@1 59.49 62.46 65.89 67.02 68.20 69.17 61.16 67.49 70.56
  R@10 76.23 80.81 81.94 83.65 82.20 82.77 78.99 82.20 84.09
  R@100 87.95 91.93 91.83 93.23 90.87 91.27 90.54 91.94 94.09
  R@1000 95.70 - 97.30 - 96.61 97.59 96.96 97.69 97.72
TABLE I: Performance comparisons with the state-of-the-art methods in terms of NMI and Recall@K (%) on Cars196, CUB-200-2011 and Stanford Online Products dataset. The performances of Triplet (Tri), N-pair (NP) and Contrastive (Con) are reported based on our implementations and we utilize the models shared by the authors for HDC evaluation. The best performances are in bold and we also underline the performances of the best competitors. For the methods of Lifted Struct (LS) and Clustering (Clu), we directly extract results reported in [24].

Iv Experiments

We evaluate our DML-DA models by conducting two object recognition tasks (clustering and -nearest neighbour retrieval) on three image datasets, i.e., Cars196 [15], CUB-200-2011 [40] and Stanford Online Products [34]. The first two are the popular fine-grained object recognition benchmarks and the latter one is a recently released object recognition dataset of online product images.

Cars196 contains 16,185 images belonging to 196 classes of cars. In our experiments, we follow the settings in [34], taking the first 98 classes (8,054 images) for training and the rest 98 classes (8,131 images) for testing.

CUB-200-2011 includes 11,788 images of 200 classes corresponding to different birds species. Following [34], we utilize the first 100 classes (5,864 images) for training and the remaining 100 classes (5,924 images) for testing.

Stanford Online Products is a recent collection of online product images from eBay.com. It is composed of 120,053 images belonging to 22,634 classes. In our experiments, we utilize the standard split in [34]: 11,318 classes (59,551 images) are used for training and 11,316 classes (60,502 images) are exploited for testing.

Iv-a Implementation Details

For the network architecture, we utilize GoogleNet [36] pre-trained on Imagenet ILSVRC12 dataset [30] plus a fully connected layer (an embedding layer), which is initialized with random weights. For density regularizer, its parameters (i.e., the target density of each class) are all initially set to 0.5. The control factor in Eq.(7) is set as 0.5 and the tradeoff parameter in Eq.(9) is fixed to 10. All the margin parameters (e.g., and ) are set to 1. We fix the embedding size as 128 throughout the experiments. We mainly implement DML models based on Caffe [12], which is one of widely adopted deep learning frameworks. Specifically, the network weights are trained by ADAM [14] with 0.9/0.999 momentum. The learning rate is initially set as , and on Cars196, CUB-200-2011 and Stanford Online Products, respectively. The mini-batch size is set as 100 and the maximum training iteration is set as 30,000 for all the experiments. In the experiments on Cars196 and CUB-200-2011, to compute the density of each class with sufficient images in a mini-batch, we first randomly sample 10 classes from all training classes and then randomly select 10 images for each sampled class, leading to the mini-batch with 100 training images. In the experiments on Stanford Online Products dataset, since each training class contains only 5 images on average, we construct each mini-batch by accumulating all the images for randomly sampled classes until the maximum size of mini-batch is achieved.

Fig. 3: Recall@K performance comparison of DML-DA framework w or w/o inter-class density correlations preservation on Cars196 dataset, when exploiting (a) triplet loss, (b) N-pair loss and (c) contrastive loss.
Fig. 4: NMI performance gains when plugging entropy (EN) regularizer, global orthogonal (GOR) regularizer and our density adaptivity (DA) regularizer into DML architecture with triplet loss, N-pair loss and contrastive loss, on (a) Car196, (b) CUB-200-2011 and (c) Stanford Online Products.

Iv-B Evaluation Metrics and Compared Methods

Evaluation Metrics. For the clustering task, we adopt the Normalised Mutual Information (NMI) [21] metric, which is defined as the ratio of mutual information and the average entropy of clusters and labels. For the -nearest neighbour retrieval task, Recall@K (R@K) is utilized for quantitative evaluation. Given a test image query, its Recall@K score is measured as 1 if an image of the same class is retrieved among the -nearest neighbours and 0 otherwise. The final metric score is the average of Recall@K for all image queries in the testing set. All the metrics are computed by using the codes111https://github.com/rksltnl/Deep-Metric-Learning-CVPR16/tree/master/code/evaluation released in [34].

Compared Methods. To empirically verify the merit of our proposed DML-DA models, we compared the following state-of-the-art methods:

(1) Triplet [43] adopts triplet loss to optimize the deep architecture. (2) Lifted Struct [34] devises a structured prediction objective on the lifted dense pairwise distance matrix within the batch. (3) N-Pair [33] trains DML with N-pair loss. (4) Clustering [24] is a structured prediction based DML model which can be optimized with clustering quality metric. (5) Contrastive [3] uses contrastive loss for DML training. (6) HDC [47] trains the embedding neural network in a cascaded manner by handling samples of different hard level with models of different complexities. (7) DML-DA is the proposal in this paper. DML-DA, DML-DA and DML-DA denotes that the basic DML model in our DML-DA is equipped with triplet loss, N-pair loss and contrastive loss, respectively. Moreover, a slightly different settings of the three runs are named as DML-DA, DML-DA and DML-DA, which are all trained without inter-class density correlations preservation constraint.

Iv-C Performance Comparison

Table I shows the NMI and k-nearest neighbors performance with Recall@K metric of different approaches on Cars196, CUB-200-2011 and Stanford Online Products dataset, respectively. It is worth noting that the dimension of the embedding space in Triplet, N-pair, Contrastive, HDC and our three DML-DA runs is 128, and in Lifted Struct and Clustering, the performances are given by choosing 64 as the embedding dimension. In view that the embedding size is not sensitive towards performance during training and testing phase as studied in [34], we compare directly with results.

Overall, the results across all evaluation metrics (NMI and Recall at different depths) and three datasets consistently indicate that our proposed DML-DA exhibits better performance against all the state-of-the-art techniques. In particular, the NMI and Recall@1 performance of DML-DA can achieve 65.17% and 77.62%, making the absolute improvement over the best competitor HDC by 3.0% and 6.2% on Cars196, respectively. DML-DA, DML-DA and DML-DA by integrating density adaptivity makes the absolute improvement over Triplet, N-pair and Contrastive by 19.97%, 14.82% and 9.67% in Recall@1 on Cars196, respectively. The performance trends on the other two datasets are similar with that of Cars196. The results indicate the advantage of exploring density adaptivity in DML training to enhance model generalization. Triplet which only compares an example with one negative example while ignoring negative examples from the rest of the classes performs the worst among all the methods. Lifted Struct, N-pair and Clustering distinguishing an example from all the negative classes lead to a large performance boost against Triplet.

(a) Triplet
(b) N-pair
(c) Contrastive
(d) DML-DA
(e) DML-DA
(f) DML-DA
Fig. 5: Image representation embedding visualizations of the training 98 classes in Cars196 by using t-SNE [38]. Each image is visualized as one point and colors denote different classes. The embedding space is learnt by Triplet, N-pair, Contrastive, our DML-DA, DML-DA and DML-DA, respectively.
Fig. 6: The effect of the tradeoff parameter in our (a) DML-DA, (b) DML-DA and (c) DML-DA over NMI (%) on Cars196 dataset.
Fig. 7: Barnes-Hut t-SNE visualization [39] of image embedding representations learnt by our DML-DA on the test split of Cars196. Best viewed on a monitor when zoomed in. By integrating density adaptivity in DML training, our DML-DA effectively balances the inter-class similarity and intra-class variation, which enhances model generalization. As such, the learnt embedding representation is more discriminative to cluster semantically similar cars despite of the significant variations in pose and body paint.

Contrastive outperforms Lifted Struct, N-pair and Clustering on both Cars196 and CUB-200-2001. Though the four runs all involve the utilization of the relationship in both positive pairs and negative pairs, they are fundamentally different in devising the objective function that Lifted Struct, N-pair and Clustering tend to push positive pairs closer through negative pairs and encourage small intra-class variation, while Contrastive could flexibly balance inter-class distance and intra-class similarity by seeking a tradeoff of impact between positive pairs and negative pairs. As indicated by our results, advisably enlarging intra-class variation leads to better performance and makes Contrastive generalize well. This is also consistent with the motivation of our density adaptivity, which is to regularize the degree of data concentration of each class. With our density adaptivity, DML-DA successfully boosts up the performance on the two datasets. In contrast, the NMI performance of Contrastive is inferior to that of Lifted Struct, N-pair and Clustering on Stanford Online Products. This is expected as the number of classes in Stanford Online Products is too large (more than 11K test classes) and thus Lifted Struct, N-pair and Clustering are benefited from the outcome of small intra-class clustering, making the chance of distinguishingly distributing such a large number of classes on the embedding space better. The improvement is also observed by DML-DA in this extreme case. Furthermore, HDC by handling samples of different hard levels with sub-networks of different depths improves Contrastive, but the performances are still lower than our DML-DA.

Iv-D Effect of Inter-class Density Correlations Preservation

Figure 3 compares the Recall@K performance of our DML-DA framework with or without inter-class density correlations preservation constraint on Cars196 dataset. The results across different depths (K) of Recall consistently indicate that additionally exploring inter-class density correlations preservation exhibits better performance when exploiting triplet loss, N-pair loss and contrastive loss in our DML-DA framework, respectively. Though the performance gain is gradually decreased when going deeper into the retrieval list, our DML-DA framework still leads to apparent improvement, even at Recall@128. In particular, DML-DA makes the absolute improvement over DML-DA and Triplet by 0.5% and 2.56% in terms of Recall@128, respectively.

Iv-E Effect of Different Regularizer

Next, we compare our density regularizer with Entropy (EN) regularizer [23] and Global Orthogonal (GOR) regularizer [50] by plugging each of them into DML architecture with triplet loss, N-pair loss and contrastive loss, respectively. The entropy regularizer aims to maximize the entropy of the representation distribution in the embedding space and thus implicitly encourages large intra-class variation and small inter-class distance. The global orthogonal regularizer is to maximize the spread of embedding representations following the property that two non-matching representations are close to orthogonal with a high probability.

Fig. 8: Barnes-Hut t-SNE visualization [39] of image embedding representations learnt by our DML-DA on the test split of CUB-200-2011 dataset. Best viewed on a monitor when zoomed in. By integrating density adaptivity in DML training, our DML-DA effectively balances the inter-class similarity and intra-class variation, which enhances model generalization. As such, the learnt embedding representation is more discriminative to cluster semantically similar birds despite of the significant variations in view point and background.

Figure 4 details the NMI performance gains when exploiting each of the three regularizers on Cars196, CUB-200-2011 and Stanford Online Products dataset, respectively. The results across DML architecture with three types of losses and three datasets consistently indicate that our DA regularizer leads to a larger performance boost against the other two regularizers. Compared to EN regularizer, our DA regularizer is more effective and robust, since we uniquely consider the balance between enlarging intra-class variation and penalizing distribution overlap across different classes in the optimization. GOR regularizer targeting for an uniform distribution of examples in the embedding space improves EN regularizer, but the performance is still lower than that of our DA regularizer. This somewhat reveals the weakness of GOR regularizer which performs a strong constraint of pushing two randomly examples from different categories close to orthogonal. In addition, the improvement trends on other evaluation metrics are similar with that of NMI.

Iv-F Effect of trade-off parameter

To further clarify the effect of the tradeoff parameter in Eq.(9), we illustrate the performance curves of DML-DA with three types of losses by varying from 0.5 to 25 in Figure 6. As shown in the figure, our DML-DA architecture with three types of losses constantly indicate that the best NMI performance is attained when the tradeoff parameter is set to 10. More importantly, the performance curve for each DML-DA model is relatively smooth as long as is larger than 7, that practically eases the selection of .

Fig. 9: Barnes-Hut t-SNE visualization [39] of image embedding representations learnt by our DML-DA on the test split of Stanford Online Products dataset. Best viewed on a monitor when zoomed in. By integrating density adaptivity in DML training, our DML-DA effectively balances the inter-class similarity and intra-class variation, which enhances model generalization. As such, the learnt embedding representation is more discriminative to cluster semantically similar products despite of the significant variations in configuration and illumination.

Iv-G Embedding Representations Visualization

Figure 5(a)5(f) shows the t-SNE [38] visualizations of image embedding representations learnt by Triplet, N-pair, Contrastive, our DML-DA, DML-DA, and DML-DA, respectively. Specifically, we utilize all the training 98 classes in Cars196 dataset and the embedding representations of all the 8,054 images are then projected into 2-dimensional space using t-SNE. It is clear that the intra-class variation of the embedding representations learnt by DML-DA is larger than those of Triplet, while guaranteeing all the classes separable. Similarly, the increase of intra-class variation is also observed in t-SNE visualization when integrating density adaptivity into N-pair loss and contrastive loss, respectively.

To better qualitatively evaluate the learnt embedding representations, we further show the Barnes-Hut t-SNE [39] visualizations of image embedding representations learnt by our DML-DA on Cars196 dataset, CUB-200-2011 and Stanford Online Products datasets in Figure 7, 8 and 9, respectively. Specifically, we leverage all the images in the test split of each dataset and the 128-dimensional embedding representations of images are then projected into 2-dimensional space using Barnes-Hut t-SNE [39]. It is clear that our learnt embedding representation effectively clusters semantically similar cars/birds/products despite of the significant variations in view point, pose and configuration.

V Conclusion

In this paper we have investigated the problem of training deep neural networks that are capable of high generalization performance in the context of metric learning. Particularly, we propose a new principle of density adaptivity into the learning of DML, which could lead to the largest possible intra-class variation in the embedding space. More importantly, the density adaptivity can be easily integrated into any existing DML implementations by simply adding one regularizer to the original objective loss. To verify our claim, we have strengthened three types of embedding, i.e., contrastive embedding, N-pair embedding and triplet embedding, with density regularizer. Extensive experiments conducted on three datasets validate our proposal and analysis. More remarkably, we achieve new state-of-the-art performance on all the three datasets. One possible future research direction would be to generalize our density adaptivity scheme to other types of embedding or other tasks with a large amount of classes.

References

  • [1] E. Ahmed, M. Jones, and T. K. Marks (2015) An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3908–3916. Cited by: §II.
  • [2] Y. Bai, Y. Lou, F. Gao, S. Wang, Y. Wu, and L. Duan (2018) Group sensitive triplet embedding for vehicle re-identification. IEEE Transactions on Multimedia 20 (9), pp. 2385–2399. Cited by: §I.
  • [3] S. Bell and K. Bala (2015) Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics 34 (4), pp. 98. Cited by: §I, Fig. 2, §II, TABLE I, §IV-B.
  • [4] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah (1994) Signature verification using a ”siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §III-A.
  • [5] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–546. Cited by: §III-A.
  • [6] P. Cui, S. Liu, and W. Zhu (2018) General knowledge embedded image representation learning. IEEE Transactions on Multimedia 20 (1), pp. 198–207. Cited by: §I.
  • [7] Y. Cui, F. Zhou, Y. Lin, and S. Belongie (2016) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1153–1162. Cited by: §II.
  • [8] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §I.
  • [9] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 31–35. Cited by: §I.
  • [10] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §II.
  • [11] C. Huang, C. C. Loy, and X. Tang (2016) Local similarity-aware deep feature embedding. In Advances in Neural Information Processing Systems, pp. 1262–1270. Cited by: §I, §II.
  • [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §IV-A.
  • [13] H. Kim, T. Mei, H. Byun, and T. Yao (2018) Exploiting web images for video highlight detection with triplet deep ranking. IEEE Transactions on Multimedia 20 (9), pp. 2415–2426. Cited by: §I.
  • [14] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §IV-A.
  • [15] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §I, §IV.
  • [16] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159. Cited by: §II.
  • [17] Y. Li, Y. Pan, T. Yao, H. Chao, Y. Rui, and T. Mei (2019) Learning click-based deep structure-preserving embeddings with visual attention. ACM Transactions on Multimedia Computing, Communications, and Applications 15 (3), pp. 78. Cited by: §I.
  • [18] Z. Li and J. Tang (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Transactions on Multimedia 17 (11), pp. 1989–1999. Cited by: §I.
  • [19] H. Liu, R. Wang, S. Shan, and X. Chen (2016) Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2064–2072. Cited by: §II.
  • [20] L. Ma, X. Yang, and D. Tao (2014) Person re-identification over camera networks using multi-task distance metric learning. IEEE Transactions on Image Processing 23 (8), pp. 3656–3670. Cited by: §I.
  • [21] C. D. Manning, P. Raghavan, H. Schütze, et al. (2010) Introduction to information retrieval. Vol. 16, Cambridge university press. Cited by: §IV-B.
  • [22] S. Melacci and M. Belkin (2011) Laplacian support vector machines trained in the primal. Journal of Machine Learning Research. Cited by: §III-B.
  • [23] G. Niu, B. Dai, M. Yamada, and M. Sugiyama (2012) Information-theoretic semi-supervised metric learning via entropy regularization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1043–1050. Cited by: §IV-E.
  • [24] H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy (2017) Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390. Cited by: §II, TABLE I, §IV-B.
  • [25] Y. Pan, Y. Li, T. Yao, T. Mei, H. Li, and Y. Rui (2016) Learning deep intrinsic video representation by exploring temporal coherence and graph structure.. In IJCAI, pp. 3832–3838. Cited by: §II.
  • [26] Y. Pan, T. Yao, H. Li, C. Ngo, and T. Mei (2015) Semi-supervised hashing with semantic confidence for large scale visual search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 53–62. Cited by: §II.
  • [27] Y. Pan, T. Yao, X. Tian, H. Li, and C. Ngo (2014) Click-through-based subspace learning for image search. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 233–236. Cited by: §II.
  • [28] Z. Qiu, Y. Pan, T. Yao, and T. Mei (2017) Deep semantic hashing with generative adversarial networks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 225–234. Cited by: §II.
  • [29] O. Rippel, M. Paluri, P. Dollar, and L. Bourdev (2016) Metric learning with adaptive density discrimination. In International Conference on Learning Representations, Cited by: §I, §II.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §IV-A.
  • [31] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §I, §III-A.
  • [32] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126. Cited by: §II.
  • [33] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §I, Fig. 2, §II, §III-A, TABLE I, §IV-B.
  • [34] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §I, §II, TABLE I, §IV-B, §IV-B, §IV-C, §IV, §IV, §IV, §IV.
  • [35] Y. Sun, Y. Chen, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pp. 1988–1996. Cited by: §I, §II.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §III-B, §III-C, §IV-A.
  • [37] E. Ustinova and V. Lempitsky (2016) Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, pp. 4170–4178. Cited by: §I, §II.
  • [38] L. van der Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: Fig. 1, Fig. 5, §IV-G.
  • [39] L. Van Der Maaten (2014) Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research 15 (1), pp. 3221–3245. Cited by: Fig. 7, Fig. 8, Fig. 9, §IV-G.
  • [40] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §IV.
  • [41] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li (2014) Deep learning for content-based image retrieval: a comprehensive study. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 157–166. Cited by: §II.
  • [42] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014) Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §II.
  • [43] K. Q. Weinberger, J. Blitzer, and L. K. Saul (2006) Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 1473–1480. Cited by: §I, Fig. 2, §III-A, TABLE I, §IV-B.
  • [44] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan (2014) Supervised hashing for image retrieval via image representation learning. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 2156–2162. Cited by: §II.
  • [45] T. Yao, T. Mei, and C. Ngo (2015) Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 28–36. Cited by: §I.
  • [46] T. Yao, T. Mei, and Y. Rui (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [47] Y. Yuan, K. Yang, and C. Zhang (2017) Hard-aware deeply cascaded embedding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 814–823. Cited by: §II, TABLE I, §IV-B.
  • [48] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, Cited by: §I.
  • [49] X. Zhang, F. Zhou, Y. Lin, and S. Zhang (2016) Embedding label structures for fine-grained feature representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1114–1123. Cited by: §II.
  • [50] X. Zhang, F. X. Yu, S. Kumar, and S. Chang (2017) Learning spread-out local feature descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4595–4603. Cited by: §II, §IV-E.
  • [51] Z. Zhang and V. Saligrama (2016) Zero-shot learning via joint latent similarity embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6034–6042. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
389192
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description