A Probabilistic approach for Learning Embeddings without SupervisionPreprint.

# A Probabilistic approach for Learning Embeddings without Supervision1

## Abstract

For challenging machine learning problems such as zero-shot learning and fine-grained categorization, embedding learning is the machinery of choice because of its ability to learn generic notions of similarity, as opposed to class-specific concepts in standard classification models. Embedding learning aims at learning discriminative representations of data such that similar examples are pulled closer, while pushing away dissimilar ones. Despite their exemplary performances, supervised embedding learning approaches require huge number of annotations for training. This restricts their applicability for large datasets in new applications where obtaining labels require extensive manual efforts and domain knowledge. In this paper, we propose to learn an embedding in a completely unsupervised manner without using any class labels. Using a graph-based clustering approach to obtain pseudo-labels, we form triplet-based constraints following a metric learning paradigm. Our novel embedding learning approach uses a probabilistic notion, that intuitively minimizes the chances of each triplet violating a geometric constraint. Due to nature of the search space, we learn the parameters of our approach using Riemannian geometry. Our proposed approach performs competitive to state-of-the-art approaches.

\MakeOuterQuote

\useunder\ul

\keywords

Embedding learning, metric learning, unsupervised learning, graph-based learning, clustering, zero-shot learning, fine-grained visual categorization, Riemannian manifolds, Riemannian optimization.

## 1 Introduction

Embedding learning is the machinery of choice in many challenging machine learning problems where standard classification models cannot be used with ease. For example, softmax-based classification networks using cross-entropy loss are hard to train when the number of classes is huge (extreme classification [1, 2]). For cross-domain tasks [3, 4, 5], classification-based models require complicated training procedures. Also, for the challenging zero-shot learning scenario [6, 7] where the test examples belong to semantic classes not seen during training, embedding learning is preferred over standard classification models. This is because of its ability to capture generic notions of similarity, rather than class-specific concepts. Furthermore, classification-based models cannot handle probabilistic class labels, but embedding learning can [8].

Learning rich and discriminative representations of data is essential in various problems, including the ones mentioned above. Embedding learning aims at learning representations or embeddings of data, while grouping similar examples and segregating dissimilar ones. As the common practice is to learn a metric in the embedding space, embedding learning is often studied interchangeably with metric learning [9, 10, 11]. Metric learning methods require only a weaker form of supervision, usually provided as pairs [12, 13, 14, 15, 16, 17], triplets [18, 19, 20, 21], batches [22] or tuples [23]. This is yet another advantage of metric learning over classification-based models.

However, in practice, pairs or triplets are constructed from class labels, thus making commonly used metric learning approaches [15, 16, 17, 10, 9] supervised in nature. This is undesirable, as obtaining manual annotations may either be infeasible, or may require domain-specific knowledge in some tasks. Particularly, for newer applications involving large datasets (for e.g., medical imaging requiring invasive procedures, 3D point cloud object datasets, pixel-level annotations for semantic segmentation), instances may be difficult to annotate. While obtaining labels for large scale datasets is often infeasible, unlabeled data on the other hand, is ubiquitous and can provide richer information if exploited well. This motivates us to learn features or embeddings in a completely unsupervised manner, i.e., without using class labels.

In this paper, we propose an unsupervised approach to learn an embedding that can achieve competitive results. Figure 1 illustrates our proposed approach. Particularly, we follow a triplets-based metric learning paradigm [18, 19, 21] for providing weak supervision. We believe, our study can establish new ways of performing self-training and learning from unlabeled data. Furthermore, it leads up to a new, generic way to include information from unlabeled data, that could be used for general semi-supervised learning in future.

As we do not have class labels, we use a graph-based clustering to obtain pseudo-labels, and form the triplet constraints. However, rather than naively using the triplets obtained from pseudo-labels, we propose to use a weight function that appropriately scales the losses associated with the triplets. Having the weighted losses for the triplets, we finally use a probabilistic notion to learn an embedding. In particular, we minimize the chances of each triplet violating a geometric angular constraint.

Additionally, to avoid the model from collapsing to a singularity and overfit to the training data, we impose constraints in the form of orthogonality on the parametric matrix of the learned embedding. This requires us to exploit Riemannian manifold based optimization techniques to learn the parameters of our approach. We jointly learn the parameters of the weight function and the embedding using optimization on a Riemannian product manifold.

## 2 Background

This section provides a basic introduction to the following topics (relevant to our paper): i) Triplets-based Distance Metric Learning (DML) along with two relevant losses, ii) the graph-based Authority Ascent Shift (AAS) clustering method [25], used to obtain pseudo-labels for our proposed approach, and iii) basics of Riemannian optimization, which is used to learn the parameters of our method.

### 2.1 Triplets-based Distance Metric Learning (DML)

Let be a dataset of cardinality , with being the descriptor of sample . For , let , denote its embedding learned by a projection matrix , such that . facilitates dimensionality reduction, if . Consider the function: , for a pair of examples . Here, is essentially a squared-Mahalanobis distance. The goal of our work is to learn the parametric matrix , given a set of triplet constraints [18, 20, 21, 19]: . Here, is similar (or from same class) to , and is dissimilar (or from different class) to both and . , and are respectively called as the anchor, the positive (or target) and the negative (or impostor) respectively. The objective function to minimize the triplet-loss can be expressed as follows:

 Jtriplet=|Tlabeled|∑i=1[δ2L(xi,x+i)−δ2L(xi,x−i)+τ]+. (1)

Here, (1) ensures that the distance between the pair is greater than distance between the pair by a margin . denotes the hinge-loss function. On the other hand, instead of constraining with respect to a distance-based margin as in (1), we can also constrain a triplet with respect to an angle. The objective function to minimize the angular loss can be expressed as follows [26]:

 Jangular=|Tlabeled|∑i=1[δ2L(xi,x+i)−4 tan2α δ2L(x−i,xi−avg)]+. (2)

Here, , and is a hyperparameter that corresponds to an upper bound on the angle at of the triplet . In both (1) and (2), the constraint set is obtained using class labels. We however, do not make use of class labels to learn .

### 2.2 Obtaining pseudo-labels using Authority Ascent Shift (AAS) clustering

We assume that the given dataset is unlabeled. To compensate for the lack of class labels, we suggest obtaining pseudo-labels to form a triplet set. In our work, we choose the graph-based Authority Ascent Shift (AAS) clustering [25] to obtain the pseudo-labels. Let, the AAS clustering be denoted by a function such that , a positive integer, denotes the pseudo-label assigned to . Briefly, AAS requires constructing a weighted graph with nodes representing the examples, and edges between nearest neighbors. Edge weights denote affinities between examples. With denoting the stationary probability distribution of a random walker on the graph, the node relevancy from node to node can be defined as [25]:

 ψ(i,j)=diPijexp(−γ(∇ω(i,j))2). (3)

Here, is the out-degree of node , is the transition probability from node to node , is the exponential function, and is a hyperparameter. The set of relevant neighbors of node can be defined as:

 Nϵ(i)={j∈V:ψ(i,j)>ϵ}∪{i}. (4)

Here, is a hyperparameter and is the vertex set of the graph. Authority ascent of a node can be performed by moving towards the node such that . By subsequently performing authority ascent on neighboring nodes, we can associate a authority mode [25] to node . Nodes sharing a common authority mode build a tree. Disjoint trees represent the distinct, arbitrary-shaped clusters present in the data [25]. Using the clustering, we can obtain the set of triplets as required.

Motivation to use AAS: AAS does not require the user to input the number of clusters present. It is very crucial in the unsupervised setting. More importantly, it is able to detect arbitrary-shaped clusters in the data, while being robust to noise and outliers. As AAS relies on a notion of geometric similarity, we further conjecture that AAS groups together far away semantically similar examples. We believe that this is helpful in capturing intra-class variances present in the data. For example, such variances occur in visual data due to minor pose, illumination or viewpoint differences. Capturing intra-class variances is important, as also pointed out by Li et al. [27].

### 2.3 Basics of Riemannian optimization

To learn the parameters of our proposed method, we make use of Riemannian manifold based optimization, which we believe, is an extensive study on its own. For a more detailed, formal study of the topic, we refer the interested reader to the book by Absil et al. [28]. However, to make the current text self-contained, we briefly explain the intuitions of a few underlying concepts relevant to our paper.

As shown in Figure 2, a Riemannian manifold is a non-Euclidean space that locally resembles an Euclidean space, and is equipped with an inner product on the tangent space at each point . The tangent space can be considered as a linearization of at .

Optimization methods like Riemannian Conjugate Gradient Descent (RCGD) can be performed by following a line-search algorithm on . Given a descent direction provided by the tangent vector , line search can be performed by moving along a geodesic, a smooth curve on the manifold (dotted curve in Figure 2). The update formula for the line-search algorithm can be given as:

 pt+1=Rpt(ηtξpt). (5)

Here, is called the retraction operator, and is the step size.

Let be a smooth function. To minimize , a gradient-descent algorithm can be obtained when direction of coincides with , where is the Riemannian gradient at . The Riemannian gradient is defined as the unique element that satisfies:

Here, is the directional derivative of in the direction of , and is the inner product between the tangent vectors .

Now, given a manifold as shown in Figure 2, a Lie group acts on , if it defines a mapping . If the action of defines an equivalence relation , then by the quotient manifold theorem (Theorem 9.16) in Lee et al. [29], forms a smooth manifold, and is called a Riemannian quotient manifold. Essentially, the quotient manifold is the set of all equivalence classes, such that for a point , the equivalence class is defined as: . Thus, and represent the same point .

As we will see later, our objective function (12) displays an invariance property. In short, it means that for an objective function , and two points such that , we get the same value of the objective, i.e., . This could be detrimental to the optimization method. Hence, considering the quotient manifold theorem is important.

However, the quotient manifold is an abstract manifold. Therefore, to provide a matrix representation of the abstract tangent space , we make use of . Here, and are called as the horizontal space and vertical space respectively, and are two complementary parts of the tangent space . The tangent vector is called the horizontal lift of .

The search spaces of the parameters of our approach (i.e., embedding and weight) are individual Riemannian manifolds. To jointly learn them we consider the product space of these manifolds, which is again a Riemannian manifold [28]. The operators discussed above, i.e., gradient and retraction, for this product manifold can be defined as the Cartesian product of the individual components. However, the inner product is defined as the sum of the inner products of the individual components. Lastly, we discuss the following two manifolds that are relevant to our paper:

###### Definition 1.

The Stiefel manifold: The (orthogonal) Stiefel manifold () [28], is formed by the set of all orthonormal matrices of order , as follows:

 St(l,d)≜{L∈Rd×l:L⊤L=Il}

has a dimensionality of . Here, is the identity matrix. The Riemannian metric is given as: , for . is the trace operator.

###### Definition 2.

The Grassmann manifold: The Grassmann manifold () [28], is the collection of -dimensional subspaces spanned by the columns of orthonormal matrices of order , and is defined as follows:

 G(l,d)≜{span(W):W∈Rd×l,W⊤W=Il}

has a dimensionality of . is essentially a quotient manifold of .

## 3 Proposed Method

We now discuss our proposed embedding learning approach. Using the clustering function , we can generate a set of triplets: , each element of which consists of the following: i) , an arbitrary example with a value . ii) , another arbitrary example with . iii) , such that . The examples , and are referred to as the anchor, positive and negative respectively. Here, is the number of triplets generated. Using , our goal is to learn in .

Having obtained the constraint set of triplets, we follow a metric learning paradigm to learn . Wang et al. [26] pointed that in the standard triplet loss as in (1), the gradients w.r.t. an example in a triplet consider only two of the examples at a time, and fail to capture the third. They propose an angular loss constraint as in (2), that handles this issue by taking into account the angular geometry w.r.t. the negative example in the triplet. Thus, the gradients are computed considering all three examples of the triplet simultaneously. However, the loss in (2) is non-smooth due to the hinge-loss function. Hence, to reap the benefits of manifold-based optimization (as we will see later), we suggest using a smooth version of the angular constraint as the metric loss associated with a triplet :

 m(xi,x+i,x−i)=log(1+exp(z(xi,x+i,x−i))), (7)

where,

 z(xi,x+i,x−i)=δ2L(xi,x+i)−4 tan2α δ2L(x−i,xi−avg). (8)

Here, . corresponds to an upper bound on the angle at of the triplet [26]. Apart from the benefit of computing gradients w.r.t. three examples simultaneously, (7) offers rotation and scale invariance. Furthermore, without a proper reference, it is not intuitive to tune a distance-based hyperparameter as in (1), whereas, it is much more intuitive to tune an angle as present in (8).

Additionally, to scale the contribution of a triplet for learning our embedding, we also provide a weightage to the metric loss term defined in (7). Specifically, we propose a weighted loss as follows:

 f(xi,x+i,x−i)=w(xi,x+i,x−i)m(xi,x+i,x−i). (9)

Here, a function is used to provide the weightage given to the metric loss associated with a triplet , and we define it as follows:

 w(xi,x+i,x−i)=11+exp(−r⊤xi−cat), (10)

where, and represents the parameter of the weight function. Intuitively, we collectively represent the triplet as a single example, and let that representation decide the weightage that should be given to the metric loss term for the triplet. Concatenating examples within a triplet as a single vector can help to capture specific relationships among them. For example, Duan et al. [10] have concatenated the examples in a triplet to generate a synthetic negative with respect to the anchor. In our case, we map the concatenation of the examples in a triplet to a confidence value.

We now propose a novel probabilistic objective to learn an embedding. Given the set of parameters and of our model, we associate a probability to a triplet , defined as follows:

 pi=11+exp(fi). (11)

Here, denotes in (9). We know that for the triplet to satisfy the angular loss constraint, we can minimize . In (11), minimizing ensures maximizing . Hence, intuitively represents the likelihood of the triplet satisfying the angular constraint. Now, let , denote the negative log likelihood associated with the triplet . We propose to use Maximum Likelihood Estimation (MLE) to learn our parameters, i.e., we minimize the sum of the negative log likelihood over all the triplets, as follows:

 minr,LL(r,L)=|T|∑i=1Ji=∑(xi,x+i,x−i)∈Tlog(1+exp(fi)). (12)

However, without any regularizer, our model may collapse to a singularity. To avoid this, we enforce the orthogonality constraint . Adding this constraint has the following additional benefits [17]: i) minimizing the difference in performance for frequent and less frequent classes, ii) avoiding overfitting, and iii) obtaining a better embedding using a small number of projection vectors.

The matrix naturally lies on a orthogonal Stiefel manifold [28]. However, the objective of (12) is invariant to the right action of the orthogonal group , i.e., for , we have . To jointly learn the parameters and , we can constrain the optimization problem in (12) on the following product manifold:

 Mp≜R2d×G(l,d). (13)

Here, is the Grassmann manifold [28], which is the quotient of with the equivalence class being . can be given the structure of a Riemannian manifold using product topology [28], and it has a dimensionality of . We call our proposed approach defined in (12) as Reweighted Probabilistic unsupervised eMbedding Learning (RPML). Figure 1 illustrates the RPML approach.

We can learn the parameters of RPML using Riemannian Conjugate Gradient Descent (RCGD), for which we require the Euclidean gradients and . Here,

 ∇rJi=gihimixi−cat, (14)
 ∇LJi=2giwiβi[δapδ⊤ap−4 tan2αδnmδ⊤nm]L, (15)

, , denotes in (7). Also, denotes in (10), , denotes in (8), and . Given the Euclidean gradients, we can use a standard toolbox like Manopt [30] to perform RCGD.

### 3.1 An efficient algorithm for the proposed RPML method

We now propose an efficient matrix-vector based algorithm for RPML. Given , we can construct matrices and , where and . We can denote a vector , each component of which is computed as: . Denoting as a vector of all ones, we can compute the following vector: , where and denotes the component-wise division operator.

We can also construct the following matrices and , where and . Denoting as a vector of all ones, we can compute vectors . Let, , and be a vector with its -th component computed as: . Then, , where is the Hadamard product operator. We can denote , each component of which can be computed as:

 ~fi=log(1+exp(fi)), (16)

where denotes the -th component of . Then, the objective of RPML can be expressed as:

 L(r,L)=Tr(diag(~f)). (17)

Here, is the trace operator and is a diagonal matrix.

Now, let be vectors, components of which are computed as: , and . We can construct the following matrices: , and as follows:

 ~cik=[g⊤⊙h⊤⊙m⊤]icik, (18)
 ~pik=[2g⊤⊙w⊤⊙β⊤]ipik, (19)
 ~qik=[2g⊤⊙w⊤⊙β⊤]iqik. (20)

Here, and are the -th components of vectors and respectively. Also, and denote the -th components of the respective vectors. Now, we can compute the Euclidean gradients as follows:

 ∇rL(r,L)=~Ce|T|, (21)
 ∇LL(r,L)=(~PP⊤−4 tan2α~QQ⊤)L. (22)

We can utilize the product topology of (13) and perform a Riemannian Conjugate Gradient Descent (RCGD) method [28] to learn the parameters of RPML. Let and be shorthand notations for (21) and (22). The Riemannian gradient is obtained as . For and the tangent vector , the retraction can be obtained as: , where . Hence, we can provide the following update steps for the parameters and (with update step for being the usual Euclidean gradient update):

 rt+1=rt−η∇rL, (23)
 Lt+1=UV⊤. (24)

Here, and are matrices obtained using a SVD such that . and denote the parameters at the -th iteration, and is the learning rate. is a diagonal matrix. The proposed algorithm for RPML has been summarized in Algorithm 1.

### 3.2 RPML using an alternative weight function (RPMLv1)

We now discuss another alternative to the weight function that we defined in (10). We propose to express our alternative weight function as:

 w(xi,x+i,x−i)=wi=(w+i+w−i)/2, (25)

where

 w+i=11+exp(−x⊤iRR⊤x+i), (26)

and

 w−i=1−11+exp(−x⊤i−avgRR⊤x−i). (27)

Here, is a matrix that is used to project the examples to a -dimensional space (we fix it equal to the dimensionality of the embedding space, but it could be different). represents the bilinear similarity of the anchor , w.r.t. the positive , in the space projected by . A lower value of similarity indicates that and should not have been grouped together by the clustering, and hence a lower weightage should be given to the loss associated with this triplet.

represents the bilinear similarity of the negative , w.r.t. the average representation of the similar pair (anchor, positive) in the triplet, in the space projected by . A higher similarity indicates that is similar to the anchor (and hence positive), and all three of them should have been grouped together by the clustering. Hence, a lower weightage should be given to the loss term associated with this triplet while learning the embedding, which explains the subtraction from 1 in (27).

We call the version of RPML obtained using this weight function as RPMLv1. Please note that RPMLv1 requires replacing the parameter in (12) by . To learn the parameters in RPMLv1, the product manifold is changed as follows: . While the Euclidean gradient w.r.t. is similar as in RPML, we provide the form of . Here,

 ∇RJi=12[gih+imiΔap−gih−imiΔmn]R, (28)

is the negative log-likelihood for a triplet, as defined earlier. , , , denotes in (7), and .

## 4 Related Work

Supervised methods: For challenging settings like zero-shot learning [6, 7], extreme classification [1, 2], few-shot learning [31, 32, 33], and Fine-Grained Visual Categorization (FGVC) [34], metric learning [11, 35] is preferred over standard classification based models. This is because metric learning has the ability to learn generic notions of similarities, as opposed to class-specific concepts. This is particularly crucial in problems like zero-shot learning where the test data belong to categories not seen during training. Metric learning implicitly learns an embedding that groups similar examples, while moving away dissimilar ones. Alternately, an embedding commonly induces a metric in the learned space.

The literature on metric learning is rich, and we refer the interested reader to the surveys by Bellet et al. [11] and Lu et al. [35]. To form constraints for weak supervision, commonly used metric learning approaches employ pairs [16, 17, 15], triplets [21, 19] etc. The role of quality of the constraints in the training convergence has been pointed out lately [19, 23]. Schroff et al. [19] discusses the selection of semi-hard examples for metric learning. Sohn et al. [23] proposes to move away multiple negative examples. Duan et al. [10] and Chen et al. [16] introduce adversarial [36] constraints for metric learning. Despite their merits, all these approaches depend on a huge number of labeled examples, which may be infeasible to obtain in many applications, as already discussed.

Unsupervised methods: Despite the availability of a plethora of supervised approaches that make use of class labels to generate constraints for metric or embedding learning, the number of unsupervised approaches is quite limited. Classically, learning of embeddings in an unsupervised manner has been studied in the context of dimensionality reduction or manifold learning [37, 38, 39, 40]. However, they either do not generalize for out-of-sample data, or are not specifically aimed at metric learning. Diffusion processes [41, 42, 43] capture the intrinsic manifold structure of the data and propagate affinities through a pairwise similarity matrix by random-walk steps.

Many existing unsupervised feature learning approaches that have been proposed recently [44, 27, 45, 46] ignore the basic semantic relationships among the examples. They are not particularly designed for metric learning. As such, they do not perform well in difficult settings like zero-shot learning, where test examples belong to unseen semantic categories. Recently, Iscen et al. [47] proposed an approach for metric learning by mining hard positives and negatives in an unsupervised manner. Ye et al. [48] recently proposed another state-of-the-art method based on softmax embeddings.

## 5 Empirical Evaluations

### 5.1 Quantitative results for fine-grained categorization

#### Datasets and Evaluation protocol:

Following standard deep metric learning protocol in the zero-shot setting [48], we conduct experiments using a stochastic version of our RPML approach (mini-batch size of 120), for the task of fine-grained visual categorization on the Caltech-UCSD Birds 200 (CUB) [49], Cars-196 [24] and Stanford Online Products (SOP) [22] datasets. CUB consists of 200 species of images of birds with first 100 species (5864 examples) for training and remaining (5924 examples) for testing. Cars-196 consists of images of cars from 196 models. We used first 98 models (8054 images) for training and remaining (8131) for testing. For the SOP dataset, that consists of 22634 classes with 120053 images of products, we used first 11318 classes (59551 images) for training and remaining 11316 classes (60502 images) for testing.

We used GoogLeNet [50] pretrained on ImageNet [51], as the backbone CNN, using the MatConvNet [52] tool. The initial features are formed by the Regional Maximum Activation of Convolutions (R-MAC) [53] right before the average pool layer (following the MOM approach [47]), aggregated over three input scales (, , ).

#### Hyperparameters:

For all datasets, we set (following Wang et al. [26]). For obtaining the initial clustering, we used a 50 nearest neighbor graph for the AAS clustering approach [25]. AAS has two important parameters: in (3) and in (4). As is merely a scale parameter, we set it to . Using the t-SNE embedding of training data, we observed that setting to a lower value easily merges nearby examples into the same cluster. This leads to huge clusters with large spread, which is undesirable. On the other hand, setting a higher value leads to difficulty in merging examples into clusters. Hence for all datasets, we arbitrarily set , a value that is neither too high, nor too low.

#### Evaluation metrics:

After learning the parameters of our method using the training data, we project the test data in the embedding space, where we perform clustering and retrieval on the test data. To compare the methods, we perform -means clustering by setting the value of to the actual number of classes. The performance of the clustering is measured in terms of Normalized Mutual Information (NMI), F-measure (F), Precision (P) and Recall (R). NMI is defined as the ratio of mutual information and the average entropy of clusters and entropy of actual ground truth class labels. F-measure is the harmonic mean of precision and recall. For the retrieval task, we use the Recall@K metric that gives us the percentage of test examples that have at least one K nearest neighbor from the same class. More details on these peformance metrics can be found in [22].

#### Comparison to state-of-the-art:

We compare the performance of our method against a few recent deep unsupervised feature and metric learning methods from literature. The results of comparisons are shown in Table 1. In all the datasets, we also report the performance obtained using a random projection matrix used as initialization for our method. As observed, despite starting with a random matrix that leads to severely poor performance, our method significantly improves upon it. Mostly, we outperform the baseline methods.

In Table 2, we also compare our method against a few popular supervised deep metric learning approaches in recent literature, on the CUB dataset. We observe that our method performs competitive despite not making use of any class labels.

### 5.2 Qualitative results on Cars dataset

We also analyse our proposed method qualitatively on the Cars 196 dataset. For this, we show the retrieval results using our approach in Figures 3 and 4. For each query, we show the top four retrieved images. As observed, the images retrieved are fairly accurate. We also show a quad grid layout 2 of nearest neighbors obtained using tSNE embedding of a subset of Cars196 test images, using our method, in Figure 5. As observed, similar cars are placed together. To further improve the performance, we may need a few labeled examples to guide the training in a better way. This interesting semi-supervised scenario could be looked at as a future direction of work.

## 6 Conclusion

In this paper we proposed an unsupervised embedding learning approach that uses a graph-based clustering approach to obtain pseudo-labels for examples. The pseudo-labels are used to form triplets of examples, which guide the learning of an embedding. We also propose a weight function to scale the losses associated with these triplets. We jointly learn the parameters of our approach using Riemannian optimization on product manifold, which further ensures a faster convergence. Our approach performs competitive to state-of-the-art metric learning techniques on a number of benchmark datasets. In future, we plan to formulate a semi-supervised variant of our approach to further guide the training process and achieve a better performance.

### Footnotes

1. thanks: Preprint.
2. https://cs.stanford.edu/people/karpathy/cnnembed/

### References

1. Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit S Dhillon. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proc. of International Conference on Machine Learning (ICML), pages 3069–3077, 2016.
2. Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proc. of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), pages 263–272. ACM, 2014.
3. Yun-Chun Chen, Yen-Yu Lin, Ming-Hsuan Yang, and Jia-Bin Huang. Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
4. Kaichao You, Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Universal domain adaptation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
5. Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy. Sketch me that shoe. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 799–807, 2016.
6. Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
7. Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8247–8255, 2019.
8. Mengdi Huai, Chenglin Miao, Yaliang Li, Qiuling Suo, Lu Su, and Aidong Zhang. Metric learning from probabilistic labels. In Proc. of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), pages 1541–1550. ACM, 2018.
9. Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, and Jie Zhou. Hardness-aware deep metric learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
10. Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep adversarial metric learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
11. Aurélien Bellet, Amaury Habrard, and Marc Sebban. Metric learning synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael, 2015.
12. Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In Proc. of International Conference on Machine Learning (ICML), pages 209–216, 2007.
13. Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. Large scale metric learning from equivalence constraints. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2288–2295, 2012.
14. Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In Proc. of International Conference on Machine Learning (ICML), pages 2464–2471, 2016.
15. Mukul Bhutani, Pratik Jawanpuria, Hiroyuki Kasai, and Bamdev Mishra. Low-rank geometric mean metric learning. arXiv preprint arXiv:1806.05454, 2018.
16. Shuo Chen, Chen Gong, Jian Yang, Xiang Li, Yang Wei, and Jun Li. Adversarial metric learning. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 2018.
17. Pengtao Xie, Wei Wu, Yichen Zhu, and Eric P Xing. Orthogonality-promoting distance metric learning: convex relaxation and theoretical analysis. In Proc. of International Conference on Machine Learning (ICML), 2018.
18. Kilian Q Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. In Proc. of Neural Information Processing Systems (NeurIPS), pages 1473–1480, 2006.
19. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
20. Yuan Shi, Aurélien Bellet, and Fei Sha. Sparse compositional metric learning. In Proc. of Association for the Advancement of Artificial Intelligence (AAAI), pages 2078–2084, 2014.
21. Han-Jia Ye, De-Chuan Zhan, Xue-Min Si, and Yuan Jiang. Learning mahalanobis distance metric: Considering instance disturbance helps. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), pages 3315–3321, 2017.
22. Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4004–4012, 2016.
23. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Proc. of Neural Information Processing Systems (NeurIPS), pages 1857–1865, 2016.
24. Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proc. of IEEE International Conference on Computer Vision Workshops (ICCVW), pages 554–561, 2013.
25. Minsu Cho and Kyoung Mu Lee. Mode-seeking on graphs via random walks. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 606–613. IEEE, 2012.
26. Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2017.
27. Dong Li, Wei-Chih Hung, Jia-Bin Huang, Shengjin Wang, Narendra Ahuja, and Ming-Hsuan Yang. Unsupervised visual representation learning by graph-based consistent constraints. In Proc. of European Conference on Computer Vision (ECCV), pages 678–694. Springer, 2016.
28. P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
29. John M Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–29. Springer, 2003.
30. Nicolas Boumal, Bamdev Mishra, P-A Absil, and Rodolphe Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455–1459, 2014.
31. Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 403–412, 2019.
32. Davis Wertheimer and Bharath Hariharan. Few-shot learning with localization in realistic settings. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6558–6567, 2019.
33. Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. Few-shot learning through an information retrieval lens. In Proc. of Neural Information Processing Systems (NeurIPS), pages 2255–2265, 2017.
34. Qi Qian, Rong Jin, Shenghuo Zhu, and Yuanqing Lin. Fine-grained visual categorization via multi-stage metric learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3716–3724, 2015.
35. Jiwen Lu, Junlin Hu, and Jie Zhou. Deep metric learning for visual understanding: An overview of recent advances. IEEE Signal Processing Magazine, 34(6):76–84, 2017.
36. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proc. of Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014.
37. Trevor F Cox and Michael AA Cox. Multidimensional scaling. CRC press, 2000.
38. Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
39. Xiaofei He and Partha Niyogi. Locality preserving projections. In Proc. of Neural Information Processing Systems (NeurIPS), pages 153–160, 2003.
40. Xiaofei He, Deng Cai, Shuicheng Yan, and Hong-Jiang Zhang. Neighborhood preserving embedding. In Proc. of IEEE International Conference on Computer Vision (ICCV), pages 1208–1213, 2005.
41. Michael Donoser and Horst Bischof. Diffusion processes for retrieval revisited. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1320–1327, 2013.
42. Song Bai, Zhichao Zhou, Jingdong Wang, Xiang Bai, Longin Jan Latecki, and Qi Tian. Ensemble diffusion for retrieval. In Proc. of IEEE International Conference on Computer Vision (ICCV), pages 774–783, 2017.
43. Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, and Ondrej Chum. Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2077–2086, 2017.
44. A Dosovitskiy, P Fischer, JT Springenberg, M Riedmiller, and T Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(9):1734–1747, 2016.
45. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proc. of European Conference on Computer Vision (ECCV), pages 132–149, 2018.
46. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018.
47. Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Mining on manifolds: Metric learning without labels. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
48. Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6210–6219, 2019.
49. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
50. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
51. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV), 115(3):211–252, 2015.
52. Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, pages 689–692. ACM, 2015.
53. Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Particular object retrieval with integral max-pooling of cnn activations. In Proc. of International Conference on Learning Representations (ICLR), 2016.
54. S Chopra, R Hadsell, and Y LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 539–546. IEEE, 2005.
55. Jiwen Lu, Junlin Hu, and Yap-Peng Tan. Discriminative deep metric learning for face and kinship verification. IEEE Transactions on Image Processing (TIP), 26(9):4269–4282, 2017.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters