Largescale Distance Metric Learning with Uncertainty
Abstract
Distance metric learning (DML) has been studied extensively in the past decades for its superior performance with distancebased algorithms. Most of the existing methods propose to learn a distance metric with pairwise or triplet constraints. However, the number of constraints is quadratic or even cubic in the number of the original examples, which makes it challenging for DML to handle the largescale data set. Besides, the realworld data may contain various uncertainty, especially for the image data. The uncertainty can mislead the learning procedure and cause the performance degradation. By investigating the image data, we find that the original data can be observed from a small set of clean latent examples with different distortions. In this work, we propose the margin preserving metric learning framework to learn the distance metric and latent examples simultaneously. By leveraging the ideal properties of latent examples, the training efficiency can be improved significantly while the learned metric also becomes robust to the uncertainty in the original data. Furthermore, we can show that the metric is learned from latent examples only, but it can preserve the large margin property even for the original data. The empirical study on the benchmark image data sets demonstrates the efficacy and efficiency of the proposed method.
1 Introduction
Distance metric learning (DML) aims to learn a distance metric where examples from the same class are well separated from examples of different classes. It is an essential task for distancebased algorithms, such as means clustering [18], nearest neighbor classification [17] and information retrieval [2]. Given a distance metric , the squared Mahalanobis distance between examples and can be computed as
Most of existing DML methods propose to learn the metric by minimizing the number of violations in the set of pairwise or triplet constraints. Given a set of pairwise constraints, DML tries to learn a metric such that the distances between examples from the same class are sufficiently small (e.g., smaller than a predefined threshold) while those between different ones are large enough [3, 18]. Different from pairwise constraints, each triplet constraint consists of three examples , where and have the same label and is from a different class. An ideal metric can push away from and by a large margin [17]. Learning with triplet constraints optimizes the local positions of examples and is more flexible for realworld applications, where defining the appropriate thresholds is hard for pairwise constraints. In this work, we will focus on DML with triplet constraints.
Optimizing the metric with a set of triplet constraints is challenging since the number of triplet constraints can be up to , where is the number of the original training examples. It makes DML computationally intractable for the largescale problems. Many strategies have been developed to deal with this challenge and most of them fall into two categories, learning by stochastic gradient descent (SGD) and learning with the active set. With the strategy of SGD, DML methods can sample just one constraint or a minibatch of constraints at each iteration to observe an unbiased estimation of the full gradient and avoid computing the gradient from the whole set [2, 10]. Other methods learn the metric with a set of active constraints (i.e., violated by the current metric), where the size can be significantly smaller than the original set [17]. It is a conventional strategy applied by cutting plane methods [1]. Both of these strategies can alleviate the largescale challenge but have inherent drawbacks. Approaches based on SGD have to search through the whole set of triplet constraints, which results in the slow convergence, especially when the number of active constraints is small. On the other hand, the methods relying on the active set have to identify the set at each iteration. Unfortunately, this operation requires computing pairwise distances with the current metric, where the cost is and is too expensive for largescale problems.
Besides the challenge from the size of data set, the uncertainty in the data is also an issue, especially for the image data, where the uncertainty can come from the differences between individual examples and distortions, e.g., pose, illumination and noise. Directly learning with the original data will lead to a poor generalization performance since the metric tends to overfit the uncertainty in the data. By further investigating the image data, we find that most of original images can be observed from a much smaller set of clean latent examples with different distortions. The phenomenon is illustrated in Fig. 5. This observation inspires us to learn the metric with latent examples in lieu of the original data. The challenge is that latent examples are unknown and only images with uncertainties are available.
In this work, we propose a framework to learn the distance metric and latent examples simultaneously. It sufficiently explores the properties of latent examples to address the mentioned challenges. First, due to the small size of latent examples, the strategy of identifying the active set becomes affordable when learning the metric. We adopt it to accelerate the learning procedure via avoiding the attempts on inactive constraints. Additionally, compared with the original data, the uncertainty in latent examples decreases significantly. Consequently, the metric directly learned from latent examples can focus on the nature of the data rather than the uncertainty in the data. To further improve the robustness, we adopt the large margin property that latent examples from different classes should be pushed away with a data dependent margin. Fig. 1 illustrates that an appropriate margin for latent examples can also preserve the large margin for the original data. We conduct the empirical study on benchmark image data sets, including the challenging ImageNet data set, to demonstrate the efficacy and efficiency of the proposed method.
The rest of the paper is organized as follows: Section 2 summarizes the related work of DML. Section 3 describes the details of the proposed method and Section 4 summarizes the theoretical analysis. Section 5 compares the proposed method to the conventional DML methods on the benchmark image data sets. Finally, Section 6 concludes this work with future directions.
2 Related Work
Many DML methods have been proposed in the past decades [3, 17, 18] and comprehensive surveys can be found in [7, 19]. The representative methods include Xing’s method [18], ITML [3] and LMNN [17]. ITML learns a metric according to pairwise constraints, where the distances between pairs from the same class should be smaller than a predefined threshold and the distances between pairs from different classes should be larger than another predefined threshold. LMNN is developed with triplet constraints and a metric is learned to make sure that pairs from the same class are separated from the examples of different classes with a large margin. Compared with pairwise constraints, triplet constraints are more flexible to depict the local geometry.
To handle the large number of constraints, some methods adopt SGD or online learning to sample one constraint or a minibatch of constraints at each iteration [2, 10]. OASIS [2] randomly samples one triplet constraint at each iteration and computes the unbiased gradient accordingly. When the size of the active set is small, these methods require extremely large number of iterations to improve the model. Other methods try to explore the concept of the active set. LMNN [17] proposes to learn the metric effectively at each iteration by collecting an active set that consists of constraints violated by the current metric within the nearest neighbors for each example. However, it requires to obtain the appropriate active set.
Besides the research about conventional DML, deep metric learning has attracted much attention recently [9, 13, 15, 16]. These studies also indicate that sampling active triplets is essential for accelerating the convergence. FaceNet [15] keeps a large size of minibatch and searches hard constraints within a minibatch. LeftedStruct [16] generates the minibatch with the randomly selected positive examples and the corresponding hard negative examples. ProxyNCA [9] adopts proxy examples to reduce the size of triplet constraints. Once an anchor example is given, the similar and dissimilar examples will be searched within the set of proxies. In this work we propose to learn the metric only with latent examples which can dramatically reduce the computational cost of obtaining the active set. Besides, the triangle inequality dose not hold for the squared distance, which makes our analysis significantly different from the existing work.
3 Margin Preserving Metric Learning
Given a training set , where is an example and is the corresponding label, DML aims to learn a good distance metric such that
where and are from the same class and is different. Given the distance metric , the squared distance is defined as
where denotes the set of positive semidefinite (PSD) matrices.
For the largescale image data set, we assume that each observed example is from a latent example with certain zero mean distortions, i.e.,
where projects the original data to its corresponding latent example.
Then, we consider the expected distance [20] between observed data and the objective is to learn a metric such that
(1) 
Let , and denote latent examples of , and respectively. For the distance between examples from the same class, we have
(2) 
The last equation is due to the fact that and are i.i.d, since they are from the same class.
By applying the same analysis for the dissimilar pair, we have
(3) 
The inequality is because that is a PSD matrix.
Combining Eqns. 3 and 3, we find that the difference between the distances in the original triplet can be lower bounded by those in the triplet consisting of latent examples
Therefore, the metric can be learned with the constraints defined on latent examples such that
Once the metric is observed, the margin for the expected distances between original data (i.e., as in Eqn. 1) is also guaranteed. Compared with the original constraints, the margin between latent examples is increased by the factor of . This term indicates the expected distance between the original data and its corresponding latent example. It means that the tighter a local cluster is, the less a margin should be increased. Furthermore, each class takes a different margin, which depends on the distribution of the original data and makes it more flexible than a global margin.
With the set of triplets , the optimization problem can be written as
where is the number of latent examples. We add a constraint for the Frobenius norm of the learned metric to prevent it from overfitting. is the loss function and the hinge loss is applied in this work.
This problem is hard to solve since both the metric and latent examples are the variables to be optimized. Therefore, we propose to solve it in an alternating way and the detailed steps are demonstrated below.
3.1 Update with Upper Bound
When fixing , the subproblem at the th iteration becomes
(4) 
The variable appears in both the term of margin and the term of the triplet difference , which makes it hard to optimize directly. Our strategy is to find an appropriate upper bound for the original problem and solve the simple problem instead.
Theorem 1.
The function can be upper bounded by the series of functions . For the th class, we have
where , and are constants and .
The detailed proof can be found in Section 4.
After removing the constant terms and rearrange the coefficients, optimizing is equivalent to optimizing the following problem
(5)  
where denotes the membership that assigns a latent example for each original example.
Till now, it shows that the original objective can be upper bounded by . Minimizing the upper bound is similar to means but with the distance defined on the metric . So we can solve it by the standard EM algorithm.
When fixing , latent examples can be updated by the closedform solution
(6) 
When fixing , just assigns each original example to its nearest latent example with the distance defined on the metric
(7) 
Alg. 1 summarizes the method for solving .
3.2 Update with Upper Bound
When fixing at the th iteration, the subproblem becomes
(8)  
where also appears in multiple terms. With the similar procedure, an upper bound can be found to make the optimization simpler.
Theorem 2.
The function can be upper bounded by the function which is
where is a constant and .
Minimizing is a standard DML problem. Since the number of latent examples is small, many existing DML methods can handle the problem well. In this work we solve the problem by SGD but sample one epoch active constraints at each stage. The active constraints contain the triplets of that incur the hinge loss with the distance defined on . This strategy enjoys the efficiency of SGD and the efficacy of learning with the active set. To further improve the efficiency, one projection paradigm is adopted to avoid the expensive PSD projection which costs . It performs the PSD projection once at the end of the learning algorithm and shows to be effective in many applications [2, 11]. Finally, since the problem is strongly convex, we apply the suffix averaging strategy, which averages the solutions over the last several iterations, to obtain the optimal convergence rate [12]. The complete approach for obtaining is shown in Alg. 2.
Alg. 3 summarizes the proposed margin preserving metric learning framework. Different from the standard alternating method, we only optimize the upper bound for each subproblem. However, the method converges as shown in the following theorem.
Theorem 3.
Let , and , denote the results obtained by applying the algorithm in Alg. 3 at th and th iterations respectively. Then, we have
which means the proposed method can converge.
Computational Complexity
The proposed method consists of two parts: obtaining latent examples and metric learning. For the former one, the cost is linear in the number of latent examples and original examples as . For the latter one, the cost of sampling an active set dominates the learning procedure. Since the number of iterations is fixed, the complexity of sampling becomes . Therefore, the whole algorithm can be linear in the number of latent examples. Note that the efficiency can be further improved with distributed computing since many components of MaPML can be implemented in parallel. For example, when updating , each class is independent and all subproblems can be solved simultaneously.
4 Theoretical Analysis
4.1 Proof of Theorem 1
Proof.
First, for the distance of the dissimilar pair in term of Eqn. 3.1, we have
where are latent examples from the last iteration. We let denote in this proof for simplicity. The inequality is from that is a PSD matrix and can be decomposed as . Then it is obtained by applying the CauchySchwarz inequality. With the assumptions that is sufficiently large and is bounded by a constant , the inequality can be simplified as
(9)  
The assumption is easy to verify since
Note that and is in the convex hull of the original data, and the constant can be set as .
With the similar procedure, we have the bound for the distance of the similar pair as
(10)  
Taking Eqns. 9 and 10 back to the original function and using the property of the hinge loss, the original one can be upper bounded by
where is a constant. By investigating the structure of this problem, we find that each class is independent in the optimization problem and the subproblem for the th class can be written as
where is the number of latent examples for the th class and is a constant as
Next we try to upper bound the hinge loss in with a linear function in the interval of , where the hinge loss incurred by the optimal solution is guaranteed to be in it.
Let , which is the expected distance between the original data of the th class and the corresponding latent examples from the last iteration, and be a constant sufficiently large as
Then, for each active hinge loss (i.e., ), if
(11) 
we have
Fig. 2 illustrates the linear function that can bound the hinge loss and the proof is straightforward. We will show that the condition in Eqn. 11 can be satisfied throughout the algorithm later.
With the upper bound of the hinge loss, can be bounded by
where
and
is an indicator function as
Finally, we check the condition in Eqn. 11. Let denote latent examples obtained by optimizing with Alg. 1. Since we use as the starting point to optimize , it is obvious that
At the same time, we have
It is observed that Eqn. 11 is satisfied by combining these inequalities.
∎
4.2 Proof of Theorem 2
Proof.
For the term in Eqn. 8, we have
where we assume that is sufficiently large and is a constant which has and can be set as .
Therefore, the original function can be upper bounded by
where . ∎
4.3 Proof of Theorem 3
Proof.
When fixing at the th iteration, we have
When fixing , we have
Therefore, after each iteration, we have
Since the value of is bounded, the sequence will converge after a finite number of iterations. ∎
5 Experiments
We conduct the empirical study on four benchmark image data sets. nearest neighbor classifier is applied to verify the efficacy of the learned metrics from different methods. The methods in the comparison are summarized as follows.

Euclid: NN with Euclidean distance.

LMNN [17]: the stateoftheart DML method that identifies a set of active triplets with the current metric at each iteration. The active triplets are searched within nearest neighbors for each example.

OASIS [2]: an online DML method that receives one random triplet at each iteration. It only updates the metric when the triplet constraint is active.

HRSGD [10]: one of the most efficient DML methods with SGD. We adopt the version that randomly samples a minibatch of triplets at each iteration in the comparison. After sampling, a Bernoulli random variable is generated to decide if updating the current metric or not. With the PSD projection, it guarantees that the learned metric is in the PSD cone at each iteration.

MaPML: the proposed method that learns the metric and latent examples simultaneously, where denotes the ratio between the number of latent examples and the number of original ones
Different from other methods, NN is implemented with latent examples as reference points. The method that takes NN with original data is referred as MaPMLO.
The parameters of OASIS, HRSGD and MaPML are searched in . The size of minibatch in HRSGD is set to be as suggested [10]. To train the model sufficiently, the number of iterations for LMNN is set to be while the number of randomly sampled triplets is for OASIS and HRSGD. The number of iterations for MaPML is set as while the number of maximal iterations for solving in the subproblem is set as , which roughly has the same number of triplets as OASIS and HRSGD. All experiments are implemented on a server with GB memory and 2 Intel Xeon E52630 CPUs. Average results with standard deviation over trails are reported.
5.1 Mnist
First, we evaluate the performance of different algorithms on MNIST [8]. It consists of handwritten digit images for training and images for test. There are 10 classes in the data set, which are corresponding to the digits  . Each example is a grayscale image which leads to the dimensional features and they are normalized to the range of .
Fig. 3 (a) compares the performance of different metrics on the test set. For MaPML, we vary the ratio of latent examples from to . First of all, It is obvious that the metrics learned with the active set outperform those from random triplets. It confirms that the strategy of sampling triplets randomly can not explore the data set sufficiently due to the extremely large number of triplets. Secondly, the performance of MaPMLO is comparable with LMNN, which shows that the proposed method can learn a good metric with only a small amount of latent examples (i.e., ). Finally, both MaPML and MaPMLO work well with the metric obtained by MaPML, which verifies that the learned metric can preserve the large margin property for both the original and latent data. Note that when the number of latent examples is small, the performance of NN with latent examples is slightly worse than that with the whole training set. However, NN with latent examples can be more robust in realworld applications.
To demonstrate the robustness, we conduct another experiment that randomly introduces the zero mean Gaussian noise (i.e., ) to each pixel of the original training images. The standard deviation of the Gaussian noise is varied in the range of and is fixed as . Fig. 3 (b) summarizes the results. It shows that MaPML has the comparable performance as MaPMLO and LMNN when the noise level is low. However, with the increasing of the noise, the performance of LMNN drops dramatically. This can be interpreted by the fact that the metric learned with the original data has been misled by the noisy information. In contrast, the errors made by MaPML and MaPMLO increase mildly and it demonstrates that the learned metric is more robust than the one learned from the original data. MaPML performs best among all methods and it is due to the reason that the uncertainty in latent examples are much less than that in the original ones. It implies that NN with latent examples is more appropriate for realworld applications with large uncertainty.
Then, we compare the CPU time cost by different algorithms to evaluate the efficiency. The results can be found in Fig. 4 (a). First, as expected, all algorithms with SGD are more efficient than LMNN, which has to compute the full gradient from the redefined active set at each iteration. Moreover, the running time of MaPML is comparable to that of HRSGD, which shows the efficiency of MaPML with the small set of latent examples. Note that OASIS has the extremely low cost, since it allows the internal metric to be out of the PSD cone. Fig. 4 (b) illustrates the convergence curve of MaPML and shows that the proposed method converges fast in practice.
Finally, since we apply the proposed method to the original pixel features directly, the learned latent examples can be recovered as images. Fig. 5 illustrates the learned latent examples and the corresponding examples in the original training set. It is obvious that the original examples are from latent examples with different distortions as claimed.
5.2 Cifar10 Cifar100
CIFAR10 contains classes with color images of size for training and images for test. CIFAR100 has the same number of images in training and test but for classes [6]. Since deep learning algorithms show the overwhelming performance on these data sets, we adopt ResNet18 [4] in Caffe [5], which is pretrained on ImageNet ILSVRC 2012 data set [14], as the feature extractor and each image is represented by a dimensional feature vector.
Methods  CIFAR10  CIFAR100 

Euclid  
OASIS  
HRSGD  
LMNN  
MaPMLO  
MaPML 
Table 1 summarizes error rates of methods in the comparison. First, we have the same observation as on MNIST, where the performance of methods adopting active triplets is much better than that of the methods with randomly sampled triplets. Different from MNIST, MaPML outperforms LMNN on both of the data sets. It is because that images in these data sets describe natural objects which contain much more uncertainty than digits in MNIST. Finally, the performance of MaPMLO is superior over OASIS and HRSGD, which shows the learned metric can work well with the original data represented by deep features. It confirms that the large margin property is preserved even for the original data.
5.3 ImageNet
Finally, we demonstrate that the proposed method can handle the largescale data set with ImageNet. ImageNet ILSVRC 2012 consists of training images and validation data. The same feature extraction procedure as above is applied for each image. Given the large number of training data, we increase the number of triplets for OASIS and HRSGD to . Correspondingly, the number of maximal iterations for solving the subproblem in MaPML is also raised to .
Methods  Test error () 

Euclid  
OASIS  
HRSGD  
MaPMLO  
MaPML 
LMNN does not finish the training after 24 hours so the result is not reported for it. In contrast, MaPML obtains the metric within about one hour. The performance of available methods can be found in Table 2. Since ResNet18 is trained on ImageNet, the extracted features are optimized for this data set and it is hard to further improve the performance. However, with latent examples, MaPML can further reduce the error rate by . It indicates that latent examples with low uncertainty are more appropriate for the largescale data set as the reference points. Note that the small number of reference points will also accelerate the test phase. For example, it costs 0.15s to predict the label of an image with the original set while the cost is only 0.007s if evaluating with latent examples. It makes MaPML with latent examples a potential method for realtime applications.
6 Conclusion
In this work, we propose a framework to learn the distance metric and latent examples simultaneously. By learning from a small set of clean latent examples, MaPML can sample the active triplets efficiently and the learning procedure is robust to the uncertainty in the realworld data. Moreover, MaPML can preserve the large margin property for the original data when learning merely with latent examples. The empirical study confirms the efficacy and efficiency of MaPML. In the future, we plan to evaluate MaPML on different tasks (e.g., information retrieval) and different types of data. Besides, incorporating the proposed strategy to deep metric learning is also an attractive direction. It can accelerate the learning for deep embedding and the resulting latent examples may further improve the performance.
References
 [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
 [2] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through ranking. JMLR, 11:1109–1135, 2010.
 [3] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Informationtheoretic metric learning. In ICML, pages 209–216, 2007.
 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages 675–678, 2014.
 [6] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 [7] B. Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287–364, 2013.
 [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [9] Y. MovshovitzAttias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh. No fuss distance metric learning using proxies. In ICCV, pages 360–368, 2017.
 [10] Q. Qian, R. Jin, J. Yi, L. Zhang, and S. Zhu. Efficient distance metric learning by adaptive sampling and minibatch stochastic gradient descent (SGD). ML, 99(3):353–372, 2015.
 [11] Q. Qian, R. Jin, S. Zhu, and Y. Lin. Finegrained visual categorization via multistage metric learning. In CVPR, pages 3716–3724, 2015.
 [12] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.
 [13] O. Rippel, M. Paluri, P. Dollár, and L. D. Bourdev. Metric learning with adaptive density discrimination. In ICLR, 2016.
 [14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
 [15] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015.
 [16] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, pages 4004–4012, 2016.
 [17] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
 [18] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with sideinformation. In NIPS, pages 505–512, 2002.
 [19] L. Yang and R. Jin. Distance metric learning: a comprehensive survery. 2006.
 [20] H. Ye, D. Zhan, X. Si, and Y. Jiang. Learning mahalanobis distance metric: Considering instance disturbance helps. In IJCAI, pages 3315–3321, 2017.