Large-scale Distance Metric Learning with Uncertainty
Distance metric learning (DML) has been studied extensively in the past decades for its superior performance with distance-based algorithms. Most of the existing methods propose to learn a distance metric with pairwise or triplet constraints. However, the number of constraints is quadratic or even cubic in the number of the original examples, which makes it challenging for DML to handle the large-scale data set. Besides, the real-world data may contain various uncertainty, especially for the image data. The uncertainty can mislead the learning procedure and cause the performance degradation. By investigating the image data, we find that the original data can be observed from a small set of clean latent examples with different distortions. In this work, we propose the margin preserving metric learning framework to learn the distance metric and latent examples simultaneously. By leveraging the ideal properties of latent examples, the training efficiency can be improved significantly while the learned metric also becomes robust to the uncertainty in the original data. Furthermore, we can show that the metric is learned from latent examples only, but it can preserve the large margin property even for the original data. The empirical study on the benchmark image data sets demonstrates the efficacy and efficiency of the proposed method.
Distance metric learning (DML) aims to learn a distance metric where examples from the same class are well separated from examples of different classes. It is an essential task for distance-based algorithms, such as -means clustering , -nearest neighbor classification  and information retrieval . Given a distance metric , the squared Mahalanobis distance between examples and can be computed as
Most of existing DML methods propose to learn the metric by minimizing the number of violations in the set of pairwise or triplet constraints. Given a set of pairwise constraints, DML tries to learn a metric such that the distances between examples from the same class are sufficiently small (e.g., smaller than a predefined threshold) while those between different ones are large enough [3, 18]. Different from pairwise constraints, each triplet constraint consists of three examples , where and have the same label and is from a different class. An ideal metric can push away from and by a large margin . Learning with triplet constraints optimizes the local positions of examples and is more flexible for real-world applications, where defining the appropriate thresholds is hard for pairwise constraints. In this work, we will focus on DML with triplet constraints.
Optimizing the metric with a set of triplet constraints is challenging since the number of triplet constraints can be up to , where is the number of the original training examples. It makes DML computationally intractable for the large-scale problems. Many strategies have been developed to deal with this challenge and most of them fall into two categories, learning by stochastic gradient descent (SGD) and learning with the active set. With the strategy of SGD, DML methods can sample just one constraint or a mini-batch of constraints at each iteration to observe an unbiased estimation of the full gradient and avoid computing the gradient from the whole set [2, 10]. Other methods learn the metric with a set of active constraints (i.e., violated by the current metric), where the size can be significantly smaller than the original set . It is a conventional strategy applied by cutting plane methods . Both of these strategies can alleviate the large-scale challenge but have inherent drawbacks. Approaches based on SGD have to search through the whole set of triplet constraints, which results in the slow convergence, especially when the number of active constraints is small. On the other hand, the methods relying on the active set have to identify the set at each iteration. Unfortunately, this operation requires computing pairwise distances with the current metric, where the cost is and is too expensive for large-scale problems.
Besides the challenge from the size of data set, the uncertainty in the data is also an issue, especially for the image data, where the uncertainty can come from the differences between individual examples and distortions, e.g., pose, illumination and noise. Directly learning with the original data will lead to a poor generalization performance since the metric tends to overfit the uncertainty in the data. By further investigating the image data, we find that most of original images can be observed from a much smaller set of clean latent examples with different distortions. The phenomenon is illustrated in Fig. 5. This observation inspires us to learn the metric with latent examples in lieu of the original data. The challenge is that latent examples are unknown and only images with uncertainties are available.
In this work, we propose a framework to learn the distance metric and latent examples simultaneously. It sufficiently explores the properties of latent examples to address the mentioned challenges. First, due to the small size of latent examples, the strategy of identifying the active set becomes affordable when learning the metric. We adopt it to accelerate the learning procedure via avoiding the attempts on inactive constraints. Additionally, compared with the original data, the uncertainty in latent examples decreases significantly. Consequently, the metric directly learned from latent examples can focus on the nature of the data rather than the uncertainty in the data. To further improve the robustness, we adopt the large margin property that latent examples from different classes should be pushed away with a data dependent margin. Fig. 1 illustrates that an appropriate margin for latent examples can also preserve the large margin for the original data. We conduct the empirical study on benchmark image data sets, including the challenging ImageNet data set, to demonstrate the efficacy and efficiency of the proposed method.
The rest of the paper is organized as follows: Section 2 summarizes the related work of DML. Section 3 describes the details of the proposed method and Section 4 summarizes the theoretical analysis. Section 5 compares the proposed method to the conventional DML methods on the benchmark image data sets. Finally, Section 6 concludes this work with future directions.
2 Related Work
Many DML methods have been proposed in the past decades [3, 17, 18] and comprehensive surveys can be found in [7, 19]. The representative methods include Xing’s method , ITML  and LMNN . ITML learns a metric according to pairwise constraints, where the distances between pairs from the same class should be smaller than a predefined threshold and the distances between pairs from different classes should be larger than another predefined threshold. LMNN is developed with triplet constraints and a metric is learned to make sure that pairs from the same class are separated from the examples of different classes with a large margin. Compared with pairwise constraints, triplet constraints are more flexible to depict the local geometry.
To handle the large number of constraints, some methods adopt SGD or online learning to sample one constraint or a mini-batch of constraints at each iteration [2, 10]. OASIS  randomly samples one triplet constraint at each iteration and computes the unbiased gradient accordingly. When the size of the active set is small, these methods require extremely large number of iterations to improve the model. Other methods try to explore the concept of the active set. LMNN  proposes to learn the metric effectively at each iteration by collecting an active set that consists of constraints violated by the current metric within the -nearest neighbors for each example. However, it requires to obtain the appropriate active set.
Besides the research about conventional DML, deep metric learning has attracted much attention recently [9, 13, 15, 16]. These studies also indicate that sampling active triplets is essential for accelerating the convergence. FaceNet  keeps a large size of mini-batch and searches hard constraints within a mini-batch. LeftedStruct  generates the mini-batch with the randomly selected positive examples and the corresponding hard negative examples. Proxy-NCA  adopts proxy examples to reduce the size of triplet constraints. Once an anchor example is given, the similar and dissimilar examples will be searched within the set of proxies. In this work we propose to learn the metric only with latent examples which can dramatically reduce the computational cost of obtaining the active set. Besides, the triangle inequality dose not hold for the squared distance, which makes our analysis significantly different from the existing work.
3 Margin Preserving Metric Learning
Given a training set , where is an example and is the corresponding label, DML aims to learn a good distance metric such that
where and are from the same class and is different. Given the distance metric , the squared distance is defined as
where denotes the set of positive semi-definite (PSD) matrices.
For the large-scale image data set, we assume that each observed example is from a latent example with certain zero mean distortions, i.e.,
where projects the original data to its corresponding latent example.
Then, we consider the expected distance  between observed data and the objective is to learn a metric such that
Let , and denote latent examples of , and respectively. For the distance between examples from the same class, we have
The last equation is due to the fact that and are i.i.d, since they are from the same class.
By applying the same analysis for the dissimilar pair, we have
The inequality is because that is a PSD matrix.
Therefore, the metric can be learned with the constraints defined on latent examples such that
Once the metric is observed, the margin for the expected distances between original data (i.e., as in Eqn. 1) is also guaranteed. Compared with the original constraints, the margin between latent examples is increased by the factor of . This term indicates the expected distance between the original data and its corresponding latent example. It means that the tighter a local cluster is, the less a margin should be increased. Furthermore, each class takes a different margin, which depends on the distribution of the original data and makes it more flexible than a global margin.
With the set of triplets , the optimization problem can be written as
where is the number of latent examples. We add a constraint for the Frobenius norm of the learned metric to prevent it from overfitting. is the loss function and the hinge loss is applied in this work.
This problem is hard to solve since both the metric and latent examples are the variables to be optimized. Therefore, we propose to solve it in an alternating way and the detailed steps are demonstrated below.
3.1 Update with Upper Bound
When fixing , the subproblem at the -th iteration becomes
The variable appears in both the term of margin and the term of the triplet difference , which makes it hard to optimize directly. Our strategy is to find an appropriate upper bound for the original problem and solve the simple problem instead.
The function can be upper bounded by the series of functions . For the -th class, we have
where , and are constants and .
The detailed proof can be found in Section 4.
After removing the constant terms and rearrange the coefficients, optimizing is equivalent to optimizing the following problem
where denotes the membership that assigns a latent example for each original example.
Till now, it shows that the original objective can be upper bounded by . Minimizing the upper bound is similar to -means but with the distance defined on the metric . So we can solve it by the standard EM algorithm.
When fixing , latent examples can be updated by the closed-form solution
When fixing , just assigns each original example to its nearest latent example with the distance defined on the metric
Alg. 1 summarizes the method for solving .
3.2 Update with Upper Bound
When fixing at the -th iteration, the subproblem becomes
where also appears in multiple terms. With the similar procedure, an upper bound can be found to make the optimization simpler.
The function can be upper bounded by the function which is
where is a constant and .
Minimizing is a standard DML problem. Since the number of latent examples is small, many existing DML methods can handle the problem well. In this work we solve the problem by SGD but sample one epoch active constraints at each stage. The active constraints contain the triplets of that incur the hinge loss with the distance defined on . This strategy enjoys the efficiency of SGD and the efficacy of learning with the active set. To further improve the efficiency, one projection paradigm is adopted to avoid the expensive PSD projection which costs . It performs the PSD projection once at the end of the learning algorithm and shows to be effective in many applications [2, 11]. Finally, since the problem is strongly convex, we apply the -suffix averaging strategy, which averages the solutions over the last several iterations, to obtain the optimal convergence rate . The complete approach for obtaining is shown in Alg. 2.
Alg. 3 summarizes the proposed margin preserving metric learning framework. Different from the standard alternating method, we only optimize the upper bound for each subproblem. However, the method converges as shown in the following theorem.
Let , and , denote the results obtained by applying the algorithm in Alg. 3 at -th and -th iterations respectively. Then, we have
which means the proposed method can converge.
The proposed method consists of two parts: obtaining latent examples and metric learning. For the former one, the cost is linear in the number of latent examples and original examples as . For the latter one, the cost of sampling an active set dominates the learning procedure. Since the number of iterations is fixed, the complexity of sampling becomes . Therefore, the whole algorithm can be linear in the number of latent examples. Note that the efficiency can be further improved with distributed computing since many components of MaPML can be implemented in parallel. For example, when updating , each class is independent and all subproblems can be solved simultaneously.
4 Theoretical Analysis
4.1 Proof of Theorem 1
First, for the distance of the dissimilar pair in term of Eqn. 3.1, we have
where are latent examples from the last iteration. We let denote in this proof for simplicity. The inequality is from that is a PSD matrix and can be decomposed as . Then it is obtained by applying the Cauchy-Schwarz inequality. With the assumptions that is sufficiently large and is bounded by a constant , the inequality can be simplified as
The assumption is easy to verify since
Note that and is in the convex hull of the original data, and the constant can be set as .
With the similar procedure, we have the bound for the distance of the similar pair as
where is a constant. By investigating the structure of this problem, we find that each class is independent in the optimization problem and the subproblem for the -th class can be written as
where is the number of latent examples for the -th class and is a constant as
Next we try to upper bound the hinge loss in with a linear function in the interval of , where the hinge loss incurred by the optimal solution is guaranteed to be in it.
Let , which is the expected distance between the original data of the -th class and the corresponding latent examples from the last iteration, and be a constant sufficiently large as
Then, for each active hinge loss (i.e., ), if
With the upper bound of the hinge loss, can be bounded by
is an indicator function as
At the same time, we have
It is observed that Eqn. 11 is satisfied by combining these inequalities.
4.2 Proof of Theorem 2
For the term in Eqn. 8, we have
where we assume that is sufficiently large and is a constant which has and can be set as .
Therefore, the original function can be upper bounded by
where . ∎
4.3 Proof of Theorem 3
When fixing at the -th iteration, we have
When fixing , we have
Therefore, after each iteration, we have
Since the value of is bounded, the sequence will converge after a finite number of iterations. ∎
We conduct the empirical study on four benchmark image data sets. -nearest neighbor classifier is applied to verify the efficacy of the learned metrics from different methods. The methods in the comparison are summarized as follows.
Euclid: -NN with Euclidean distance.
LMNN : the state-of-the-art DML method that identifies a set of active triplets with the current metric at each iteration. The active triplets are searched within -nearest neighbors for each example.
OASIS : an online DML method that receives one random triplet at each iteration. It only updates the metric when the triplet constraint is active.
HR-SGD : one of the most efficient DML methods with SGD. We adopt the version that randomly samples a mini-batch of triplets at each iteration in the comparison. After sampling, a Bernoulli random variable is generated to decide if updating the current metric or not. With the PSD projection, it guarantees that the learned metric is in the PSD cone at each iteration.
MaPML: the proposed method that learns the metric and latent examples simultaneously, where denotes the ratio between the number of latent examples and the number of original ones
Different from other methods, -NN is implemented with latent examples as reference points. The method that takes -NN with original data is referred as MaPML-O.
The parameters of OASIS, HR-SGD and MaPML are searched in . The size of mini-batch in HR-SGD is set to be as suggested . To train the model sufficiently, the number of iterations for LMNN is set to be while the number of randomly sampled triplets is for OASIS and HR-SGD. The number of iterations for MaPML is set as while the number of maximal iterations for solving in the subproblem is set as , which roughly has the same number of triplets as OASIS and HR-SGD. All experiments are implemented on a server with GB memory and 2 Intel Xeon E5-2630 CPUs. Average results with standard deviation over trails are reported.
First, we evaluate the performance of different algorithms on MNIST . It consists of handwritten digit images for training and images for test. There are 10 classes in the data set, which are corresponding to the digits - . Each example is a grayscale image which leads to the -dimensional features and they are normalized to the range of .
Fig. 3 (a) compares the performance of different metrics on the test set. For MaPML, we vary the ratio of latent examples from to . First of all, It is obvious that the metrics learned with the active set outperform those from random triplets. It confirms that the strategy of sampling triplets randomly can not explore the data set sufficiently due to the extremely large number of triplets. Secondly, the performance of MaPML-O is comparable with LMNN, which shows that the proposed method can learn a good metric with only a small amount of latent examples (i.e., ). Finally, both MaPML and MaPML-O work well with the metric obtained by MaPML, which verifies that the learned metric can preserve the large margin property for both the original and latent data. Note that when the number of latent examples is small, the performance of -NN with latent examples is slightly worse than that with the whole training set. However, -NN with latent examples can be more robust in real-world applications.
To demonstrate the robustness, we conduct another experiment that randomly introduces the zero mean Gaussian noise (i.e., ) to each pixel of the original training images. The standard deviation of the Gaussian noise is varied in the range of and is fixed as . Fig. 3 (b) summarizes the results. It shows that MaPML has the comparable performance as MaPML-O and LMNN when the noise level is low. However, with the increasing of the noise, the performance of LMNN drops dramatically. This can be interpreted by the fact that the metric learned with the original data has been misled by the noisy information. In contrast, the errors made by MaPML and MaPML-O increase mildly and it demonstrates that the learned metric is more robust than the one learned from the original data. MaPML performs best among all methods and it is due to the reason that the uncertainty in latent examples are much less than that in the original ones. It implies that -NN with latent examples is more appropriate for real-world applications with large uncertainty.
Then, we compare the CPU time cost by different algorithms to evaluate the efficiency. The results can be found in Fig. 4 (a). First, as expected, all algorithms with SGD are more efficient than LMNN, which has to compute the full gradient from the redefined active set at each iteration. Moreover, the running time of MaPML is comparable to that of HR-SGD, which shows the efficiency of MaPML with the small set of latent examples. Note that OASIS has the extremely low cost, since it allows the internal metric to be out of the PSD cone. Fig. 4 (b) illustrates the convergence curve of MaPML and shows that the proposed method converges fast in practice.
Finally, since we apply the proposed method to the original pixel features directly, the learned latent examples can be recovered as images. Fig. 5 illustrates the learned latent examples and the corresponding examples in the original training set. It is obvious that the original examples are from latent examples with different distortions as claimed.
5.2 Cifar-10 Cifar-100
CIFAR-10 contains classes with color images of size for training and images for test. CIFAR-100 has the same number of images in training and test but for classes . Since deep learning algorithms show the overwhelming performance on these data sets, we adopt ResNet18  in Caffe , which is pre-trained on ImageNet ILSVRC 2012 data set , as the feature extractor and each image is represented by a -dimensional feature vector.
Table 1 summarizes error rates of methods in the comparison. First, we have the same observation as on MNIST, where the performance of methods adopting active triplets is much better than that of the methods with randomly sampled triplets. Different from MNIST, MaPML outperforms LMNN on both of the data sets. It is because that images in these data sets describe natural objects which contain much more uncertainty than digits in MNIST. Finally, the performance of MaPML-O is superior over OASIS and HR-SGD, which shows the learned metric can work well with the original data represented by deep features. It confirms that the large margin property is preserved even for the original data.
Finally, we demonstrate that the proposed method can handle the large-scale data set with ImageNet. ImageNet ILSVRC 2012 consists of training images and validation data. The same feature extraction procedure as above is applied for each image. Given the large number of training data, we increase the number of triplets for OASIS and HR-SGD to . Correspondingly, the number of maximal iterations for solving the subproblem in MaPML is also raised to .
|Methods||Test error ()|
LMNN does not finish the training after 24 hours so the result is not reported for it. In contrast, MaPML obtains the metric within about one hour. The performance of available methods can be found in Table 2. Since ResNet18 is trained on ImageNet, the extracted features are optimized for this data set and it is hard to further improve the performance. However, with latent examples, MaPML can further reduce the error rate by . It indicates that latent examples with low uncertainty are more appropriate for the large-scale data set as the reference points. Note that the small number of reference points will also accelerate the test phase. For example, it costs 0.15s to predict the label of an image with the original set while the cost is only 0.007s if evaluating with latent examples. It makes MaPML with latent examples a potential method for real-time applications.
In this work, we propose a framework to learn the distance metric and latent examples simultaneously. By learning from a small set of clean latent examples, MaPML can sample the active triplets efficiently and the learning procedure is robust to the uncertainty in the real-world data. Moreover, MaPML can preserve the large margin property for the original data when learning merely with latent examples. The empirical study confirms the efficacy and efficiency of MaPML. In the future, we plan to evaluate MaPML on different tasks (e.g., information retrieval) and different types of data. Besides, incorporating the proposed strategy to deep metric learning is also an attractive direction. It can accelerate the learning for deep embedding and the resulting latent examples may further improve the performance.
-  S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
-  G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through ranking. JMLR, 11:1109–1135, 2010.
-  J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML, pages 209–216, 2007.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages 675–678, 2014.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
-  B. Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287–364, 2013.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh. No fuss distance metric learning using proxies. In ICCV, pages 360–368, 2017.
-  Q. Qian, R. Jin, J. Yi, L. Zhang, and S. Zhu. Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD). ML, 99(3):353–372, 2015.
-  Q. Qian, R. Jin, S. Zhu, and Y. Lin. Fine-grained visual categorization via multi-stage metric learning. In CVPR, pages 3716–3724, 2015.
-  A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.
-  O. Rippel, M. Paluri, P. Dollár, and L. D. Bourdev. Metric learning with adaptive density discrimination. In ICLR, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015.
-  H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, pages 4004–4012, 2016.
-  K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
-  E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with side-information. In NIPS, pages 505–512, 2002.
-  L. Yang and R. Jin. Distance metric learning: a comprehensive survery. 2006.
-  H. Ye, D. Zhan, X. Si, and Y. Jiang. Learning mahalanobis distance metric: Considering instance disturbance helps. In IJCAI, pages 3315–3321, 2017.