Scalable Large-Margin Mahalanobis Distance Metric Learning
For many machine learning algorithms such as -Nearest Neighbor (-NN) classifiers and -means clustering, often their success heavily depends on the metric used to calculate distances between different data points. An effective solution for defining such a metric is to learn it from a set of labeled training samples. In this work, we propose a fast and scalable algorithm to learn a Mahalanobis distance metric. The Mahalanobis metric can be viewed as the Euclidean distance metric on the input data that have been linearly transformed. By employing the principle of margin maximization to achieve better generalization performances, this algorithm formulates the metric learning as a convex optimization problem and a positive semidefinite (p.s.d.) matrix is the unknown variable. Based on an important theorem that a p.s.d. trace-one matrix can always be represented as a convex combination of multiple rank-one matrices, our algorithm accommodates any differentiable loss function and solves the resulting optimization problem using a specialized gradient descent procedure. During the course of optimization, the proposed algorithm maintains the positive semidefiniteness of the matrix variable that is essential for a Mahalanobis metric. Compared with conventional methods like standard interior-point algorithms  or the special solver used in Large Margin Nearest Neighbor (LMNN) , our algorithm is much more efficient and has a better performance in scalability. Experiments on benchmark data sets suggest that, compared with state-of-the-art metric learning algorithms, our algorithm can achieve a comparable classification accuracy with reduced computational complexity.
In many machine learning problems, the distance metric used over the input data has critical impact on the success of a learning algorithm. For instance, -Nearest Neighbor (-NN) classification , and clustering algorithms such as -means rely on if an appropriate distance metric is used to faithfully model the underlying relationships between the input data points. A more concrete example is visual object recognition. Many visual recognition tasks can be viewed as inferring a distance metric that is able to measure the (dis)similarity of the input visual data, ideally being consistent with human perception. Typical examples include object categorization  and content-based image retrieval , in which a similarity metric is needed to discriminate different object classes or relevant and irrelevant images against a given query. As one of the most classic and simplest classifiers, -NN has been applied to a wide range of vision tasks and it is the classifier that directly depends on a predefined distance metric. An appropriate distance metric is usually needed for achieving a promising accuracy. Previous work (e.g., [25, 26]) has shown that compared to using the standard Euclidean distance, applying an well-designed distance often can significantly boost the classification accuracy of a -NN classifier. In this work, we propose a scalable and fast algorithm to learn a Mahalanobis distance metric. Mahalanobis metric removes the main limitation of the Euclidean metric in that it corrects for correlation between the different features.
Recently, much research effort has been spent on learning a Mahalanobis distance metric from labeled data [25, 26, 23, 5]. Typically, a convex cost function is defined such that a global optimum can be achieved in polynomial time. It has been shown in the statistical learning theory  that increasing the margin between different classes helps to reduce the generalization error. Inspired by the work of , we directly learn the Mahalanobis matrix from a set of distance comparisons, and optimize it via margin maximization. The intuition is that such a learned Mahalanobis distance metric may achieve sufficient separation at the boundaries between different classes. More importantly, we address the scalability problem of learning the Mahalanobis distance matrix in the presence of high-dimensional feature vectors, which is a critical issue of distance metric learning. As indicated in a theorem in , a positive semidefinite trace-one matrix can always be decomposed as a convex combination of a set of rank-one matrices. This theorem has inspired us to develop a fast optimization algorithm that works in the style of gradient descent. At each iteration, it only needs to find the principal eigenvector of a matrix of size ( is the dimensionality of the input data) and a simple matrix update. This process incurs much less computational overhead than the metric learning algorithms in the literature [23, 2]. Moreover, thanks to the above theorem, this process automatically preserves the p.s.d. property of the Mahalanobis matrix. To verify its effectiveness and efficiency, the proposed algorithm is tested on a few benchmark data sets and is compared with the state-of-the-art distance metric learning algorithms. As experimentally demonstrated, -NN with the Mahalanobis distance learned by our algorithms attains comparable (sometimes slightly better) classification accuracy. Meanwhile, in terms of the computation time, the proposed algorithm has much better scalability in terms of the dimensionality of input feature vectors.
We briefly review some related work before we present our work. Given a classification task, some previous work on learning a distance metric aims to find a metric that makes the data in the same class close and separates those in different classes from each other as far as possible. Xing et al.  proposed an approach to learn a Mahalanobis distance for supervised clustering. It minimizes the sum of the distances among data in the same class while maximizing the sum of the distances among data in different classes. Their work shows that the learned metric could improve clustering performance significantly. However, to maintain the p.s.d. property, they have used projected gradient descent and their approach has to perform a full eigen-decomposition of the Mahalanobis matrix at each iteration. Its computational cost rises rapidly when the number of features increases, and this makes it less efficient in coping with high-dimensional data. Goldberger et al.  developed an algorithm termed Neighborhood Component Analysis (NCA), which learns a Mahalanobis distance by minimizing the leave-one-out cross-validation error of the -NN classifier on the training set. NCA needs to solve a non-convex optimization problem, which might have many local optima. Thus it is critically important to start the search from a reasonable initial point. Goldberger et al. have used the result of linear discriminant analysis as the initial point. In NCA, the variable to optimize is the projection matrix.
The work closest to ours is Large Margin Nearest Neighbor (LMNN)  in the sense that it also learns a Mahalanobis distance in the large margin framework. In their approach, the distances between each sample and its “target neighbors” are minimized while the distances among the data with different labels are maximized. A convex objective function is obtained and the resulting problem is a semidefinite program (SDP). Since conventional interior-point based SDP solvers can only solve problems of up to a few thousand variables, LMNN has adopted an alternating projection algorithm for solving the SDP problem. At each iteration, similar to , also a full eigen-decomposition is needed. Our approach is largely inspired by their work. Our work differs LMNN  in the following: (1) LMNN learns the metric from the pairwise distance information. In contrast, our algorithm uses examples of proximity comparisons among triples of objects (e.g., example is closer to example than example ). In some applications like image retrieval, this type of information could be easier to obtain than to tag the actual class label of each training image. Rosales and Fung  have used similar ideas on metric learning; (2) More importantly, we design an optimization method that has a clear advantage on computational efficiency (we only need to compute the leading eigenvector at each iteration). The optimization problems of  and  are both SDPs, which are computationally heavy. Linear programs (LPs) are used in  to approximate the SDP problem. It remains unclear how well this approximation is.
The problem of learning a kernel from a set of labeled data shares similarities with metric learning because the optimization involved has similar formulations. Lanckriet et al.  and Kulis et al.  considered learning p.s.d. kernels subject to some pre-defined constraints. An appropriate kernel can often offer algorithmic improvements. It is possible to apply the proposed gradient descent optimization technique to solve the kernel learning problems. We leave this topic for future study.
The rest of the paper is organized as follows. Section II presents the convex formulation of learning a Mahalanobis metric. In Section III, we show how to efficiently solve the optimization problem by a specialized gradient descent procedure, which is the main contribution of this work. The performance of our approach is experimentally demonstrated in Section IV. Finally, we conclude this work in Section V.
Ii Large-Margin Mahalanobis Metric Learning
In this section, we propose our distance metric learning approach as follows. The intuition is to find a particular distance metric for which the margin of separation between the classes is maximized. In particular, we are interested in learning a quadratic Mahalanobis metric.
Let denote a training sample where is the number of training samples and is the number of features. To learn a Mahalanobis distance, we create a set that contains a group of training triplets as , where and come from the same class and belongs to different classes. A Mahalanobis distance is defined as follows. Let denote a linear transformation and be the squared Euclidean distance in the transformed space. The squared distance between the projections of and writes:
According to the class memberships of , and , we wish to achieve and it can be obtained as
It is not difficult to see that this inequality is generally not a convex constrain in because the difference of quadratic terms in is involved. In order to make this inequality constrain convex, a new variable is introduced and used throughout the whole learning process. Learning a Mahalanobis distance is essentially learning the Mahalanobis matrix . (2) becomes linear in . This is a typical technique to convexify a problem in convex optimization .
Ii-a Maximization of a soft margin
In our algorithm, a margin is defined as the difference between and , that is,
Similar to the large margin principle that has been widely used in machine learning algorithms such as support vector machines and boosting, here we maximize this margin (3) to obtain the optimal Mahalanobis matrix . Clearly, the larger is the margin , the better metric might be achieved. To enable some flexibility, i.e., to allow some inequalities of (2) not to be satisfied, a soft-margin criterion is needed. Considering these factors, we could define the objective function for learning as
where constrains to be a p.s.d. matrix and denotes the trace of . indexes the training set and denotes the size of . is an algorithmic parameter that balances the violation of (2) and the margin maximization. is the slack variable similar to that used in support vector machines and it corresponds to the soft-margin hinge loss. Enforcing removes the scale ambiguity because the inequality constrains are scale invariant. To simplify exposition, we define
ssss Therefore, the last constraint in (4) can be written as
Note that this is a linear constrain on . Problem (4) is thus a typical SDP problem since it has a linear objective function and linear constraints plus a p.s.d. conic constraint. One may solve it using off-the-shelf SDP solvers like CSDP . However, directly solving the problem (4) using those standard interior-point SDP solvers would quickly become computationally intractable with the increasing dimensionality of feature vectors. We show how to efficiently solve (4) in a fashion of first-order gradient descent.
Ii-B Employment of a differentiable loss function
It is proved in  that a p.s.d. matrix can always be decomposed as a linear convex combination of a set of rank-one matrices. In the context of our problem, this means that , where is a rank-one matrix and . This important result inspires us to develop a gradient descent based optimization algorithm. In each iteration, can be updated as
where is a rank-one and trace-one matrix. is the search direction. It is straightforward to verify that , and hold. This is the starting point of our gradient descent algorithm. With this update strategy, the trace-one and positive semidefinteness of is always retained. We show how to calculate this search direction in Algorithm LABEL:ALG:2. Although it is possible to use subgradient methods to optimize non-smooth objective functions, we use a differentiable objective function instead so that the optimization procedure is simplified (standard gradient descent can be applied). So, we need to ensure that the objective function is differentiable with respect to the variables and .
Let denote the objective function and be a loss function. Our objective function can be rewritten as
The above problem (4) adopts the hinge loss function that is defined as . However, the hinge loss is not differentiable at the point of , and standard gradient-based optimization cam be applied directly. In order to make standard gradient descent methods applicable, we propose to use differentiable loss functions, for example, the squared hinge loss or Huber loss functions as discussed below.
The squared hinge loss function can be represented as
As shown in Fig. 1, this function connects the positive and zero segments smoothly and it is differentiable everywhere including the point . We also consider the Huber loss function in this work:
where is a parameter whose value is usually between and . A Huber loss function with is plotted in Fig. 1. There are three different parts in the Huber loss function, and they together form a continuous and differentiable function. This loss function approaches the hinge loss curve when . Although the Huber loss is more complicated than the squared hinge loss, its function value increases linearly with the value of . Hence, when a training set contains outliers or samples heavily contaminated by noise, the Huber loss might give a more reasonable (milder) penalty than the squared hinge loss does. We discuss both loss functions in our experimental study. Again, we highlight that by using these two loss functions, the cost function that we are going to optimization becomes differentiable with respect to both and .
Iii A scalable and fast optimization algorithm
The proposed algorithm maximizes the objective function iteratively, and in each iteration the two variables and are optimized alternatively. Note that the optimization in this alternative strategy retains the global optimum because is a convex function in both variables and are not coupled together. We summarize the proposed algorithm in Algorithm LABEL:ALG:0. Note that is a scalar and Line 3 in Algorithm LABEL:ALG:0 can be solved directly by a simple one-dimensional maximization process. However, is a p.s.d. matrix with size of . Recall that is the dimensionality of feature vectors. The following section presents how is efficiently optimized in our algorithm.
Iii-a Optimizing for the Mahalanobis matrix
Let be the domain in which a feasible lies. Note that is a convex set of . As shown in Line 4 in Algorithm LABEL:ALG:0, we need to solve the following maximization problem:
where is the output of Line 3 in Algorithm LABEL:ALG:0. Our algorithm offers a simple and efficient way for solving this problem by explicitly maintaining the positive semidefiniteness property of the matrix . It needs only compute the largest eigenvalue and the corresponding eigenvector whereas most previous approaches such as the method of  require a full eigen-decomposition of . Their computational complexities are and , respectively. When is large, this computational complexity difference could be significant.
Let be the gradient matrix of with respect to and be the step size for updating . Recall that we update in such a way that , where and . To find the that satisfies these constraints and in the meantime can best approximate the gradient matrix , we need to solve the following optimization problem:
The optimal is exactly where is the eigenvector of that corresponds to the largest eigenvalue. The constraints says that is a outer product of a unit vector: with . Here is the Euclidean norm. Problem (III-A) can then be written as: , subject to . It is clear now that an eigen-decomposition gives the solution to the above problem.
Hence, to solve the above optimization, we only need to compute the leading eigenvector of the matrix . Note that still retains the properties of after applying this process.
Clearly, a key parameter of this optimization process is which implicitly decides the total number of iterations. The computational overhead of our algorithm is proportional to the number of iterations. Hence, to achieve a fast optimization process, we need to ensure that in each iteration the can lead to a sufficient reduction on the value of . This is discussed in the following part.
Iii-B Finding the optimal step size
We employ the backtracking line search algorithm in  to identify a suitable . It reduces the value of until the Wolfe conditions are satisfied. As shown in Algorithm LABEL:ALG:2, the search direction is . The Wolfe conditions that we use are
where . The result of backtracking line search is an acceptable which can give rise to sufficient reduction on the function value of . We show in the experiments that with this setting our optimization algorithm can achieve higher computational efficiency than some of the existing solvers.
The goal of these experiments is to verify the efficiency of our algorithm in achieving comparable (or sometimes even better) classification performances with a reduced computational cost. We perform experiments on data sets described in Table I. For some data sets, PCA is performed to remove noises and reduce the dimensionality. The metric learning algorithms are then run on the data sets pre-processed by PCA. The Wine, Balance, Vehicle, Breast-Cancer and Diabetes data sets are obtained from UCI Machine Learning Repository , and USPS, MNIST and Letter are from LibSVM  For MNIST, we only use its test data in our experiment. The ORLface data is from att research111http://www.uk.research.att.com/facedatabase.html and Twin-Peaks is downloaded from L. van der Maaten’s website222http://ticc.uvt.nl/lvdrmaaten/. The Face and Background classes (435 and 520 images respectively) in the image retrieval experiment are obtained from the Caltech-101 object database . In order to perform statistics analysis, the ORLface, Twin-Peaks, Wine, Balance, Vehicle, Diabetes and Face-Background data sets are randomly split as 10 pairs of train/validation/test subsets and experiments on those data set are repeated 10 times on each split.
|# training||# validation||# test||dimension||dimension after PCA||# classes||# runs||# triplets for SDPMetric|
The -NN classifier with the Mahalanobis distance learned by our algorithm (termed SDPMetric in short) is compared with the -NN classifiers using a simple Euclidean distance (“Euclidean” in short) and that learned by the Large Margin Nearest Neighbor in  (LMNN333In our experiment, we have used the implementation of LMNN’s authors. Note that to be consistent with the setting in , LMNN here also uses the “obj=1” option and updates the projection matrix to speed up its computation. If we update the distance matrix directly to get global optimum, LMNN would be much more slower due to full eigen-decomposition at each iteration. in short). Since Weinberger et al.  has shown that LMNN obtains the classification performance comparable to support vector machines on some data sets, we focus on the comparison between our algorithm and LMNN, which is considered as the state-of-the-art. To prepare the training triplet set , we apply the -NN method to these data sets and generate the training triplets for our algorithm. The training data sets for LMNN is also generated using -NN, except that the Twin-peaks and ORLface are applied with the -NN method. Also, the experiment compares the two variants of our proposed SDPMetric, which use the squared hinge loss (denoted as SDPMetric-S) and the Huber loss(SDPMetric-H), respectively. We split each data set into 70/15/15% randomly and refer to those split sets as training, cross validation and test sets except pre-separated data sets (Letter and USPS) and Face-Background which was made for image retrieval. Following , LMNN uses 85/15% data for training and testing. The training data is also split into 70/15% in LMNN for cross validation to be consistent with our SDPMetric. Since USPS data set has been split into training/test already, only the training data are divided into 70/15% for training and validation. The Letter data set is separated according to Hsu and Lin . Same as in , PCA is applied to USPS, MNIST and ORLface to reduce the dimensionality of feature vectors.
The following experimental study demonstrates that our algorithm achieves slightly better classification accuracy rates with a much less computational cost than LMNN on most of the tested data sets. The detailed test error rates and timing results are reported in Tables II and III. As we can see, the test error rates of SDPMetric-S are comparable to those of LMNN. SDPMetric-H achieves lower misclassification error rates than LMNN and the Euclidean distance on most of data sets except Face-Background data (which is treated as an image retrieval problem) and MNIST, on which SDPMetric-S achieves a lower error rate. Overall, we can conclude that the proposed SDPMetric either with squared hinge loss or Huber loss is at least comparable to (or sometimes slightly better than) the state-of-the-art LMNN method in terms of classification performance.
Before reporting the timing result on these benchmark data sets, we compared our algorithm (SDPMetric-H) with two convex optimization solvers, namely, SeDuMi  and SDPT3  which are used as internal solvers in the disciplined convex programming software CVX . Both SeDuMi and SDPT3 use interior-point based methods. To perform eigen-decomposition, our SDPMetric uses ARPACK , which is designed to solve large scale eigenvalue problems. Our SDPMetric is implemented in standard C/C++. Experiments have been conducted on a standard desktop. We randomly generated training triplets and gradually increase the dimensionality of feature vectors from to . Fig. 2 illustrates computational time of ours, CVX/SeDuMi and CVX/SDPT3. As shown, the computational load of our algorithm almost keeps constant as the dimensionality increases. This might be because the proportion of eigen-decomposition’s CPU time does not dominate with dimensions varying from to in SDPMetric on this data set. In contrast, the computational loads of CVX/SeDuMi and CVX/SDPT3 increase quickly in this course. In the case of the dimension of , the difference on CPU time can be as large as seconds. This shows the inefficiency and poor scalability of standard interior-point methods. Secondly, the computational time of LMNN, SDPMetric-S and SDPMetric-H on these benchmark data sets are compared in Table III. As shown, LMNN is always slower than the proposed SDPMetric which converges very fast on these data sets. Especially, on the Letter and Twin-Peaks data sets, SDPMetric shows significantly improved computational efficiency.
Face-Background data set consists of the two object classes, Face-easy and Background-Google in , as a retrieval problem. The images in the class of Background-Google are randomly collected from the Internet and they are used to represent the non-target class. For each image, a number of interest regions are identified by the Harris-Affine detector  and the visual content in each region is characterized by the SIFT descriptor . A codebook of size is created by using -means clustering. Each image is then represented by a -dimensional histogram vector containing the number of occurrences of each visual word. We evaluate retrieval accuracy using each facial image in a test subset as a query. For each compared metric, the accuracy of the retrieved top to images are computed, which is defined as the ratio of the number of facial images to the total number of retrieved images. We calculate the average accuracy of each test subset and then average over the whole test subsets. Fig. 3 shows the retrieval accuracies of the Mahalanobis distances learned by Euclidean, LMNN and SDPMetric. Clearly we can observe that SDPMetric-H and SDPMetric-S consistently present higher retrieval accuracy values, which again verifies their advantages over the LMNN method and Euclidean distance.
|ORLface||6.00 (3.46)||5.00 (2.36)||4.75 (2.36)||4.25 (2.97)|
|Twin-Peaks||1.03 (0.21)||0.90 (0.19)||1.17 (0.20)||0.79 (0.19)|
|Wine||24.62 (5.83)||3.85 (2.72)||3.46 (2.69)||3.08 (2.31)|
|Bal||19.14 (1.59)||14.19 (4.12)||9.78 (3.17)||10.32 (3.44)|
|Vehicle||28.41 (2.41)||21.59 (2.71)||21.67 (4.00)||20.87 (2.97)|
|Breast-Cancer||4.51 (1.49)||4.71 (1.61)||3.33 (1.40)||2.94 (0.88)|
|Diabetes||28.00 (2.84)||27.65 (3.45)||28.70 (3.67)||27.64 (3.71)|
|Face-Background||26.41 (2.72)||14.71 (1.33)||16.75 (1.72)||15.86 (1.37)|
|Twin-peakes||595s||less than 1s||less than 1s|
|Bal||7s||less than 1s||2s|
|Diabetes||10s||less than 1s||2s|
We have proposed a new algorithm to demonstrate how to efficiently learn a Mahalanobis distance metric with the principle of margin maximization. Enlightened by the important theorem on p.s.d. matrix decomposition in , we have designed a gradient descent method to update the Mahalanobis matrix with cheap computational loads and at the same time, the p.s.d. property of the learned matrix is maintained during the whole optimization process. Experiments on benchmark data sets and the retrieval problem verify the superior classification performance and computational efficiency of the proposed distance metric learning algorithm.
The proposed algorithm may be used to solve more general SDP problems in machine learning. To look for other applications is one of the future research directions.
-  B. Borchers, “CSDP, a C library for semidefinite programming,” Optim. Methods and Softw., vol. 11, no. 1, pp. 613–623, 1999.
-  S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
-  C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” 2001. [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
-  T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
-  J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proc. Int. Conf. Mach. Learn., Corvalis, Oregon, June 2007, pp. 209–216.
-  L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories,” in Workshop on Generative-Model Based Vision, in conjunction with IEEE Conf. Comp. Vis. Patt. Recogn., Washington, D.C., July 2004.
-  J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, Canada, Dec. 2005.
-  M. Grant, S. Boyd, and Y. Ye, “CVX user’s guide: for CVX version 1.1,” Stanford University, Tech. Rep., 2007. [Online]. Available: http://www.stanford.edu/~boyd/cvx/
-  C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-class support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425, Feb. 2002.
-  B. Kulis, M. A. Sustik, and I. S. Dhillon, “Low-rank kernel learning with bregman matrix divergences,” J. Mach. Learn. Res., vol. 10, pp. 341–376, 2009.
-  G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, no. 1, pp. 27–72, Dec. 2004.
-  D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. IEEE Int. Conf. Comp. Vis., vol. 2, Kerkyra, Greece, Sept. 1999, pp. 1150–1157.
-  K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors.” Int. J. Comp. Vis., vol. 60, no. 1, pp. 63–86, 2004.
-  D. Newman, S. Hettich, C. Blake, and C. Merz, “UCI repository of machine learning databases,” 1998. [Online]. Available: http://archive.ics.uci.edu/ml/
-  J. Nocedal and S. J. Wright, Numerical optimization. New York: Springer Verlag, 1999.
-  R. Rosales and G. Fung, “Learning sparse metrics via linear programming,” in Proc. ACM Int. Conf. Knowledge Discovery & Data Mining, Philadelphia, PA, USA, Aug. 2006, pp. 367–373.
-  A. W. M. S., M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1349–1380, Dec. 2000.
-  C. Shen, A. Welsh, and L. Wang, “PSDBoost: Matrix-generation linear programming for positive semidefinite matrices learning,” in Proc. Adv. Neural Inf. Process. Syst. Vancouver, Canada: MIT Press, Dec. 2008, pp. 1473–1480.
-  D. C. Sorensen, “Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations,” Tech. Rep., 1996.
-  J. F. Sturm, “Using SeDuMi 1.02, a matlab toolbox for optimization over symmetric cones (updated for version 1.05),” Optim. Methods and Softw., vol. 11-12, pp. 625–653, 1999.
-  R. H. Tütüncü, K. C. Toh, and M. J. Todd, “Solving semidefinite-quadratic-linear programs using SDPT3,” Math. Program., vol. 95, no. 2, pp. 189–217, 2003.
-  V. Vapnik, Statistical learning theory. New York: John Wiley and Sons Inc., 1998.
-  K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, Canada, Dec. 2006, pp. 1475–1482.
-  J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in Proc. IEEE Int. Conf. Comp. Vis., vol. 2, Beijing, China, Oct. 2005, pp. 1800–1807.
-  E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst. Vancouver, Canada: MIT Press, Dec. 2003, pp. 505–512.
-  L. Yang, R. Sukthankar, and S. C. H. Hoi, “A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, Jan. 2010.