Nonlinear Metric Learning for kNN and SVMs through Geometric Transformations
In recent years, research efforts to extend linear metric learning models to handle nonlinear structures have attracted great interests. In this paper, we propose a novel nonlinear solution through the utilization of deformable geometric models to learn spatially varying metrics, and apply the strategy to boost the performance of both NN and SVM classifiers. Thin-plate splines (TPS) are chosen as the geometric model due to their remarkable versatility and representation power in accounting for high-order deformations. By transforming the input space through TPS, we can pull same-class neighbors closer while pushing different-class points farther away in NN, as well as make the input data points more linearly separable in SVMs. Improvements in the performance of NN classification are demonstrated through experiments on synthetic and real world datasets, with comparisons made with several state-of-the-art metric learning solutions. Our SVM-based models also achieve significant improvements over traditional linear and kernel SVMs with the same datasets.
Many machine learning and data mining algorithms rely on Euclidean metrics to compute pair-wise dissimilarities, which assign equal weight to each feature component. Replacing Euclidean metric with a learned one from the inputs can often significantly improve the performance of the algorithms [1, 2]. Based on the form of the learned metric, metric learning (ML) algorithms can be categorized into linear and nonlinear groups . Linear models [3, 4, 5, 6, 7, 8] commonly try to estimate a “best” affine transformation to deform the input space, such that the resulted Mahalanobis distance would very well agree with the supervisory information brought by training samples. Many early works have focused on linear methods as they are easy to use, convenient to optimize and less prone to overfitting . However, when handling data with nonlinear structures, linear models show inherently limited expressive power and separation capability — highly nonlinear multi-class boundaries often can not be well modeled by a single Mahalanobis distance metric.
Generalizing linear models for nonlinear cases have gained steam in recent years, and such extensions have been pushed forward mainly along kernelization [9, 10, 11] and localization [12, 13, 14, 15, 16] directions . The idea of kernelization [9, 10] is to embed the input features into a higher dimensional space, with the hope that the transformed data would be more linearly separable under the new space. While kernelization may dramatically improve the performance of linear methods for many highly nonlinear problems, solutions in this group are prone to overfitting , and their utilization is inherently limited by the sizes of the kernel matrices . Localization approaches focus on combining multiple local metrics, which are learned based on either local neighborhoods or class memberships. The granularity levels of the neighborhoods vary from per-partition [13, 14], per-class  to per-exemplar [15, 16]. A different strategy is adopted in the GB-LMNN method , which learns a global nonlinear mapping by iteratively adding nonlinear components onto a linear metric. At each iteration, a regression tree of depth splits the input space into axis-aligned regions, and points falling into the regions are shifted in different directions. While the localization strategies are usually more powerful in accommodating nonlinear structures, generalizing these methods to fit other classifiers than NN is not trivial. To avoid non-symmetric metrics, extra cares are commonly needed to ensure the smoothness of the transformed input space. In addition, estimating geodesic distances and group statistics on such metric manifolds are often computationally expensive.
Most of the existing ML solutions are designed based on pairwise distances, and therefore best suited to improve nearest neighbor (NN) based algorithms, such as -NN and -means. Typically, a two-step procedure is involved: a best metric is first estimated through training samples, followed by the application of the learned metric to the ensuing classification or clustering algorithms. Since learning a metric is equivalent to learn a feature transformation , metric learning can also be applied to SVM models, either as a preprocessing step , or as an input space transformation [19, 20, 21]. In , Xu et al. studied both approaches and found applying linear transformations to the input samples outperformed three state-of-the-art linear ML models utilized as preprocessing steps for SVMs. Several other transformation-based models [20, 21] have also reported improved classification accuracies over the standard linear and kernel SVMs. However, all the models employ linear transformations, which limit their capabilities in dealing with complex data.
In light of the aforementioned limitations and drawbacks of the existing models, we propose a new nonlinear remedy in this paper. Our solution is a direct generalization of linear metric learning through the application of deformable geometric models to transform the entire input space. In this study, we choose thin-plate splines (TPS) as the transformation model, and the choice is with the consideration of the compromise between computational efficiency and richness of description. TPS are well-known for their remarkable versatility and representation power in accounting for high-order deformations. We have designed TPS-based ML solutions for both NN and SVM classifiers, which will be presented in next two sections. To our best knowledge, this is the first work that utilizes nonlinear dense transformations, or spatially varying deformation models in metric learning. Our experimental results on synthetic and real data demonstrate the effectiveness of the proposed methods.
2 Nonlinear Metric Learning for Nearest Neighbor
Many linear metric learning models are formulated under the nearest neighbor (NN) paradigm, with the same goal that the estimated transformation would pull similar data points closer while pushing dissimilar points apart. Our nonlinear ML model for NN is designed with the same idea. However, instead of using a single linear transformation, we choose to deform the input space nonlinearly through powerful radial basis functions – thin-plate splines (TPS). With TPS, nonlinear metrics are computed globally, with smoothness ensured across the entire data space. Similarly as in linear models, the learned pairwise distance is simply the Euclidean distance after the nonlinear projection of the data through the estimated TPS transformation.
In this section, a pioneer Mahalanobis ML for clustering method (MMC) proposed by Xing et al.  will be used as the platform to formulate our nonlinear ML solution for NN. Therefore, we will briefly review the concept of MMC first. Then we will describe the theoretical background of the TPS in the general context of transformations, followed by the presentation of our proposed model.
2.1 Linear Metric Learning and MMC
Given a set of training data instances , where is the number of training samples, and is the number of features that a data instance has, the goal of ML is to learn a “better” metric function to the problem of interest with the information carried by the training samples. Mahalanobis metric is one of the most popular metric functions used in existing ML algorithms [4, 5, 8, 7, 22, 13], which is defined by . The control parameter is a square matrix. In order to qualify as a valid (pseudo-)metric, has to be positive semi-definite (PSD), denoted as . As a PSD matrix, can be decomposed as , where and is the rank of . Then, can be rewritten as follows:
Eqn. (1) explains why learning a Mahalanobis metric is equivalent to learning a linear transformation function and computing the Euclidean distance over the transformed data domain.
With the side information embedded in the class-equivalent constraints and class-nonequivalent constraints , MMC model formulates the problem of ML into the following convex programming problem:
The objective function aims at improving the subsequent NN based algorithms via minimizing the sum of distances between similar training data, while keeping the sum of distances between dissimilar ones large. Note that, besides the PSD constraint on , an additional constraint on the training samples in is needed to avoid trivial solutions for the optimization. To solve this optimization problem, the projected gradient descent method is used, which projects the estimated matrix back to the PSD group whenever it is necessary.
Thin-plate splines (TPS) are the high-dimensional analogs of the cubic splines in one dimension, and have been widely used as an interpolation tool in the research of data approximation, surface reconstruction, shape alignments, etc. When it is utilized to align a set of corresponding point-pairs and (), a TPS transformation is a mapping function within a suitable Hilbert space , that matches and , as well as minimizes a smoothness TPS penalty functional (will be given in Eqn. 3).
Typically, the problem of finding can be decomposed into interpolation problems, finding component thin plate splines , separately. Suppose the unknown interpolation function belongs to the Sobolev space , where is an unknown positive integer and is an open subset of , TPS transformations minimize the smoothness penalty functional of the following general form:
where is the matrix of -th order partial derivatives of , with being positive, and , where are the components of . The penalty functional is the generalized form for the space integral of the squared second order derivatives of the mapping function. We will suppose the mapping function , a space of functions whose partial derivatives of total order are in . To have the evaluation functionals bounded in , we need to be a reproducing kernel Hilbert space (r.k.h.s.), endowed with the seminorm . For this, it is necessary and sufficient that . The null space of consists of a set of polynomial functions with maximum degree of , so the dimension of this null space is .
The main problem of TPS is that , the dimension of the null space, increases exponentially with due to the requirement of . To solve this problem, Duchon  proposed to replace by its weighted squared norm in Fourier space. Since the Fourier transform, denoted as is isometric, the penalty functional can be replaced by its squared norm in Fourier space:
By adding a weighting function, Duchon introduced a new penalty functional to solve the exponential growth problem of the dimension for TPS’ null space, which is defined as
provided that and . As suggested by , one can select an appropriate to have a lower dimension for the null space of , with the maximum degree of the polynomial functions spanned in this null space being decreased to .
The classic solution of Eqn. (5) has a representation in terms of a radial basis function (TPS interpolation function),
where denotes the Euclidean norm and are a set of weights for the nonlinear part; and are the weights for the linear part. The corresponding radial distance kernel of TPS, which is the Green’s function to solve Eqn. (5), is as follows:
2.3 TPS Metric Learning for Nearest Neighbor (TML-NN)
The TPS transformation for point interpolation, as specified in Eqn. (6), can be employed as the geometric model to deform the input space for nonlinear metric learning. Such a transformation would ensure certain desired smoothness as it minimizes the bending energy in Eqn. (3). Within the metric learning setting, let be one of the training samples in the original feature space of dimensions, and be the transformed destination of , also of dimensions. Through a straightforward mathematical manipulations , we can get in matrix format:
where (size ) is a linear transformation matrix, corresponding to in Mahalanobis metric, (size ) is the weight matrix for the nonlinear parts, and is the number of anchor points () to compute the TPS kernel. Usually, we can use all the training data points as the anchor points. However, in practice, anchor points are extracted via different methods to describe the whole input space under the consideration of computational cost, such as k-medoids method used in .
The goal of our ML solution is also pulling the samples of the same class closer to each other while pushing different classes further away, directly through a TPS nonlinear transformation as described in Eqn. (8). This can be achieved through the following constrained optimization:
is in the form of Eqn. (8); is the th column of ; is the th component of . Compared with MMC, another component , the squared Frobenius norm of , is added to the objective function as a regularizer to prevent overfitting. is the weighting factor to control the importance of two components. Similarly as in MMC, the nonequivalent constraint is to impose a scaling control to avoid trivial solutions. The other two equivalent constraints with respect to is to ensure that the elastic part of the transformation is zero at infinity .
Due to the nonlinearity of TPS, it is difficult to analytically solve this nonlinear constrained problem. Alternatively, we can use a gradient based constrained optimization solver
To demonstrate the ability of TML-NN in handling nonlinear cases, we conducted a similar experiment used in the GB-LMNN method . Fig. 1.(a) shows a synthetic dataset consisting of inputs sampled from two concentric circles (in blue dots and red diamonds), each of which defines a different class membership. Global linear transformations in linear metric learning are not sufficient to improve the accuracy of NN () classification on this data set. As contrast, by utilizing TPS to model the underlying nonlinear transformation, as shown in Fig. 1.(b), we can easily enlarge the separation between outer and inner circles, leading to improved classification rate (would be for NN).
3 TPS Metric Learning for Support Vector Machines (TML-SVM)
In this section, we present how to generalize our TPS metric learning model for SVMs. Similar as in , we formulate our model under the Margin-Radius-Ratio bounded SVM paradigm, which generalizes the traditional SVMs by bounding the estimation error . Given training dataset together with the class label information , our proposed TML-SVM aims to simultaneously learn the nonlinear transformation as described in Eqn. (8) and a SVM classifier, which can be formulated as follows:
The objective function combines the regularizer w.r.t. for TPS transformation with the traditional soft margin SVMs. and are two trade-off hyper-parameters. The first two nonequivalent constraints (\@slowromancapi@ and \@slowromancapii@) are the same as used in traditional SVMs. The third nonequivalent constraint (\@slowromancapiii@) is a unit-enclosing-ball constraint, which forces the radius of minimum-enclosing-ball to be unit in the transformed space and avoids trivial solutions. is the center of all samples. In practice, we can simplify the unit-enclosing-ball constraint to through a preprocessing step to centralize the input data: . The last two equivalent constraints are used to maintain the properties for TPS transformation at infinity, similar as in Eqn. (9).
To solve this optimization problem, we propose an efficient EM-like iterative minimization algorithm by updating and alternatively. With fixed, is explicit, and Eqn. (10) can be reformulated as:
This becomes exactly the primal form of soft margin SVMs, which can be solved by off-the-shelf SVM solvers. With fixed, Eqn. (10) can be reformulated as:
By using hinge loss function, we can eliminate variables , and reformulate Eqn. (12) as:
As the squared hinge loss function is differentiable, it is not difficult to differentiate the objective function w.r.t and . Similarly as in solving Eqn. (9), we can also use a gradient based optimizer to get a local minimum for Eqn. (13), with the gradient computed as:
To sum it up, the optimal nonlinear transformation defined by along with the optimal SVM classifier coefficients can be obtained by an EM-like iterative procedure, as described in Algorithm 1.
3.1 Kernelization of TML-SVM
TML-SVM can be kernelized through a kernel principal component analysis (KPCA) based framework, as introduced in [28, 11]. Unlike the traditional kernel trick , which often involves the derivation of new mathematical formulas, KPCA based framework provides an alternative choice that can directly utilize the original linear models. Typically, it consists of two simple stages: first, map the input data into a kernel feature space introduced by KPCA; then, train the linear model in this kernel space. Proved to be equivalent to the traditional kernel trick, this KPCA based framework also provides a convenient way to speed up a learner, if a low-rank KPCA is used. Through this procedure, kernelized TML-SVM can be easily realized by directly utilizing Algorithm 1 in the mapped KPCA space. For more details about this KPCA-based approach, we refer readers to [28, 11].
4 Experimental Results
In this section, we present evaluation and comparison results of applying our proposed TPS-based nonlinear ML methods on seven widely used datasets from UCI machine learning repository. The details of these datasets are summarized in the leftmost column of Table 1. The three numbers inside the bracket indicate data size, feature dimension, and number of classes for the corresponding dataset. All datasets have been preprocessed through normalization. To demonstrate the effectiveness of our proposed nonlinear metric learning method, we firstly choose NN method as the baseline classifier, and compare the improvements made by TML-NN against five state-of-the-art NN based metric learning methods; then, similar experiments are conducted to show improvements made by our proposed TML-SVM over the traditional SVMs.
4.1 Comparisons with NN based ML solutions
The first set of experiments are within the Nearest Neighbor (NN) category. We choose in NN, and the five competing metric learning methods are: Large Margin Nearest Neighbor classification (LMNN) , Information-Theoretic Metric Learning (ITML) , Neighborhood Components Analysis (NCA) , GB-LMNN  and Parametric Local Metric Learning (PLML) . The hyper-parameters of NCA, ITML, LMNN and GB-LMNN are set by following [4, 6, 7, 18] respectively. PLML has a number of hyper-parameters, so we follow the suggestion of : use a 3-fold CV to select from , and set the other hyper-parameters by its default. In our TML-NN model, there are two hyper-parameters: the number of anchor points and the weighting factor . For , we empirically set it to of the training samples; for , we select it through CV from .
To better compare the classification performance, we run the experiment 100 times with different random 3-fold splits of each dataset, two for training and one for testing. Furthermore, we conduct a pairwise Student’s -test with a -value 0.05 among the seven methods for each dataset. Then, a ranking schema from  is used to evaluate the relative performance of these algorithms: a method A will be assigned 1 point if it has a statistically significantly better accuracy than another method B; 0.5 points if there is no significant difference, and 0 point if A performs significantly worse than B. The experiment results by averaging over the 100 runs are reported in Table 1.
From Table 1, we can see that TML-NN outperforms the other six methods in a statistically significant manner, with a total score of points. Out of the total pairwise comparisons, TML-NN has statistical wins. Furthermore, it has significantly improved the baseline method, NN, on six datasets out of the total seven, and performed equally well on the seventh (“Sonar”). It is also worth pointing out that the proposed nonlinear TML-NN has wins and no loss out of the total comparisons against the linear ML solutions (LMNN, ITML, NCA); against the local nonlinear ML solutions (PLML, GB-LMNN), TML-NN has five wins and only two loss out of the total comparisons.
4.2 Improvements over SVMs
To verify the effectiveness of our proposed nonlinear metric learning for SVMs, we conduct another set of experiments on the same seven UCI datasets to compare the following four SVM models: linear SVM (-SVM), kernel SVM (-SVM), our proposed TML-SVM and kernel TML-SVM. For -SVM, we directly utilize the off-the-shelf LIBSVM solver , for which the slackness coefficient are tuned through 3-fold CV from . For -SVM, we choose the Gaussian kernel and select the kernel width through CV from , where is the mean of the distances between a input data to its nearest neighbor. TML-SVM has three hyper-parameters to be tuned: the number of anchor points and the tradeoff coefficients and . For , we still empirically set it to of the training samples; for and , we select them through CV from and respectively. In kernel TML-SVM, we use the same Gaussian kernel width selected in -SVM for each dataset, and tune the other parameters and similarly as in TML-SVM. To deal with multi-class classification, we apply the “one-against-one” strategy on top of binary TML-SVM and kernel TML-SVM, the same as used in LIBSVM.
We adopt the same experimental setting and statistical ranking scheme as in the NN based classification, and report the results averaged from 100 runs in Table 2. It is evident that combining our proposed nonlinear metric learning has significantly improved the performance of both -SVM and -SVM. To be specific, TML-SVM outperforms -SVM on all seven datasets; kernel TML-SVM also fares better than -SVM on all seven datasets. Furthermore, it is worth pointing out that TML-SVM has significantly improved -SVM’s classification rates, performing better than or comparable to -SVM on five datasets (“Iris”, “Wine”, “Breast”, “Diabetes”, and “Liver”).
In this paper, we present two nonlinear metric learning solutions, for NN and SVMs respectively, based on geometric transformations. The novelty of our approaches lies in the fact that it generalizes the linear or piecewise linear transformations in traditional metric learning solutions to a globally smooth nonlinear deformation in the input space. The geometric model used in this paper is thin-plate splines, and it can be extended to other radial distance functions. To explore other types of geometric models from the perspective of conditionally positive definite kernels is the direction of our future efforts. We are also interested in investigating a more efficient numerical optimization scheme (or the analytic form) for the proposed TPS based methods.
- We use a SQP based constrained optimizer “fmincon” in Matlab Optimization Toolbox.
- We still use “fmincon” in Matlab to solve Eqn. (13). In practice, the convergence for the second inner step is not necessary, so we use an early stop strategy to speed up the whole algorithm.
- A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013.
- Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2006.
- Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. NIPS’02, 2002.
- Sam Roweis Jacob Goldberger and Ruslan Salakhutdinov Geoff Hinton. Neighbourhood components analysis. NIPS’04, 2004.
- Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. 2003.
- Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. 2007.
- Kilian Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems, 18:1473, 2006.
- Amir Globerson and Sam Roweis. Metric learning by collapsing classes. In Nips, volume 18, pages 451–458, 2005.
- Lorenzo Torresani and Kuang-chih Lee. Large margin component analysis. NIPS’07, 2007.
- James T Kwok and Ivor W Tsang. Learning with idealized kernels. In ICML, pages 400–407, 2003.
- Ratthachat Chatpatanasiri, Teesid Korsrilabutr, Pasakorn Tangchanachaianan, and Boonserm Kijsirikul. A new kernelization framework for mahalanobis distance learning algorithms. Neurocomputing, 73:1570–1579, 2010.
- Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009.
- Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 906–913. IEEE, 2011.
- Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Trans. Pattern Anal. Mach. Intell., 33(4):794–806, 2011.
- Yung-Kyun Noh, Byoung-Tak Zhang, and Daniel D Lee. Generative local metric learning for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 1822–1830, 2010.
- Jun Wang, Alexandros Kalousis, and Adam Woznica. Parametric local metric learning for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 1601–1609, 2012.
- Yujie He, Wenlin Chen, Yixin Chen, and Yi Mao. Kernel density metric learning. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 271–280. IEEE, 2013.
- Dor Kedem, Stephen Tyree, Fei Sha, Gert R Lanckriet, and Kilian Q Weinberger. Non-linear metric learning. In Advances in Neural Information Processing Systems, pages 2573–2581, 2012.
- Zhixiang Xu, Kilian Q Weinberger, and Olivier Chapelle. Distance metric learning for kernel machines. arXiv preprint arXiv:1208.3422, 2012.
- Xiaoqiang Zhu, Pinghua Gong, Zengshun Zhao, and Changshui Zhang. Learning similarity metric with svm. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pages 1–8. IEEE, 2012.
- Xiaohe Wu, Wangmeng Zuo, Yuanyuan Zhu, and Liang Lin. F-SVM: combination of feature transformation and SVM learning via convex relaxation. CoRR, abs/1504.05035, 2015.
- Steven CH Hoi, Wei Liu, Michael R Lyu, and Wei-Ying Ma. Learning distance metrics with contextual constraints for image retrieval. In CVPR’06, 2006.
- Jean Duchon. Splines minimizing rotation-invariant semi-norms in sobolev spaces. In Constructive theory of functions of several variables, pages 85–100. Springer, 1977.
- Grace Wahba. Spline models for observational data, volume 59. Siam, 1990.
- Haili Chui and Anand Rangarajan. A new point matching algorithm for non-rigid registration. CVIU, 89(2–3):114–141, 2003.
- Karl Rohr, H Siegfried Stiehl, Rainer Sprengel, Thorsten M Buzug, Jürgen Weese, and MH Kuhn. Landmark-based elastic registration using approximating thin-plate splines. Medical Imaging, IEEE Transactions on, 20(6):526–534, 2001.
- Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and Vladimir Vapnik. Feature selection for svms. In NIPS, volume 12, pages 668–674, 2000.
- Changshui Zhang, Feiping Nie, and Shiming Xiang. A general kernelization framework for learning algorithms based on kernel pca. Neurocomputing, 73(4):959–967, 2010.
- Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
- Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.