Iterated Support Vector Machines for Distance Metric Learning

Iterated Support Vector Machines for Distance Metric Learning

Wangmeng Zuo,  Faqiang Wang, David Zhang,  Liang Lin,  Yuchi Huang,  Deyu Meng, and Lei Zhang,  W. Zuo and F. Wang are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China. (e-mail: cswmzuo@gmail.com; tshfqw@163.com) D. Zhang and L. Zhang are with the Department of Computing, the Hong Kong Polytechnic University, Kowloon, Hong Kong. (e-mail: csdzhang@comp.polyu.edu.hk; cslzhang@comp.polyu.edu.hk)L. Lin is with the School of Super-computing, Sun Yat-Sen University, Guangzhou, 510275, China. (e-mail: linliang@ieee.org)Y. Huang is with the NEC Laboratories China, Beijing, 100084, China. (e-mail: huang_yuchi@nec.cn)D. Meng is with the Institute of Information and System Sciences, Faculty of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, 710049, China. (e-mail: dymeng@mail.xjtu.edu.cn)Manuscript received XXX; revised XXX.
Abstract

Distance metric learning aims to learn from the given training data a valid distance metric, with which the similarity between data samples can be more effectively evaluated for classification. Metric learning is often formulated as a convex or nonconvex optimization problem, while many existing metric learning algorithms become inefficient for large scale problems. In this paper, we formulate metric learning as a kernel classification problem, and solve it by iterated training of support vector machines (SVM). The new formulation is easy to implement, efficient in training, and tractable for large-scale problems. Two novel metric learning models, namely Positive-semidefinite Constrained Metric Learning (PCML) and Nonnegative-coefficient Constrained Metric Learning (NCML), are developed. Both PCML and NCML can guarantee the global optimality of their solutions. Experimental results on UCI dataset classification, handwritten digit recognition, face verification and person re-identification demonstrate that the proposed metric learning methods achieve higher classification accuracy than state-of-the-art methods and they are significantly more efficient in training.

metric learning, support vector machine, kernel method, Lagrange duality, alternative optimization

1 Introduction

Distance metric learning aims to train a valid distance metric which can enlarge the distances between samples of different classes and reduce the distances between samples of the same class [1]. Metric learning is closely related to -Nearest Neighbor (-NN) classification [2], clustering [3], ranking [4, 5], feature extraction [6] and support vector machine (SVM) [7], and has been widely applied to face recognition [8], person re-identification [9, 10], image retrieval [11, 12], activity recognition [13], document classification [14], and link prediction [15], etc. One popular metric learning approach is the Mahalanobis distance metric learning, which is to learn a linear transformation matrix or a matrix from the training data. Given two samples and , the Mahalanobis distance between them is defined as:

(1)

To satisfy the nonnegative property of a distance metric, should be positive semidefinite (PSD). According to which one of and is learned, Mahalanobis distance metric learning methods can be grouped into two categories. Methods that learn , including neighborhood components analysis (NCA) [16], large margin components analysis (LMCA) [17] and neighborhood repulsed metric learning (NRML) [18], are mostly formulated as nonconvex optimization problems, which are solved by gradient descent based optimizers. Taking the PSD constraint into account, methods that learn , including large margin nearest neighbor (LMNN) [19] and maximally collapsing metric learning (MCML) [20], are mostly formulated as convex semidefinite programming (SDP) problems, which can be optimized by standard SDP solvers [19], projected gradient [3], Boosting-like [21], or Frank-Wolfe [22] algorithms. Davis et al. [23] proposed an information-theoretic metric learning (ITML) model with an iterative Bregman projection algorithm, which does not need projections onto the PSD cone. Besides, the use of online solvers for metric learning has been discussed in [24, 9, 25].

On the other hand, kernel methods [26, 27, 28, 29, 30, 31] have been widely studied in many learning tasks, e.g., semi-supervised learning, multiple instance learning, multitask learning, etc. Kernel learning methods, such as support vector machine (SVM), exhibit good generalization performance. There are many open resources on kernel classification methods, and a variety of toolboxes and libraries have been released [32, 33, 34, 35, 36, 37, 38]. It is thus important to investigate the connections between metric learning and kernel classification and explore how to utilize the kernel classification resources in the research and development of metric learning methods.

Abbreviation Full Name
PSD Positive semidefinite (matrix)
SDP Semidefinite programming
-NN -nearest neighbor (classification)
KKT Karush-Kuhn-Tucker (condition)
SVM Support vector machine
LMCA[17] Large margin components analysis
LMNN[2] Large margin nearest neighbor
NCA[16] Neighborhood components analysis
MCML[20] Maximally collapsing metric learning
ITML[23] Information-theoretic metric learning
LDML[8] Logistic discriminant metric learning
DML-eig[22] Distance metric learning with eigenvalue optimization
PLML[39] Parametric local metric learning
KISSME[9] Keep it simple and straightforward metric learning
PCML Positive-semidefinite constrained metric learning
NCML Nonnegative-coefficient constrained metric learning
TABLE I: Summary of main abbreviations

In this paper, we propose a novel formulation of metric learning by casting it as a kernel classification problem, which allows us to effectively and efficiently learn distance metrics by iterated training of SVM. The off-the-shelf SVM solvers such as LibSVM [33] can be employed to solve the metric learning problem. Specifically, we propose two novel methods to bridge metric learning with the well-developed SVM techniques, and they are easy to implement. First, we propose a Positive-semidefinite Constrained Metric Learning (PCML) model, which can be solved via iterating between PSD projection and dual SVM learning. Second, by re-parameterizing the matrix , we transform the PSD constraint into a nonnegative coefficient constraint and consequently propose a Nonnegative-coefficient Constrained Metric Learning (NCML) model, which can be solved by iterated learning of two SVMs. Both PCML and NCML have globally optimal solutions, and our extensive experiments on UCI dataset classification, handwritten digit recognition, face verification and person re-identification clearly demonstrate the effectiveness of them.

The remainder of this paper is organized as follows. Section 2 reviews the related works. Section 3 presents the PCML model and the optimization algorithm. Section 4 presents the model and algorithm of NCML. Section 5 presents the experimental results, and Section 6 concludes the paper.

The main abbreviations used in this paper are summarized in Table 1.

2 Related Work

Compared with nonconvex metric learning models [16, 17, 40], convex formulation of metric learning [2, 3, 20, 21, 22] has drawn increasing attentions due to its desired properties such as global optimality. Most convex metric learning models can be formulated as SDP or quadratic SDP problems. Standard SDP solvers, however, are inefficient for metric learning, especially when the size of training samples is big or the feature dimension is high. Therefore, customized optimization algorithm needs to be developed for each specific metric learning model. For LMNN, Weinberger et al. developed an efficient solver based on the sub-gradient descent and the active set techniques [41]. In ITML, Davis et al. [23] suggested an iterative Bregman projection algorithm. Iterative projected gradient descent method [3, 42] has been widely employed for metric learning but it requires an eigenvalue decomposition in each iteration. Other algorithms such as block-coordinate descent [43], smooth optimization [44], and Frank-Wolfe [22] have also been studied for metric learning. Unlike the customized algorithms, in this work we formulate metric learning as a kernel classification problem and solve it using the off-the-shelf SVM solvers, which can guarantee the global optimality and the PSD property of the learned , and is easy to implement and efficient in training.

Another line of work aims to develop metric learning algorithms by solving the Lagrange dual problems. Shen et al. derived the Lagrange dual of the exponential loss based metric learning model, and proposed a boosting-like approach, namely BoostMetric, where the matrix is learned as a linear positive combination of rank-one matrices [21, 45]. MetricBoost [46] and FrobMetric [47, 48] were further proposed to improve the performance of BoostMetric. Liu and Vemuri incorporated two regularization terms in the duality for robust metric learning [49]. Note that BoostMetric [21, 45], MetricBoost [46], and FrobMetric [47] are proposed for metric learning with triplet constraints, whereas in many applications such as verification, only pairwise constraints are available in the training stage.

Several SVM-based metric learning approaches [50, 51, 52, 53] have also been proposed. Using SVM, Nguyen and Guo [50] formulated metric learning as a quadratic semidefinite programming problem, and suggested a projected gradient descent algorithm. The formulations of the proposed PCML and NCML in this work are different from the model in [50], and they are solved by the dual problems with the off-the-shelf SVM solvers. Brunner et al. [51] proposed a pairwise SVM method to learn a dissimilarity function rather than a distance metric. Different from [51], the proposed PCML and NCML learn a distance metric and the matrix is constrained to be a PSD matrix. Do et al. [52] studied SVM from a metric learning perspective and presented an improved variant of SVM classification. Wang et al. [53] developed a kernel classification framework for metric learning and proposed two learning models which can be efficiently implemented by the standard SVM solvers. However, they adopted a two-step greedy strategy to solve the models and neglected the PSD constraint in the first step. In this work, the proposed PCML and NCML models have different formulations from [53], and their solutions are globally optimal.

3 Positive-semidefinite Constrained Metric Learning (PCML)

Denote by a training set, where is the th training sample, and is the class label of . The Mahalanobis distance between and can be equivalently written as:

(2)

where is a PSD matrix, is defined as the Frobenius inner product of two matrices and , and stands for the matrix trace operator. For each pair of and , we define a matrix . With , the Mahalanobis distance can be rewritten as .

3.1 PCML and Its Dual Problem

Let be the set of similar pairs, and let be the set of dissimilar pairs. By introducing an indicator variable

(3)

the PCML model can be formulated as:

(4)
s.t.

where denotes the slack variables, denotes the bias, and denotes the Frobenius norm.

The PCML model defined above is convex and can be solved using the standard SDP solvers. However, the high complexity of general-purpose interior-point SDP solver makes it only suitable for small-scale problems. In order to improve the efficiency, in the following we first analyze the Lagrange duality of the PCML model, and then propose an algorithm to iterate between SVM training and PSD projection to learn the Mahalanobis distance metric.

By introducing the Lagrange multipliers and a PSD matrix , the Lagrange dual of the problem in (4) can be formulated as:

(5)
s.t.

Please refer to Appendix A for the detailed derivation of the dual problem. Based on the Karush-Kuhn-Tucker (KKT) conditions, the matrix can be obtained by

(6)

The strong duality allows us to first solve the equivalent dual problem in (5) and then obtain the matrix by (6). However, due to the PSD constraint , the problem in (5) is still difficult to optimize.

3.2 Alternative Optimization Algorithm

To solve the dual problem efficiently, we propose an optimization approach by updating and alternatively. Given , we introduce a new variable with , and the subproblem on can be formulated as:

(7)

The subproblem (7) is a QP problem. We can define a kernel function of sample pairs as follows:

(8)

Substituting (8) into (7), the subproblem on becomes a kernel-based classification problem, and can be efficiently solved by using the existing SVM solvers such as LibSVM [33]. Given , the subproblem on can be formulated as the projection of a matrix onto the convex cone of PSD matrices:

(9)

where . Through the eigen-decomposition of , i.e., and is the diagonal matrix of eigenvalues, the solution to the subproblem on can be explicitly expressed as , where . Finally, the PCML algorithm is summarized in Algorithm 1.

  Input: and
  Output: .
  Initialize , .
  repeat
     1. Update with .
     2. Update by solving the subproblem (7) using an SVM solver.
     3. Update .
     4. Update , where and .
     5. .
  until convergence
  .
  return
Algorithm 1 Algorithm of PCML

3.3 Optimality Condition

As shown in [54, 55], the general alternating minimization approach will converge. By alternatively updating and , the proposed algorithm can reach the global optimum of the problems in (4) and (5).

The optimality condition of the proposed algorithm can be checked by the duality gap in each iteration, which is defined as the difference between the primal and dual objective values:

(10)

where , , , and are feasible primal and dual variables, and is the duality gap in the th iteration. According to (6), we can derive that

(11)

As shown in Subsection 3.2, , , and hence , where . Thus, can be computed by

(12)

Substituting (11) and (12) into (10), the duality gap of PCML can be obtained as follows

(13)

Based on the KKT conditions of the PCML dual problem in (5), can be obtained by

(14)

where

(15)

Please refer to Appendix A for the detailed derivation of and . The duality gap is always nonnegative and approaches to zero when the primal problem is convex. Thus, it can be used as the termination condition of the algorithm. Fig. 1 plots the curve of duality gap versus the number of iterations on the PenDigits dataset by PCML. One can see that the duality gap converges to zero in less than 20 iterations and our algorithm will reach the global optimum. In Algorithm 1, we adopt the following termination condition:

(16)

where is a small constant and we set in the experiment.

Fig. 1: Duality gap vs. number of iterations on the PenDigits dataset for PCML.

3.4 Remarks

Warm-start: In the updating of , we adopt a simple warm-start strategy. We use the solution of the previous iteration as the initialization of the next iteration. Since the previous solution can serve as a good guess, warm-start results in significant improvement in efficiency.

Construction of pairwise constraints: Based on the training set, we can introduce pairwise constraints in total. However, in practice we only need to choose a subset of pairwise constraints to reduce the computational cost. For each sample, we find its nearest neighbors to construct similar pairs and its farthest neighbors to construct dissimilar pairs. Thus, we only need pairwise constraints. By this strategy, we can reduce the scale of pairwise constraints from to . Since is usually small constant (=13) in practice, the computational cost of metric learning is much reduced. Similar strategy for constructing pairwise or triplet constraints can be found in [2, 11].

Computational Complexity: We use the LibSVM library for SVM training. The computational complexity of SMO-type algorithms [34] is . For PSD projection, the complexity of conventional SVD algorithms is .

4 Nonnegative-coefficient Constrained Metric Learning (NCML)

Given a set of rank-1 PSD matrices , a linear combination of is defined as , where is the scalar combination coefficient. One can easily prove the following Theorem 1.

Theorem 1

If the scalar coefficient , the matrix is a PSD matrix, where is a rank-1 PSD matrix.

Proof:

Denote by a random vector. Based on the expression of , we have:

Since and , we have . Therefore, is a PSD matrix. \qed

4.1 NCML and Its Dual Problem

Motivated by Theorem 1, we propose to transform the PSD constraint in (4) by re-parameterizing the distance metric , and develop a nonnegative-coefficient constrained metric learning (NCML) method to learn the PSD matrix . Given the training data and , a rank-1 PSD matrix can be constructed for each pair . By assuming that the learned matrix should be the linear combination of with the nonnegative coefficient constraint, the NCML model can be formulated as:

(17)

By substituting with , we reformulate the NCML model as follows:

(18)

By introducing the Lagrange multipliers and , the Lagrange dual of the primal problem in (18) can be formulated as:

(19)
s.t.

Please refer to Appendix B for the detailed derivation of the dual problem. Based on the KKT conditions, the coefficient can be obtained by:

(20)

Thus, we can first solve the above dual problem, and then obtain the matrix by

(21)

4.2 Optimization Algorithm

There are two groups of variables, and , in problem (19). We adopt an alternative optimization approach to solve them. First, given , the variables can be solved as follows:

(22)
s.t.

where is the variable with . Clearly, the subproblem on is exactly the dual problem of SVM, and it can be efficiently solved by any standard SVM solvers, e.g., LibSVM [33].

Given , the subproblem on can be formulated as follows:

(23)

where . To simplify the subproblem on , we derive the Lagrange dual of (23) based on the KKT condition:

(24)

where is the Lagrange dual multiplier. The Lagrange dual problem of (23) is formulated as follows:

(25)

Please refer to Appendix C for the detailed derivation. Clearly, problem (25) is a simpler QP problem than (23), which can be efficiently solved by the standard SVM solvers.

By alternatively updating and , we can solve the NCML dual problem (19). After obtaining the optimal solutions of and , the optimal solution of in problem (18) can be obtained by

(26)

We then have . The NCML algorithm is summarized in Algorithm 2.

  Input: Training set .
  Output: The matrix .
  Initialize with small random values, .
  repeat
     1. Update with .
     2. Update by solving the subproblem (15) using an SVM solver.
     3. Update with .
     4. Update by solving the subproblem (18) using an SVM solver.
     5. Update with .
     6. .
  until convergence
  .
  return
Algorithm 2 Algorithm of NCML

Analogous to PCML, the updating of and in NCML can be speeded up by using the warm-start strategy. As shown in Fig. 2, the proposed NCML algorithm will converge in 1015 iterations.

4.3 Optimality Condition

We check the duality gap of NCML to investigate the optimality condition of it. From the primal and dual objectives in (18) and (19), the NCML duality gap in the th iteration is

(27)

where and are the feasible solutions to the primal problem, and are the feasible solutions to the dual problem, and is the duality gap in the th iteration. As and are the optimal solutions to the primal subproblem on in (23) and its dual problem in (25), respectively, the duality gap of subproblem on is zero, i.e.,

(28)

As shown in (26), and should be equal. We substitute (28) into (27) as follows:

(29)

Based on the KKT conditions of the NCML dual problem in (19), can be obtained by (30) (see page 7), where and can be obtained by

(30)
(31)

Please refer to Appendix B for the detailed derivation of and .

Fig. 2 plots the curve of duality gap versus the number of iterations on the PenDigits dataset by NCML. One can see that the duality gap converges to zero in 15 iterations, and NCML reaches the global optimum. In the implementation of Algorithm 2, we adopt the following termination condition:

(32)

where is a small constant and we set in the experiment.

Fig. 2: Duality gap vs. number of iterations on the PenDigits dataset for NCML.

4.4 Remarks

Computational complexity: We use the same strategy as that in PCML to construct the pairwise constraints for NCML. In each iteration, NCML calls for the SVM solver twice while PCML calls for it only once. When the SMO-type algorithm [34] is adopted for SVM training, the computational complexity of NCML is . One extra advantage of NCML lies in its lower computational cost with respect to , which involves the computation of and the construction of matrix . Since , the cost of computing is . The cost of constructing the matrix is less than , and this operation is required only once after the convergence of and .

Nonlinear extensions: Note that can be treated as an inner product of two pairs of samples: and . Analogous to PCML, if we can define a kernel on and , we can substitute with to develop new linear or even nonlinear metric learning algorithms, and the Mahalanobis distance between any two samples and can be formulated as:

(33)

Another nonlinear extension strategy is to define a kernel on and . Since , we can substitute with and formulate the Mahalanobis distance between and as:

(34)

That is to say, NCML allows us to learn nonlinear metrics for histograms and structural data by designing proper kernel functions and incorporating appropriate regularizations on . Metric learning for structural data beyond vector data has been recently receiving considerable research interests [56, 5], and NCML can provide a new perspective on this topic.

SVM solvers: Although our implementation is based on LibSVM, there are a number of well-studied SVM training algorithms, e.g., core vector machines [35], LaRank [36], BMRM [37], and Pegasos [38], which can be utilized for large scale metric learning. Moreover, we can refer to the progresses in kernel methods [26, 27, 28] for developing semi-supervised, multiple instance, and multitask metric learning approaches.

5 Experimental Results

We evaluate the proposed PCML and NCML models for -NN classification () using 9 UCI datasets, 4 handwritten digit datasets, 2 face verification datasets and 2 person re-identification datasets. We compare PCML and NCML with the baseline Euclidean distance metric and 7 state-of-the-art metric learning models, including NCA [16], ITML [23], MCML [20], LDML [8], LMNN [2], PLML [39], and DML-eig [22]. On each dataset, if the partition of training set and test set is not defined, we evaluate the performance of each method by 10-fold cross-validation, and the classification error rate and training time are obtained by averaging over 10 runs of 10-fold cross-validation. PCML and NCML are implemented using the LibSVM111http://www.csie.ntu.edu.tw/~cjlin/libsvm/ toolbox. The source codes of NCA222http://www.cs.berkeley.edu/~fowlkes/software/nca/, ITML333http://www.cs.utexas.edu/~pjain/itml/, MCML444http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html, LDML555http://lear.inrialpes.fr/people/guillaumin/code.php, LMNN666http://www.cse.wustl.edu/~kilian/code/code.html, PLML777http://cui.unige.ch/~wangjun/, and DML-eig888http://empslocal.ex.ac.uk/people/staff/yy267/software.html are online available, and we tune their parameters to get the best results.

5.1 Results on the UCI Datasets

We first use 9 datasets from the UCI Machine Learning Repository [57] to evaluate the proposed models. The information of the 9 UCI datasets is summarized in Table II. On the Satellite, SPECTF Heart, and Letter datasets, the training set and test set are defined. On the other datasets, we use 10-fold cross-validation to evaluate the metric learning models.

Dataset

# of training samples

# of test samples

Feature dimension

# of classes

Breast Tissue 96 10 9 6
Cardiotocography 1,914 212 21 10
ILPD 525 58 10 2
Letter 16,000 4,000 16 26
Parkinsons 176 19 22 2
Satellite 4,435 2,000 36 6
Segmentation 2,079 231 19 7
Sonar 188 20 60 2
SPECTF Heart 80 187 44 2
TABLE II: The UCI datasets used in our experiments.

The proposed PCML and NCML methods involve only one hyper-parameter, i.e., the regularization parameter . We simply adopt the cross-validation strategy to select by investigating the influence of on the classification error rate. Fig. 3 shows the curves of classification error rate versus for PCML and NCML on the SPECTF Heart dataset. The curves on other datasets are similar. We can observe that when , the classification error rates of PCML and NCML will be low and stable. When is higher than , the classification error rates jump dramatically. Thus, we set in our experiments.

Fig. 3: Classification error rate (%) versus . (a) PCML; (b) NCML.

We compare the classification error rates of the competing methods in Table III. On the Cardiotocography and Segmentation datasets, PCML achieves the lowest error rates. On the Segmentation and SPECTF Heart datasets, NCML achieves the lowest error rates. The average ranks of competing methods are listed in the last row of Table III. On each dataset, we rank the methods based on their error rates, i.e., we assign rank 1 to the method with the lowest error rate and rank 2 to the method with the second lowest error rate, and so on. The average rank is defined as the mean rank of one method over the nine datasets, which can provide a fair comparison of the learning methods [58]. From Table III, we can see that both PCML and NCML achieve the first and second best average ranks, respectively, demonstrating strong classification capability for general classification tasks.

Dataset Euclidean NCA ITML MCML LDML LMNN PLML DML-eig PCML NCML
Breast Tissue 31.00 41.27 35.82 32.09 48.00 34.37 34.13 33.13 38.00 35.37
Cardiotocography 21.40 21.16 18.67 22.29 22.26 19.21 18.54 29.31 18.50 18.69
ILPD 35.69 34.65 35.35 35.49 35.84 34.12 31.61 36.87 33.96 32.43
Letter 4.33 2.47 3.80 4.20 11.05 3.45 3.28 3.85 2.67 2.72
Parkinsons 4.08 6.63 6.13 9.84 7.15 5.26 8.84 7.82 5.68 7.26
Satellite 10.95 10.40 11.45 15.65 15.90 10.05 11.85 10.90 11.15 11.10
Segmentation 2.86 2.51 2.73 2.60 2.86 2.64 2.68 2.97 2.12 2.12
Sonar 12.98 15.40 12.07 24.29 22.86 11.57 12.07 15.07 12.71 13.29
SPECTF Heart 38.50 26.74 34.76 38.50 33.16 34.76 27.27 31.02 28.88 25.67
Average Rank 5.78 4.56 5.44 7.56 8.44 4.00 4.33 7.00 3.56 3.89
TABLE III: Classification error rate (%) on the UCI datasets.

We then compare the training time of competing metric learning methods in Fig. 4. All the experiments are run in a PC with 4 Intel Core i5-2410 CPUs (2.30 GHz) and 16GB RAM. Clearly, the proposed PCML and NCML are the fastest in most cases. Although DML-eig is faster than PCML on the Letter dataset, its classification error rate on this dataset is much higher than PCML and NCML. On average, PCML and NCML are 23 and 18 times faster than PLML, the third fastest algorithm, respectively.

Fig. 4: Training time (s) of NCA, ITML, MCML, LDML, LMNN, DML-eig, PLML, PCML and NCML. From 1 to 9, the Dataset ID represents Breast Tissue, Cardiotocography, ILPD, Letter, Parkinsons, Satellite, Segmentation, Sonar and SPECTF Heart.

5.2 Handwritten Digit Recognition

We further evaluate the proposed methods on four handwritten digit datasets: MNIST, Pen-based recognition of handwritten Digits data set (PenDigits), Semeion and USPS. Table IV summarizes the basic information of these four handwritten digit datasets. On the MNIST, PenDigits, and USPS datasets, we use the defined training sets to train the metrics, and use the defined test sets to compute the classification error rates. On the Semeion dataset, we use 10-fold cross-validation to evaluate the metric learning methods, and the classification error rate and training time are obtained by averaging over 10 runs of 10-fold cross-validation.

Dataset

# of training samples

# of test samples

dimension

PCA dimension

# of classes

MNIST 60,000 10,000 784 100 10
PenDigits 7,494 3,498 16 N/A 10
Semeion 1,434 159 256 100 10
USPS 7,291 2,007 256 100 10
TABLE IV: The handwritten digit datasets used in the experiments.

As the dimensions of images in the MNIST, Semeion and USPS datasets are relatively high, we use principal component analysis (PCA) to reduce the feature dimension to 100, and train the metrics in the PCA subspace. Table V lists the classification error rates of the ten competing methods on the four handwritten digit datasets. The last row of Table V lists the average ranks of the competing methods. We do not report the error rate and training time of MCML on the MNIST dataset because MCML requires too large memory space (more than 30 GB) on this dataset and cannot run in our PC. From Table V, we can see that both PCML and NCML achieve the best average rank. Again, the results indicate that the proposed methods have better classification performance.

Dataset Euclidean NCA ITML MCML LDML LMNN DML-eig PLML PCML NCML
MNIST 2.87 5.46 2.89 N/A 6.05 2.28 5.06 2.54 3.85 2.80
PenDigits 2.26 2.23 2.29 2.26 6.20 2.52 3.75 2.46 2.06 2.06
Semeion 8.54 8.60 5.71 11.23 11.98 6.09 5.72 7.66 4.83 5.53
USPS 5.08 5.68 6.33 5.08 8.77 5.38 11.36 6.73 5.33 5.43
Average Rank 4.00 6.25 5.25 4.67 9.50 4.50 7.50 5.75 2.75 2.75
TABLE V: Comparison of classification error rate (%) on the handwritten digit datasets.

All the experiments were executed in the same PC as used in Subsection 5.1. Fig. 5 compares the training time of NCA, ITML, MCML, LDML, LMNN, DML-eig, PLML, PCML, and NCML. Clearly, the proposed PCML and NCML methods are much faster than the other methods. On average, PCML and NCML are 61 and 27 times faster than PLML, the third fastest algorithm, respectively. One can conclude that PCML and NCML offer promising solutions to effective and efficient metric learning.

Fig. 5: Training time (s) of NCA, ITML, MCML, LDML, LMNN, DML-eig, PLML, PCML and NCML. From 1 to 4, the Dataset ID represents MNIST, PenDigits, Semeion and USPS.

Finally, we compare the running time of PCML and NCML under different feature dimensions . As analyzed in Subsections 3.4 and 4.4, the time complexities of PCML and NCML are and , respectively. Fig. 6 shows the training time on the Semeion dataset with different PCA dimensions. We can see that when the dimension is lower than 110, the training time of NCML is longer than PCML. When the dimension is higher than 110, the training time of PCML increases and becomes longer than NCML.

Fig. 6: Training time (s) vs. PCA dimension on the Semeion dataset.

5.3 Face Verification

In this subsection, we evaluate the proposed methods for face verification using two challenging face databases: Labeled Faces in the Wild (LFW) [59] and Public Figures (PubFig) [60].

5.3.1 The LFW Database

The face images in the LFW database were collected from the Internet and demonstrate large variations of pose, illumination, expression, etc. The database consists of 13,233 face images from 5,749 persons. Under the image restricted setting, the performance of a face verification method is evaluated by 10-fold cross validation. For each of the 10 runs, the database provides 300 positive pairs and 300 negative pairs for testing, and 5,400 image pairs for training. The verification rate and Receiver Operator Characteristic (ROC) curve of each method are obtained by averaging over the 10 runs.

In our experiments, we use the SIFT [61] features and the attribute features provided by [8] and [60] to evaluate the metric learning methods. Since the dimension of SIFT features is high (i.e., 128 3 9), PCA is used to reduce the feature dimension to 150. Under the restricted setting of the LFW database, we only know whether two images are matched or not for the given pairs. In the training stage, we use the training pairs to train a Mahalanobis distance metric. In the test stage, we compare the Mahalanobis distance of the test pair with a threshold to decide whether the two images are matched or not.

We report the ROC curves of PCML, NCML, DML-eig [22], ITML [23], KISSME [9], LDML [8] and Euclidean distance in Fig. 7. We also compare the verification accuracies of PCML and NCML and other metric learning methods by using the SIFT and the attribute features in Table VI. It can be seen that the proposed PCML and NCML methods perform much better than all the other competing methods. Using the combination of SIFT and Attribute features, the verification accuracies of PCML (89.00%) and NCML (89.50%) are higher than the third best method, i.e. DML-eig (85.65%), by 3.35% and 3.85%, respectively. We also compare the training time of the competing methods in Table VI. The training time of PCML and NCML is shorter than the other methods except for KISSME. The reason is that KISSME is a one-pass training approach. Although KISSME is faster, its verification accuracy is much lower than PCML and NCML.

(a) SIFT
(b) Attribute
(c) SIFT + Attribute
Fig. 7: The ROC curves of different metric learning methods on the LFW-funneled dataset under the image restricted setting[22, 9, 8]. (a) SIFT feature; (b) Attribute feature; (c) SIFT + Attribute feature.
Method Verification Accuracy (%) Training Time (s)
SIFT Attribute SIFT + Attribute SIFT Attribute
PCML 85.70 84.70 89.00 13.22 14.17
NCML 86.45 85.45 89.50 31.62 27.55
DML-eig[22] 81.27 80.13 85.65 1931.50 113.79
ITML[9] 82.40 82.98 85.50 3341.80 3222.40
LDML[8] 79.27 83.40 86.02 1316.60 543.08
KISSME[9] 80.50 84.60 85.39 0.22 0.05
Euclidean 68.10 75.25 76.53 0 0
TABLE VI: Verification accuracies (%) and training time (s) of competing metric learning methods on the LFW-funneled dataset under the image restricted setting.

5.3.2 The PubFig Database

The PubFig database [60] contains 58,797 face images of 200 persons with large variations in pose, lighting, expression, scene, camera, imaging conditions and parameters, etc. In this database, the face verification methods are also evaluated using 10-fold cross validation. Among the given 20,000 image pairs, we randomly select 18,000 pairs for training and use the remaining 2,000 pairs for testing in each run. The ROC curves and verification rates are obtained by averaging over the 10 runs.

We use the attribute features provided by [60] to evaluate the competing methods. Fig. 8 shows the ROC curves of PCML, NCML, KISSME [9], ITML [23], DML-eig [22], Attribute Classifiers [60] and the baseline Euclidean distance. It can be seen that the performance of PCML and NCML is similar, and is superior to that of the other methods.

We further report the verification rates of PCML, NCML and the other methods in Table VII. One can see that PCML and NCML perform better than the other methods. The accuracies of PCML (79.71%) and NCML (79.75%) are higher than the third best method, i.e., Attribute Classifiers (78.65%), by 1.06% and 1.10%, respectively. The training time of PCML, NCML and other metric learning methods is also listed in Table VII. It can be seen that PCML and NCML are much faster than ITML and DML-eig.

Fig. 8: The ROC curves of different methods on the PubFig database (the curves of PCML and NCML almost coincide).
Methods

Verification Accuracy (%)

Training Time (s)

PCML 79.71 118.55
NCML 79.75 216.38
KISSME[9] 77.60 0.09
ITML[9] 69.30 3796.50
Attribute Classifiers[60] 78.65 -
DML-eig[22] 77.36 1132.30
Euclidean 72.50 0
TABLE VII: Verification accuracies (%) and training time (s) of competing methods on the PubFig database.

5.4 Person Re-identification

In this subsection, we evaluate the performance of the proposed methods for person re-identification, i.e., recognizing a person at different locations and at different times [62]. Two challenging person re-identification databases, the Viewpoint Invariant Pedestrian Recognition (VIPeR) database [63] and the Context Aware Vision using Image-based Active Recognition for Re-Identification (CAVIAR4REID) database [64] are used to assess the performance of the proposed methods.

5.4.1 The VIPeR Database

The VIPeR database contains 1,264 pedestrian images of 632 persons from two camera viewspoints (camera A and camera B). For each person, there are two images taken from different viewpoints with a change of 90 degrees. In our experiments, we randomly select 316 persons and use their images for training, and use the images of the other 316 persons for testing. For the testing images, we use the images taken by camera B as the probe set and the images from camera A as the gallery set. Finally, 10 partitions of training and test sets are constructed, and the average accuracy over the 10 test sets is computed as the final accuracy.

Methods Accuracy (%) Training
Time
(s)
Rank 1 Rank 25 Rank 50 Rank 80 Rank 100
PCML 19.40 80.60 93.77 97.25 98.23 4.94
NCML 21.04 82.28 93.07 97.25 98.32 9.05
KISSME[9] 19.60 80.70 91.80 96.68 97.78 0.07
LMNN[9] 16.61 72.94 88.13 94.30 96.36 437.43
ITML[9] 15.66 74.21 88.29 95.41 96.99 1199.10
DML-eig[22] 8.07 50.47 65.82 77.69 82.44 47.03
Euclidean 10.90 44.94 60.76 70.09 74.37 0
TABLE VIII: Person re-identification accuracies (%) and training time (s) on the VIPeR dataset.

We report the Cumulative Matching Characteristic (CMC) curves of the competing methods in Fig. 9. We also compare their accuracies under different ranks in Table VIII. From Fig. 9 and Table VIII, one can see that both PCML and NCML outperform LMNN, ITML and Euclidean distance significantly under all ranks. When the rank is no more than 25, PCML performs similarly to KISSME, while NCML outperforms KISSME. When the rank is between 25 and 200, both PCML and NCML perform better than KISSME. The training time of the metric learning methods is also reported in Table VIII. We can see that both PCML and NCML are much more efficient than LMNN and ITML in training.

Fig. 9: The CMC curves on the VIPeR dataset.

5.4.2 The CAVIAR4REID Database

CAVIAR4REID consists of 1,220 pedestrian images from 72 persons, where the images are extracted from the shopping center scenario of the CAVIAR database [64]. The database covers a large range of image resolution and pose variation. The minimum and maximum image sizes in the CAVIAR4REID database are and , respectively. Following [65] and [10], we use the hierarchical Gaussian (HG) features to evaluate the metric learning methods.

According to the evaluation protocol in [10], we randomly select 36 persons and use their images for training, and use the rest images for testing. For the testing images, we randomly select one image for each person to construct a probe set consisting of 36 images, and use the other test images as the gallery set. Finally, 10 partitions of training and test sets are constructed, and the final results are obtained by averaging over the 10 runs.

We report the CMC curves of PCML, NCML, DML-eig [22], KISSME [9], ITML [23], LMNN [19] and Euclidean distance in Fig. 10. One can see that PCML and NCML perform the best and the second best among all the competing methods, respectively. Table IX lists the re-identification accuracies and training time by different methods. PCML and NCML perform better than the other metric learning methods under all the ranks. We also report the training times of the competing metric learning methods in Table IX. PCML and NCML are much faster than the other metric learning methods except for KISSME.

Fig. 10: The CMC curves on the CAVIAR4REID dataset.
Methods Accuracy (%) Training Time (s)
Rank 1 Rank 5 Rank 10 Rank 15
PCML 32.86 61.26 76.06 85.34 11.47
NCML 32.27 60.38 75.33 84.25 19.23
DML-eig[22] 30.68 57.15 73.18 82.64 829.24
LMNN[9] 28.66 56.53 71.30 81.19 95.62
ITML[9] 31.48 59.56 74.83 84.15 2819.18
KISSME[9] 29.87 54.75 71.36 82.15 1.12
Euclidean 27.98 50.67 66.25 77.54 0
TABLE IX: Person re-identification accuracies (%) and training time (s) on the CAVIAR4REID dataset.

6 Conclusion

We proposed two distance metric learning models, namely Positive-semidefinite Constrained Metric Learning (PCML) and Nonnegative-coefficient Constrained Metric Learning (NCML). The proposed models can guarantee the positive semidefinite property of the learned matrix , and can be solved efficiently by the existing SVM solvers. Experimental results on nine UCI machine learning repository datasets and four handwritten digit datasets showed that, compared with the state-of-the-art metric learning methods, including NCA [16], ITML [23], MCML [20], LDML [8], LMNN [2], PLML [39], and DML-eig [22], the proposed PCML and NCML methods can not only achieve higher classification accuracy, but also are much faster in training. On average, they are 35 and 21 times faster than PLML, the 3rd fastest metric learning method, respectively. The experimental results on LFW, PubFig, VIPeR and CAVIAR4REID databases indicate that the proposed methods also perform very well in vision tasks such as face verification and person re-identification, leading to higher verification rates and very competitive training efficiency.

Appendix A The Dual of PCML

The original problem of PCML is formulated as

(35)

Its Lagrangian is:

(36)

where , and are the Lagrange multipliers which satisfy , , and . Converting the original problem to its dual problem needs the following KKT conditions:

(37)
(38)
(39)
(40)
(41)
(42)

Equation (37) implies the following relationship between , and :

(43)

Substituting (37)(39) back into the Lagrangian, we get the following Lagrange dual problem of PCML:

(44)

As we can see from (43) and (44), is explicitly determined by the training procedure, but is not. Nevertheless, can be easily found by using the KKT complementarity condition in (39) and (42), which show that if , and if . Thus we can simply take any training point, for which , to compute by

(45)

Note that it is numerically wiser to take the average over all such training data points to compute . After is computed, we can compute by