Towards Making High Dimensional Distance Metric Learning Practical
In this work, we study distance metric learning (DML) for high dimensional data. A typical approach for DML with high dimensional data is to perform the dimensionality reduction first before learning the distance metric. The main shortcoming of this approach is that it may result in a suboptimal solution due to the subspace removed by the dimensionality reduction method. In this work, we present a dual random projection frame for DML with high dimensional data that explicitly addresses the limitation of dimensionality reduction for DML. The key idea is to first project all the data points into a low dimensional space by random projection, and compute the dual variables using the projected vectors. It then reconstructs the distance metric in the original space using the estimated dual variables. The proposed method, on one hand, enjoys the light computation of random projection, and on the other hand, alleviates the limitation of most dimensionality reduction methods. We verify both empirically and theoretically the effectiveness of the proposed algorithm for high dimensional DML.
Keywords: Distance Metric Learning, Dual Random Projection
Distance metric learning (DML) is essential to many machine learning tasks, including ranking (Chechik et al., 2010; Lim et al., 2013), -nearest neighbor (-NN) classification (Weinberger and Saul, 2009) and -means clustering (Xing et al., 2002). It finds a good metric by minimizing the distance between data pairs in the same classes and maximizing the distance between data pairs from different classes (Xing et al., 2002; Globerson and Roweis, 2005; Yang and Jin, 2006; Davis et al., 2007; Weinberger and Saul, 2009; Shaw et al., 2011). The main computational challenge of DML arises from the constraint that the learned matrix has to be positive semi-definite (PSD). It is computationally demanding even with a stochastic gradient descent (SGD) because it has to project the intermediate solutions onto the PSD cone at every iteration. In a recent study (Chechik et al., 2010), the authors show empirically that it is possible to learn a good distance metric using online learning without having to perform the projection at each iteration. In fact, only one projection into the PSD cone is performed at the end of online learning to ensure that the resulting matrix is PSD 111We note that this is different from the algorithms presented in (Hazan and Kale, 2012; Mahdavi et al., 2012). Although these two algorithms only need either one or no projection step, they introduce additional mechanisms to prevent the intermediate solutions from being too far away from the PSD cone, which could result in a significant overhead per iteration.. Our study of DML follows the same paradigm, to which we refer as one-projection paradigm.
Although the one-projection paradigm resolves the computational challenge from projection onto the PSD cone, it still suffers from a high computational cost when each data point is described by a large number of features. This is because, for dimensional data points, the size of learned matrix will be , and as a result, the cost of computing the gradient, the fundamental operation for any first order optimization method, will also be . The focus of this work is to develop an efficient first order optimization method for high dimensional DML that avoids computational cost per iteration.
Several approaches have been proposed to reduce the computation cost for high dimensional DML. In (Davis and Dhillon, 2008), the authors assume that the learned metric is of low rank, and write it as , where with . Instead of learning , the authors proposed to learn directly, which reduces the cost of computing the gradient from to . A similar idea was studied in (Weinberger and Saul, 2009). The main problem with this approach is that it will result in non-convex optimization. An alternative approach is to reduce the dimensionality of data using dimensionality reduction methods such as principal component analysis (PCA) (Weinberger and Saul, 2009) or random projection (RP) (Tsagkatakis and Savakis, 2010). Although RP is computationally more efficient than PCA, it often yields significantly worse performance than PCA unless the number of random projections is sufficiently large (Fradkin and Madigan, 2003). We note that although RP has been successfully applied to many machine learning tasks, e.g., classification (Rahimi and Recht, 2007), clustering (Boutsidis et al., 2010) and regression (Maillard and Munos, 2012), only a few studies examined the application of RP to DML, and most of them with limited success.
In this paper, we propose a dual random projection approach for high dimensional DML. Our approach, on one hand, enjoys the light computation of random projection, and on the other hand, significantly improves the effectiveness of random projection. The main limitation of using random projection for DML is that all the columns/rows of the learned metric will lie in the subspace spanned by the random vectors. We address this limitation of random projection by
first estimating the dual variables based on the random projected vectors and,
then reconstructing the distance metric using the estimated dual variables and data vectors in the original space.
Since the final distance metric is computed using the original vectors, not the randomly projected vectors, the column/row space of the learned metric will NOT be restricted to the subspace spanned by the random projection, thus alleviating the limitation of random projection. We verify the effectiveness of the proposed algorithms both empirically and theoretically.
We finally note that our work is built upon the recent work (Zhang et al., 2013) on random projection where a dual random projection algorithm is developed for linear classification. Our work differs from (Zhang et al., 2013) in that we apply the theory of dual random projection to DML. More importantly, we have made an important progress in advancing the theory of dual random projection. Unlike the theory in (Zhang et al., 2013) where the data matrix is assumed to be low rank or approximately low rank, our new theory of dual random projection is applicable to any data matrix even when it is NOT approximately low rank. This new analysis significantly broadens the application domains where dual random projection is applicable, which is further verified by our empirical study.
The rest of the paper is organized as follows: Section 2 introduces the methods that are related to the proposed method. Section 3 describes the proposed dual random projection approach for DML and the detailed algorithm for solving the dual problem in the subspace spanned by random projection. Section 4 summarizes the results of the empirical study, and Section 5 concludes this work with future directions.
2 Related Work
Many algorithms have been developed for DML (Xing et al., 2002; Globerson and Roweis, 2005; Davis et al., 2007; Weinberger and Saul, 2009). Exemplar DML algorithms are MCML (Globerson and Roweis, 2005), ITML (Davis et al., 2007), LMNN (Weinberger and Saul, 2009) and OASIS (Chechik et al., 2010). Besides algorithms, several studies were devoted to analyzing the generalization performance of DML (Jin et al., 2009; Bellet and Habrard, 2012). Survey papers (Yang and Jin, 2006; Kulis, 2013) provide detailed investigation about the topic. Although numerous studies were devoted to DML, only a limited progress is made to address the high dimensional challenge in DML (Davis and Dhillon, 2008; Weinberger and Saul, 2009; Qi et al., 2009; Lim et al., 2013). In (Davis and Dhillon, 2008; Weinberger and Saul, 2009), the authors address the challenge of high dimensionality by enforcing the distance metric to be a low rank matrix. (Qi et al., 2009; Lim et al., 2013) alleviate the challenge of learning a distance metric from high dimensional data by assuming to be a sparse matrix. The main shortcoming of these approaches is that they have to place strong assumption on the learned metric, significantly limiting their application. In addition, these approaches will result in non-convex optimization problems that are usually difficult to solve. In contrast, the proposed DML algorithm does not have to make strong assumption regarding the learned metric.
Random projection is widely used for dimension reduction in various learning tasks (Rahimi and Recht, 2007; Boutsidis et al., 2010; Maillard and Munos, 2012). Unfortunately, it requires a large amount of random projections for the desired result (Fradkin and Madigan, 2003), and this limits its application in DML, where the computational cost is proportion to the square of dimensions. Dual random projection is first introduced for linear classification task (Zhang et al., 2013) and following aspects make our work significantly different from the initial study (Zhang et al., 2013): First, we apply dual random projection for DML, where the number of variables is quadratic to the dimension and the dimension crisis is more serious than linear classifier. Second, we optimize the dual problem directly rather than the primal problem in the subspace as the previous work. Consequently, non-smoothed loss (e.g., hinge loss) could be used for the proposed method. Last, we give the theoretical guarantee when the dataset is not low rank, which is an important assumption for the study (Zhang et al., 2013). All of these efforts try to efficiently learn a distance metric for high dimensional datasets and sufficient empirical study verifies the success of our method.
3 Dual Random Projection for Distance Metric Learning
Let denote the collection of training examples. Given a PSD matrix , the distance between two examples and is given as
The proposed framework for DML will be based on triplet constraints, not pairwise constraints. This is because several previous studies have suggested that triplet constraints are more effective than pairwise constraints (Weinberger and Saul, 2009; Chechik et al., 2010; Shaw et al., 2011). Let be the set of triplet constraints used for training, where is expected to be more similar to than to . Our goal is to learn a metric function that is consistent with most of the triplet constraints in , i.e.
Following the empirical risk minimization framework, we cast the triplet constraints based DML into the following optimization problem:
where stands for the symmetric matrix of size , is the regularization parameter, is a convex loss function, , and stands for the dot product between two matrices. We note that we did not enforce in (1) to be PSD because we follow the one-projection paradigm proposed in (Chechik et al., 2010) that first learns a symmetric matrix by solving the optimization problem in (1) and then projects the learned matrix onto the PSD cone. We emphasize that unlike (Zhang et al., 2013), we did not assume to be smooth, making it possible to apply the proposed approach to the hinge loss.
Let be the convex conjugate of . The dual problem of (1) is given by
which is equivalent to
3.1 Dual Random Projection for Distance Metric Learning
Directly solving the primal problem in (1) or the dual problem in (2) could be computational expensive when the data is of high dimension and the number of training triplets is very large. We address this challenge by inducing a random matrix , where and , and projecting all the data points into the low dimensional space using the random matrix, i.e., . As a result, , after random projection, becomes .
A typical approach of using random projection for DML is to obtain a matrix of size by solving the primal problem with the randomly projected vectors , i.e.
Given the learned metric , for any two data points and , their distance is measured by , where is the effective metric in the original space . The key limitation of this random projection approach is that both the column and row space of are restricted to the subspace spanned by vectors in random matrix .
Instead of solving the primal problem, we proposed to solve the dual problem using the randomly projected data points , i.e.
where . After obtaining the optimal solution for (5), we reconstruct the metric by using the dual variables and data matrix in the original space, i.e.
It is important to note that unlike the random projection approach, the recovered metric in (6) is not restricted by the subspace spanned by the random vectors, a key to the success of the proposed algorithm.
Alg. 1 summarizes the key steps for the proposed dual random projection method for DML. Following one-projection paradigm (Chechik et al., 2010), we project the learned symmetric matrix onto the PSD cone at the end of the algorithm. The key component of Alg. 1 is to solve the optimization problem in (2) at Step 4 accurately. We choose stochastic dual coordinate ascent (SDCA) method for solving the dual problem (5) because it enjoys a linear convergence when the loss function is smooth, and is shown empirically to be significantly faster than the other stochastic optimization methods (Shalev-Shwartz and Zhang, 2012). We use the combination strategy recommended in (Shalev-Shwartz and Zhang, 2012), denoted by CSDCA, which uses SGD for the first epoch and then applies SDCA for the rest epochs.
3.2 Main Theoretical Results
First, similar to (Zhang et al., 2013), we consider the case when the data matrix is of low rank. The theorem below shows that under the low rank assumption, with a high probability, the distance metric recovered by Algorithm 1 is nearly optimal.
The proof of Theorem 1 can be found in appendix. Theorem 1 indicates that if the number of random projections is sufficiently large (i.e. ), we can recover the optimal solution in the original space with a small error. It is important to note that our analysis, unlike (Zhang et al., 2013), can be applied to non-smooth loss such as the hinge loss.
In the second case, we assume the loss function is -smooth (i.e., ). The theorem below shows that the dual variables obtained by solving the optimization problem in (5) can be close to the optimal dual variables, even when the data matrix is NOT low rank or approximately low rank. For the presentation of theorem, we first define a few important quantities. Define matrices , , , and as
Define the maximum of the spectral norm of the four matrices, i.e.
where stands for the spectral norm of matrices.
The proof of Theorem 2 can be found in the appendix. Unlike Theorem 1 where the data matrix is assumed to be low rank, Theorem 2 holds without any prior assumption about the data matrix. It shows that despite the random projection, the dual solution can be recovered approximately using the randomly projected vectors, provided that the number of random projections is sufficiently large, is small, and the approximately optimal solution is sufficiently accurate. In the case when most of the training examples are not linear dependent, we could have , which could be a modest number when is very large. The result in Theorem 2 essentially justifies the key idea of our approach, i.e. computing the dual variables first and recovering the distance metric later. Finally, since , the approximation error in the recovered dual variables, is proportional to the square root of the suboptimality , an accurate solution for (5) is needed to ensure a small approximation error. We note that given Theorem 2, it is straightforward to bound using the relationship between the dual variables and the primal variables in (3).
We will first describe the experimental setting, and then present our empirical study for ranking and classification tasks on various datesets.
4.1 Experimental Setting
|# C||# F||#Train||#Test|
Six datasets are used to validate the effectiveness of the proposed algorithm for DML. Table 1 summarizes the information of these datasets. caltech30 is a subset of Caltech256 image dataset (Griffin et al., 2007) and we use the version pre-processed by (Chechik et al., 2010). tdt30 is a subset of tdt2 dataset (Cai et al., 2009). Both caltech30 and tdt30 are comprised of the examples from the most popular categories. All the other datasets are downloaded from LIBSVM (Chang and Lin, 2011), where rcv30 is a subset of the original dataset consisted of documents from the most popular categories. For datasets tdt30, 20news and rcv30, they are comprised of documents represented by vectors of dimensions. Since it is expensive to compute and maintain a matrix of , for these three datasets, we follow the procedure in (Chechik et al., 2010) that maps all documents to a space of dimension. More specifically, we first keep the top most popular words for each collection, and then reduce their dimensionality to by using PCA. We emphasize that for several data sets in our test beds, their data matrices can not be well approximated by low rank matrices. Fig. 2 summarizes the eigenvalue distribution of the six datasets used in our experiment. We observe that four out of these datasets (i.e., caltech20, tdt30, 20news, rcv30) have a flat eigenvalue distribution, indicating that the associated data matrices can not be well approximated by a low rank matrix. This justifies the importance of removing the low rank assumption from the theory of dual random projection, an important contribution of this work.
For most datasets used in this study, we use the standard training/testing split provided by the original datasets, except for datasets tdt30, caltech30 and rcv30. For tdt30 and caltech30, we randomly select of the data for training and use the remaining for testing; for rcv30, we switch the training and test sets defined by the original package to ensure that the number of training examples is sufficiently large.
To measure the quality of learned distance metrics, two types of evaluations are adopted in our study. First, we follow the evaluation protocol in (Chechik et al., 2010) and evaluate the learned metric by its ranking performance. More specifically, we treat each test instance as a query, and rank the other test instances in the ascending order of their distance to using the learned metric. The mean-average-precision(mAP) given below is used to evaluate the quality of the ranking list
where is the size of query set, is the number of relevant instances for -th query and is the precision for the first ranked instances when the instance ranked at the -th position is relevant to the query . Here, an instance is relevant to a query if they belong to the same class. Second, we evaluate the learned metric by its classification performance with -nearest neighbor classifier. More specifically, for each test instance , we apply the learned metric to find the first training examples with the shortest distance, and predict the class assignment for by taking the majority vote among the nearest neighbors. Finally, we also evaluate the computational efficiency of the proposed algorithm for DML by its efficiency.
Besides the Euclidean distance that is used as a baseline similarity measure, six state-of-the-art DML methods are compared in our empirical study:
DuRP: This is the proposed algorithm for DML (i.e. Algorithm 1).
SRP: This algorithm applies random projection to project data into low dimensional space, and then it employs CSDCA to learn the distance metric in this subspace.
SPCA: This algorithm uses PCA as the initial step to reduce the dimensionality, and then applies CSDCA to learn the distance metric in the subspace generated by PCA.
OASIS (Chechik et al., 2010): A state-of-art online learning algorithm for DML that learns the optimal distance metric directly from the original space without any dimensionality reduction.
LMNN (Weinberger and Saul, 2009): A state-of-art batch learning algorithm for DML. It performs the dimensionality reduction using PCA before starting DML.
We randomly select active triplets (i.e., incur the positive hinge loss by Euclidean distance) and set the number of epochs to be for all stochastic methods (i.e., DuOri, DuRP, SRP, SPCA and OASIS), which yields sufficiently accurate solutions in our experiments and is also consistent with the observation in (Shalev-Shwartz and Zhang, 2012). We search in and fix it as since it is insensitive. The step size of CSDCA is set according to the analysis in (Shalev-Shwartz and Zhang, 2012). For all stochastic optimization methods, we follow the one-projection paradigm by projecting the learned metric onto the PSD cone. The hinge loss is used in the implementation of the proposed algorithm. Both OASIS and LMNN use the implementation provided by the original authors and parameters are tuned based on the recommendation by the original authors. All methods are implemented in Matlab, except for LMNN, whose core part is implemented in C, which is shown to be more efficient than our Matlab implementation. All stochastic optimization methods are repeated five times and the average result over five trials is reported. All experiments are implemented on a Linux Server with 64GB memory and GHz CPUs and only single thread is permitted for each experiment.
4.2 Efficiency of the Proposed Method
|Metric in Original Space||Metric in Subspace|
In this experiment, we set the number of random projection to be , which according to experimental results in Section 4.3 and 4.4, yields almost the optimal performance for the proposed algorithm. For fair comparison, the number of reduced dimension is also set to be for LMNN.
Table. 2 compares the CPUtime (in minutes) of different methods. Notice that the time of sampling triplets is not taken into account as it is consumed by all the methods, and all the other operators (e.g., random projection and PCA) are included. It is not surprising to observe that DuRP, SRP and SPCA have similar CPUtimes, and are significantly more efficient than the other methods due to the effect of dimensionality reduction. Since DuRP and SRP share the same procedure for computing the dual variables in the subspace, the only difference between them lies in the procedure for reconstructing the distance metric from the estimated dual variables, a computational overhead that makes DuRP slightly slower than SRP. For all datasets, we observe that DuRP is at least 200 times faster than DuOri and 20 times faster than OASIS. Compared to the stochastic optimization methods, LMNN is the least efficient on three datasets (i.e., protein, caltech30 and 20news), mostly due to the fact that it is a batch learning algorithm.
4.3 Evaluation by Ranking
|Metric in Original Space||Metric in Subspace Metric|
In first experiment, we set the number of random projections used by SRP, SPCA and the proposed DuRP algorithm to be , which is roughly of the dimensionality of the original space. For fair comparison, the number of reduced dimension for LMNN is also set to be . We measure the quality of learned metrics by its ranking performance using the metric of mAP.
Table. 3 summarizes the performance of different methods for DML. First, we observe that DuRP significantly outperforms SRP and SPCA for all datasets. In fact, SRP is worse than Euclidean distance which computes the distance in the original space. SPCA is only able to perform better than the Euclidean distance, and is outperformed by all the other DML algorithms. Second, we observe that for all the datasets, DuRP yields similar performance as DuOri. The only difference between DuRP and DuOri is that DuOri solves the dual problem without using random projection. The comparison between DuRP and DuOri indicates that the random projection step has minimal impact on the learned distance metric, justifying the design of the proposed algorithm. Third, compared to OASIS, we observe that DuRP performs significantly better on two datasets (i.e., tdt30 and 20news) and has the comparable performance on the other datasets. Finally, we observe that for all datasets, the proposed DuRP method significantly outperforms LMNN, a state-of-the-art batch learning algorithm for DML. We also note that because of limited memory, we are unable to run LMNN on datasets rcv30.
In the second experiment, we vary the number of random projections from to . All stochastic methods are run with five trails and Fig. 2 reports the average results with standard deviation. Note that the performance of OASIS and DuOri remain unchanged with varied number of projections because they do not use projection. It is surprising to observe that DuRP almost achieves its best performance with only 10 projections for all datasets. This is in contrast to SRP and SPCA, whose performance usually improves with increasing number of projections except for the data set usps where the performance of SPCA declines when the number of random projections is increased from to . A detailed examination shows that the strange behavior for SPCA is due to its extreme low rank at 30 projections after the learned matrix is projected onto the PSD cone. More investigation is needed for this strange case. We also observe that DuRP outperforms DuOri for several datasets (i.e. protein, caltech30, tdt30 and 20news). We suspect that the better performance of DuRP is because of the implicit regularization due to the random projection. We plan to investigate more about the regularization capability of random projection in the future. We finally point out that with sufficiently large number of projections, SPCA is able to outperform OASIS on 3 datasets (i.e., protein, tdt30 and 20news), indicating that the comparison result may be sensitive to the number of projections.
4.4 Evaluation by Classification
In this experiment, we evaluate the learned metric by its classification accuracy with -NN () classifier. We emphasize that the purpose of this experiment is to evaluate the metrics learned by different DML algorithms, not to demonstrate that the learned metric will result in the state-of-art classification performance222Many studies (e.g., (Weinberger and Saul, 2009; Xu et al., 2012)) have shown that metric learning do not yield better classification accuracy than the standard classification algorithms (e.g., SVM) given a sufficiently large number of training data.. Similar to the evaluation by ranking, all experiments are run five times and the results averaged over five trials with standard deviation are reported in Fig. 3. We essentially have the same observation as that for the ranking experiments reported in Section 4.3 except that for most datasets, the three methods DuRP, DuOri, and OASIS yield very similar performance.
Note the main concern of this paper is time efficiency and the size of learned metric is . It is straightforward to store the learned metric efficiently by keeping a low-rank approximation of it.
In this paper, we propose a dual random projection method to learn the distance metric for large-scale high-dimensional datasets. The main idea is to solve the dual problem in the subspace spanned by random projection, and then recover the distance metric in the original space using the estimated dual variables. We develop the theoretical guarantee that with a high probability, the proposed method can accurately recover the optimal solution with small error when the data matrix is of low rank, and the optimal dual variables even when the data matrix cannot be well approximated by a low rank matrix. Our empirical study confirms both the effectiveness and efficiency of the proposed algorithm for DML by comparing it to the state-of-the-art algorithms for DML. In the future, we plan to further improve the efficiency of our method by exploiting the scenario when optimal distance metric can be well approximated by a low rank matrix.
A Proof of Theorem 1
First, we want to prove that is a good estimation for . We rewrite by Kronecker product:
where . Define , we have .
Under the low rank assumption that all training examples lie in the subspace of -dimension, the dataset can be decomposed as:
where is the -th singular value of , and and are the corresponding left and right singular vectors of . Given the property of Kronecker product that , we have:
where . Define , where , we have:
where equals to the identity operator of .
With the random projection approximation, we have:
In order to bound the difference between and , we need the following corollary:
(Zhang et al., 2013) Let be a standard Gaussian random matrix. Then, for any , with a probability , we have
where constant is at least .
Define . Using Corollary. 3, with a probability , we have . Using the notation , we have the following expression for
where . Using the fact that the eigenvalue values of is given by , it is easy to verify that,
Using the fact that and taking which results in , with a probability , we have
Define and as
We are now ready to give the proof for Theorem 1. The basic logic is straightforward. Since is close to , we would expect , the optimal solution to , to be close to , the optimal solution to . Since both and are linear in the dual variables and , we would expect to be close to .
Since maximizes over its domain, which means , we have
Using the concaveness of and the fact that maximizes over its domain, we have:
B Proof of Theorem 2
Our analysis is based on the following two theorems.
(Theorem 2 (Blum, 2005)) Let , and , where is a random matrix whose entries are chosen independently from . Then:
(Lemma B-1 (Karoui, 2010)) Suppose is a real symmetric matrix with non-negative entries, and is a real symmetric matrix such that . Then, , where stands for the spectral norm of matrix and is the element-wise product between matrices and .
Define and as
Since is -smooth, we have be -strongly-convex. Using the fact that approximately maximizes with -suboptimality and is -strongly-convex, we have
Using the concaveness of and the fact that maximizes over its domain, we have:
when we set . Using the fact , we have
To bound , we need to bound . To this end, we write the as