Fast Low-rank Metric Learning for Large-scale and High-dimensional Data
Low-rank metric learning aims to learn better discrimination of data subject to low-rank constraints. It keeps the intrinsic low-rank structure of datasets and reduces the time cost and memory usage in metric learning. However, it is still a challenge for current methods to handle datasets with both high dimensions and large numbers of samples. To address this issue, we present a novel fast low-rank metric learning (FLRML) method. FLRML casts the low-rank metric learning problem into an unconstrained optimization on the Stiefel manifold, which can be efficiently solved by searching along the descent curves of the manifold. FLRML significantly reduces the complexity and memory usage in optimization, which makes the method scalable to both high dimensions and large numbers of samples. Furthermore, we introduce a mini-batch version of FLRML to make the method scalable to larger datasets which are hard to be loaded and decomposed in limited memory. The outperforming experimental results show that our method is with high accuracy and much faster than the state-of-the-art methods under several benchmarks with large numbers of high-dimensional data. Code has been made available at https://github.com/highan911/FLRML.
Metric learning aims to learn a distance (or similarity) metric from supervised or semi-supervised information, which provides better discrimination between samples. Metric learning has been widely used in various area, such as dimensionality reduction Liu et al. (2015b); Mu (2016); Harandi et al. (2017), robust feature extraction Ding and Fu (2017); Luo and Huang (2018) and information retrieval Chechik et al. (2009); Shalit et al. (2012). For existing metric learning methods, the huge time cost and memory usage are major challenges when dealing with high-dimensional datasets with large numbers of samples. To resolve this issue, low-rank metric learning (LRML) methods optimize a metric matrix subject to low-rank constraints. These methods tend to keep the intrinsic low-rank structure of the dataset, and also, reduce the time cost and memory usage in the learning process. Reducing the matrix size in optimization is an important idea to reduce time and memory usage. However, the size of the matrix to be optimized still increases linearly or squarely with either the dimensions, the number of samples, or the number of pairwise/triplet constraints. As a result, it is still a research challenge when dealing with the metric learning task on datasets with both high dimensions and large numbers of samples Bellet et al. (2013); Wang and Sun (2015).
To address this issue, we present a Fast Low-Rank Metric Learning (FLRML).
In contrast to state-of-the-art methods, FLRML introduces a novel formulation to better employ the low rank constraints to further reduce the complexity, the size of involved matrices, which enables FLRML to achieve high accuracy and faster speed on large numbers of high-dimensional data.
Our main contributions are listed as follows.
- Modeling the constrained metric learning problem as an unconstrained optimization that can be efficiently solved on the Stiefel manifold, which makes our method scalable to large numbers of samples and constraints.
- Reducing the matrix size and complexity in optimization as much as possible while ensuring the accuracy, which makes our method scalable to both large numbers of samples and high dimensions.
- Furthermore, a mini-batch version of FLRML is proposed to make the method scalable to larger datasets which are hard to be fully loaded in memory.
2 Related Work
In metric learning tasks, the training dataset can be represented as a matrix , where is the number of training samples and each sample is with dimensions. Metric learning methods aim to learn a metric matrix from the training set in order to obtain better discrimination between samples. Some low-rank metric learning (LRML) methods have been proposed to obtain the robust metric of data, and to reduce the computational costs for high-dimensional metric learning tasks. Since the optimization with fixed low-rank constraint is nonconvex, the naive gradient descent methods are easy to fall into bad local optimal solutions Mu (2016); Wen and Yin (2013). In terms of different strategies to remedy this issue, the existing LRML methods can be roughly divided into the following two categories.
One type of method Liu et al. (2015b); Schultz and Joachims (2004); Kwok and Tsang (2003); Hegde et al. (2015); Mason et al. (2017) introduces the low-rankness encouraging norms (such as nuclear norm) as regularization, which relaxes the nonconvex low-rank constrained problems to convex problems. The two disadvantages of such methods are: (1) the norm regularization can only encourage the low-rankness, but cannot limit the upper bound of rank; (2) the matrix to be optimized is still the size of either or .
Another type of method Mu (2016); Harandi et al. (2017); Cheng (2013); Huang et al. (2015); Shukla and Anand (2015); Zhang and Zhang (2017) considers the low-rank constrained space as Riemannian manifold. This type of method can obtain high-quality solutions of the nonconvex low-rank constrained problems. However, for these methods, the matrices to be optimized are at least a linear size of either or . The performance of these methods is still suffering on large-scale and high-dimensional datasets.
Besides low-rank metric learning methods, there are some other types of methods for speeding up metric learning on large and high-dimensional datasets. Online metric leaning Chechik et al. (2009); Shalit et al. (2012); Gillen et al. (2018); Li et al. (2018) randomly takes one sample at each time. Sparse metric leaning Qi et al. (2009); Bellet et al. (2012); Shi et al. (2014); Liu et al. (2015a); Zhang et al. (2016); Liu et al. (2010) represents the metric matrix as the sparse combination of some pre-generated rank-1 bases. Non-iterative metric leaning Koestinger et al. (2012); Zhu et al. (2018) avoids iterative calculation by providing explicit optimal solutions. In the experiments, some state-of-the-art methods of these types will also be included for comparison. Compared with these methods, our method also has advantage in time and memory usage on large-scale and high-dimensional datasets. A literature review of many available metric learning methods is beyond the scope of this paper. The reader may consult Refs. Bellet et al. (2013); Wang and Sun (2015); Kulis (2013) for detailed expositions.
3 Fast Low-Rank Metric Learning (FLRML)
The metric matrix is usually semidefinite, which guarantees the non-negative distances and non-negative self-similarities. A semidefinite can be represented as the transpose multiplication of two identical matrices, , where is a row-full-rank matrix, and . Using the matrix as a linear transformation, the training set can be mapped into , which is denoted by . Each column vector in is the corresponding -dimensional vector of the column vector in .
In this paper, we present a fast low-rank metric learning (FLRML) method, which typically learns the low-rank cosine similarity metric from triplet constraints . The cosine similarity between a pair of vectors is measured by their corresponding low dimensional vector as . Each constraint refers to the comparison of a pair of similarities .
To solve the problem of metric learning on large-scale and high-dimensional datasets, our motivation is to reduce matrix size and complexity in optimization as much as possible while ensuring the accuracy. To address this issue, our idea is to embed the triplet constraints into a matrix , so that the constrained metric learning problem can be casted into an unconstrained optimization in the “form” of , where is a low-rank semidefinite matrix to be optimized (Section 3.1). By reducing the sizes of and to (), the complexity and memory usage are greatly reduced. An unconstrained optimization in this form can be efficiently solved on the Stiefel manifold (Section 3.2). In addition, a mini-batch version of FLRML is proposed, which makes our method scalable to larger datasets that are hard to be fully loaded and decomposed in the memory (Section 3.3).
3.1 Forming the Objective Function
Using margin loss, each triplet in corresponds to a loss function . A naive idea is to sum the loss functions, but when and are very large, the evaluation of loss functions will be time consuming. Our idea is to embed the evaluation of loss functions into matrices to speed up their calculation.
For each triplet in , a matrix with the size is generated, which is a sparse matrix with and . The summation of all matrices is represented as . The matrix is with the size . Let be the subset of triplets with as the first item, and be the -th column of , then can be written as . This is the sum of negative samples minus the sum of positive samples for the set . By multiplying on on both sides, we can get as the mean of negative samples minus the mean of positive samples (in which “” is to avoid zero on the denominator).
Let , then by minimizing , the vector tends to be closer to the positive samples than the negative samples. Let be a diagonal matrix with , then is the -th diagonal element of . The loss function can be constructed by putting into the margin loss as . A binary function is defined as: if , then ; otherwise, . By introducing the function , the loss function can be written as , which can be further represented in matrices:
where is a diagonal matrix with , and is the sum of constant terms in the margin loss.
In Eq.(1), is the optimization variable. For any value of , the corresponding value of can be obtained by solving the linear equation group . A minimum norm least squares solution of is , where is the SVD of . Based on this, the size of the optimization variable can be reduced from to , as shown in Theorem 1.
is a subset of that covers all the possible minimum norm least squares solutions of .
By substituting into ,
Since and are constants, then covers all the possible minimum norm least squares solutions of . ∎
By substituting into Eq.(1), the sizes of and can be reduced to , which are represented as
This function is in the form of , where and . The size of and are reduced to , and . In addition, is a constant matrix that will not change in the process of optimization. So this model is with low complexity.
It should be noted that the purpose of SVD here is not for approximation. If all the ranks of are kept, i.e. , the solutions are supposed to be exact. In practice, it is also reasonable to neglect the smallest eigenvalues of to speed up the calculation. In the experiments, an upper bound is set as , since most computers can easily handle a matrix of size, and the information in most of the datasets can be preserved well.
3.2 Optimizing on the Stiefel Manifold
The matrix is the low-rank semidefinite matrix to be optimized. Due to the non-convexity of low-rank semidefinite optimization, directly optimizing in the linear space often falls into bad local optimal solutions Mu (2016); Wen and Yin (2013). The mainstream strategy of low-rank semidefinite problem is to achieve the optimization on manifolds.
The Stiefel manifold is defined as the set of column-orthogonal matrices, i.e., . Any semidefinite with can be represented as , where and . Since is already restricted on the Stiefel manifold, we only need to add regularization term for . We want to guarantee the existence of dense finite optimal solution of , so the L2-norm of is used as a regularization term. By adding into Eq.(3), we get
where is the -th column of .
Let . Since is a quadratic function for each , for any value of , the only corresponding optimal solution of is
In order to make the gradient of continuous, and to keep dense and positive, we adopt function as the smoothing of Mu (2016), where is the sigmoid function. Function satisfies and . The derivative of is . Figure 1 displays the sample plot of and . Using this smoothed function, the loss function is redefined as
The initialization of needs to satisfy the condition
When is fixed, is a linear function of (see Eq.(5)), is a linear function of (see Eq.(3)), and the 0-1 values in are relevant with (see Eq.(3)). So Eq.(8) is a nonlinear equation in a form that can be easily solved iteratively by updating with . Since , , and , this iterative process has a superlinear convergence rate.
To solve the model of this paper, for a matrix , we need to get its gradient .
the gradient can be derived from
The can be easily obtained since . ∎
For solving optimizations on manifolds, the commonly used method is the “projection and retraction”, which first projects the gradient onto the tangent space of the manifold as , and then retracts back to the manifold. For Stiefel manifold, the projection of on the tangent space is Wen and Yin (2013). The retraction of the matrix to the Stiefel manifold can be represented as , which is obtained by setting all the singular values of to 1 Absil and Malick (2012).
For Stiefel manifolds, we adopt a more efficient algorithm Wen and Yin (2013), which performs a non-monotonic line search with Barzilai-Borwein step length Zhang and Hager (2004); Barzilai and Borwein (1988) on a descent curve of the Stiefel manifold. A descent curve with parameter is defined as
where . The optimization is performed by searching the optimal along the descent curve. The Barzilai-Borwein method predicts a step length according to the step lengths in previous iterations, which makes the method converges faster than the “projection and retraction”.
The outline of the FLRML algorithm is shown in Algorithm 1. It can be mainly divided into four stages: SVD preprocessing (line 2), constant initializing (line 3), variable initializing (lines 4 and 5), and the iterative optimization (lines 6 to 11). In one iteration, the complexity of each step is: (a) updating and : ; (b) updating : ; (c) updating : ; (d) optimizing and : .
3.3 Mini-batch FLRML
In FLRML, the maximum size of constant matrices in the iterations is only ( and ), and the maximum size of variable matrices is only . Smaller matrix size theoretically means the ability to process larger datasets on the same size of memory. However, in practice, we find that the bottleneck is not the optimization process of FLRML. On large-scale and high-dimensional datasets, SVD preprocessing may take more time and memory than the FLRML optimization process. And for very large datasets, it will be difficult to load all data into memory. In order to break the bottleneck, and make our method scalable to larger numbers of high-dimensional data in limited memory, we further propose Mini-batch FLRML (M-FLRML).
Inspired by the stochastic gradient descent method Zhang and Zhang (2017); Qian et al. (2015), M-FLRML calculates a descent direction from each mini-batch of data, and updates at a decreasing ratio. For the -th mini-batch, we randomly select triplets from the triplet set, and use the union of the samples to form a mini-batch with samples. Considering that the Stiefel manifold requires , if the number of samples in the union of triplets is less than , we randomly add some other samples to make . The matrix is composed of the extracted columns from , and is composed of the corresponding columns and rows in .
The objective in Eq.(4) consists of small matrices with size and . Our idea is to first find the descent direction for small matrices, and then maps it back to get the descent direction of large matrix . Matrix can be decomposed as , and the complexity of decomposition is significantly reduced from to on this mini-batch. According to Eq.(2), a matrix can be represented as . Using SVD, matrix can be decomposed as , and then the variable for objective can be represented as
In FLRML, in order to convert into , the initial value of satisfies the condition . But in M-FLRML, is generated from , so generally this condition is not satisfied. So instead, we take and as two variables, and find the descent direction of them separately. In Mini-batch FLRML, when a different mini-batch is taken in next iteration, the predicted Barzilai-Borwein step length tends to be improper, so we use “projection and retraction” instead. The updated matrix is obtained as
For , we use Eq.(5) to get an updated vector . Then the updated matrix for can be obtained as . By mapping back to the high-dimensional space, the descent direction of can be obtained as
For the -th mini-batch, is updated at a decreasing ratio as . The theoretical analysis of the stochastic strategy which updates in step sizes by can refer to the reference Zhang and Zhang (2017). The outline of M-FLRML is shown in Algorithm 2.
4.1 Experiment Setup
In the experiments, our FLRML and M-FLRML are compared with 5 state-of-the-art low-rank metric learning methods, including LRSML Liu et al. (2015b), FRML Mu (2016), LMNN Weinberger and Saul (2009), SGDINC Zhang and Zhang (2017), and DRML Harandi et al. (2017). For these methods, the complexities, maximum variable size and maximum constant size in one iteration are compared in Table 4.1. Considering that and , the relatively small items in the table are omitted.
In addition, four state-of-the-art metric learning methods of other types are also compared, including one sparse method (SCML Shi et al. (2014)), one online method (OASIS Chechik et al. (2009)), and two non-iterative methods (KISSME Koestinger et al. (2012), RMML Zhu et al. (2018)).
The methods are evaluated on eight datasets with high dimensions or large numbers of samples: three datasets NG20, RCV1-4 and TDT2-30 derived from three text collections respectively Lang (1995); Cai and He (2012); one handwritten characters dataset MNIST LeCun et al. (1998); four voxel datasets of 3D models M10-16, M10-100, M40-16, and M40-100 with different resolutions in and dimensions, respectively, generated from “ModelNet10” and “ModelNet40” Wu et al. (2015) which are widely used in 3D shape understanding Han et al. (2019d, a, h); Liu et al. (2019a); Han et al. (2019g, c, e); Liu et al. (2019b); Han et al. (2019f, 2017, b, 2018). To measure the similarity, the data vectors are normalized to the unit length. The dimensions , the number of training samples , the number of test samples , and the number of categories of all the datasets are listed in Table 4.1.
Different methods have different requirements for SVD preprocessing. In our experiments, a fast SVD algorithm Li et al. (2017) is adopted. The time in SVD preprocessing is listed at the top of Table 4.1. Using the same decomposed matrices as input, seven methods are compared: three methods (LRSML, LMNN, and our FLRML) require SVD preprocessing; four methods (FRML, KISSME, RMML, OASIS) do not mention SVD preprocessing, but since they need to optimize large dense matrices, SVD has to be performed to prevent them from out-of-memory error on high-dimensional datasets. For all these methods, the rank for SVD is set as . The rest four methods (SCML, DRML, SGDINC, and our M-FLRML) claim that there is no need for SVD preprocessing, which are compared using the original data matrices as input. Specifically, since the SVD calculation for datasets M10-100 and M40-100 has exceeded the memory limit of common PCs, only these four methods are tested on these two datasets.
Most tested methods use either pairwise or triplet constraints, except for LMNN and FRML that requires directly inputting the labels in the implemented codes. For the other methods, 5 triplets are randomly generated for each sample, which is also used as 5 positive pairs and 5 negative pairs for the methods using pairwise constraints. The accuracy is evaluated by a 5-NN classifier using the output metric of each method. For each low-rank metric learning method, the rank constraint for is set as . All the experiments are performed on the Matlab R2015a platform on a PC with 3.60GHz processor and 16GB of physical memory. The code is available at https://github.com/highan911/FLRML.
4.2 Experimental Results
Table 4.1 and Table 4.1 list the classification accuracy (left) and training time (right, in seconds) of all the compared metric learning methods in all the datasets. The symbol “E” indicates that the objective fails to converge to a finite non-zero solution, and “M” indicates that its computation was aborted due to out-of-memory error. The maximum accuracy and minimum time usage for each dataset are boldly emphasized.
Comparing the results with the analysis of complexity in Table 4.1, we find that for many tested methods, if the complexity or matrix is a polynomial of , or , the efficiency on datasets with large numbers of samples is still limited. As shown in Table 4.1 and Table 4.1, FLRML and M-FLRML are faster than the state-of-the-art methods on all datasets. Our methods can achieve comparable accuracy with the state-of-the-art methods on all datasets, and obtain the highest accuracy on several datasets with both high dimensions and large numbers of samples.
Both our M-FLRML and SGDINC use mini-batches to improve efficiency. The theoretical complexity of these two methods is close, but in the experiment M-FLRML is faster. Generally, M-FLRML is less accurate than FLRML, but it significantly reduces the time and memory usage on large datasets. In the experiments, the largest dataset “M40-100” is with size . If there is a dense matrix of such size, it will take up 880 GB of memory. When using M-FLRML to process this data set, the recorded maximum memory usage of Matlab is only 6.20 GB (Matlab takes up 0.95 GB of memory on startup). The experiment shows that M-FLRML is suitable for metric learning of large-scale high-dimensional data on devices with limited memory.
In the experiments, we find the initialization of usually converges within 3 iterations. The optimization on the Stiefel manifold usually converges in less than 15 iterations. Figure 4 shows the samples of convergence behavior of FLRML in optimization on each dataset. The plots are drawn in relative values, in which the values of first iteration are scaled to 1.
In FLRML, one parameter is about the margin in the margin loss. An experiment is performed to study the effect of the margin parameter on accuracy. Let be the mean of values, i.e. . We test the change in accuracy of FLRML when the ratio varies between 0.1 and 2. The mean values and standard deviations of 5 repeated runs are plotted in Figure 4, which shows that FLRML works well on most datasets when is around 1. So we use in the experiments in Table 4.1 and Table 4.1.
In M-FLRML, another parameter is the number of triplets used to generate a mini-batch. We test the effect of on the accuracy of M-FLRML with the increasing number of mini-batches. The mean values and standard deviations of 5 repeated runs are plotted in Figure 4, which shows that a larger makes the accuracy increase faster, and usually M-FLRML is able to get good results within 20 mini-batches. So in Table 4.1, all the results are obtained with and .
5 Conclusion and Future Work
In this paper, FLRML and M-FLRML are proposed for efficient low-rank similarity metric learning on high-dimensional datasets with large numbers of samples. With a novel formulation, FLRML and M-FLRML can better employ low-rank constraints to further reduce the complexity and matrix size, based on which optimization is efficiently conducted on Stiefel manifold. This enables FLRML and M-FLRML to achieve good accuracy and faster speed on large numbers of high-dimensional data. One limitation of our current implementation of FLRML and M-FLRML is that the algorithm still runs on a single processor. Recently, there is a trend about distributed metric learning for big data Su et al. (2016); Xing et al. (2015). It is an interest of our future research to implement M-FLRML on distributed architecture for scaling to larger datasets.
This research is sponsored in part by the National Key R&D Program of China (No. 2018YFB0505400, 2016QY07X1402), the National Science and Technology Major Project of China (No. 2016ZX 01038101), and the NSFC Program (No. 61527812).
- Projection-like retractions on matrix manifolds. SIAM Journal on Optimization 22 (1), pp. 135–158. Cited by: §3.2.
- Two-point step size gradient methods. IMA Journal of Numerical Analysis 8 (1), pp. 141–148. Cited by: §3.2.
- A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Cited by: §1, §2.
- Similarity learning for provably accurate sparse linear classification. In ICML, pp. 1871–1878. Cited by: §2.
- Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering 24 (4), pp. 707–719. Cited by: §4.1.
- An online algorithm for large scale image similarity learning. In Advances in Neural Information Processing Systems, pp. 306–314. Cited by: §1, §2, §4.1.
- Riemannian similarity learning. In ICML, pp. 540–548. Cited by: §2.
- Robust transfer metric learning for image classification. IEEE Transactions on Image Processing PP (99), pp. 1–1. Cited by: §1.
- Online learning with an unknown fairness metric. In Advances in Neural Information Processing Systems, pp. 2600–2609. Cited by: §2.
- Parts4Feature: learning 3D global features from generally semantic parts in multiple views. In IJCAI, Cited by: §4.1.
- Unsupervised learning of 3D local features from raw voxels based on a novel permutation voxelization strategy. IEEE Transactions on Cybernetics 49 (2), pp. 481–494. Cited by: §4.1.
- Deep Spatiality: unsupervised learning of spatially-enhanced global and local 3D features by deep neural network with coupled softmax. IEEE Transactions on Image Processing 27 (6), pp. 3049–3063. Cited by: §4.1.
- BoSCC: bag of spatial context correlations for spatially enhanced 3D shape representation. IEEE Transactions on Image Processing 26 (8), pp. 3707–3720. Cited by: §4.1.
- 3D2SeqViews: aggregating sequential views for 3d global feature learning by cnn with hierarchical attention aggregation. IEEE Transactions on Image Processing 28 (8), pp. 3986–3999. Cited by: §4.1.
- View Inter-Prediction GAN: unsupervised representation learning for 3D shapes by learning global shape memories to support local view predictions. In AAAI, Cited by: §4.1.
- SeqViews2SeqLabels: learning 3D global features via aggregating sequential views by rnn with attention. IEEE Transactions on Image Processing 28 (2), pp. 685–672. Cited by: §4.1.
- Y2Seq2Seq: cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. In AAAI, Cited by: §4.1.
- Multi-angle point cloud-vae:unsupervised feature learning for 3D point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In ICCV, Cited by: §4.1.
- 3DViewGraph: learning global features for 3d shapes from a graph of unordered views with attention. In IJCAI, Cited by: §4.1.
- Joint dimensionality reduction and metric learning: a geometric take. In ICML, Cited by: §1, §2, §4.1.
- NuMax: a convex approach for learning near-isometric linear embeddings. IEEE Transactions on Signal Processing 63 (22), pp. 6109–6121. Cited by: §2.
- Projection metric learning on Grassmann manifold with application to video based face recognition. In CVPR, pp. 140–149. Cited by: §2.
- Large scale metric learning from equivalence constraints. In Computer Vision and Pattern Recognition (CVPR), pp. 2288–2295. Cited by: §2, §4.1.
- Metric learning: a survey. Foundations and Trends® in Machine Learning 5 (4), pp. 287–364. Cited by: §2.
- Learning with idealized kernels. In ICML, pp. 400–407. Cited by: §2.
- Newsweeder: learning to filter netnews. In ICML, pp. 331–339. Cited by: §4.1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
- Algorithm 971: an implementation of a randomized algorithm for principal component analysis. ACM Transactions on Mathematical Software 43 (3), pp. 28:1–28:14. External Links: Cited by: §4.1.
- OPML: a one-pass closed-form solution for online metric learning. Pattern Recognition 75, pp. 302–314. Cited by: §2.
- Similarity learning for high-dimensional sparse data. In AISTATS, Cited by: §2.
- Low-rank similarity metric learning in high dimensions. In AAAI, pp. 2792–2799. Cited by: §1, §2, §4.1.
- Large graph construction for scalable semi-supervised learning. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 679–686. Cited by: §2.
- Point2Sequence: learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In AAAI, Cited by: §4.1.
- L2G auto-encoder: understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In ACMMM, Cited by: §4.1.
- Matrix variate gaussian mixture distribution steered robust metric learning. In AAAI, Cited by: §1.
- Learning low-dimensional metrics. In Advances in Neural Information Processing Systems, pp. 4139–4147. Cited by: §2.
- Fixed-rank supervised metric learning on Riemannian manifold. In AAAI, pp. 1941–1947. Cited by: §1, §2, §2, §3.2, §3.2, §4.1.
- An efficient sparse metric learning in high-dimensional space via l 1-penalized log-determinant regularization. In ICML, pp. 841–848. Cited by: §2.
- Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD). Machine Learning 99 (3), pp. 353–372. Cited by: §3.3.
- Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems, pp. 41–48. Cited by: §2.
- Online learning in the embedded manifold of low-rank matrices. Journal of Machine Learning Research 13 (Feb), pp. 429–458. Cited by: §1, §2.
- Sparse compositional metric learning. In AAAI, Cited by: §2, §4.1.
- Distance metric learning by optimization on the Stiefel manifold. In International Workshop on Differential Geometry in Computer Vision for Analysis of Shapes, Images and Trajectories, Cited by: §2.
- Distributed information-theoretic metric learning in apache spark. In IJCNN, pp. 3306–3313. Cited by: §5.
- Survey on distance metric learning and dimensionality reduction in data mining. Data Mining and Knowledge Discovery 29 (2), pp. 534–564. Cited by: §1, §2.
- Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §4.1.
- A feasible method for optimization with orthogonality constraints. Mathematical Programming 142 (1-2), pp. 397–434. Cited by: §2, §3.2, §3.2, §3.2.
- 3D ShapeNets: a deep representation for volumetric shapes. In CVPR, pp. 1912–1920. Cited by: §4.1.
- Petuum: a new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1 (2), pp. 49–67. Cited by: §5.
- A nonmonotone line search technique and its application to unconstrained optimization. SIAM journal on Optimization 14 (4), pp. 1043–1056. Cited by: §3.2.
- Efficient stochastic optimization for low-rank distance metric learning. In AAAI, Cited by: §2, §3.3, §3.3, §4.1.
- Sparse learning for large-scale and high-dimensional data: a randomized convex-concave optimization approach. In ALT, pp. 83–97. Cited by: §2.
- Towards generalized and efficient metric learning on riemannian manifold. In IJCAI, pp. 3235–3241. Cited by: §2, §4.1.