Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD)
Abstract
Distance metric learning (DML) is an important task that has found applications in many domains. The high computational cost of DML arises from the large number of variables to be determined and the constraint that a distance metric has to be a positive semidefinite (PSD) matrix. Although stochastic gradient descent (SGD) has been successfully applied to improve the efficiency of DML, it can still be computationally expensive because in order to ensure that the solution is a PSD matrix, it has to, at every iteration, project the updated distance metric onto the PSD cone, an expensive operation. We address this challenge by developing two strategies within SGD, i.e. minibatch and adaptive sampling, to effectively reduce the number of updates (i.e., projections onto the PSD cone) in SGD. We also develop hybrid approaches that combine the strength of adaptive sampling with that of minibatch online learning techniques to further improve the computational efficiency of SGD for DML. We prove the theoretical guarantees for both adaptive sampling and minibatch based approaches for DML. We also conduct an extensive empirical study to verify the effectiveness of the proposed algorithms for DML.
Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD)
Qi Qian, Rong Jin, Jinfeng Yi, Lijun Zhang and Shenghuo Zhu 
Department of Computer Science and Engineering 
Michigan State University, East Lansing, MI, 48824, USA 
NEC Laboratories America, Cupertino, CA, 95014, USA 
{qianqi, rongjin, yijinfen, zhanglij}@cse.msu.edu, zsh@neclabs.com 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.2.6 [Artificial Intelligence]: Learning

Algorithms, Experimentation

Distance Metric Learning, Stochastic Gradient Descent, MiniBatch, Adaptive Sampling
Distance metric learning (DML) is an important subject, and has found applications in many domains, including information retrieval [?], supervised classification [?], clustering [?], and semisupervised clustering [?]. The objective of DML is to learn a distance metric consistent with a given set of constraints, namely minimizing the distances between pairs of data points from the same class and maximizing the distances between pairs of data points from different classes. The constraints are often specified in the form of mustlinks, where data points belong to the same class, and cannotlinks, where data points belong to different classes. The constraints can also be specified in the form of triplets [?], in which and belong to a class different from that of and therefore and should be separated by a distance smaller than that between and . In this work, we focus on DML using triplet constraints due to its encouraging performance [?, ?, ?].
The main computational challenge in DML arises from the restriction that the learned distance metric must be a positive semidefinite (PSD) matrix, which is often referred as the PSD constraint. Early approach [?] addressed the PSD constraint by exploring the technique of semidefinite programming (SDP) [?], which unfortunately does not scale to large and high dimensional datasets. More recent approaches [?, ?] addressed this challenge by exploiting the techniques of online learning and stochastic optimization, particularly stochastic gradient descent (SGD), that only needs to deal with one constraint at each iteration. Although these approaches are significantly more efficient than the early approach, they share one common drawback: in order to ensure that the learned distance metric is PSD, these approaches require, at each iteration, projecting the updated distance metric onto the PSD cone. The projection step requires performing the eigendecomposition for a given matrix, and therefore is computationally expensive ^{1}^{1}1The computational cost is if we only need to compute the top eigenvectors of the distance metric and becomes if all the eigenvalues and eigenvectors have to be computed for the projection step, where is the dimensionality of the data.. As a result, the key challenge in developing efficient SGD algorithms for DML is how to reduce the number of projections without affecting the performance of DML.
A common approach for reducing the number of updates and projections in DML is to use the nonsmooth loss function. A popular choice of the nonsmooth loss function is the hinge loss, whose derivative becomes zero when the input value exceeds a certain threshold. Many online learning algorithms for DML [?, ?, ?] take advantage of the nonsmooth loss function to reduce the number of updates and projections. In [?], the authors proposed a structure preserving metric learning algorithm (SPML) that combines a minibatch strategy with the hinge loss to further reduce the number of updates for DML. It groups multiple constraints into a minibatch and performs only one update of the distance metric for each minibatch. But, according to our empirical study, although SPML reduces the running time of the standard SGD algorithm, it results in a significantly worse performance for several datasets, due to the deployment of the minibatch strategy.
In this work, we first develop a new minibatch based SGD algorithm for DML, termed MiniSGD. Unlike SPML that relies on the hinge loss, the proposed MiniSGD algorithm uses a smooth loss function for DML. We show theoretically that by using a smooth loss function, MiniSGD is able to achieve similar convergence rate as the standard SGD algorithm but with significantly less number of updates. The second contribution of this work is to develop a new strategy, termed adaptive sampling, for reducing the number of projections in DML. The key idea of adaptive sampling is to first measure the “difficulty” in classifying a constraint using the learned distance metric, and then perform stochastic updating based on the classification difficulty. More specifically, given the distance metric and triplet , we first measure the difficulty in classifying the triplet by , where is the loss function that measures the classification error. We then sample a binary variable with , and only update the distance metric when . We refer to the proposed approach for DML as ASSGD for short. Finally, we develop two hybrid approaches, termed HASGD and HRSGD, that combine adaptive sampling with minibatch to further improve the computational efficiency of SGD for DML. We conduct an extensive empirical study to verify the effectiveness and efficiency of the proposed algorithms for DML.
The rest of the paper is organized as follows: Section Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) reviews the related work on distance metric learning and stochastic gradient descent with reduced number of projection steps. Section Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) describes the proposed SGD algorithms for DML based on minibatch and adaptive sampling. Two hybrid approaches are presented that combine minibatch and adaptive sampling for DML. The theoretical guarantees for both minibatch based and adaptive sampling based SGD are also presented in Section Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD). Section Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) summarizes the results of the empirical study, and Section Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) concludes this work with future directions.
Many algorithms have been developed to learn a linear distance metric from pairwise constraints, where mustlinks include pairs of data points from the same class and cannotlinks include pairs of data points from different classes ( [?] and references therein). Besides pairwise constraints, an alternative strategy is to learn a distance metric from a set of triplet constraints , where is expected to be closer to than to . Previous studies [?, ?, ?] showed that triplet constraints could be more effective for DML than pairwise constraints.
Several online algorithms have been developed to reduce the computational cost of DML [?, ?, ?, ?]. Most of these methods are based on stochastic gradient descent. At each iteration, they randomly sample one constraint, and update the distance metric based on the sampled constraint. The updated distance metric is further projected onto the PSD cone to ensure that it is PSD. Although these approaches are significantly more scalable than the batch learning algorithms for DML [?], they suffer from the high computational cost in the projection step that has to be performed at every iteration. A common approach for reducing the number of projections is to use a nonsmooth loss function, such as the hinge loss. In addition, in [?], the authors proposed a structure preserving metric learning (SPML) that combines minibatch with the hinge loss to further reduce the number of projections. The main problem with the approach proposed in [?] is that according to the theory of minibatch, it only works well with a smooth loss. Since the hinge loss is a nonsmooth loss function, combining minibatch with the hinge loss may result in a suboptimal performance. This is verified by our empirical study in which we observed that the distance metric learned by SPML performs significantly worse than that learned by the standard stochastic gradient descent method. We resolve this problem by presenting a new SGD algorithm for DML that combines minibatch with a smooth loss, instead of the hinge loss.
Finally, it is worthwhile mentioning several recent studies proposed to avoid projections in SGD. In [?], the authors developed a projection free SGD algorithm that replaces the projection step with a constrained linear programming problem. In [?], the authors proposed a SGD algorithm with only one projection that is performed at the end of the iterations. Unfortunately, the improvement of the two algorithms in computational efficiency is limited, because they require computing, at each iteration, the minimum eigenvalue and eigenvector of the updated distance metric, an operation with cost, where is the dimensionality of the data.
We first review the basic framework of DML with triplet constraints. We then present two strategies to improve the computational efficiency of SGD for DML, one by minibatch and one by adaptive sampling. We present the theoretical guarantees for both strategies, and defer more detailed analysis to the appendix. At the end of this section, we present two hybrid approaches that combine minibatch with adaptive sampling for more efficient DML.
Let be the domain for input patterns, where is the dimensionality. For the convenience of analysis, we assume all the input patterns with bounded norm, i.e. . Given a distance metric , the distance square between and , denoted by , is measured by
Let be the domain for distance metric , where specifies the domain size. Let be the set of triplet constraints used for DML, where is expected to be closer to than to . Let be the convex loss function. Define as
where
Given the triplet constraints in and the domain in , we learn an optimal distance metric by solving the following optimization problem
(1) The key idea of online DML is to update the distance metric based on one sampled constraint at each iteration. More specifically, at iteration , it samples a triplet constraint , and updates the distance metric to by
where is the step size, is the derivative and projects a matrix onto the domain . The following proposition shows can be computed in two steps, i.e. first projecting onto the PSD cone, and then scaling the projected to fit in with the constraint .
Proposition 1
[?] We have
where and projects matrix onto the PSD cone.
As indicated by Proposition 1, requires projecting distance metric onto the PSD cone, an expensive operation that requires eigendecomposition of .
Finally, to bound both the regret and the number of updates, in this study, we approximate the hinge loss by a smooth loss function
(2) where is a parameter that controls the approximation error: the larger the , the closer is to the hinge loss. Note that the smooth approximation of the hinge loss was first suggested in [?] for classification and was later verified by an empirical study in [?]. The key properties of the loss function in (2) are given in the following proposition.
Proposition 2
For the loss function defined in (2), we have
Compared to the hinge loss function, the main advantage of the loss function in (2) is that it is a smooth loss function. As will be revealed by our analysis, it is the smoothness of the loss function that allows us to effectively explore both the minibatch and adaptive sampling strategies for more efficient DML without having to sacrifice the prediction performance.
Minibatch SGD improves the computational efficiency of online DML by grouping multiple constraints into a minibatch and only updating the distance metric once for each minibatch. For brevity, we will refer to this algorithm as MiniSGD in the rest of the paper.
Let be the batch size. At iteration , it samples triplet constraints, denoted by
and defines the minibatch loss at iteration as
Minibatch DML updates the distance metric to using the gradient of the minibach loss function , i.e.,
Algorithm 1 gives the detailed steps of MiniSGD for DML, where step 5 uses Proposition 1 for computing the projection .
1: Input: triplet constraints , step size , minibatch size , and domain size2: Initialize and3: for do4: Sample triplet constraints5: Update the distance metric by6: end for7: returnAlgorithm 1 Minibatch Stochastic Gradient Descent (MiniSGD) for DML The theorem below provides the theoretical guarantee for the MiniSGD algorithm for DML using the smooth loss function defined in (2).
Theorem 1
Figure Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) shows the reduction in the training error over the number of triplet constraints by the MiniSGD algorithm on three datasets ^{2}^{2}2The information of these datasets can be found in the experimental section.. Compared to the standard SGD algorithm, we observe that MiniSGD converges to a similar value of training error, thus validating our theorem empirically.
Remark 1 We observe that the second term in the upper bound in (3), i.e., , has a linear dependence on minibatch size , implying that the larger the , the less accurate the distance metric learned by Algorithm 1. Hence, by adjusting parameter , the size of minibatch, we are able to make appropriate tradeoff between the prediction accuracy and the computational efficiency: the smaller the , the more accurate the distance metric but with more updates and consequentially higher computational cost. Finally, it is worthwhile comparing Theorem 1 to the theoretical result for a general minibatch SGD algorithm given in [?], i.e.
(4) It is clear that Theorem 1 gives a significantly better result when the optimal loss is small (i.e. when the triplet constraints can be well classified by the optimal distance metric ). In particular, when , the convergence rate given in Theorem 1 is on the order of while the convergence rate in (4) is only .
1: Input: triplet constraints , step size , and domain size2: Initialize3: for do4: Sample a binary random variable with5: if then6: Update the distance metric by7: end if8: end for9: returnAlgorithm 2 Adaptive Sampling Stochastic Gradient Descent (ASSGD) for DML We now develop a new approach for reducing the number of updates in SGD in order to improve the computational efficiency of DML. Instead of updating the distance metric at each iteration, the proposed strategy introduces a random binary variable to decide if the distance metric will be updated given a triplet constraint . More specifically, it computes the derivative , and samples a random variable with probability
The distance metric will be updated only when . According to Proposition 2, we have for the smooth loss function given in (2), implying that a triplet constraint has a high chance to be used for updating the distance metric if it has a large loss. Therefore, the essential idea of the proposed adaptive sampling strategy is to give a large chance to update the distance metric when the triplet is difficult to be classified and a low chance when the triplet can be classified correctly with large margin. We note that an alternative strategy is to sample a triplet constraint base on its loss . We did not choose the loss as the basis for updating because it is the derivative, not the loss, that will be used by SGD for updating the distance metric. The detailed steps of adaptive sampling based SGD for DML is given in Algorithm 2. We refer to this algorithm as ASSGD for short in the rest of this paper.
The theorem below provides the performance guarantee for ASSGD. It also bounds the number of updates for ASSGD.
Theorem 2
Remark 2 The bound given in (5) shares similar structure as that given in (3) except that it does not have minibatch size that can be used to make tradeoff between the number of updates and the classification accuracy. The number of updates performed by Algorithm 2 is bounded by (6). The dominate term in (6) is , implying that Algorithm 2 will have a small number of updates if the learned distance metric can classify the triplet constraint correctly at most iterations. In other words, the smaller the number of classification mistakes made by the learned distance metric , the less number of updates will be performed by Algorithm 2. We validate the theorem by running the ASSGD algorithm on three datasets. Figure Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) shows the reduction in the training error over the number of triplet constraints by ASSGD and the standard SGD algorithm. We observe that ASSGD converges to a similar value of training error as the full SGD algorithm.
1: Input: triplet constraints , step size , minibatch size , and domain size2: Initialize and3: for do4: Sample triplets .5: Compute sampling probability .6: Sample a binary random variable with7: if then8: Update the distance metric by9: end if10: end for11: returnAlgorithm 3 A Framework of Hybrid Stochastic Gradient Descent (HybridSGD) for DML Since minibatch and adaptive sampling improve the computational efficiency of SGD from different aspects, it is natural to combine them together for more efficient DML. Similar to the MiniSGD algorithm, the hybrid approaches will group multiple triplet constraints into a minibatch. But, unlike MiniSGD that updates the distance metric for every minibatch of constraints, the hybrid approaches follow the idea of adaptive sampling, and introduce a binary random variable to decide if the distance metric will be updated for every minibatch of constraints. By combining the strength of minibatch and adaptive sampling for SGD, the hybrid approaches are able to make further improvement in the computational efficiency of DML. Algorithm 3 highlights the key steps of the hybrid approaches.
One of the key steps in the hybrid approaches (step 5 in Algorithm 3) is to choose appropriate sampling probability for every minibatch constraints . In this work, we study two different choices for sampling probability :

The first approach chooses based on a triplet constraint randomly sampled from a minibatch. More specifically, given a minibatch of triplet constraints , it randomly samples an index in the range . It then sets the sampling probability to be the derivative for the randomly sampled triplet, i.e.,
We refer to this approach as HRSGD.

The second approach is based on the average case analysis. It sets the sampling probability as the average derivative measured by the norm of the gradient , i.e.,
where and is estimated by sampling. We refer to this approach as HASGD.
# class # feature # train # test semeion 10 256 1,115 478 dna 3 180 2,000 1,186 isolet 26 617 6,238 1,559 tdt30 30 200 6,575 2,819 letter 26 16 15,000 5,000 protein 3 357 17,766 6,621 connect4 3 42 47,289 20,268 sensit 3 100 78,823 19,705 rcv20 20 200 477,141 14,185 poker 10 10 1,000,000 25,010 Table \thetable: Statistics for the ten datasets used in our empirical study. Ten datasets are used to validate the effectiveness of the proposed algorithms. Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) summarizes the information of these datasets. Datasets dna, letter [?], protein and sensit [?] are downloaded from LIBSVM [?]. Datasets tdt30 and rcv20 are document corpora: tdt30 is the subset of tdt2 data [?] comprised of the documents from the most popular categories and rcv20 is the subset of a large rcv1 dataset [?] consisted of documents from the most popular categories. We reduce the dimensionality of these document datasets to by principle components analysis (PCA). All the other datasets are downloaded directly from the UCI repository [?]. For most datasets used in this study, we use the standard training/testing split provided by the original dataset, except for datasets semeion, connect4 and tdt30. For these three datasets, we randomly select of data for training and use the remaining for testing; experiments related to these three datasets are repeated ten times, and the prediction result averaged over ten trials is reported. All experiments are implemented on a laptop with 8GB memory and two 2.50GHz Intel Core i52520M CPUs.
The parameter in the loss function (2) is set to be according to the suggestion in [?]. We set for the number of iterations (i.e., the number of triplet constraints). To construct a triplet constraint at each iteration , we first randomly sample an example from the training data; we then find two of its nearest neighbors and , measured by Euclidean distance, from the training examples, with sharing the same class label as and belonging to a class different from . For MiniSGD and the hybrid approaches, we set for the size of minibatch as in [?], leading to a total of iterations for these approaches. We evaluate the learned distance metric by the classification error of a NN on the test data, where the number of nearest neighbors is set to be based on our experience.
Parameter in the proposed algorithms determines the domain size for the distance metric to be learned. We observe that the classification error of NN remains almost unchanged when varying in the range of . We thus set for all the experiments. Another important parameter used by the proposed algorithms is the step size . We evaluate the impact of step size by measuring the classification error of a NN algorithm that uses the distance metric learned by the MiniSGD algorithm with . We observe that yields a low classification error for almost all datasets by crossvalidation with and . We thus fix for the proposed algorithms in all the experiments.
In this experiment, we compare the performance of the proposed SGD algorithms for DML, i.e., MiniSGD, ASSGD and two hybrid approaches (HRSGD and HASGD), to the full version of SGD for DML (SGD). We also include Euclidean distance as the reference method in our comparison. Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) shows the classification error of NN () using the distance metric learned by different DML algorithms. First, it is not surprising to observe that all the distance metric learning algorithms improve the classification performance of NN compared to the Euclidean distance. Second, for almost all datasets, we observe that all the proposed DML algorithms (i.e., MiniSGD, ASSGD, HRSGD, and HASGD) yield similar classification performance as SGD, the full version of SGD algorithm for DML. This result confirms that the proposed SGD algorithms are effective for DML despite the modifications we made to the SGD algorithm.
Table. Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) summarizes the running time for the proposed DML algorithms and the SGD method. We note that the running time in Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) does not take into account the time for constructing triplet constraints since it is shared by all the methods in comparison.
It is not surprising to observe that all the proposed SGD algorithms, including MiniSGD, ASSGD, HASGD and HRSGD, significantly reduce the running time of SGD. For instance, for dataset isolet, it takes SGD more than seconds to learn a distance metric, while the running time is reduced to less than seconds when applying the proposed SGD algorithms, roughly a factor of reduction in running time. Comparing the running time of ASSGD to that of MiniSGD, we observe that each method has its own advantage: ASSGD is more efficient on datasets semeion, dna, isolet, and tdt30, while MiniSGD is more efficient on the other six datasets. This is because different mechanisms are employed by ASSGD and MiniSGD to reduce the computational cost: ASSGD improves the computational efficiency of DML by skipping the constraints that are easy to be classified, while MiniSGD improves the the computational efficiency of SGD by performing the updating of distance metric once for multiple triplet constraints. Finally, we observe that the two hybrid approaches that combine the strength of both adaptive sampling and minibatch SGD, are computationally most efficient for almost all datasets. We also observe that HRSGD appears to be more efficient than HASGD on six datasets and only loses on datasets protein, sensit and rcv20. This is because HRSGD computes the sampling probability based on one randomly sampled triplet while HASGD needs to compute the average derivative for each minibatch of triplet constraints for the sampling probability.
To further examine the computational efficiency of proposed SGD algorithms for DML, we summarize in Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) the number of updating performed by different SGD algorithms. We observe that all the proposed SGD algorithms for DML are able to reduce the number of updates significantly compared to SGD. Comparing MiniSGD to ASSGD, we observe that for some datasets (e.g., semeion, dna, isolet, and tdt30), the number of updates performed by ASSGD is significantly less than MiniSGD, while it is the other way around for the other datasets. This is again due to the fact that ASSGD and MiniSGD deploy different mechanisms for reducing computational costs. As we expect, the two hybrid approaches are able to further reduce the number of updates performed by ASSGD and MiniSGD, making them more efficient algorithms for DML.
By comparing the results in Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) to the results in Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD), we observe that a small number of updates does NOT always guarantee a short running time. This is exhibited by the comparison between the two hybrid approaches: although HASGD performs the similar number of updates as HRSGD on datasets dna and isolet, it takes HASGD significantly longer time to finish the computation than HRSGD. This is also exhibited by comparing the results across different datasets for a fixed method. For example, for the HASGD method, the number of updates for the protein dataset is nearly the same as that for the poker dataset, but the running time for the protein dataset is about times longer than that for the poker dataset. This result may sound counter intuitive at the first glance. But, a more careful analysis reveals that in addition to the number of updates, the running time of DML is also affected by the computational cost per iteration, which explains the consistency between Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) and Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD). In the case of comparing the two hybrid approaches, we observe that HASGD is subjected to a higher computational cost per iteration than HRSGD because HASGD has to compute the norm of the average gradient over each minibatch while HRSGD only needs to compute the derivative of one randomly sampled triplet constraint for each minibatch. In the case of comparing the running time across different datasets, the protein dataset has a significantly higher dimensionality than the poker dataset, and therefore is subjected to a higher computational cost per iteration because the computational cost of projecting an updated distance metric onto the PSD cone increases at least quadratically in the dimensionality.
Baseline Batch Online Learning Proposed Methods Euclidean LMNN LEGO OASIS SPML SGD MiniSGD ASSGD HRSGD HASGD semeion 8.7 9.0 11.9 8.3 6.3 6.3 6.5 6.3 6.4 6.2 dna 20.7 6.2 9.3 16.6 9.1 8.6 9.4 8.4 8.1 8.1 isolet 9.0 5.4 8.3 6.5 6.6 6.3 6.2 6.0 6.4 6.1 tdt30 5.3 3.0 14.6 4.0 3.7 3.8 3.7 3.7 3.8 3.6 letter 4.4 3.2 4.0 2.2 3.1 2.1 2.5 2.1 2.5 2.3 protein 50.0 40.1 42.4 40.1 41.9 40.7 38.9 40.7 41.0 40.9 connect4 29.5 21.1 25.8 22.1 24.5 20.1 20.1 20.1 22.2 20.4 sensit 27.3 24.3 25.4 24.1 23.7 24.0 24.0 24.0 24.4 24.6 rcv20 9.1 N/A 8.9 8.6 8.9 8.5 8.7 8.4 8.4 8.6 poker 38.0 N/A 39.2 36.1 37.8 35.0 33.8 35.0 34.3 34.4 Table \thetable: Classification error () of NN () using the distance metrics learned by different SGD methods, online learning algorithms and batch learning approach for DML. Batch Online Learning Proposed Methods LMNN LEGO OASIS SPML SGD MiniSGD ASSGD HRSGD HASGD semeion 112.7 355.8 29.1 206.6 2,172.4 263.2 45.2 7.4 42.4 dna 255.9 330.2 39.1 122.1 1,165.3 121.0 30.6 7.1 28.0 isolet 2,454.3 3,454.2 515.7 3,017.2 32,762.7 3,440.7 908.4 127.6 246.3 tdt30 264.5 372.6 51.2 145.1 1,351.0 148.0 108.8 11.6 41.6 letter 251.6 15.0 10.8 5.6 27.3 5.3 10.9 1.8 3.2 protein 3,906.4 1,318.9 3,825.9 573.8 5,448.9 580.6 1,335.8 184.5 145.6 connect4 540.2 23.1 79.0 16.4 109.6 15.9 60.5 8.0 6.97 sensit 10,481.2 93.3 303.9 44.3 365.4 41.3 243.9 26.2 17.9 rcv20 N/A 443.6 1,313.7 154.4 1,542.1 158.4 932.9 101.4 45.8 poker N/A 17.3 17.6 5.8 21.0 4.5 13.5 2.8 3.4 Table \thetable: Running time (seconds) for different SGD methods, online learning algorithms and batch learning approach for DML. Note that LMNN, a batch DML algorithm, is mainly implemented in C, while the other algorithms in comparison are implemented in Matlab, which is usually less efficient than C. Online Learning Proposed Methods LEGO OASIS SPML SGD MiniSGD ASSGD HRSGD HASGD semeion 71,142.4 432.7 10,000 100,000 10,000 142.2 101.4 162.8 dna 140,027 2,042 10,000 100,000 10,000 707 351 372 isolet 110,175 1,426 10,000 100,000 10,000 1,893 353 378 tdt30 131,997.6 2,284.6 10,000 100,000 10,000 5,563.7 567.6 784.6 letter 130,794 28,063 10,000 100,000 10,000 12,931 1,398 457 protein 166,384 64,804 10,000 100,000 10,000 22,127 3,064 1,623 connect4 153,311.6 69,865 10,000 100,000 10,000 44,510.8 4,161.2 2,134.3 sensit 162,869 78,223 10,000 100,000 10,000 60,028 5,675 1,281 rcv20 137,246 88,476 10,000 100,000 10,000 60,708 6,095 779 poker 179,714 71,620 10,000 100,000 10,000 43,259 4,111 1,635 Table \thetable: The number of updates for different SGD methods and online learning algorithms for DML. We compare the proposed SGD algorithms to three stateoftheart online algorithms and one bath method for DML:

SPML [?]: an online learning algorithm for DML that is based on minibatch SGD and the hinge loss,

OASIS [?]: a stateoftheart online DML algorithm,

LEGO [?]: an online version of the information theoretic based DML algorithm [?].
Finally, for sanity checking, we also compare the proposed SGD algorithms to LMNN [?], a stateoftheart batch learning algorithm for DML.
Both SPML and OASIS use the same set of triplet constraints to learn a distance metric as the proposed SGD algorithms. However, unlike SPML and OASIS, pairwise constraints are used by LEGO for DML. For fair comparison, we generate the pairwise constraints for LEGO by splitting each triplet constraint into two pairwise constraints: a mustlink constraint and a cannotlink constraint . This splitting operation results in a total of pairwise constraints for LEGO. Finally, we note that since LMNN is a batch learning method, it is allowed to utilize any triplet constraint derived from the data, and is not restricted to the set of triplet constraints we generate for the SGD methods. All the baseline DML algorithms are implemented by using the codes from the original authors except for SPML, for which we made appropriate changes to the original code in order to avoid large matrix multiplication and improve the computational efficiency. SPML, OASIS and LEGO are implemented in Matlab, while the core parts of LMNN are implemented by C that is usually deemed to be more efficient than Matlab. The default parameters suggested by the original authors are used in the baseline algorithms. The step size of LEGO is set to be , as it was observed in [?] that the prediction performance of LEGO is in general insensitive to the step size. In all experiments, all the baseline methods set the initial solution for distance metric to be an identity matrix.
Table. Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) summarizes the classification results of NN () using the distance metrics learned by the four baseline algorithms. First, we observe that LEGO performs significantly worse than the proposed DML algorithms for five datasets, including semeion, isolet, tdt30, connect4, and poker. This can be explained by the fact that LEGO uses pairwise constraints for DML while the other methods in comparison use triplet constraints for DML. According to [?, ?, ?], triplet constraints are in general more effective than pairwise constraints. Second, although both SPML and MiniSGD are based on the minibatch strategy, SPML performs significantly worse than MiniSGD on three datasets, i.e. protein, connect4, and poker. The performance difference between SPML and MiniSGD can be explained by the fact that MiniSGD uses a smooth loss function while a hinge loss is used by SPML. According to our analysis and the analysis in [?], using a smooth loss function is critical for the success of the minibatch strategy. Third, OASIS yields similar performance as the proposed algorithms for almost all datasets except for datasets semeion, dna and poker, for which OASIS performs significantly worse. Overall, we conclude that the proposed DML algorithms yield similar, if not better, performance as the stateoftheart online learning algorithms for DML.
Compared to LMNN, a stateoftheart batch learning algorithm for DML, we observe that the proposed SGD algorithms yield similar performance on three datasets. They however perform significantly better than LMNN on datasets semeion and letter, and significantly worse on datasets dna, isolet and tdt30. We attribute the difference in classification error to the fact that the proposed DML algorithms are restricted to randomly sampled triplet constraints while LMNN is allowed to use all the triplet constraints that can be derived from the data. The restriction in triplet constraints could sometimes limit the classification performance but at the other time help avoid the overfitting problem. We also observe that LMNN is unable to run on the two large datasets rcv20 and poker, indicating that LMNN does not scale well to the size of datasets.
The running time and the number of updates of the baseline online DML algorithms can be found in Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD) and Table Efficient Distance Metric Learning by Adaptive Sampling and MiniBatch Stochastic Gradient Descent (SGD), respectively. It is not surprising to observe that the three online DML algorithms are significantly more efficient than SGD in terms of both running time and the number of updates. We also observe that MiniSGD and SPML share the same number of updates and similar running time for all datasets because they use the same minibatch strategy. Furthermore, compared to the three online DML algorithms, the two hybrid approaches are significantly more efficient in both running time and the number of updates. Finally, since LMNN is implemented by C, it is not surprising to observe that LMNN shares similar running time as the other online DML algorithms for relatively small datasets. It is however significantly less efficient than the online learning algorithms for datasets of modest size (e.g. connect4 and sensit), and becomes computationally infeasible for the two large datasets rcv20 and poker. Overall, we observe that the two hybrid approaches are significantly more efficient than the other DML algorithms in comparison.
In this paper, we propose two strategies to improve the computational efficiency of SGD for DML, i.e. minibatch and adaptive sampling. The key idea of minibatch is to group multiple triplet constraints into a minibatch, and only update the distance metric once for each minibatch; the key idea of adaptive sampling is to perform stochastic updating by giving a difficult triplet constraint more chance to be used for updating the distance metric than an easy triplet constraint. We develop theoretical guarantees for both strategies. We also develop two variants of hybrid approaches that combine minibatch with adaptive sampling for more efficient DML. Our empirical study confirms that the proposed algorithms yield similar, if not better, prediction performance as the stateoftheart online learning algorithms for DML but with significantly less amount of running time. Since our empirical study is currently limited to datasets with relatively small number of features, we plan to examine the effectiveness of the proposed algorithms for DML with high dimensional data.
 [1] R. Bekkerman and M. Scholz. Data weaving: scaling up the stateoftheart in data clustering. In CIKM, pages 1083–1092, 2008.
 [2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [3] D. Cai, X. Wang, and X. He. Probabilistic dyadic data analysis with local and global consistency. In ICML, pages 105–112, 2009.
 [4] N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
 [5] C.C. Chang and C.J. Lin. Libsvm: A library for support vector machines. ACM TIST, 2(3):27, 2011.
 [6] H. Chang and D.Y. Yeung. Locally linear metric adaptation for semisupervised clustering. In ICML, pages 153–160, 2004.
 [7] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through ranking. JMLR, 11:1109–1135, 2010.
 [8] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better minibatch algorithms via accelerated gradient methods. In NIPS, pages 1647–1655, 2011.
 [9] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Informationtheoretic metric learning. In ICML, pages 209–216, 2007.
 [10] M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput., 64(7):826–838, 2004.
 [11] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
 [12] A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, page 451, 2005.
 [13] E. Hazan and S. Kale. Projectionfree online learning. In ICML, 2012.
 [14] X. He, W.Y. Ma, and H. Zhang. Learning an image manifold for retrieval. In ACM Multimedia, pages 17–23, 2004.
 [15] C.W. Hsu and C.J. Lin. A comparison of methods for multiclass support vector machines. IEEE Trans. on Neural Netw., 13(2):415–425, 2002.
 [16] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. In NIPS, pages 761–768, 2008.
 [17] M. Mahdavi, T. Yang, R. Jin, S. Zhu, and J. Yi. Stochastic gradient descent with only one projection. In NIPS, pages 503–511, 2012.
 [18] B. Shaw, B. C. Huang, and T. Jebara. Learning a distance metric from a network. In NIPS, pages 1899–1907, 2011.
 [19] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
 [20] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with sideinformation. In NIPS, pages 505–512, 2002.
 [21] L. Yang and R. Jin. Distance metric learning: a comprehensive survery. 2006.
 [22] J. Zhang, R. Jin, Y. Yang, and A. G. Hauptmann. Modified logistic regression: An approximation to SVM and its applications in largescale text categorization. In ICML, pages 888–895, 2003.
 [23] T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5–31, 2001.
APPENDIX
The analysis for Theorem 1 is in the supplementary document ^{3}^{3}3https://sites.google.com/site/zljzju/Supplymentary.pdf and we give the proof for Theorem 2 here. Define:
Using Berstein inequality for martingales [?], we have:
where . By setting , with a probability , the number of updates can be bounded as:
(7) Then, we give the regret bound. Using the standard analysis for online learning [?], we have:
Taking the sum from to , we have:
According to (7), with a probability , the second item could be bounded as:
(8) where .
Applying Berstein inequality for martingales [?] for the last item, we have, with a probability :
(9) 
