Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)

# Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)

Qi Qian, Rong Jin, Jinfeng Yi, Lijun Zhang and Shenghuo Zhu
Department of Computer Science and Engineering
Michigan State University, East Lansing, MI, 48824, USA
NEC Laboratories America, Cupertino, CA, 95014, USA
{qianqi, rongjin, yijinfen, zhanglij}@cse.msu.edu, zsh@nec-labs.com
###### Abstract

Distance metric learning (DML) is an important task that has found applications in many domains. The high computational cost of DML arises from the large number of variables to be determined and the constraint that a distance metric has to be a positive semi-definite (PSD) matrix. Although stochastic gradient descent (SGD) has been successfully applied to improve the efficiency of DML, it can still be computationally expensive because in order to ensure that the solution is a PSD matrix, it has to, at every iteration, project the updated distance metric onto the PSD cone, an expensive operation. We address this challenge by developing two strategies within SGD, i.e. mini-batch and adaptive sampling, to effectively reduce the number of updates (i.e., projections onto the PSD cone) in SGD. We also develop hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML. We prove the theoretical guarantees for both adaptive sampling and mini-batch based approaches for DML. We also conduct an extensive empirical study to verify the effectiveness of the proposed algorithms for DML.

Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)

 Qi Qian†, Rong Jin†, Jinfeng Yi†, Lijun Zhang† and Shenghuo Zhu‡ †Department of Computer Science and Engineering Michigan State University, East Lansing, MI, 48824, USA ‡NEC Laboratories America, Cupertino, CA, 95014, USA {qianqi, rongjin, yijinfen, zhanglij}@cse.msu.edu, zsh@nec-labs.com

\@float

\end@float
\@ssect

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.2.6 [Artificial Intelligence]: Learning

• Algorithms, Experimentation

Distance metric learning (DML) is an important subject, and has found applications in many domains, including information retrieval [?], supervised classification [?], clustering [?], and semi-supervised clustering [?]. The objective of DML is to learn a distance metric consistent with a given set of constraints, namely minimizing the distances between pairs of data points from the same class and maximizing the distances between pairs of data points from different classes. The constraints are often specified in the form of must-links, where data points belong to the same class, and cannot-links, where data points belong to different classes. The constraints can also be specified in the form of triplets  [?], in which and belong to a class different from that of and therefore and should be separated by a distance smaller than that between and . In this work, we focus on DML using triplet constraints due to its encouraging performance [?, ?, ?].

The main computational challenge in DML arises from the restriction that the learned distance metric must be a positive semi-definite (PSD) matrix, which is often referred as the PSD constraint. Early approach [?] addressed the PSD constraint by exploring the technique of semi-definite programming (SDP) [?], which unfortunately does not scale to large and high dimensional datasets. More recent approaches [?, ?] addressed this challenge by exploiting the techniques of online learning and stochastic optimization, particularly stochastic gradient descent (SGD), that only needs to deal with one constraint at each iteration. Although these approaches are significantly more efficient than the early approach, they share one common drawback: in order to ensure that the learned distance metric is PSD, these approaches require, at each iteration, projecting the updated distance metric onto the PSD cone. The projection step requires performing the eigen-decomposition for a given matrix, and therefore is computationally expensive 111The computational cost is if we only need to compute the top eigenvectors of the distance metric and becomes if all the eigenvalues and eigenvectors have to be computed for the projection step, where is the dimensionality of the data.. As a result, the key challenge in developing efficient SGD algorithms for DML is how to reduce the number of projections without affecting the performance of DML.

A common approach for reducing the number of updates and projections in DML is to use the non-smooth loss function. A popular choice of the non-smooth loss function is the hinge loss, whose derivative becomes zero when the input value exceeds a certain threshold. Many online learning algorithms for DML [?, ?, ?] take advantage of the non-smooth loss function to reduce the number of updates and projections. In [?], the authors proposed a structure preserving metric learning algorithm (SPML) that combines a mini-batch strategy with the hinge loss to further reduce the number of updates for DML. It groups multiple constraints into a mini-batch and performs only one update of the distance metric for each mini-batch. But, according to our empirical study, although SPML reduces the running time of the standard SGD algorithm, it results in a significantly worse performance for several datasets, due to the deployment of the mini-batch strategy.

In this work, we first develop a new mini-batch based SGD algorithm for DML, termed Mini-SGD. Unlike SPML that relies on the hinge loss, the proposed Mini-SGD algorithm uses a smooth loss function for DML. We show theoretically that by using a smooth loss function, Mini-SGD is able to achieve similar convergence rate as the standard SGD algorithm but with significantly less number of updates. The second contribution of this work is to develop a new strategy, termed adaptive sampling, for reducing the number of projections in DML. The key idea of adaptive sampling is to first measure the “difficulty” in classifying a constraint using the learned distance metric, and then perform stochastic updating based on the classification difficulty. More specifically, given the distance metric and triplet , we first measure the difficulty in classifying the triplet by , where is the loss function that measures the classification error. We then sample a binary variable with , and only update the distance metric when . We refer to the proposed approach for DML as AS-SGD for short. Finally, we develop two hybrid approaches, termed HA-SGD and HR-SGD, that combine adaptive sampling with mini-batch to further improve the computational efficiency of SGD for DML. We conduct an extensive empirical study to verify the effectiveness and efficiency of the proposed algorithms for DML.

Many algorithms have been developed to learn a linear distance metric from pairwise constraints, where must-links include pairs of data points from the same class and cannot-links include pairs of data points from different classes ( [?] and references therein). Besides pairwise constraints, an alternative strategy is to learn a distance metric from a set of triplet constraints , where is expected to be closer to than to . Previous studies [?, ?, ?] showed that triplet constraints could be more effective for DML than pairwise constraints.

Several online algorithms have been developed to reduce the computational cost of DML [?, ?, ?, ?]. Most of these methods are based on stochastic gradient descent. At each iteration, they randomly sample one constraint, and update the distance metric based on the sampled constraint. The updated distance metric is further projected onto the PSD cone to ensure that it is PSD. Although these approaches are significantly more scalable than the batch learning algorithms for DML [?], they suffer from the high computational cost in the projection step that has to be performed at every iteration. A common approach for reducing the number of projections is to use a non-smooth loss function, such as the hinge loss. In addition, in [?], the authors proposed a structure preserving metric learning (SPML) that combines mini-batch with the hinge loss to further reduce the number of projections. The main problem with the approach proposed in [?] is that according to the theory of mini-batch, it only works well with a smooth loss. Since the hinge loss is a non-smooth loss function, combining mini-batch with the hinge loss may result in a suboptimal performance. This is verified by our empirical study in which we observed that the distance metric learned by SPML performs significantly worse than that learned by the standard stochastic gradient descent method. We resolve this problem by presenting a new SGD algorithm for DML that combines mini-batch with a smooth loss, instead of the hinge loss.

Finally, it is worthwhile mentioning several recent studies proposed to avoid projections in SGD. In [?], the authors developed a projection free SGD algorithm that replaces the projection step with a constrained linear programming problem. In [?], the authors proposed a SGD algorithm with only one projection that is performed at the end of the iterations. Unfortunately, the improvement of the two algorithms in computational efficiency is limited, because they require computing, at each iteration, the minimum eigenvalue and eigenvector of the updated distance metric, an operation with cost, where is the dimensionality of the data.

We first review the basic framework of DML with triplet constraints. We then present two strategies to improve the computational efficiency of SGD for DML, one by mini-batch and one by adaptive sampling. We present the theoretical guarantees for both strategies, and defer more detailed analysis to the appendix. At the end of this section, we present two hybrid approaches that combine mini-batch with adaptive sampling for more efficient DML.

Let be the domain for input patterns, where is the dimensionality. For the convenience of analysis, we assume all the input patterns with bounded norm, i.e. . Given a distance metric , the distance square between and , denoted by , is measured by

 |xa−xb|2M=(xa−xb)⊤M(xa−xb)

Let be the domain for distance metric , where specifies the domain size. Let be the set of triplet constraints used for DML, where is expected to be closer to than to . Let be the convex loss function. Define as

 Δ(xti,xtj,xtk;M)=|xti−xtk|2M−|xti−xtj|2M = ⟨M,(xti−xtk)(xti−xtk)⊤−(xti−xtj)(xti−xtj)⊤⟩ = ⟨M,At⟩

where

 At=(xti−xtk)(xti−xtk)⊤−(xti−xtj)(xti−xtj)⊤

Given the triplet constraints in and the domain in , we learn an optimal distance metric by solving the following optimization problem

 minM∈Ω L(M)=1NN∑t=1ℓ(Δ(xti,xtj,xtk;M)) (1)

The key idea of online DML is to update the distance metric based on one sampled constraint at each iteration. More specifically, at iteration , it samples a triplet constraint , and updates the distance metric to by

 Mt+1=ΠΩ(Mt−ηℓ′(Δ(xti,xtj,xtk;Mt))At)

where is the step size, is the derivative and projects a matrix onto the domain . The following proposition shows can be computed in two steps, i.e. first projecting onto the PSD cone, and then scaling the projected to fit in with the constraint .

###### Proposition 1

[?] We have

 ΠΩ(M)=1max(∥M′∥F/R,1)M′

where and projects matrix onto the PSD cone.

As indicated by Proposition 1, requires projecting distance metric onto the PSD cone, an expensive operation that requires eigen-decomposition of .

Finally, to bound both the regret and the number of updates, in this study, we approximate the hinge loss by a smooth loss function

 ℓ(z)=1Llog(1+exp(−L(z−1))) (2)

where is a parameter that controls the approximation error: the larger the , the closer is to the hinge loss. Note that the smooth approximation of the hinge loss was first suggested in [?] for classification and was later verified by an empirical study in [?]. The key properties of the loss function in (2) are given in the following proposition.

###### Proposition 2

For the loss function defined in (2), we have

 ∀z∈R,|ℓ′(z)|≤1,|ℓ′(z)|≤Lℓ(z)

Compared to the hinge loss function, the main advantage of the loss function in (2) is that it is a smooth loss function. As will be revealed by our analysis, it is the smoothness of the loss function that allows us to effectively explore both the mini-batch and adaptive sampling strategies for more efficient DML without having to sacrifice the prediction performance.

Mini-batch SGD improves the computational efficiency of online DML by grouping multiple constraints into a mini-batch and only updating the distance metric once for each mini-batch. For brevity, we will refer to this algorithm as Mini-SGD in the rest of the paper.

Let be the batch size. At iteration , it samples triplet constraints, denoted by

 (xt,si,xt,sj,xt,sk),s=1,…,b,

and defines the mini-batch loss at iteration as

 ℓt(Mt)=1bb∑s=1ℓ(Δ(xt,si,xt,sj,xt,sk;Mt))

Mini-batch DML updates the distance metric to using the gradient of the mini-bach loss function , i.e.,

 Mt+1=ΠΩ(Mt−η∇ℓt(Mt))

Algorithm 1 gives the detailed steps of Mini-SGD for DML, where step 5 uses Proposition 1 for computing the projection .

The theorem below provides the theoretical guarantee for the Mini-SGD algorithm for DML using the smooth loss function defined in (2).

###### Theorem 1

Let be the solution output by Algorithm 1 that uses the loss function defined in (2). Let be the optimal solution to (1). Assume for any triplet constraint. For a fixed , we have, with a probability :

 L(¯M)≤L(M∗)1−3ηLA2+bR22(1−3ηLA2)ηN (3) +C1A2η(1−3ηLA2)N[log2Nδb]2logmδ

where , and is an universal constant that is at most .

Figure Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) shows the reduction in the training error over the number of triplet constraints by the Mini-SGD algorithm on three datasets 222The information of these datasets can be found in the experimental section.. Compared to the standard SGD algorithm, we observe that Mini-SGD converges to a similar value of training error, thus validating our theorem empirically.

Remark 1 We observe that the second term in the upper bound in (3), i.e., , has a linear dependence on mini-batch size , implying that the larger the , the less accurate the distance metric learned by Algorithm 1. Hence, by adjusting parameter , the size of mini-batch, we are able to make appropriate tradeoff between the prediction accuracy and the computational efficiency: the smaller the , the more accurate the distance metric but with more updates and consequentially higher computational cost. Finally, it is worthwhile comparing Theorem 1 to the theoretical result for a general mini-batch SGD algorithm given in [?], i.e.

 L(¯M)≤L(M∗)+O(1√N+b2N2) (4)

It is clear that Theorem 1 gives a significantly better result when the optimal loss is small (i.e. when the triplet constraints can be well classified by the optimal distance metric ). In particular, when , the convergence rate given in Theorem 1 is on the order of while the convergence rate in (4) is only .

We now develop a new approach for reducing the number of updates in SGD in order to improve the computational efficiency of DML. Instead of updating the distance metric at each iteration, the proposed strategy introduces a random binary variable to decide if the distance metric will be updated given a triplet constraint . More specifically, it computes the derivative , and samples a random variable with probability

 Pr(Zt=1)=|ℓ′(Δ(xti,xtj,xtk;Mt))|

The distance metric will be updated only when . According to Proposition 2, we have for the smooth loss function given in (2), implying that a triplet constraint has a high chance to be used for updating the distance metric if it has a large loss. Therefore, the essential idea of the proposed adaptive sampling strategy is to give a large chance to update the distance metric when the triplet is difficult to be classified and a low chance when the triplet can be classified correctly with large margin. We note that an alternative strategy is to sample a triplet constraint base on its loss . We did not choose the loss as the basis for updating because it is the derivative, not the loss, that will be used by SGD for updating the distance metric. The detailed steps of adaptive sampling based SGD for DML is given in Algorithm 2. We refer to this algorithm as AS-SGD for short in the rest of this paper.

The theorem below provides the performance guarantee for AS-SGD. It also bounds the number of updates for AS-SGD.

###### Theorem 2

Let be the solution output by Algorithm 2 that uses the loss function defined in (2). Let be the optimal solution to (1). Assume for any triplet constraint. For a fixed , we have, with a probability :

 L(¯M)≤L(M∗)1−3ηLA2+C2(1−3ηLA2)N(R2η+η+1) (5)

and

 N∑t=1Zt≤32LN∑t=1ℓ(Mt)+52lnmδ (6)

where

 C2 = max{12+16lnmδ,54A2lnmδ,RAln2mδ} m = ⌈log2(N2)⌉

Remark 2 The bound given in (5) shares similar structure as that given in (3) except that it does not have mini-batch size that can be used to make tradeoff between the number of updates and the classification accuracy. The number of updates performed by Algorithm 2 is bounded by (6). The dominate term in (6) is , implying that Algorithm 2 will have a small number of updates if the learned distance metric can classify the triplet constraint correctly at most iterations. In other words, the smaller the number of classification mistakes made by the learned distance metric , the less number of updates will be performed by Algorithm 2. We validate the theorem by running the AS-SGD algorithm on three datasets. Figure Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) shows the reduction in the training error over the number of triplet constraints by AS-SGD and the standard SGD algorithm. We observe that AS-SGD converges to a similar value of training error as the full SGD algorithm.

Since mini-batch and adaptive sampling improve the computational efficiency of SGD from different aspects, it is natural to combine them together for more efficient DML. Similar to the Mini-SGD algorithm, the hybrid approaches will group multiple triplet constraints into a mini-batch. But, unlike Mini-SGD that updates the distance metric for every mini-batch of constraints, the hybrid approaches follow the idea of adaptive sampling, and introduce a binary random variable to decide if the distance metric will be updated for every mini-batch of constraints. By combining the strength of mini-batch and adaptive sampling for SGD, the hybrid approaches are able to make further improvement in the computational efficiency of DML. Algorithm 3 highlights the key steps of the hybrid approaches.

One of the key steps in the hybrid approaches (step 5 in Algorithm 3) is to choose appropriate sampling probability for every mini-batch constraints . In this work, we study two different choices for sampling probability :

• The first approach chooses based on a triplet constraint randomly sampled from a mini-batch. More specifically, given a mini-batch of triplet constraints , it randomly samples an index in the range . It then sets the sampling probability to be the derivative for the randomly sampled triplet, i.e.,

 γt=|ℓ′(Δ(xt,s′i,xt,s′j,xt,s′k;Mt))|

We refer to this approach as HR-SGD.

• The second approach is based on the average case analysis. It sets the sampling probability as the average derivative measured by the norm of the gradient , i.e.,

 γt=1W∥∇ℓt(Mt)∥F

where and is estimated by sampling. We refer to this approach as HA-SGD.

Ten datasets are used to validate the effectiveness of the proposed algorithms. Table Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) summarizes the information of these datasets. Datasets dna, letter [?], protein and sensit [?] are downloaded from LIBSVM [?]. Datasets tdt30 and rcv20 are document corpora: tdt30 is the subset of tdt2 data [?] comprised of the documents from the most popular categories and rcv20 is the subset of a large rcv1 dataset [?] consisted of documents from the most popular categories. We reduce the dimensionality of these document datasets to by principle components analysis (PCA). All the other datasets are downloaded directly from the UCI repository [?]. For most datasets used in this study, we use the standard training/testing split provided by the original dataset, except for datasets semeion, connect4 and tdt30. For these three datasets, we randomly select of data for training and use the remaining for testing; experiments related to these three datasets are repeated ten times, and the prediction result averaged over ten trials is reported. All experiments are implemented on a laptop with 8GB memory and two 2.50GHz Intel Core i5-2520M CPUs.

The parameter in the loss function (2) is set to be according to the suggestion in [?]. We set for the number of iterations (i.e., the number of triplet constraints). To construct a triplet constraint at each iteration , we first randomly sample an example from the training data; we then find two of its nearest neighbors and , measured by Euclidean distance, from the training examples, with sharing the same class label as and belonging to a class different from . For Mini-SGD and the hybrid approaches, we set for the size of mini-batch as in [?], leading to a total of iterations for these approaches. We evaluate the learned distance metric by the classification error of a -NN on the test data, where the number of nearest neighbors is set to be based on our experience.

Parameter in the proposed algorithms determines the domain size for the distance metric to be learned. We observe that the classification error of -NN remains almost unchanged when varying in the range of . We thus set for all the experiments. Another important parameter used by the proposed algorithms is the step size . We evaluate the impact of step size by measuring the classification error of a -NN algorithm that uses the distance metric learned by the Mini-SGD algorithm with . We observe that yields a low classification error for almost all datasets by cross-validation with and . We thus fix for the proposed algorithms in all the experiments.

In this experiment, we compare the performance of the proposed SGD algorithms for DML, i.e., Mini-SGD, AS-SGD and two hybrid approaches (HR-SGD and HA-SGD), to the full version of SGD for DML (SGD). We also include Euclidean distance as the reference method in our comparison. Table Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) shows the classification error of -NN () using the distance metric learned by different DML algorithms. First, it is not surprising to observe that all the distance metric learning algorithms improve the classification performance of -NN compared to the Euclidean distance. Second, for almost all datasets, we observe that all the proposed DML algorithms (i.e., Mini-SGD, AS-SGD, HR-SGD, and HA-SGD) yield similar classification performance as SGD, the full version of SGD algorithm for DML. This result confirms that the proposed SGD algorithms are effective for DML despite the modifications we made to the SGD algorithm.

Table. Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) summarizes the running time for the proposed DML algorithms and the SGD method. We note that the running time in Table Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) does not take into account the time for constructing triplet constraints since it is shared by all the methods in comparison.

It is not surprising to observe that all the proposed SGD algorithms, including Mini-SGD, AS-SGD, HA-SGD and HR-SGD, significantly reduce the running time of SGD. For instance, for dataset isolet, it takes SGD more than seconds to learn a distance metric, while the running time is reduced to less than seconds when applying the proposed SGD algorithms, roughly a factor of reduction in running time. Comparing the running time of AS-SGD to that of Mini-SGD, we observe that each method has its own advantage: AS-SGD is more efficient on datasets semeion, dna, isolet, and tdt30, while Mini-SGD is more efficient on the other six datasets. This is because different mechanisms are employed by AS-SGD and Mini-SGD to reduce the computational cost: AS-SGD improves the computational efficiency of DML by skipping the constraints that are easy to be classified, while Mini-SGD improves the the computational efficiency of SGD by performing the updating of distance metric once for multiple triplet constraints. Finally, we observe that the two hybrid approaches that combine the strength of both adaptive sampling and mini-batch SGD, are computationally most efficient for almost all datasets. We also observe that HR-SGD appears to be more efficient than HA-SGD on six datasets and only loses on datasets protein, sensit and rcv20. This is because HR-SGD computes the sampling probability based on one randomly sampled triplet while HA-SGD needs to compute the average derivative for each mini-batch of triplet constraints for the sampling probability.

To further examine the computational efficiency of proposed SGD algorithms for DML, we summarize in Table Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) the number of updating performed by different SGD algorithms. We observe that all the proposed SGD algorithms for DML are able to reduce the number of updates significantly compared to SGD. Comparing Mini-SGD to AS-SGD, we observe that for some datasets (e.g., semeion, dna, isolet, and tdt30), the number of updates performed by AS-SGD is significantly less than Mini-SGD, while it is the other way around for the other datasets. This is again due to the fact that AS-SGD and Mini-SGD deploy different mechanisms for reducing computational costs. As we expect, the two hybrid approaches are able to further reduce the number of updates performed by AS-SGD and Mini-SGD, making them more efficient algorithms for DML.

We compare the proposed SGD algorithms to three state-of-the-art online algorithms and one bath method for DML:

• SPML [?]: an online learning algorithm for DML that is based on mini-batch SGD and the hinge loss,

• OASIS [?]: a state-of-the-art online DML algorithm,

• LEGO [?]: an online version of the information theoretic based DML algorithm [?].

Finally, for sanity checking, we also compare the proposed SGD algorithms to LMNN [?], a state-of-the-art batch learning algorithm for DML.

Both SPML and OASIS use the same set of triplet constraints to learn a distance metric as the proposed SGD algorithms. However, unlike SPML and OASIS, pairwise constraints are used by LEGO for DML. For fair comparison, we generate the pairwise constraints for LEGO by splitting each triplet constraint into two pairwise constraints: a must-link constraint and a cannot-link constraint . This splitting operation results in a total of pairwise constraints for LEGO. Finally, we note that since LMNN is a batch learning method, it is allowed to utilize any triplet constraint derived from the data, and is not restricted to the set of triplet constraints we generate for the SGD methods. All the baseline DML algorithms are implemented by using the codes from the original authors except for SPML, for which we made appropriate changes to the original code in order to avoid large matrix multiplication and improve the computational efficiency. SPML, OASIS and LEGO are implemented in Matlab, while the core parts of LMNN are implemented by C that is usually deemed to be more efficient than Matlab. The default parameters suggested by the original authors are used in the baseline algorithms. The step size of LEGO is set to be , as it was observed in  [?] that the prediction performance of LEGO is in general insensitive to the step size. In all experiments, all the baseline methods set the initial solution for distance metric to be an identity matrix.

Table. Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) summarizes the classification results of -NN () using the distance metrics learned by the four baseline algorithms. First, we observe that LEGO performs significantly worse than the proposed DML algorithms for five datasets, including semeion, isolet, tdt30, connect4, and poker. This can be explained by the fact that LEGO uses pairwise constraints for DML while the other methods in comparison use triplet constraints for DML. According to [?, ?, ?], triplet constraints are in general more effective than pairwise constraints. Second, although both SPML and Mini-SGD are based on the mini-batch strategy, SPML performs significantly worse than Mini-SGD on three datasets, i.e. protein, connect4, and poker. The performance difference between SPML and Mini-SGD can be explained by the fact that Mini-SGD uses a smooth loss function while a hinge loss is used by SPML. According to our analysis and the analysis in [?], using a smooth loss function is critical for the success of the mini-batch strategy. Third, OASIS yields similar performance as the proposed algorithms for almost all datasets except for datasets semeion, dna and poker, for which OASIS performs significantly worse. Overall, we conclude that the proposed DML algorithms yield similar, if not better, performance as the state-of-the-art online learning algorithms for DML.

Compared to LMNN, a state-of-the-art batch learning algorithm for DML, we observe that the proposed SGD algorithms yield similar performance on three datasets. They however perform significantly better than LMNN on datasets semeion and letter, and significantly worse on datasets dna, isolet and tdt30. We attribute the difference in classification error to the fact that the proposed DML algorithms are restricted to randomly sampled triplet constraints while LMNN is allowed to use all the triplet constraints that can be derived from the data. The restriction in triplet constraints could sometimes limit the classification performance but at the other time help avoid the overfitting problem. We also observe that LMNN is unable to run on the two large datasets rcv20 and poker, indicating that LMNN does not scale well to the size of datasets.

The running time and the number of updates of the baseline online DML algorithms can be found in Table Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD) and Table Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD), respectively. It is not surprising to observe that the three online DML algorithms are significantly more efficient than SGD in terms of both running time and the number of updates. We also observe that Mini-SGD and SPML share the same number of updates and similar running time for all datasets because they use the same mini-batch strategy. Furthermore, compared to the three online DML algorithms, the two hybrid approaches are significantly more efficient in both running time and the number of updates. Finally, since LMNN is implemented by C, it is not surprising to observe that LMNN shares similar running time as the other online DML algorithms for relatively small datasets. It is however significantly less efficient than the online learning algorithms for datasets of modest size (e.g. connect4 and sensit), and becomes computationally infeasible for the two large datasets rcv20 and poker. Overall, we observe that the two hybrid approaches are significantly more efficient than the other DML algorithms in comparison.

In this paper, we propose two strategies to improve the computational efficiency of SGD for DML, i.e. mini-batch and adaptive sampling. The key idea of mini-batch is to group multiple triplet constraints into a mini-batch, and only update the distance metric once for each mini-batch; the key idea of adaptive sampling is to perform stochastic updating by giving a difficult triplet constraint more chance to be used for updating the distance metric than an easy triplet constraint. We develop theoretical guarantees for both strategies. We also develop two variants of hybrid approaches that combine mini-batch with adaptive sampling for more efficient DML. Our empirical study confirms that the proposed algorithms yield similar, if not better, prediction performance as the state-of-the-art online learning algorithms for DML but with significantly less amount of running time. Since our empirical study is currently limited to datasets with relatively small number of features, we plan to examine the effectiveness of the proposed algorithms for DML with high dimensional data.

• [1] R. Bekkerman and M. Scholz. Data weaving: scaling up the state-of-the-art in data clustering. In CIKM, pages 1083–1092, 2008.
• [2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
• [3] D. Cai, X. Wang, and X. He. Probabilistic dyadic data analysis with local and global consistency. In ICML, pages 105–112, 2009.
• [4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
• [5] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM TIST, 2(3):27, 2011.
• [6] H. Chang and D.-Y. Yeung. Locally linear metric adaptation for semi-supervised clustering. In ICML, pages 153–160, 2004.
• [7] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through ranking. JMLR, 11:1109–1135, 2010.
• [8] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, pages 1647–1655, 2011.
• [9] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML, pages 209–216, 2007.
• [10] M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput., 64(7):826–838, 2004.
• [11] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
• [12] A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, page 451, 2005.
• [13] E. Hazan and S. Kale. Projection-free online learning. In ICML, 2012.
• [14] X. He, W.-Y. Ma, and H. Zhang. Learning an image manifold for retrieval. In ACM Multimedia, pages 17–23, 2004.
• [15] C.-W. Hsu and C.-J. Lin. A comparison of methods for multiclass support vector machines. IEEE Trans. on Neural Netw., 13(2):415–425, 2002.
• [16] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. In NIPS, pages 761–768, 2008.
• [17] M. Mahdavi, T. Yang, R. Jin, S. Zhu, and J. Yi. Stochastic gradient descent with only one projection. In NIPS, pages 503–511, 2012.
• [18] B. Shaw, B. C. Huang, and T. Jebara. Learning a distance metric from a network. In NIPS, pages 1899–1907, 2011.
• [19] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
• [20] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with side-information. In NIPS, pages 505–512, 2002.
• [21] L. Yang and R. Jin. Distance metric learning: a comprehensive survery. 2006.
• [22] J. Zhang, R. Jin, Y. Yang, and A. G. Hauptmann. Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. In ICML, pages 888–895, 2003.
• [23] T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5–31, 2001.
\@ssect

APPENDIX

The analysis for Theorem 1 is in the supplementary document  and we give the proof for Theorem 2 here. Define:

 CN=∑Nt=1|ℓ′(Mt)|,Xt=Zt−|ℓ′(Mt)|,ΛN=∑N1≤t≤NXt,K=max1≤t≤NXt≤1,σ2N=∑Nt=1E[(Zt−|ℓ′(Mt)|)2]≤∑Nt=1|ℓ′(Mt)|=CN

Using Berstein inequality for martingales [?], we have:

 Pr(ΛN≥2√CNτ+√2Kτ/3) = Pr(ΛN≥2√CNτ+√2Kτ/3,σ2N≤CN,CN≤N) ≤ Pr(ΛN≥2√CNτ+√2Kτ/3,σ2N≤CN,CN≤1/N) +m∑i=1Pr(ΛN≥2√CNτ+√2Kτ/3,σ2N≤CN,2i−1/N

where . By setting , with a probability , the number of updates can be bounded as:

 N∑t=1Zt ≤ CN+12CN+2lnmδ+√23Klnmδ (7) ≤ 32LN∑t=1ℓ(Mt)+52lnmδ

Then, we give the regret bound. Using the standard analysis for online learning [?], we have:

 ℓ(Mt)−ℓ(M∗)≤⟨ℓ′(Mt)At,Mt−M∗⟩ = τtZt⟨At,Mt−M∗⟩ +(ℓ′(Mt)−τtZt)⟨At,Mt−M∗⟩ ≤ ∥Mt−M∗∥2F−∥Mt+1−M∗∥2F2η+ηA2Zt2 +τt(|ℓ′(Mt)|−Zt)⟨At,Mt−M∗⟩

Taking the sum from to , we have:

 N∑t=1ℓ(Mt)−ℓ(M∗)≤∥M1−M∗∥2F2η+ηA22N∑t=1Zt +N∑t=12τt(|ℓ′(Mt)|−Zt)RA

According to (7), with a probability , the second item could be bounded as:

 ηA22N∑t=1Zt ≤ ηA2(34LN∑t=1ℓ(Mt)+54lnmδ) (8) ≤ 34γN∑t=1ℓ(Mt)+54ηA2lnmδ

where .

Applying Berstein inequality for martingales [?] for the last item, we have, with a probability :

 ∑Nt=12τt(|ℓ′(Mt)|−Zt)RA≤4RA√CNlnmδ+2√23RAlnmδ
 ≤ γ4N∑t=1ℓ(Mt)+16R2ηlnmδ+RAlnmδ (9)

Combining the bounds in (8) and (9), we have, with a probability :

 N∑t=1ℓ(Mt)−ℓ(M∗)≤12η(R2+32R2lnmδ) +γN∑t=1ℓ(Mt)+54ηA2lnmδ+RAlnmδ

which is equal to:

 L(¯M)≤11−γ(L(M∗)+R2cηN+ηcN+cN)

where

 c=max{12+16lnmδ,54A2lnmδ,RAlnmδ}

The proof is completed by setting .

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters