Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk MinimizationAn extended abstract (9 pages) of an early version of this manuscript (arXiv:1409.3257) appeared in the Proceedings of The 32nd International Conference on Machine Learning (ICML), Lille, France, July 2015.

Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk MinimizationAn extended abstract (9 pages) of an early version of this manuscript (arXiv:1409.3257) appeared in the Proceedings of The 32nd International Conference on Machine Learning (ICML), Lille, France, July 2015.

Abstract

We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convex-concave saddle point problem. We propose a stochastic primal-dual coordinate (SPDC) method, which alternates between maximizing over a randomly chosen dual variable and minimizing over the primal variable. An extrapolation step on the primal variable is performed to obtain accelerated convergence rate. We also develop a mini-batch version of the SPDC method which facilitates parallel computing, and an extension with weighted sampling probabilities on the dual variables, which has a better complexity than uniform sampling on unnormalized data. Both theoretically and empirically, we show that the SPDC method has comparable or better performance than several state-of-the-art optimization methods.

1Introduction

We consider a generic convex optimization problem that arises often in machine learning: regularized empirical risk minimization (ERM) of linear predictors. More specifically, let be the feature vectors of data samples, be a convex loss function associated with the linear prediction , for , and be a convex regularization function for the predictor . Our goal is to solve the following optimization problem:

Examples of the above formulation include many well-known classification and regression problems. For binary classification, each feature vector is associated with a label . We obtain the linear SVM (support vector machine) by setting (the hinge loss) and , where is a regularization parameter. Regularized logistic regression is obtained by setting . For linear regression problems, each feature vector is associated with a dependent variable , and . Then we get ridge regression with , and the Lasso with . Further backgrounds on regularized ERM in machine learning and statistics can be found, e.g., in the book [14].

We are especially interested in developing efficient algorithms for solving problem when the number of samples is very large. In this case, evaluating the full gradient or subgradient of the function is very expensive, thus incremental methods that operate on a single component function at each iteration can be very attractive. There have been extensive research on incremental (sub)gradient methods (e.g. [44]) as well as variants of the stochastic gradient method (e.g., [52]). While the computational cost per iteration of these methods is only a small fraction, say , of that of the batch gradient methods, their iteration complexities are much higher (it takes many more iterations for them to reach the same precision). In order to better quantify the complexities of various algorithms and position our contributions, we need to make some concrete assumptions and introduce the notion of condition number and batch complexity.

1.1Condition number and batch complexity

Let and be two positive real parameters. We make the following assumption:

For example, the logistic loss is -smooth, the squared error is -smooth, and the squared -norm is -strongly convex. The hinge loss and the -regularization do not satisfy Assumption ?. Nevertheless, we can treat them using smoothing and strongly convex perturbations, respectively, so that our algorithm and theoretical framework still apply (see Section 3).

Under Assumption ?, the gradient of each component function, , is also Lipschitz continuous, with Lipschitz constant , where . In other words, each is -smooth. We define a condition number

and focus on ill-conditioned problems where . In the statistical learning context, the regularization parameter is usually on the order of or (e.g., [7]), thus is on the order of or . It can be even larger if the strong convexity in is added purely for numerical regularization purposes (see Section 3). We note that the actual conditioning of problem may be better than , if the empirical loss function by itself is strongly convex. In those cases, our complexity estimates in terms of can be loose (upper bounds), but they are still useful in comparing different algorithms for solving the same given problem.

Let be the optimal value of problem , i.e., . In order to find an approximate solution satisfying , the classical full gradient method and its proximal variants require iterations (e.g., [25]). Accelerated full gradient (AFG) methods [25] enjoy the improved iteration complexity .1 However, each iteration of these batch methods requires a full pass over the dataset, computing the gradient of each component function and forming their average, which cost operations (assuming the features vectors are dense). In contrast, the stochastic gradient method and its proximal variants operate on one single component (chosen randomly) at each iteration, which only costs . But their iteration complexities are far worse. Under Assumption ?, it takes them iterations to find an such that , where the expectation is with respect to the random choices made at all the iterations (see, e.g., [32]).

To make fair comparisons with batch methods, we measure the complexity of stochastic or incremental gradient methods in terms of the number of equivalent passes over the dataset required to reach an expected precision . We call this measure the batch complexity, which are usually obtained by dividing their iteration complexities by . For example, the batch complexity of the stochastic gradient method is . The batch complexities of full gradient methods are the same as their iteration complexities.

By carefully exploiting the finite average structure in and other similar problems, several recent work [36] proposed new variants of the stochastic gradient or dual coordinate ascent methods and obtained the iteration complexity . Since their computational cost per iteration is , the equivalent batch complexity is of their iteration complexity, i.e., . This complexity has much weaker dependence on than the full gradient methods, and also much weaker dependence on than the stochastic gradient methods.

In this paper, we propose a stochastic primal-dual coordinate (SPDC) method, which has the iteration complexity

or equivalently, the batch complexity

When , this is lower than the batch complexity mentioned above. Indeed, it is very close to a lower bound for minimizing finite sums recently established in [1].

1.2Outline of the paper

Our approach is based on reformulating problem as a convex-concave saddle point problem, and then devising a primal-dual algorithm to approximate the saddle point. More specifically, we replace each component function through convex conjugation, i.e.,

where , and denotes the inner product of and (which is the same as , but is more convenient for later presentation). This leads to a convex-concave saddle point problem

Under Assumption ?, each is -strongly convex (since is -smooth; see, e.g., [15]) and is -strongly convex. As a consequence, the saddle point problem has a unique solution, which we denote by .

In Section 2, we present the SPDC method as well as its convergence analysis. It alternates between maximizing over a randomly chosen dual coordinate and minimizing over the primal variable . In order to accelerate the convergence, an extrapolation step is applied in updating the primal variable . We also give a mini-batch SPDC algorithm which is well suited for parallel computing.

In Section 3 and Section 4, we present two extensions of the SPDC method. We first explain how to solve problem when Assumption ? does not hold. The idea is to apply small regularizations to the saddle point function so that SPDC can still be applied, which results in accelerated sublinear rates. The second extension is a SPDC method with non-uniform sampling. The batch complexity of this algorithm has the same form as , but with , where , which can be much smaller than if there is considerable variation in the norms .

In Section 5, we discuss related work. In particular, the SPDC method can be viewed as a coordinate-update extension of the batch primal-dual algorithm developed by Chambolle and Pock [9]. We also discuss two very recent work [38] which achieve the same batch complexity .

In Section 6, we discuss efficient implementation of the SPDC method when the feature vectors are sparse. We focus on two popular cases: when is a squared -norm penalty and when is an penalty. We show that the computational cost per iteration of SPDC only depends on the number of non-zero elements in the feature vectors.

In Section 7, we present experiment results comparing SPDC with several state-of-the-art optimization methods, including both batch algorithms and randomized incremental and coordinate gradient methods. On all scenarios we tested, SPDC has comparable or better performance.

2The SPDC method

In this section, we describe and analyze the Stochastic Primal-Dual Coordinate (SPDC) method. The basic idea of SPDC is quite simple: to approach the saddle point of defined in , we alternatively maximize with respect to , and minimize with respect to . Since the dual vector has coordinates and each coordinate is associated with a feature vector , maximizing with respect to takes computation, which can be very expensive if is large. We reduce the computational cost by randomly picking a single coordinate of at a time, and maximizing only with respect to this coordinate. Consequently, the computational cost of each iteration is .

We give the details of the SPDC method in Algorithm ?. The dual coordinate update and primal vector update are given in equations and respectively. Instead of maximizing over and minimizing over directly, we add two quadratic regularization terms to penalize and from deviating from and . The parameters and control their regularization strength, which we will specify in the convergence analysis (Theorem ?). Moreover, we introduce two auxiliary variables and . From the initialization and the update rules and , we have

Equation obtains based on extrapolation from and . This step is similar to Nesterov’s acceleration technique [25], and yields faster convergence rate.

The Mini-Batch SPDC method in Algorithm ? is a natural extension of SPDC in Algorithm ?. The difference between these two algorithms is that, the Mini-Batch SPDC method may simultaneously select more than one dual coordinates to update. Let be the mini-batch size. During each iteration, the Mini-Batch SPDC method randomly picks a subset of indices of size , such that the probability of each index being picked is equal to . The following is a simple procedure to achieve this. First, partition the set of indices into disjoint subsets, so that the cardinality of each subset is equal to (assuming divides ). Then, during each iteration, randomly select a single index from each subset and add it to . Other approaches for mini-batch selection are also possible; see the discussions in [34].

In Algorithm ?, we also switched the order of updating and (comparing with Algorithm ?), to better illustrate that is obtained based on an extrapolation from to . However, this form is not recommended in implementation, because is usually a dense vector even if the feature vectors are sparse. Details on efficient implementation of SPDC are given in Section 6. In the following discussion, we do not make sparseness assumptions.

With a single processor, each iteration of Algorithm ? takes time to accomplish. Since the updates of each coordinate are independent of each other, we can use parallel computing to accelerate the Mini-Batch SPDC method. Concretely, we can use processors to update the coordinates in the subset in parallel, then aggregate them to update . In terms of wall-clock time, each iteration takes time, which is the same as running one iteration of the basic SPDC algorithm. Not surprisingly, we will show that the Mini-Batch SPDC algorithm converges faster than SPDC in terms of the iteration complexity, because it processes multiple dual coordinates in a single iteration.

2.1Convergence analysis

Since the basic SPDC algorithm is a special case of Mini-Batch SPDC with , we only present a convergence theorem for the mini-batch version. The expectations in the following results are taken with respect to the random variables , where denotes the random subset picked at the -th iteration of the SPDC method.

The proof of Theorem ? is given in Appendix A. The following corollary establishes the expected iteration complexity of Mini-Batch SPDC for obtaining an -accurate solution.

By Theorem ?, for each , we have and . To obtain , it suffices to ensure that , which is equivalent to

Applying the inequality to the denominator above completes the proof.

Recall the definition of the condition number in . Corollary ? establishes that the iteration complexity of the Mini-Batch SPDC method for achieving is

So a larger batch size leads to less number of iterations. In the extreme case of , we obtain a full batch algorithm, which has iteration or batch complexity . This complexity is also shared by the AFG methods [25] (see Section 1.1), as well as the batch primal-dual algorithm of Chambolle and Pock [9] (see discussions on related work in Section 5).

Since an equivalent pass over the dataset corresponds to iterations, the batch complexity (the number of equivalent passes over the data) of Mini-Batch SPDC is

The above expression implies that a smaller batch size leads to less number of passes through the data. In this sense, the basic SPDC method with is the most efficient one. However, if we prefer the least amount of wall-clock time, then the best choice is to choose a mini-batch size that matches the number of parallel processors available.

2.2Convergence rate of primal-dual gap

In the previous subsection, we established iteration complexity of the Mini-Batch SPDC method in terms of approximating the saddle point of the minimax problem , more specifically, to meet the requirement in . Next we show that it has the same order of complexity in reducing the primal-dual objective gap , where is defined in and

where is the conjugate function of .

Under Assumption ?, the function defined in has a unique saddle point , and

However, in general, for any point , we have

Thus the result in Theorem ? does not translate directly into a convergence bound on the primal-dual gap. We need to bound and by and , respectively, in the opposite directions. For this purpose, we need the following lemma, which we extracted from [51]. We provide the proof in Appendix B for completeness.

The function is strongly convex in with parameter , and is the minimizer. Similarly, is strongly convex in with parameter , and is minimized by . Therefore,

We bound the following weighted primal-dual gap

The first inequality above is due to Lemma ?, the second and fourth inequalities are due to the definition of , and the third inequality is due to . Taking expectations on both sides of the above inequality, then applying Theorem ?, we obtain

Since and , this implies the desired result.

3Extensions to non-smooth or non-strongly convex functions

The complexity bounds established in Section 2 require each be -smooth, and the function be -strongly convex. For general loss functions where either or both of these conditions fail (e.g., the hinge loss and -regularization), we can slightly perturb the saddle-point function so that the SPDC method can still be applied.

To be concise, we only consider the case where neither is smooth nor is strongly convex. Formally, we assume that each and are convex and Lipschitz continuous, and has a saddle point . We choose a scalar and consider the modified saddle-point function:

Denote by the saddle-point of . We employ the Mini-Batch SPDC method (Algorithm ?) to approximate , treating as and as , which are all -strongly convex. We note that adding strongly convex perturbation on is equivalent to smoothing , which becomes -smooth (see, e.g., [26]). Letting , the parameters , and in become

Although is not exactly the saddle point of , the following corollary shows that applying the SPDC method to the perturbed function effectively minimizes the original loss function . Similar results for the convergence of the primal-dual gap can also be established.

Let be a shorthand notation. We have

Here, equations (i) and (vii) use the definition of the function , inequalities (ii) and (v) use the definition of the function , inequalities (iii) and (iv) use the fact that is the saddle point of , and inequality (vi) is due to the fact that is the saddle point of .

Since is -Lipschitz continuous, the domain of is in the interval , which implies (see, e.g., [38]). Thus, we have

On the other hand, since is -Lipschitz continuous, Theorem ? implies

Combining and , in order to obtain , it suffices to have and

The corollary is established by finding the smallest that satisfies inequality .

Table 1: Iteration complexities of the SPDC method under different assumptions on the functions and . For the last three cases, we solve the perturbed saddle-point problem with .
iteration complexity
-smooth -strongly convex
-smooth non-strongly convex
non-smooth -strongly convex
non-smooth non-strongly convex

There are two other cases that can be considered: when is not smooth but is strongly convex, and when is smooth but is not strongly convex. They can be handled with the same technique described above, and we omit the details here. In Table 1, we list the complexities of the Mini-Batch SPDC method for finding an -optimal solution of problem under various assumptions. Similar results are also obtained in [38].

4SPDC with non-uniform sampling

One potential drawback of the SPDC algorithm is that, its convergence rate depends on a problem-specific constant , which is the largest -norm of the feature vectors . As a consequence, the algorithm may perform badly on unnormalized data, especially if the -norms of some feature vectors are substantially larger than others. In this section, we propose an extension of the SPDC method to mitigate this problem, which is given in Algorithm ?.

The basic idea is to use non-uniform sampling in picking the dual coordinate to update at each iteration. In Algorithm ?, we pick coordinate with the probability

where is a parameter. In other words, this distribution is a (strict) convex combination of the uniform distribution and the distribution that is proportional to the feature norms. Therefore, instances with large feature norms are sampled more frequently, controlled by . Simultaneously, we adopt an adaptive regularization in step , imposing stronger regularization on such instances. In addition, we adjust the weight of in for updating the primal variable. As a consequence, the convergence rate of Algorithm ? depends on the average norm of feature vectors, as well as the parameter . This is summarized in the following theorem.

Choosing and comparing with Theorem ?, the parameters , , and in Theorem ? are determined by the average norm of the features, , instead of the largest one . This difference makes Algorithm ? more robust to unnormalized feature vectors. For example, if the ’s are sampled i.i.d. from a multivariate normal distribution, then almost surely goes to infinity as , but the average norm converges to .

Since is a bound on the convergence factor, we would like to make it as small as possible. For its expression in , it can be minimized by choosing

where is an average condition number. We have if . The value of decreases slowly to zero as the ratio grows, and increases to one as the ratio drops. Thus, we may choose a relatively uniform distribution for well conditioned problems, but a more aggressively weighted distribution for ill-conditioned problems.

For simplicity of presentation, we described in Algorithm ? a weighted sampling SPDC method with single dual coordinate update, i.e., the case of . It is not hard to see that the non-uniform sampling scheme can also be extended to Mini-Batch SPDC with . Here, we omit the technical details.

5Related Work

Chambolle and Pock [9] considered a class of convex optimization problems with the following saddle-point structure:

where , and are proper closed convex functions, with itself being the conjugate of a convex function . They developed the following first-order primal-dual algorithm:

When both and are strongly convex and the parameters , and are chosen appropriately, this algorithm obtains accelerated linear convergence rate [9].

We can map the saddle-point problem into the form of by letting and

The SPDC method developed in this paper can be viewed as an extension of the batch method -, where the dual update step is replaced by a single coordinate update or a mini-batch update . However, in order to obtain accelerated convergence rate, more subtle changes are necessary in the primal update step. More specifically, we introduced the auxiliary variable , and replaced the primal update step by and . The primal extrapolation step stays the same.

To compare the batch complexity of SPDC with that of -, we use the following facts implied by Assumption ? and the relations in :

Based on these conditions, we list in Table 2 the equivalent parameters used in [9] and the batch complexity obtained in [9], and compare them with SPDC.

The batch complexity of the Chambolle-Pock algorithm is , where the notation hides the factor. We can bound the spectral norm by the Frobenius norm and obtain

(Note that the second inequality above would be an equality if the columns of are normalized.) So in the worst case, the batch complexity of the Chambolle-Pock algorithm becomes

which matches the worst-case complexity of the AFG methods [25] (see Section 1.1 and also the discussions in [20]). This is also of the same order as the complexity of SPDC with (see Section 2.1). When the condition number , they can be worse than the batch complexity of SPDC with , which is .

Table 2: Comparing SPDC with Chambolle and Pock .
algorithm batch complexity
Chambolle-Pock
SPDC with
SPDC with

If either or in is not strongly convex, Chambolle and Pock proposed variants of the primal-dual batch algorithm to achieve accelerated sublinear convergence rates [9]. It is also possible to extend them to coordinate update methods for solving problem when either or is not strongly convex. Their complexities would be similar to those in Table 1.

Our algorithms and theory can be readily generalized to solve the problem of

where each is an matrix, and is a smooth convex function. This more general formulation is used, e.g., in [38]. Most recently, Lan [18] considered a special case with and , and recognized that the dual coordinate proximal mapping used in and is equivalent to computing the primal gradients at a particular sequence of points . Based on this observation, he derived a similar randomized incremental gradient algorithm which share the same order of iteration complexity as we presented in this paper.

5.1Dual coordinate ascent methods

We can also solve the primal problem via its dual:

Because of the problem structure, coordinate ascent methods (e.g., [31]) can be more efficient than full gradient methods. In the stochastic dual coordinate ascent (SDCA) method [40], a dual coordinate is picked at random during each iteration and updated to increase the dual objective value. Shalev-Shwartz and Zhang [40] showed that the iteration complexity of SDCA is , which corresponds to the batch complexity .

For more general convex optimization problems, there is a vast literature on coordinate descent methods; see, e.g., the recent overview by Wright [47]. In particular, Nesterov’s work on randomized coordinate descent [27] sparked a lot of recent activities on this topic. Richtárik and Takáč [35] extended the algorithm and analysis to composite convex optimization. When applied to the dual problem , it becomes one variant of SDCA studied in [40]. Mini-batch and distributed versions of SDCA have been proposed and analyzed in [43] and [50] respectively. Non-uniform sampling schemes have been studied for both stochastic gradient and SDCA methods (e.g., [23]).

Shalev-Shwartz and Zhang [39] proposed an accelerated mini-batch SDCA method which incorporates additional primal updates than SDCA, and bears some similarity to our Mini-Batch SPDC method. They showed that its complexity interpolates between that of SDCA and AFG by varying the mini-batch size . In particular, for , it matches that of the AFG methods (as SPDC does). But for , the complexity of their method is the same as SDCA, which is worse than SPDC for ill-conditioned problems.

In addition, Shalev-Shwartz and Zhang [38] developed an accelerated proximal SDCA method which achieves the same batch complexity as SPDC. Their method is an inner-outer iteration procedure, where the outer loop is a full-dimensional accelerated gradient method in the primal space . At each iteration of the outer loop, the SDCA method [40] is called to solve the dual problem with customized regularization parameter and precision. In contrast, SPDC is a straightforward single-loop coordinate optimization methods.

More recently, Lin et al. [20] developed an accelerated proximal coordinate gradient (APCG) method for solving a more general class of composite convex optimization problems. When applied to the dual problem , APCG enjoys the same batch complexity as of SPDC. However, it needs an extra primal proximal-gradient step to have theoretical guarantees on the convergence of primal-dual gap [20]. The computational cost of this additional step is equivalent to one pass of the dataset, thus it does not affect the overall complexity.

5.2Other related work

Another way to approach problem is to reformulate it as a constrained optimization problem

and solve it by ADMM type of operator-splitting methods (e.g., [21]). In fact, as shown in [9], the batch primal-dual algorithm - is equivalent to a pre-conditioned ADMM (or inexact Uzawa method; see, e.g., [53]). Several authors [46] have considered a more general formulation than , where each is a function of the whole vector . They proposed online or stochastic versions of ADMM which operate on only one in each iteration, and obtained sublinear convergence rates. However, their cost per iteration is instead of .

Suzuki [42] considered a problem similar to , but with more complex regularization function , meaning that does not have a simple proximal mapping. Thus primal updates such as step or in SPDC and similar steps in SDCA cannot be computed efficiently. He proposed an algorithm that combines SDCA [40] and ADMM (e.g., [8]), and showed that it has linear rate of convergence under similar conditions as Assumption ?. It would be interesting to see if the SPDC method can be extended to their setting to obtain accelerated linear convergence rate.

6Efficient Implementation with Sparse Data

During each iteration of the SPDC methods, the updates of primal variables (i.e., computing ) require full -dimensional vector operations; see the step of Algorithm ?, the step of Algorithm ? and the step of Algorithm ?. So the computational cost per iteration is , and this can be too expensive if the dimension is very high. In this section, we show how to exploit problem structure to avoid high-dimensional vector operations when the feature vectors are sparse. We illustrate the efficient implementation for two popular cases: when is an squared- penalty and when is an penalty. For both cases, we show that the computation cost per iteration only depends on the number of non-zero components of the feature vector.

6.1Squared -norm penalty

Suppose that . For this case, the updates for each coordinate of are independent of each other. More specifically, can be computed coordinate-wise in closed form:

where denotes in Algorithm ?, or in Algorithm ?, or in Algorithm ?, and represents the -th coordinate of .

Although the dimension can be very large, we assume that each feature vector is sparse. We denote by the set of non-zero coordinates at iteration , that is, if for some index picked at iteration we have , then . If , then the SPDC algorithm (and its variants) updates without using the value of or . This can be seen from the updates in , and , where the value of the inner product does not depend on the value of . As a consequence, we can delay the updates on and whenever without affecting the updates on , and process all the missing updates at the next time when .

Such a delayed update can be carried out very efficiently. We assume that is the last time when , and is the current iteration where we want to update and . Since implies , we have

Notice that is updated only at iterations where . The value of doesn’t change during iterations , so we have for . Substituting this equation into the recursive formula , we obtain

The update takes time to compute. Using the same formula, we can compute and subsequently compute . Thus, the computational complexity of a single iteration in SPDC is proportional to , independent of the dimension .

6.2-norm penalty

Suppose that . Since both the -norm and the squared -norm are decomposable, the updates for each coordinate of are independent. More specifically,

where follows the definition in Section 6.1. If , then and equation can be simplified as

Similar to the approach of Section 6.1, we delay the update of until . We assume