Stochastic Dual Coordinate Ascent
with Alternating Direction Multiplier Method
Abstract
We propose a new stochastic dual coordinate ascent technique that can be applied to a wide range of regularized learning problems. Our method is based on Alternating Direction Multiplier Method (ADMM) to deal with complex regularization functions such as structured regularizations. Although the original ADMM is a batch method, the proposed method offers a stochastic update rule where each iteration requires only one or few sample observations. Moreover, our method can naturally afford minibatch update and it gives speed up of convergence. We show that, under mild assumptions, our method converges exponentially. The numerical experiments show that our method actually performs efficiently.
Keywords: Stochastic Dual Coordinate Ascent, Alternating Direction Multiplier Method, Exponential Convergence, Structured Sparsity.
1 Introduction
This paper proposes a new stochastic optimization method that shows exponential convergence and can be applied to wide range of regularization functions using the techniques of stochastic dual coordinate ascent with alternating direction multiplier method. Recently, it is getting more and more important to develop an efficient optimization method which can handle large amount of samples. One of the most successful approaches is a stochastic optimization approach. Indeed, a lot of stochastic methods have been proposed to deal with large amount of samples. Among them, the (online) stochastic gradient method is the most basic and successful one. This can be naturally applied to the regularized learning framework. Such a method is called several different names including online proximal gradient descent, forwardbackward splitting and online mirror descent (Duchi and Singer, 2009). Basically, these methods are intended to process sequentially coming data. They update the parameter using one new observation and discard the observed sample. Therefore, they don’t need large memory space to store the whole observed data. The convergence rate of those methods is for general settings and for strongly convex losses, which are minimax optimal (Nemirovskii and Yudin, 1983).
On the other hand, recently it was shown that, if it is allowed to reuse the observed data several times, it is possible to develop a stochastic method with exponential convergence rate for a strongly convex objective (Le Roux et al., 2013; ShalevShwartz and Zhang, 2013c; ShalevShwartz and Zhang, 2013a). These methods are still stochastic in a sense that one sample or small minibatch is randomly picked up to be used for each update. The main difference from the stochastic gradient method is that these methods are intended to process data with a fixed number of training samples. Stochastic Average Gradient (SAG) method (Le Roux et al., 2013) utilizes an averaged gradient to show an exponential convergence. Stochastic Dual Coordinate Ascent (SDCA) method solves the dual problem using a stochastic coordinate ascent technique (ShalevShwartz and Zhang, 2013c; ShalevShwartz and Zhang, 2013a). These methods have favorable properties of both onlinestochastic approach and batch approach. That is, they show fast decrease of the objective function in the early stage of the optimization as onlinestochastic approaches, and shows exponential convergence after the “burn in” time as batch approaches. However, these methods have some drawbacks. SAG needs to maintain all gradients computed on each training sample in memory which amount to dimension times sample size. SDCA method can be applied only to a simple regularization function for which the dual function is easily computed, thus it is hard to apply the method to a complex regularization function such as structured regularization.
In this paper, we propose Stochastic Dual Coordinate Ascent method for Alternating Direction Multiplier Method (SDCAADMM). Our method is similar to SDCA, but inherits a favorable property of ADMM. By combining SDCA and ADMM, our method can be applied to a wide range of regularized learning problems. ADMM is an effective optimization method to solve a composite optimization problem described as (Gabay and Mercier, 1976; Boyd et al., 2010; Qin and Goldfarb, 2012). This formulation is quite flexible and fit wide range of applications such as structured regularization, dictionary learning, convex tensor decomposition and so on (Qin and Goldfarb, 2012; Jacob et al., 2009; Tomioka et al., 2011; Rakotomamonjy, 2013). However, ADMM is a batch optimization method. Our approach transforms ADMM to a stochastic one by utilizing stochastic coordinate ascent technique. Our method, SDCAADMM, does not require large amount of memory because it observes only one or few samples for each iteration. SDCAADMM can be naturally adapted to a subbatch situation where a block of few samples is utilized for each iteration. Moreover, it is shown that our method shows exponential convergence for risk functions with some strong convexity and smoothness property. The convergence rate is affected by the size of subbatch. If the samples are not strongly correlated, subbatch gives a better convergence rate than onesample update.
2 Structured Regularization and its Dual Formulation
In this section, we give the problem formulation of structured regularization and its dual formulation. The standard regularized risk minimization is described as follows:
(1) 
where are vectors in , is the weight vector that we want to learn, is a loss function for the th sample, and is the regularization function which is used to avoid overfitting. For example, the loss function can be taken as a classification surrogate loss where is the training label of the th sample. With regard to , we are interested in a sparsity inducing regularization, e.g., regularization, group lasso regularization, tracenorm regularization, and so on. Our motivation in this paper is to deal with a “complex” regularization where it is not easy to directly minimize the regularization function (more precisely the proximal operation determined by is not easily computed, see Eq. (5)). This kind of regularization appears in, for example, structured sparsity such as overlapped group lasso and graph regularization (Jacob et al., 2009; Signoretto et al., 2010). In many cases, such a “complex” regularization function can be decomposed into a “simple” regularization and a linear transformation , that is, where . Using this formulation, the optimization problem (Eq. (1)) is equivalent to
(2) 
The purpose of this paper is to give an efficient stochastic optimization method to solve this problem (2). For this purpose, we employ the dual formulation. Using the Fenchel’s duality theorem, we have the following dual formulation.
Lemma 1.
(3) 
where and are the convex conjugates of and respectively (Rockafellar, 1970)^{1}^{1}1The convex conjugate function of is defined by ., and . Moreover , and are optimal solutions of both sides if and only if
Proof.
By Fenchel’s duality theorem (Corollary 31.2.1 of Rockafellar (1970)), we have that
(4) 
Moreover are optimal of each side if and only if and (Corollary 31.3 of Rockafellar (1970)). Now, Theorem 16.3 of Rockafellar (1970) gives that
Thus , and substituting this into the RHS of Eq. (4) we obtain Eq. (3). Now, satisfying is the optimal value if and only if for the optimal . Thus, if is optimal, then we have and thus which implies . Contrary, if , then it is obvious that because . Therefore, we obtain the optimality conditions.
∎
The dual problem is a composite objective function optimization with a linear constraint . In the next section, we give an efficient stochastic method to solve this dual problem. A nice property of the dual formulation is that, in many machine learning applications, the dual loss function becomes strongly convex. For example, for the logistic loss , the dual function is and its modulus of strong convexity is much better than the primal one. More importantly, each sample directly affects only each coordinate of dual variable. In other words, if is fixed the th sample has no influence to the objective value. This enables us to utilize the stochastic coordinate ascent technique in the dual problem because update of single coordinate requires only the information of the th sample .
Finally, we give the precise notion of the “complex” and “simple” regularizations. This notion is defined by the computational complexity of proximal operation corresponding to the regularization function (Rockafellar, 1970). The proximal operation corresponding to a convex function is defined by
(5) 
For example, the proximal operation corresponding to regularization is easily computed as which is the socalled softthresholding operation. More generally, the proximal operation for group lasso regularization with nonoverlapped groups can also be analytically computed. On the other hand, for overlapped group regularization, the proximal operation is no longer analytically obtained. However, by choosing appropriately, we can split the overlap and obtain for which the proximal operation is easily computed (see Section 6 for concrete examples).
3 Proposed Method: Stochastic Dual Coordinate Ascent with ADMM
In this section, we present our proposal, stochastic dual coordinate ascent type ADMM. For a positive semidefinite matrix , we denote by . denotes the th column of , which is , and is a matrix obtained by subtracting th column from . Similarly, for a vector , is a vector obtained by subtracting th component from .
3.1 One Sample Update of SDCA for ADMM
The basic update rule of our proposed method in the th step is given as follows: Each update step, choose uniformly at random, and update as
(6a)  
(6b)  
(6c) 
where is the primal variable at the th step, and are arbitrary positive semidefinite matrices, and are parameters we give beforehand.
The optimization procedure looks a bit complicated, To simplify the procedure, we set as
(7) 
where are chosen so that . Then, by carrying out simple calculations and denoting , the update rule of and is rewritten as
(8a)  
(8b) 
Note that the update (8b) of is just a one dimensional optimization, thus it is quite easily computed. Moreover, for some loss functions such as the smoothed hinge loss used in Section 6, we have an analytic form of the update.
The update rule (8a) of can be rewritten by the proximal operation corresponding to the primal function while the rule (8a) is given by that corresponding to the dual function . Indeed, there is a clear relation between primal and dual (Theorem 31.5 of Rockafellar (1970)):
Using this, for , we have that
(9) 
because for a convex function and . This is efficiently computed because we assumed the proximal operation corresponding to can be efficiently computed.
During the update, we need which seems to require computation at the first glance. However, it can be incrementally updated as . Thus we don’t need to road all the samples to compute at each iteration.
In the above, the update rule of our algorithm is based on one sample observation. Next, we give a minibatch extension of the algorithm where more than one samples could be used for each iteration.
3.2 MiniBatch Extension
Here, we generalize our method to minibatch situation where, at each iteration, we observe a small number of samples instead of one sample observation. At each iteration, we randomly choose an index set so that each index is included in with probability ; for all . To do so, we suggest the following procedure. We split the index set into groups beforehand, and then pick up uniformly and set for each iteration. Each subbatch can have different cardinality from others, but the probability should be uniform for all . The update rule using subbatch is given as follows: Update as before (6a), and update and by
(10a)  
(10b) 
Using given in Eq. (7), the update rule of can be replaced by Eq. (9) as in onesample update situation. The update rule of can also be simplified by choosing appropriately. Because subbatches have no overlap between each other, we can construct a positive semidefinite matrix such that the blockdiagonal element has the form
(11) 
where is a positive real satisfying . The reason why we split the index sets into sets is to construct this kind of which “diagonalizes” the quadratic function in (10a). The choice of and could be replaced with another one for which we could compute the update efficiently, as long as is uniform for all . Using given in (11), the update rule (10a) of is rewritten as
(12) 
where is a vector consisting of components with indexes , , and is a submatrix of consisting of columns with indexes , . Note that, since is sum of single variable convex functions , the proximal operation in Eq. (12) can be split into the proximal operation with respect to each single variable . This is advantageous for not only the simpleness of the computation but also parallel computation. That is, for , the update rule (12) is reduced to for each , which is easily parallelizable. In summary, our proposed algorithm is given in Algorithm 1.
Finally, we would like to highlight the connection between our method and the original batch ADMM (Hestenes, 1969; Powell, 1969; Rockafellar, 1976). The batch ADMM utilizes the following update rule
(13a)  
(13b)  
(13c) 
One can see that the update rule of our algorithm is reduced to that of the batch ADMM (13) if we set except the term related to and (the terms and ). These terms related to and are used also in batch situation to eliminate cross terms in and . This technique is called linearization. The linearization technique makes the update rule simple and parallelizable, and in some situations makes it possible to obtain an analytic form of the update.
4 Linear Convergence of SDCAADMM
In this section, the convergence rate of our proposed algorithm is given. Indeed, the convergence rate is exponential (Rlinear). To show the convergence rate, we assume some conditions. First, we assume that there exits an unique optimal solution and is injective (on the other hand, is not necessarily injective). Moreover, we assume the uniqueness of the dual solution , but don’t assume the uniqueness of . We denote by the set of dual optimum of as and assume that is compact. Then, by Lemma 1, we have that
(14) 
By the convex duality arguments, this implies that .
Moreover, we suppose that each (dual) loss function is locally strongly convex and , smooth around the optimal solution and is also locally strongly convex in a weak sense as follows.
Assumption 1.
There exits such that, ,
There exit and such that, for all , there exists such that
(15) 
and for all we have
(16) 
where is the projection matrix to the kernel of .
Note that these conditions should be satisfied only around the optimal solutions and . It does not need to hold for every point, thus is much weaker than the ordinal strong convexity. Moreover, the inequalities need to be satisfied only for the solution sequence of our algorithm. The condition (15) is satisfied, for example, by regularization because the dual of regularization is an indicator function with a compact support and, outside the optimal solution set , the indicator function is lower bounded by a quadratic function. In addition, the quadratic term in the right hand side of this condition (15) is restricted on . This makes it possible to include several types of regularization functions. Indeed, if , this condition is always satisfied. The assumption (16) is the strongest assumption. This is satisfied for elasticnet regularization. If one wants to obtain a solution for nonstrongly convex regularization such as regularization, just adding a small square term, we obtain an approximated solution which is sufficiently close to the true one within a precision.
Define the primal and dual objectives as
Note that, by Eq. (14), and are always nonnegative. Define the block diagonal matrix as for all and for . Let . We define as
For a symmetric matrix , we define and as the maximum and minimum singular value respectively.
Theorem 2.
Suppose that , for all and is injective. Then, under Assumption 1, the dual objective function converges Rlinearly: We have that, for ,
where
In particular, we have that
If we further assume (), then this implies that
The proof is deferred to the appendix. This theorem shows that the primal and dual objective values converge Rlinearly. Moreover, the primal variable also converges Rlinearly to the optimal value. The number of subbatches controls the convergence rate. If all samples are nearly orthogonal to each other, is bounded by a constant for all , and thus convergence rate gets faster and faster as decreases (the size of each subbatch grows up). On the other hand, if samples are strongly correlated to each other, grows linearly against and then the convergence rate is not improved by decreasing . As for batch settings, the linear convergence of batch ADMM has been shown by (Deng and Yin, 2012). However, their proof can not be directly applied to our stochastic setting. Our proof requires a technique specialized to stochastic coordinate ascent technique. We would like to point out that the exponential convergence is not guaranteed if the choice of index set at each update is cyclic. Thus the index should be randomly chosen. This is reported also in the paper (ShalevShwartz and Zhang, 2013a).
The statement can be described in terms of the number of iterations required to achieve a precision , i.e. smallest satisfying :
where and are an absolute constant. This says that dependency of is logorder. An interesting point is that the influence of , the modulus of local strong convexity of . Usually the regularization function is made weaker as the number of samples increases. In that situation, decreases as goes up. However, even if we set , we still have instead of . Thus, the convergence rate is hardly affected by the setting of . This point is same as the ordinary SDCA algorithm (ShalevShwartz and Zhang, 2013a).
5 Related Works
In this section, we present some related works and discuss differences from our method.
The most related work is a recent study by (ShalevShwartz and Zhang, 2013c) in which Stochastic Dual Coordinate Ascent (SDCA) method for a regularized risk minimization is proposed. Their method also deals with the dual problem (3) with in our setting, and apply a stochastic coordinate ascent technique. This method converges linearly. At each iteration, the method solves the following onedimensional optimization problem,
and updates and . The most important difference from our method is the computation of . In a “simple” regularization function, it is often easy to compute the (sub)gradient of . However, in a “complex” regularization such as structured regularization, the computation is not efficiently carried out. To overcome this difficulty, our method utilizes a linear transformed one , and split the optimization with respect to and by applying ADMM technique. Thus, our method is applicable to much more general regularization functions. Recently, a minibatch extension of SDCA is a hot topic (Takáč et al., 2013; ShalevShwartz and Zhang, 2013b). Our approach realizes the minibatch extension using the linearlization technique in ADMM which is naturally derived in the framework of ADMM. Although the proof technique is quite different, the convergence analysis of normal minibatch SDCA given by (ShalevShwartz and Zhang, 2013b) is parallel to our theorem.
The second method related to ours is Stochastic Average Gradient (SAG) method (Le Roux et al., 2013). The method is a modification of stochastic gradient descent method, but utilizes an averaged gradient. A good point of their method is that we only need to deal with the primal problem. Thus the computation is easy, and we don’t need to look at the convex conjugate function. Moreover, their method also converges linearly. However, the biggest drawback is that, to compute the averaged gradient, all gradients of loss functions computed at each sample should be stored in memory. The memory size is usually which is hard to be stored for big data situation. On the other hand, our method is free from such a memory problem. Indeed, our method requires only size memory.
The third method is online version of ADMM. Recently some online variants of ADMM have been proposed by (Wang and Banerjee, 2012; Suzuki, 2013; Ouyang et al., 2013). These methods are effective for complex regularizations as discussed in this paper. Thus they are applicable to wide range of situations. However, those methods are basically online methods, thus they discard the samples once observed. They are not adapted to a situation where the training samples are observed several times. Therefore, the convergence rate is in general and for a strongly convex loss (possibly with some modification). On the other hand, our method converges linearly.
6 Numerical Experiments
In this section, we give numerical experiments on artificial and real data to demonstrate the effectiveness of our proposed algorithm^{2}^{2}2All the experiments were carried out on Intel Core i7 2.93GHz with 8GB RAM.. We compare our SDCAADMM with the existing stochastic optimization methods such as Regularized Dual Averaging (RDA) (Duchi and Singer, 2009; Xiao, 2009), Online ADMM (OLADMM) (Wang and Banerjee, 2012), Online Proximal Gradient descent ADMM (OPGADMM) (Ouyang et al., 2013; Suzuki, 2013) and RDAADMM (Suzuki, 2013). We also compared our method with batch ADMM (BatchADMM) in the artificial data sets. We used subbatch with size 50 for all the methods including ours (, but could be less than 50). We employed the parameter settings and . As for and , we used and . All of the experiments are classification problems with structured sparsity. We employed the smoothed hinge loss:
Then the proximal operation with respect to the dual function of the smoothed hinge loss is analytically given by
6.1 Artificial Data
Here we execute numerical experiments on artificial data sets. The problem is a classification problem with overlapped group regularization as performed in (Suzuki, 2013). We generated input feature vectors with dimension where each feature is generated from i.i.d. standard normal distribution. Then the true weight vector is generated as follows: First we generate a random matrix which has nonzero elements on its first column (distributed from i.i.d. standard normal) and zeros on other columns, and vectorize the matrix to obtain . The training label is given by where is distributed from normal distribution with mean 0 and standard deviation .
The group regularization is given as where is the matrix obtained by reshaping . The quadratic term is added to make the regularization function strongly convex^{3}^{3}3Even if there is no quadratic term, our method converged with almost the same speed.. Since there exist overlaps between groups, the proximal operation can not be straightforwardly computed (Jacob et al., 2009). To deal with this regularization function in our framework, we let and . Then we can see that and the proximal operation with respect to is analytically obtained; indeed it is easily checked that where and .
The original RDA requires a direct computation of the proximal operation for the overlapped group penalty. To compute that, we employed the dual formulation proposed by (Yuan et al., 2011).
We independently repeated the experiments 10 times and averaged the excess empirical risk (), the expected loss on the test data () and the classification error (). Figure 1 shows these three values against CPU time with the standard deviation for and . We employed .


We observe that the excess empirical risk of our method, SDCAADMM, actually converges linearly while other stochastic methods don’t show linear convergence. Although BatchADMM also shows linear convergence and its convergence speed is comparable to SDCAADMM for small sample situation (), SDCAADMM is much faster than BatchADMM when the number of samples is large (). As for the classification error, existing stochastic methods also show nice performances despite the poor convergence of the empirical risk. On the other hand, SDCAADMM rapidly converges to a stable state and shows comparable or better classification accuracy than existing methods.
6.2 Real Data
Here we execute numerical experiments on real data sets; ‘20 Newsgroups’^{4}^{4}4Available at http://www.cs.nyu.edu/~roweis/data.html. We converted the four class classification task into binary classification by grouping category 1,2 and category 3,4 respectively. and ‘a9a’^{5}^{5}5Available at ‘LIBSVM data sets’ http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.. ‘20 Newsgroups’ contains 100 dimensional 12,995 training samples and 3,247 test samples. ‘a9a’ contains 123 dimensional 32,561 training samples and 16,281 test samples. We constructed a similarity graph between features using graph Lasso and applied graph guided regularization as in Ouyang et al. (2013). That is, we applied graph Lasso to the training samples, and obtain a sparse inverse variancecovariance matrix . Based on the similarity matrix , we connect all index pairs with on edges. We denote by the set of edges. Then we impose the following graph guided regularization: