Stochastic Dual Coordinate Ascent with Alternating Direction Multiplier Method

Stochastic Dual Coordinate Ascent
with Alternating Direction Multiplier Method

Taiji Suzuki
Department of Mathematical and Computing Sciences,
Tokyo Institute of Technology,
Tokyo 152-8552, JAPAN
s-taiji@is.titech.ac.jp
Abstract

We propose a new stochastic dual coordinate ascent technique that can be applied to a wide range of regularized learning problems. Our method is based on Alternating Direction Multiplier Method (ADMM) to deal with complex regularization functions such as structured regularizations. Although the original ADMM is a batch method, the proposed method offers a stochastic update rule where each iteration requires only one or few sample observations. Moreover, our method can naturally afford mini-batch update and it gives speed up of convergence. We show that, under mild assumptions, our method converges exponentially. The numerical experiments show that our method actually performs efficiently.

Keywords: Stochastic Dual Coordinate Ascent, Alternating Direction Multiplier Method, Exponential Convergence, Structured Sparsity.

1 Introduction

This paper proposes a new stochastic optimization method that shows exponential convergence and can be applied to wide range of regularization functions using the techniques of stochastic dual coordinate ascent with alternating direction multiplier method. Recently, it is getting more and more important to develop an efficient optimization method which can handle large amount of samples. One of the most successful approaches is a stochastic optimization approach. Indeed, a lot of stochastic methods have been proposed to deal with large amount of samples. Among them, the (online) stochastic gradient method is the most basic and successful one. This can be naturally applied to the regularized learning frame-work. Such a method is called several different names including online proximal gradient descent, forward-backward splitting and online mirror descent (Duchi and Singer, 2009). Basically, these methods are intended to process sequentially coming data. They update the parameter using one new observation and discard the observed sample. Therefore, they don’t need large memory space to store the whole observed data. The convergence rate of those methods is for general settings and for strongly convex losses, which are minimax optimal (Nemirovskii and Yudin, 1983).

On the other hand, recently it was shown that, if it is allowed to reuse the observed data several times, it is possible to develop a stochastic method with exponential convergence rate for a strongly convex objective (Le Roux et al., 2013; Shalev-Shwartz and Zhang, 2013c; Shalev-Shwartz and Zhang, 2013a). These methods are still stochastic in a sense that one sample or small mini-batch is randomly picked up to be used for each update. The main difference from the stochastic gradient method is that these methods are intended to process data with a fixed number of training samples. Stochastic Average Gradient (SAG) method (Le Roux et al., 2013) utilizes an averaged gradient to show an exponential convergence. Stochastic Dual Coordinate Ascent (SDCA) method solves the dual problem using a stochastic coordinate ascent technique (Shalev-Shwartz and Zhang, 2013c; Shalev-Shwartz and Zhang, 2013a). These methods have favorable properties of both online-stochastic approach and batch approach. That is, they show fast decrease of the objective function in the early stage of the optimization as online-stochastic approaches, and shows exponential convergence after the “burn in” time as batch approaches. However, these methods have some drawbacks. SAG needs to maintain all gradients computed on each training sample in memory which amount to dimension times sample size. SDCA method can be applied only to a simple regularization function for which the dual function is easily computed, thus it is hard to apply the method to a complex regularization function such as structured regularization.

In this paper, we propose Stochastic Dual Coordinate Ascent method for Alternating Direction Multiplier Method (SDCA-ADMM). Our method is similar to SDCA, but inherits a favorable property of ADMM. By combining SDCA and ADMM, our method can be applied to a wide range of regularized learning problems. ADMM is an effective optimization method to solve a composite optimization problem described as  (Gabay and Mercier, 1976; Boyd et al., 2010; Qin and Goldfarb, 2012). This formulation is quite flexible and fit wide range of applications such as structured regularization, dictionary learning, convex tensor decomposition and so on (Qin and Goldfarb, 2012; Jacob et al., 2009; Tomioka et al., 2011; Rakotomamonjy, 2013). However, ADMM is a batch optimization method. Our approach transforms ADMM to a stochastic one by utilizing stochastic coordinate ascent technique. Our method, SDCA-ADMM, does not require large amount of memory because it observes only one or few samples for each iteration. SDCA-ADMM can be naturally adapted to a sub-batch situation where a block of few samples is utilized for each iteration. Moreover, it is shown that our method shows exponential convergence for risk functions with some strong convexity and smoothness property. The convergence rate is affected by the size of sub-batch. If the samples are not strongly correlated, sub-batch gives a better convergence rate than one-sample update.

2 Structured Regularization and its Dual Formulation

In this section, we give the problem formulation of structured regularization and its dual formulation. The standard regularized risk minimization is described as follows:

(1)

where are vectors in , is the weight vector that we want to learn, is a loss function for the -th sample, and is the regularization function which is used to avoid over-fitting. For example, the loss function can be taken as a classification surrogate loss where is the training label of the -th sample. With regard to , we are interested in a sparsity inducing regularization, e.g., -regularization, group lasso regularization, trace-norm regularization, and so on. Our motivation in this paper is to deal with a “complex” regularization where it is not easy to directly minimize the regularization function (more precisely the proximal operation determined by is not easily computed, see Eq. (5)). This kind of regularization appears in, for example, structured sparsity such as overlapped group lasso and graph regularization (Jacob et al., 2009; Signoretto et al., 2010). In many cases, such a “complex” regularization function can be decomposed into a “simple” regularization and a linear transformation , that is, where . Using this formulation, the optimization problem (Eq. (1)) is equivalent to

(2)

The purpose of this paper is to give an efficient stochastic optimization method to solve this problem (2). For this purpose, we employ the dual formulation. Using the Fenchel’s duality theorem, we have the following dual formulation.

Lemma 1.
(3)

where and are the convex conjugates of and respectively (Rockafellar, 1970)111The convex conjugate function of is defined by ., and . Moreover , and are optimal solutions of both sides if and only if

Proof.

By Fenchel’s duality theorem (Corollary 31.2.1 of Rockafellar (1970)), we have that

(4)

Moreover are optimal of each side if and only if and (Corollary 31.3 of Rockafellar (1970)). Now, Theorem 16.3 of Rockafellar (1970) gives that

Thus , and substituting this into the RHS of Eq. (4) we obtain Eq. (3). Now, satisfying is the optimal value if and only if for the optimal . Thus, if is optimal, then we have and thus which implies . Contrary, if , then it is obvious that because . Therefore, we obtain the optimality conditions.

The dual problem is a composite objective function optimization with a linear constraint . In the next section, we give an efficient stochastic method to solve this dual problem. A nice property of the dual formulation is that, in many machine learning applications, the dual loss function becomes strongly convex. For example, for the logistic loss , the dual function is and its modulus of strong convexity is much better than the primal one. More importantly, each sample directly affects only each coordinate of dual variable. In other words, if is fixed the -th sample has no influence to the objective value. This enables us to utilize the stochastic coordinate ascent technique in the dual problem because update of single coordinate requires only the information of the -th sample .

Finally, we give the precise notion of the “complex” and “simple” regularizations. This notion is defined by the computational complexity of proximal operation corresponding to the regularization function (Rockafellar, 1970). The proximal operation corresponding to a convex function is defined by

(5)

For example, the proximal operation corresponding to -regularization is easily computed as which is the so-called soft-thresholding operation. More generally, the proximal operation for group lasso regularization with non-overlapped groups can also be analytically computed. On the other hand, for overlapped group regularization, the proximal operation is no longer analytically obtained. However, by choosing appropriately, we can split the overlap and obtain for which the proximal operation is easily computed (see Section 6 for concrete examples).

3 Proposed Method: Stochastic Dual Coordinate Ascent with ADMM

In this section, we present our proposal, stochastic dual coordinate ascent type ADMM. For a positive semidefinite matrix , we denote by . denotes the -th column of , which is , and is a matrix obtained by subtracting -th column from . Similarly, for a vector , is a vector obtained by subtracting -th component from .

3.1 One Sample Update of SDCA for ADMM

The basic update rule of our proposed method in the -th step is given as follows: Each update step, choose uniformly at random, and update as

(6a)
(6b)
(6c)

where is the primal variable at the -th step, and are arbitrary positive semidefinite matrices, and are parameters we give beforehand.

The optimization procedure looks a bit complicated, To simplify the procedure, we set as

(7)

where are chosen so that . Then, by carrying out simple calculations and denoting , the update rule of and is rewritten as

(8a)
(8b)

Note that the update (8b) of is just a one dimensional optimization, thus it is quite easily computed. Moreover, for some loss functions such as the smoothed hinge loss used in Section 6, we have an analytic form of the update.

The update rule (8a) of can be rewritten by the proximal operation corresponding to the primal function while the rule (8a) is given by that corresponding to the dual function . Indeed, there is a clear relation between primal and dual (Theorem 31.5 of Rockafellar (1970)):

Using this, for , we have that

(9)

because for a convex function and . This is efficiently computed because we assumed the proximal operation corresponding to can be efficiently computed.

During the update, we need which seems to require computation at the first glance. However, it can be incrementally updated as . Thus we don’t need to road all the samples to compute at each iteration.

In the above, the update rule of our algorithm is based on one sample observation. Next, we give a mini-batch extension of the algorithm where more than one samples could be used for each iteration.

3.2 Mini-Batch Extension

Here, we generalize our method to mini-batch situation where, at each iteration, we observe a small number of samples instead of one sample observation. At each iteration, we randomly choose an index set so that each index is included in with probability ; for all . To do so, we suggest the following procedure. We split the index set into groups beforehand, and then pick up uniformly and set for each iteration. Each sub-batch can have different cardinality from others, but the probability should be uniform for all . The update rule using sub-batch is given as follows: Update as before (6a), and update and by

(10a)
(10b)

Using given in Eq. (7), the update rule of can be replaced by Eq. (9) as in one-sample update situation. The update rule of can also be simplified by choosing appropriately. Because sub-batches have no overlap between each other, we can construct a positive semi-definite matrix such that the block-diagonal element has the form

(11)

where is a positive real satisfying . The reason why we split the index sets into sets is to construct this kind of which “diagonalizes” the quadratic function in (10a). The choice of and could be replaced with another one for which we could compute the update efficiently, as long as is uniform for all . Using given in (11), the update rule (10a) of is rewritten as

(12)

where is a vector consisting of components with indexes , , and is a sub-matrix of consisting of columns with indexes , . Note that, since is sum of single variable convex functions , the proximal operation in Eq. (12) can be split into the proximal operation with respect to each single variable . This is advantageous for not only the simpleness of the computation but also parallel computation. That is, for , the update rule (12) is reduced to for each , which is easily parallelizable. In summary, our proposed algorithm is given in Algorithm 1.

  Input:
  Initialize , , and .
  for  to  do
     Choose uniformly at random, set , and observe the training samples .
     Set .
     Update
     Update
     Update .
  end for
  Output: .
Algorithm 1 SDCA-ADMM

Finally, we would like to highlight the connection between our method and the original batch ADMM (Hestenes, 1969; Powell, 1969; Rockafellar, 1976). The batch ADMM utilizes the following update rule

(13a)
(13b)
(13c)

One can see that the update rule of our algorithm is reduced to that of the batch ADMM (13) if we set except the term related to and (the terms and ). These terms related to and are used also in batch situation to eliminate cross terms in and . This technique is called linearization. The linearization technique makes the update rule simple and parallelizable, and in some situations makes it possible to obtain an analytic form of the update.

4 Linear Convergence of SDCA-ADMM

In this section, the convergence rate of our proposed algorithm is given. Indeed, the convergence rate is exponential (R-linear). To show the convergence rate, we assume some conditions. First, we assume that there exits an unique optimal solution and is injective (on the other hand, is not necessarily injective). Moreover, we assume the uniqueness of the dual solution , but don’t assume the uniqueness of . We denote by the set of dual optimum of as and assume that is compact. Then, by Lemma 1, we have that

(14)

By the convex duality arguments, this implies that .

Moreover, we suppose that each (dual) loss function is locally -strongly convex and , -smooth around the optimal solution and is also locally strongly convex in a weak sense as follows.

Assumption 1.

There exits such that, ,

There exit and such that, for all , there exists such that

(15)

and for all we have

(16)

where is the projection matrix to the kernel of .

Note that these conditions should be satisfied only around the optimal solutions and . It does not need to hold for every point, thus is much weaker than the ordinal strong convexity. Moreover, the inequalities need to be satisfied only for the solution sequence of our algorithm. The condition (15) is satisfied, for example, by -regularization because the dual of -regularization is an indicator function with a compact support and, outside the optimal solution set , the indicator function is lower bounded by a quadratic function. In addition, the quadratic term in the right hand side of this condition (15) is restricted on . This makes it possible to include several types of regularization functions. Indeed, if , this condition is always satisfied. The assumption (16) is the strongest assumption. This is satisfied for elastic-net regularization. If one wants to obtain a solution for non-strongly convex regularization such as -regularization, just adding a small square term, we obtain an approximated solution which is sufficiently close to the true one within a precision.

Define the primal and dual objectives as

Note that, by Eq. (14), and are always non-negative. Define the block diagonal matrix as for all and for . Let . We define as

For a symmetric matrix , we define and as the maximum and minimum singular value respectively.

Theorem 2.

Suppose that , for all and is injective. Then, under Assumption 1, the dual objective function converges R-linearly: We have that, for ,

where

In particular, we have that

If we further assume  (), then this implies that

The proof is deferred to the appendix. This theorem shows that the primal and dual objective values converge R-linearly. Moreover, the primal variable also converges R-linearly to the optimal value. The number of sub-batches controls the convergence rate. If all samples are nearly orthogonal to each other, is bounded by a constant for all , and thus convergence rate gets faster and faster as decreases (the size of each sub-batch grows up). On the other hand, if samples are strongly correlated to each other, grows linearly against and then the convergence rate is not improved by decreasing . As for batch settings, the linear convergence of batch ADMM has been shown by (Deng and Yin, 2012). However, their proof can not be directly applied to our stochastic setting. Our proof requires a technique specialized to stochastic coordinate ascent technique. We would like to point out that the exponential convergence is not guaranteed if the choice of index set at each update is cyclic. Thus the index should be randomly chosen. This is reported also in the paper (Shalev-Shwartz and Zhang, 2013a).

The statement can be described in terms of the number of iterations required to achieve a precision , i.e. smallest satisfying :

where and are an absolute constant. This says that dependency of is log-order. An interesting point is that the influence of , the modulus of local strong convexity of . Usually the regularization function is made weaker as the number of samples increases. In that situation, decreases as goes up. However, even if we set , we still have instead of . Thus, the convergence rate is hardly affected by the setting of . This point is same as the ordinary SDCA algorithm (Shalev-Shwartz and Zhang, 2013a).

5 Related Works

In this section, we present some related works and discuss differences from our method.

The most related work is a recent study by (Shalev-Shwartz and Zhang, 2013c) in which Stochastic Dual Coordinate Ascent (SDCA) method for a regularized risk minimization is proposed. Their method also deals with the dual problem (3) with in our setting, and apply a stochastic coordinate ascent technique. This method converges linearly. At each iteration, the method solves the following one-dimensional optimization problem,

and updates and . The most important difference from our method is the computation of . In a “simple” regularization function, it is often easy to compute the (sub-)gradient of . However, in a “complex” regularization such as structured regularization, the computation is not efficiently carried out. To overcome this difficulty, our method utilizes a linear transformed one , and split the optimization with respect to and by applying ADMM technique. Thus, our method is applicable to much more general regularization functions. Recently, a mini-batch extension of SDCA is a hot topic (Takáč et al., 2013; Shalev-Shwartz and Zhang, 2013b). Our approach realizes the mini-batch extension using the linearlization technique in ADMM which is naturally derived in the frame-work of ADMM. Although the proof technique is quite different, the convergence analysis of normal mini-batch SDCA given by (Shalev-Shwartz and Zhang, 2013b) is parallel to our theorem.

The second method related to ours is Stochastic Average Gradient (SAG) method (Le Roux et al., 2013). The method is a modification of stochastic gradient descent method, but utilizes an averaged gradient. A good point of their method is that we only need to deal with the primal problem. Thus the computation is easy, and we don’t need to look at the convex conjugate function. Moreover, their method also converges linearly. However, the biggest drawback is that, to compute the averaged gradient, all gradients of loss functions computed at each sample should be stored in memory. The memory size is usually which is hard to be stored for big data situation. On the other hand, our method is free from such a memory problem. Indeed, our method requires only -size memory.

The third method is online version of ADMM. Recently some online variants of ADMM have been proposed by (Wang and Banerjee, 2012; Suzuki, 2013; Ouyang et al., 2013). These methods are effective for complex regularizations as discussed in this paper. Thus they are applicable to wide range of situations. However, those methods are basically online methods, thus they discard the samples once observed. They are not adapted to a situation where the training samples are observed several times. Therefore, the convergence rate is in general and for a strongly convex loss (possibly with some modification). On the other hand, our method converges linearly.

6 Numerical Experiments

In this section, we give numerical experiments on artificial and real data to demonstrate the effectiveness of our proposed algorithm222All the experiments were carried out on Intel Core i7 2.93GHz with 8GB RAM.. We compare our SDCA-ADMM with the existing stochastic optimization methods such as Regularized Dual Averaging (RDA) (Duchi and Singer, 2009; Xiao, 2009), Online ADMM (OL-ADMM) (Wang and Banerjee, 2012), Online Proximal Gradient descent ADMM (OPG-ADMM) (Ouyang et al., 2013; Suzuki, 2013) and RDA-ADMM (Suzuki, 2013). We also compared our method with batch ADMM (Batch-ADMM) in the artificial data sets. We used sub-batch with size 50 for all the methods including ours (, but could be less than 50). We employed the parameter settings and . As for and , we used and . All of the experiments are classification problems with structured sparsity. We employed the smoothed hinge loss:

Then the proximal operation with respect to the dual function of the smoothed hinge loss is analytically given by

6.1 Artificial Data

Here we execute numerical experiments on artificial data sets. The problem is a classification problem with overlapped group regularization as performed in (Suzuki, 2013). We generated input feature vectors with dimension where each feature is generated from i.i.d. standard normal distribution. Then the true weight vector is generated as follows: First we generate a random matrix which has non-zero elements on its first column (distributed from i.i.d. standard normal) and zeros on other columns, and vectorize the matrix to obtain . The training label is given by where is distributed from normal distribution with mean 0 and standard deviation .

The group regularization is given as where is the matrix obtained by reshaping . The quadratic term is added to make the regularization function strongly convex333Even if there is no quadratic term, our method converged with almost the same speed.. Since there exist overlaps between groups, the proximal operation can not be straightforwardly computed (Jacob et al., 2009). To deal with this regularization function in our frame-work, we let and . Then we can see that and the proximal operation with respect to is analytically obtained; indeed it is easily checked that where and .

The original RDA requires a direct computation of the proximal operation for the overlapped group penalty. To compute that, we employed the dual formulation proposed by (Yuan et al., 2011).

We independently repeated the experiments 10 times and averaged the excess empirical risk (), the expected loss on the test data () and the classification error (). Figure 1 shows these three values against CPU time with the standard deviation for and . We employed .

(a)
(b)
Figure 1: Artificial data: Excess empirical risk, exected loss on the test data and test classification error averaged over 10 independent iteration against CPU time in artificial data with (a) and (b) . The error bar indicates the standard deviation.

We observe that the excess empirical risk of our method, SDCA-ADMM, actually converges linearly while other stochastic methods don’t show linear convergence. Although Batch-ADMM also shows linear convergence and its convergence speed is comparable to SDCA-ADMM for small sample situation (), SDCA-ADMM is much faster than Batch-ADMM when the number of samples is large (). As for the classification error, existing stochastic methods also show nice performances despite the poor convergence of the empirical risk. On the other hand, SDCA-ADMM rapidly converges to a stable state and shows comparable or better classification accuracy than existing methods.

6.2 Real Data

Here we execute numerical experiments on real data sets; ‘20 Newsgroups’444Available at http://www.cs.nyu.edu/~roweis/data.html. We converted the four class classification task into binary classification by grouping category 1,2 and category 3,4 respectively. and ‘a9a’555Available at ‘LIBSVM data sets’ http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.. ‘20 Newsgroups’ contains 100 dimensional 12,995 training samples and 3,247 test samples. ‘a9a’ contains 123 dimensional 32,561 training samples and 16,281 test samples. We constructed a similarity graph between features using graph Lasso and applied graph guided regularization as in Ouyang et al. (2013). That is, we applied graph Lasso to the training samples, and obtain a sparse inverse variance-covariance matrix . Based on the similarity matrix , we connect all index pairs with on edges. We denote by the set of edges. Then we impose the following graph guided regularization: