Accelerated Block Coordinate Proximal Gradients with Applications in High Dimensional Statistics

Accelerated Block Coordinate Proximal Gradients with Applications in High Dimensional Statistics

Tsz Kit Lau Yuan Yao
Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong
tklauag@connect.ust.hkyuany@ust.hk
Abstract

Nonconvex optimization problems arise in different research fields and arouse lots of attention in signal processing, statistics and machine learning. In this work, we explore the accelerated proximal gradient method and some of its variants which have been shown to converge under nonconvex context recently. We show that a novel variant proposed here, which exploits adaptive momentum and block coordinate update with specific update rules, further improves the performance of a broad class of nonconvex problems. In applications to sparse linear regression with regularizations like Lasso, grouped Lasso, capped and SCAP, the proposed scheme enjoys provable local linear convergence, with experimental justification.

 

Accelerated Block Coordinate Proximal Gradients with Applications in High Dimensional Statistics


  Tsz Kit Lau Yuan Yao Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong tklauag@connect.ust.hkyuany@ust.hk

\@float

noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float

1 Introduction

Many problems in machine learning are targeted to solve the following minimization problem

(1)

where is differentiable, can possibly be nonsmooth. Convexity is not assumed for and . If is convex, it is shown that (1) can be solved efficiently by the accelerated proximal gradient (APG) method (sometimes referred to as FISTA (Beck and Teboulle, 2009), as in Algorithm 1. APG has a convergence rate of which meets the theoretical lower bound of first-order gradient methods for minimizing smooth convex functions.

Input : . for  do        , , . Algorithm 1 APG

For nonconvex version of (1), APG was first introduced and analyzed by Li and Lin (2015), in which they propose monotone APG (mAPG) and nonmonotone APG (nmAPG) by exploiting the Kurdyka-Łojasiewicz (KL) property. The main deficiency of these two algorithms is that they require two proximal steps in each iteration. In view of this, Yao et al. (2017) propose nonconvex inexact APG (niAPG) which is equivalent to APG for nonconvex problems (APGnc) in Li et al. (2017) if the proximal step is exact. This is an algorithm with comparable numerical performance to other state-of-the-art algorithms.

Li et al. (2017) analyze the convergence rates of mAPG and APGnc by exploiting the KL property. They further propose an APGnc+ algorithm, which improves APGnc by introducing an adaptive momentum (Algorithm 3, see Appendix). APGnc+ has the same theoretical convergence rate as APGnc but has better numerical performance. However, the aforementioned APG-like algorithms do not leverage the special structure of the objective function (particularly the structure of ), which is what we want to explore in this paper.

In practice, many machine learning and statistics problems in the form of (1) have separable or block-separable regularizer , so we can rewrite (1) as

(2)

where the variable has blocks, , and again the functions , , can be nonconvex and nonsmooth. Block coordinate update is widely applied to solve convex and nonconvex problems in the form of (2). Since each iteration has low computational cost and small required memory, it is easy for parallel and distributed implementations and thus regarded as a more feasible method for large-scale problems. Xu and Yin (2017) propose a block prox-linear (BPL) method (Algorithm 4, see Appendix) which can be viewed as a block coordinate version of APG. At iteration , only one block is selected and updated. They establish the whole sequence convergence of BPL to a critical point, first by obtaining subsequence convergence followed by exploiting the KL property again. In their numerical tests, they mainly resort to randomly shuffling of the blocks and show that it leads to better numerical performance, as opposed to cyclic (Gauss-Seidel iteration scheme) updates (Attouch et al., 2010, 2013) and randomized block selection (Lin et al., 2015).

Contribution.

Our main contribution is to bring adaptive momentum and block prox-linear method together in a new algorithm, for a further numerical speed-up sharing the same theoretical convergence guarantee under KL property. Moreover, a new block update rule based on Gauss-Southwell Rule, is shown to beat randomized or cyclic updates as it selects the “best” block maximizing the magnitude of the step at each iteration. In the applications to high dimensional statistics, we particularly show that for sparse linear regressions with regularizations including (grouped) Lasso, capped , and SCAD, the proposed algorithm has provable local linear convergence. §2 presents the algorithm and §3 discusses its applications in those sparse linear regressions together with empirical justification.

2 The BCoAPGnc+ Algorithm

We present our proposed algorithm in Algorithm 2, Block-Coordinate APGnc with adaptive momentum (BCoAPGnc+), which takes advantage of several acceleration tools in Algorithms 4 and 3.

Input : no. of blocks , .
for  do
       Pick in a deterministic (e.g., GS- rule) or random manner and set , if  then
            
      else if  then
            
      
Algorithm 2 Block-Coordinate APGnc with adaptive momentum (BCoAPGnc+)

In each iteration of this algorithm, the updates of and follow that of Algorithm 4. The extrapolation step aims to further exploit the opportunity of acceleration by magnifying the momentum of the block when achieves an even lower objective value. If this does not hold, the momentum is diminished. This intuitive yet efficient step follows the main idea of Algorithm 3.

Gauss-Southwell rules.

In Nutini et al. (2015), three proximal-gradient Gauss-Southwell rules are presented, namely the GS-, GS- and GS- rules (see Appendix for details). In particular, in §3, we test with the GS- which would possibly speed up the convergence since it selects the block which maximizes the magnitude of the step at each iteration.

We need certain assumptions of Problem (2) in order to establish the convergence of our Algorithm 2.

Assumption

We have several assumptions on the functions and : (i) is proper and bounded below in , is continuously differentiable, and is proper lower semicontinuous for every . Problem (4) has a critical point , i.e., ; (ii) Let , then has Lipschitz constant with respect to , which is bounded, finite and positive for all ; (iii) In Algorithm 2  every block is updated at least once within any iterations (see Appendix for definitions and notations).

Theorem 1 (Whole sequence convergence)

Suppose that Section 2 holds. Let be generated from Algorithm 2. Assume (i) has a finite limit point ; (ii) satisfies the KL property (Definition 6, see Appendix) around with parameters , and ; (iii) For each , is Lipschitz continuous within with respect to . Then, we have .

Proof

This theorem mainly follows from Theorem 2 of Xu and Yin (2017), since in its proof, there are no strict requirements on the momentum as long as it is between 0 and 1, and on the block update rule for each iteration. The step size can be set to fulfill the assumption of this theorem which depends on the choice of Lipschitz constant of .

3 Applications in High Dimensional Statistics

Many optimization problems in machine learning and statistics can be formulated in the form of (1), for instance, sparse learning (Bach et al., 2012), regressions with nonconvex regularizers (Fan and Li, 2001; Zhang, 2010a), capped -norm (Zhang, 2010b) and the log-sum-penalty (Candès et al., 2008). We consider the general class of regularized least squares or regression problems having the form

(3)

where , and is a separable or block separable regularizer. We test our algorithm with two convex problems and two nonconvex problems, mainly with sparse instances since GS rules can be efficiently calculated in such cases. These strategies can be applied to nonconvex problems since the calculation doe not depend on convexity (Nutini et al., 2015). All objective functions in this section are KL functions since all of them are semialgebraic (Attouch et al., 2010; Xu and Yin, 2013).

-regularized Underdetermined Sparse Least Squares.

In this case, which is a nonsmooth regularizer promoting sparsity. We generate and in the same way as that of §9 in (Nutini et al., 2015), with , and . In BPL and BCoAPGnc+, for illustrative purpose, we separate into blocks (set ) of equal size , and the step size at each iteration for a selected block is chosen to be , where represents the columns of corresponds to the block . We also take for APGnc+ and BCoAPGnc+.

-regularized Sparse Least Squares.

In this case, , where is a partition of containing , where , for and . We use the same data set and parameters except and . Only block coordinate methods are used since is not completely separable.

Capped -regularized Sparse Least Squares.

We consider the nonconvex capped penalty (Zhang, 2010b), , . In this case we specify , , , and .

Least Squares with SCAD penalty.

We consider another nonconvex penalty term, the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001), , where is defined in Appendix. Both and are sampled from but they are standardized such that and each column of have zero mean, and each column of has unit variance. We take , , , and .

Theorem 2 (Convergence rate)

Suppose is in the form of (3), where is chosen as the above four examples. Under the assumptions of Theorem 1, we have , , for a certain , . Thus, converges locally linearly to a stationary point of .

Proof

According to Propositions 4.1, 4.2 and 5.2 of Li and Pong (2017), all four above examples of are KL functions with an exponent (for -regularized least squares, we also need a mild assumption that ). Then, the desired result follows immediately from Theorem 3 of Xu and Yin (2017). The convergence rate theorem for general is the same as this Theorem 3 for the same reason as in the proof of Theorem 1.

We plot the value in each experiment, for the proposed algorithm and some existing APG-like algorithms mentioned in §1. For fair comparison, since in each iteration block coordinate methods only update one block, we consider block updates as one iteration in the plots.

(a) sparse LS
(b) sparse LS
(c) Capped sparse LS
(d) LS with SCAD
Figure 1: Comparison of APG-like algorithms

We observe in Figure 1 that our proposed algorithm BCoAPGnc+ (both randomized and GS- versions) provides the greatest initial acceleration, in both convex and nonconvex examples. It dominates most existing methods, especially during the first 20 iterations. We also see that in general the BCoAPGnc+ with GS- updates outperforms that with randomized updates, justifying the use of GS- rule for further acceleration. We see that BCoAPGnc+ has superior performance in nonconvex problems revealed in Figures 0(d) and 0(c), where its original counterpart BPL does not give monotone objective value decline (in which its momentum is chosen according to that of APG). Overall, both versions of our proposed BCoAPGnc+ speed up the convergence, compared with the current state-of-the-art APGnc+and BPL.

4 Conclusion and Future Work

In this paper, we suggest a new algorithm which considers three main acceleration techniques, namely adaptive momentum, block coordinate update and GS- update rule. We also implement with adaptive step sizes in our experiments. We show that it shares the same convergence guarantee and convergence rate as BPL. It is noteworthy that all objective functions in sparse linear regression considered here, including (grouped) Lasso, capped and SCAD regularizations, are KL functions with an exponent , and thus have local linear convergence to their stationary points. Experiments show impressive results that our proposed method outperforms the current state-of-the-art.

We focus on experiments with convex losses and (block) separable regularizers. Other applications with block separable regularizers but nonconvex losses such as matrix factorization and completion (e.g., in Xu and Yin (2013)) deserve further treatment by applying the proposed algorithm. Further acceleration would be the use of variable metrics (Chouzenoux et al., 2016), which makes use of specific preconditioning matrices. For more general nonconvex optimization problems, it is interesting to find the KL exponents of such objective functions in order to find their local convergence rates. Extra theoretical and empirical work in these directions is expected in the future.

Acknowledgments

The authors would like to thank Jian-Feng Cai and Jinshan Zeng for useful discussions and comments.

References

  • Attouch et al. [2010] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
  • Attouch et al. [2013] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming, 137(1):91–129, 2013.
  • Bach et al. [2012] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
  • Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  • Bolte et al. [2014] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
  • Candès et al. [2008] E. J. Candès, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.
  • Chouzenoux et al. [2016] E. Chouzenoux, J.-C. Pesquet, and A. Repetti. A block coordinate variable metric forward–backward algorithm. Journal of Global Optimization, 66(3):457–485, 2016.
  • Combettes and Pesquet [2011] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In H. H. Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, editors, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer New York, 2011.
  • Fan and Li [2001] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
  • Gong et al. [2013] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 37–45, 2013.
  • Li and Pong [2017] G. Li and T. K. Pong. Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Foundations of Computational Mathematics, 2017.
  • Li and Lin [2015] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Advances in Neural Information Processing Systems 28, pages 379–387. 2015.
  • Li et al. [2017] Q. Li, Y. Zhou, Y. Liang, and P. K. Varshney. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 2111–2119, 2017.
  • Lin et al. [2015] Q. Lin, Z. Lu, and L. Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244–2273, 2015.
  • Nutini et al. [2015] J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, and H. Koepke. Coordinate descent converges faster with the gauss-southwell rule than random selection. In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 1632–1641, 2015.
  • Rockafellar and Wets [1998] R. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer Verlag, 1998.
  • Xu and Yin [2013] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
  • Xu and Yin [2017] Y. Xu and W. Yin. A globally convergent algorithm for nonconvex optimization based on block coordinate update. Journal of Scientific Computing, 72(2):700–734, 2017.
  • Yao et al. [2017] Q. Yao, J. T. Kwok, F. Gao, W. Chen, and T.-Y. Liu. Efficient inexact proximal gradient algorithm for nonconvex problems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3308–3314, 2017.
  • Zhang [2010a] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38(2):894–942, 2010a.
  • Zhang [2010b] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11:1081–1107, 2010b.

Appendix

In this Appendix, we provide more details not mentioned in the main text due to space constraint.

Input : .
for  do
       , , if  then
            , ,
      else if  then
            , .
      
Algorithm 3 APGnc with adaptive momentum (APGnc+)
Input : .
for  do
       Pick in a deterministic or random manner, Set and ,
Algorithm 4 Block prox-linear (BPL)
Notation

is the shorthand notation for , is the short-hand notation for , means . So means .

Definition 1 (Proximity operator [Combettes and Pesquet, 2011])

Let be a positive parameter. The proximity operator is defined through

If is convex, proper and lower semicontinuous, admits a unique solution. If is nonconvex, then it is generally set-valued.

Definition 2 (Domain)

The domain of is defined by

Definition 3 (Subdifferential [Rockafellar and Wets, 1998])
  1. For a given , the Fréchet subdifferential of at , written , is the set of all vectors which satisfy

    When , we set .

  2. The limiting-subdifferential, or simply the subdifferential, of at , written , is defined through the following closure process

Definition 4 (Sublevel set)

Being given real numbers and we set

is defined similarly.

Definition 5 (Distance)

The distance of a point to a closed set is defined as

If , we have that for all .

Definition 6 (Kurdyka-Łojasiewicz property and KL function [Bolte et al., 2014])
  1. The function is said to have the Kurdyka-Łojasiewicz property at if there exist , a neighbourhood of and a continuous concave function for some and such that for all , the Kurdyka-Łojasiewicz inequality holds

  2. Proper lower semincontinuous functions which satisfy the Kurdyka-Łojasiewicz inequality at each point of are called KL functions.

Definition 7 (KL exponent [Li and Pong, 2017])

For a proper closed function satisfying the KL property at , if the corresponding function can be chosen as in Definition 6, the KL inequality can be written as

for some . We say has the KL property at with an exponent . If is a KL function and has the same exponent at any , then we can say that is a KL function with an exponent of .

Definition 8 (SCAD penalty [Fan and Li, 2001])

The SCAD penalty is defined as

Proposition 1 (Proximal-Gradient Gauss-Southwell rules [Gong et al., 2013])
  1. In coordinate descent methods, the GS- rule chooses the coordinate with the most negative directional derivative, given by

    We generalize it to the block coordinate scenario which has the form

  2. The GS- rule selects the coordinate which maximizes the length of the step

    which is generalized to the block coordinate version

  3. The GS- rule maximizes the progress assuming a quadratic upper bound on

    We do not apply this to a block coordinate scenario.

Proposition 2 (Proximity operators of regularizers in §3)

If , then

  1. for [Bach et al., 2012],

    where and the above proximity operator is often referred to as soft-thresholding.

  2. for the -norm [Bach et al., 2012],

    Further, if is a partition of , for the -norm we have

    which is often referred to as group-soft-thresholding. The problem being solved is called group Lasso if it is used with a least square loss in Equation 3.

  3. for defined in Definition 8 [Gong et al., 2013], let and

    Then, we have

    Thus, we further have

  4. for [Gong et al., 2013], let and

    Then, we have

    Thus, we further have

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
534
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description