Relax, and Accelerate: A Continuous Perspective on ADMM

Relax, and Accelerate:
A Continuous Perspective on ADMM

Guilherme França guifranca@jhu.edu    Daniel Robinson daniel.p.robinson@jhu.edu    René Vidal rvidal@jhu.edu Mathematical Institute for Data Science, Johns Hopkins University, Baltimore MD 21218, USA
Abstract

The acceleration technique first introduced by Nesterov for gradient descent is widely used in many machine learning applications, however it is not yet well-understood. Recently, significant progress has been made to close this understanding gap by using a continuous-time dynamical system perspective associated with gradient-based methods. In this paper, we extend this perspective by considering the continuous limit of the family of relaxed Alternating Direction Method of Multipliers (ADMM). We also introduce two new families of relaxed and accelerated ADMM algorithms, one follows Nesterov’s acceleration approach and the other is inspired by Polyak’s Heavy Ball method, and derive the continuous limit of these families of relaxed and accelerated algorithms as differential equations. Then, using a Lyapunov stability analysis of the dynamical systems, we obtain rate-of-convergence results for convex and strongly convex objective functions.

I Introduction

Accelerated gradient-based methods

A popular method to accelerate the convergence of gradient descent was proposed by Nesterov in the seminal paper Nesterov:1983. In the convex case, accelerated gradient descent attains a convergence rate of in terms of the error in the objective function value, with denoting the iteration time, which is known to be optimal in the sense of a worst-case complexity Nesterov:2004. Another accelerated variant of gradient descent was introduced by Polyak Polyak:1964, called the Heavy Ball method, which is known to have a convergence rate of for convex functions, and linear convergence for strongly convex functions Ghadimi:2014; Polyak:2017. Nonetheless, adding momentum as a mechanism for accelerating optimization methods is still considered not well-understood.

Recently, there has been progress in understanding acceleration by using a differential equation to model the continuous limit of Nesterov’s method Candes:2016. Additional follow-up work has brought an even larger class of accelerated methods into a Hamiltonian formalism Wibisono:2016, thus opening opportunities for analysis through the lens of continuous dynamical systems. For example, analyses based on Lyapunov’s theory were explored for both continuous and discrete settings Krichene:2015; Wilson:2016. However, such connections have thus far been limited to gradient descent methods for unconstrained optimization.

Perhaps, the simplest example to illustrate the interplay between discrete and continuous approaches is the gradient descent method. In this case, one can make a simple correspondence between the following discrete update and a continuous dynamical system:

(1)

where is the stepsize, is a continuous function of time such that with , and . Interestingly, the differential equation in (1) was used to solve an optimization problem by Cauchy Cauchy:1847 before gradient descent was invented. It is not hard to show that the differential equation (1) shares the convergence rate with gradient descent. Analogously, for the Heavy Ball method Polyak:1964, one has the correspondence

(2)

for some constants and , and where and . For Nesterov’s accelerated gradient descent method Nesterov:1983, only recently has its continuous limit been obtained as Candes:2016

(3)

The differential equation in (3) has a convergence rate of for convex functions , which matches the optimal rate of its discrete counterpart.

Accelerated ADMM

A separate important algorithm is the Alternating Direction Method of Multipliers (ADMM) Gabay:1976; Glowinsky:1975; Boyd:2011. ADMM is well-known for its ease of implementation, scalability, and applicability in many important areas of machine learning and statistics. In the convex case, it converges at a rate of He:2012; Eckstein:2015, while in the strongly convex case, it converges at a linear rate of for some Deng:2016. Many variants of ADMM exist. A popular example is one that incorporates a relaxation strategy, which empirically is known to improve convergence Eckstein:1994; Eckstein:1998. However, compared to standard ADMM, few theoretical results for relaxed ADMM are available; for strongly convex functions, relaxed ADMM has been shown to be linearly convergent Damek:2017; Nishihara:2015; Giselsson:2014; FrancaBento:2016.

An accelerated version of ADMM (here called A-ADMM) was proposed Goldstein:2014. For composite objective functions of the form with and strongly convex and quadratic, it was shown that A-ADMM attains a convergence rate of Goldstein:2014. To the best of our knowledge, no other convergence rates are known. Numerical experiments Goldstein:2014 show that A-ADMM performs better compared to other accelerated methods, such as Nesterov acceleration. Recently, a differential equation modeling A-ADMM was proposed Franca:2018 that generalizes previous results such as (3), as well as shows that the continuous dynamical system has an convergence rate under merely a convexity assumption on the objective. In this paper, we analyze more general frameworks and provide analyses that go beyond these results in several aspects that we describe next.

Paper contributions

We propose new variants of ADMM for solving the following problem:111 Our results can be extended to the formulation provided is invertible, which is not an uncommon assumption Nishihara:2015; FrancaBento:2016; Eckstein:2015. Since , one can easily redefine to cast the problem into a similar form as (4).

(4)

where , with , , and . We first consider the known family of relaxed ADMM algorithms. Then, we introduce two accelerated variants to the relaxed ADMM scheme: one follows Nesterov’s approach, which we refer to as relaxed and accelerated ADMM (relaxed A-ADMM), and the other is closer to Polyak’s Heavy Ball method, which we call relaxed Heavy Ball ADMM (see Algorithms 2 and 3). To the best of our knowledge, this is the first time that acceleration and relaxation are considered jointly in an optimization framework.

Having introduced these new families of algorithms, we turn our attention to deriving their continuous limit as differential equations that model their behavior. We then obtain rates of convergence for the dynamical systems by constructing appropriate Lyapunov functions in both the convex and strongly convex cases. Our results are more general than those in Franca:2018, with their results being one particular instance of our framework. For a summary of the results obtained in this paper, we refer the reader to Table 1. We can see that by incorporating relaxation into A-ADMM (the relaxation parameter is ) an improved constant in the complexity bound is obtained. Also, the proposed relaxed Heavy Ball ADMM recovers linear convergence in the strongly convex case, which contrasts with (relaxed) A-ADMM.

Convex Strongly Convex
ADMM
A-ADMM
Relaxed ADMM
Relaxed A-ADMM
Relaxed Heavy Ball ADMM
Table 1: Convergence rates of the dynamical systems related to relaxed and accelerated variants of ADMM proposed in this paper. The relaxation parameter is , is the strong convexity constant, and is a “damping” constant; see Definition 3, Algorithm 2 and Algorithm 3. Algorithms marked with are known, but with previously unknown convergence rates under at least one of the columns (convex versus strongly convex). Those marked with indicate a new family of algorithms.

Preliminaries and notation

Given , we let denote the Euclidean norm and denote the inner product. Given a matrix , the corresponding induced matrix norm is denoted by , where the respective largest and smallest singular values are and . The condition number of is denoted as . For future reference, we formalize our general assumptions as follows.

Assumption 1.

The functions and in (4) are continuously differentiable and the matrix has full column rank. Moreover, the function is Lipschitz continuous.

We stress that differentiability of and is a natural assumption when drawing connections to differential equations. One may be able to relax these conditions by using subdifferential calculus and differential inclusions, but this is beyond the scope of this work. Moreover, Lipschitz continuity of is to ensure the existence and uniqueness of global solutions to the differential equations.

Definition 2 (Convex function).

We say that is convex if and only if

(5)
Definition 3 (Strongly convex function).

We say that is strongly convex if and only if there exists a constant such that

(6)

Let , where denotes the continuous-time variable. The corresponding state of an algorithm at discrete-time will be denoted by . We assume that , at instant , for some small enough . One simple, but important, relation when taking the continuous limit is . Also, using the Mean Value Theorem and Taylor’s Theorem on the components of and , one may show that (see Appendix A)

(7)

Ii Variants of ADMM as Continuous Dynamical Systems

In this section, we first consider a family of relaxed ADMM algorithms, and then introduce two new accelerated variants. Moreover, we present differential equations modelling their behavior in the continuous-time limit. The family of ADMM algorithms is developed for problem (4), after adding an auxiliary variable and considering the (scaled) augmented Lagrangian , where is the Lagrange multiplier vector and is the penalty parameter.

ii.1 Relaxed ADMM

Let us start with a relaxed ADMM framework Boyd:2011 that is presented in Algorithm 1. The relaxation parameter is introduced to speed up convergence. We obtain the continuous limit of these updates in Theorem 4. The derivation of such a differential equation is considerably easier compared to its accelerated variant (see Theorem 5), and is therefore relegated to the Appendix.


0:  functions and , matrix , and parameters and
1:  initialize , ,
2:  for  do
3:     
4:     
5:     
6:  end for
Algorithm 1 Family of relaxed ADMM algorithms for problem (4). The penalty parameter is and the relaxation parameter is . Standard ADMM is recovered with .
Theorem 4.

Consider the relaxed ADMM framework in Algorithm 1 with . Let Assumption 1 hold, and furthermore assume that and are convex. Then, the continuous limit for the updates in Algorithm 1, with , is given by the initial value problem

(8)

with .

Note that (8) reduces to the differential equation (1) in the special case and .

ii.2 Relaxed and Accelerated ADMM

Motivated by Nesterov’s approach Nesterov:1983, we introduce the new variables and to obtain an accelerated version of Algorithm 1. The resulting family of algorithms is shown in Algorithm 2.

0:  functions and , matrix , and parameters , , and
1:  initialize , , , ,
2:  for  do
3:     
4:     
5:     
6:     
7:     
8:     
9:  end for
Algorithm 2 Family of relaxed A-ADMM algorithms for problem (4). The penalty parameter is and the relaxation parameter is . The damping constant is .

Algorithm 2 has not been considered in the literature, and is a relaxation of A-ADMM Goldstein:2014222 Strictly speaking, Goldstein:2014 uses the parametrization with the recursion and . However, asymptotically this is the same as with . that may be recovered by setting and . It is also worth noting that even for relaxed ADMM (without acceleration) the existing theoretical results are sparse compared to standard ADMM. Our next result derives the continuous limit of Algorithm 2.

Theorem 5.

Consider the relaxed A-ADMM framework in Algorithm 2 with . Let Assumption 1 hold, and furthermore assume that and are convex. Then, the continuous limit for the updates in Algorithm 2, with , is given by the initial value problem

(9)

with and .

Proof.

Since and are convex, and has full column rank, the optimization problems associated with the proximal operators in steps 3 and 4 of Algorithm 2 are strongly convex, so that they have unique solutions. From the optimality conditions we thus have

(10a)
(10b)
(10c)
(10d)
(10e)

The first two equations in (10) can be further combined into

(11)

We now consider each term separately. From (10e) we have By adding to the right hand side and then reorganizing, we obtain

(12)

Let at . Furthermore, let us choose . According to (7), times the first term on the right hand side of (12) gives as , while times the second term on the right hand side of (12) satisfies

(13)

Therefore, we have shown that

(14)

We now focus on the third term of (11). By adding and then reorganizing, we obtain

(15)

Considering (10d) and noting that , it now follows that

(16)

Taking the limit , the above equation implies that . Combining this with (10c), we conclude that , which in turn implies that and . Therefore, equation (15) gives

(17)

Combining these results with (11), in the limit , we have

(18)

Using the definition of in (4), noting that , and since has full column rank by assumption, we finally obtain (9).

For the initial conditions, one can choose , where is the initial estimate of a solution to (4). Next, using the Mean Value Theorem, we have for some and all . Combining this with (9) yields

(19)

for all . Letting , which also forces , we have . Since , it follows that , which completes the proof. ∎

Note that (9) reduces to the differential equation (3) in the particular case , and . It will be shown that the generality in (9) allows for refined convergence results; see Section III.

ii.3 Relaxed Heavy Ball ADMM

Another acceleration scheme for gradient descent is the Heavy Ball Method introduced by Polyak Polyak:1964. Motivated by his work, we now introduce another accelerated variant of relaxed ADMM that we call relaxed Heavy Ball ADMM, which is presented in Algorithm 3. Note that the key difference when compared with Algorithm 2 is the choice , which is now a constant that depends on the penalty parameter . This choice is inspired by the continuous limit but is otherwise not obvious. The proof of the next result is similar to the proof of Theorem 5 and is deferred to the Appendix.

0:  functions and , matrix , and parameters , and
1:  initialize , , , ,
2:  
3:  for  do
4:     
5:     
6:     
7:     
8:     
9:  end for
Algorithm 3 Family of Relaxed Heavy Ball ADMM algorithms for problem (4). The penalty parameter is and the relaxation parameter is . The damping constant is .
Theorem 6.

Consider the relaxed Heavy Ball ADMM framework in Algorithm 3 with . Let Assumption 1 hold, and furthermore assume that and are convex. Then, the continuous limit for the updates in Algorithm 3, with , is given by the following initial value problem:

(20)

with and .

The differential equation (20) is closely related to (9). The key difference is that the different damping strategies render different stabilities in the dynamical systems, which we believe reflects upon the different empirical behavior observed between Algorithm 2 and Algorithm 3; see also Table 1.

Iii Convergence Rates of the Dynamical Systems

We now provide the convergence rates of the dynamical systems derived in Section II when is either convex or strongly convex. The overall strategy for second-order systems consists of constructing a Lyapunov function such that and over trajectories of the system.333 Strictly speaking, for non-autonomous systems, has to dominate a positive-definite function in order to be called a Lyapunov function LaSalle. However, this is only important when considering the stability of the dynamical system. To obtain convergence rates, and are sufficient, and explains our abuse of terminology. If one is able to find such an , then the convergence rate usually follows from . In what follows, we state the results and, for reference, write down the corresponding Lyapunov functions. The actual proofs are deferred to the Appendix.

iii.1 Convergence of Relaxed ADMM

We first consider the dynamical system (8) associated with relaxed ADMM and described in Algorithm 1. The proof of the following theorem uses the Lyapunov function

(21)

for the convex case, where is a minimizer of , and the Lyapunov function

(22)

for the strongly convex case. We also use the shorthand notation in several parts.

Theorem 7.

Consider the dynamical system (8) under Assumption 1. Let be a minimizer of , and a trajectory of the system with initial condition . The following then holds:

  • If is convex, then

    (23)
  • If is -strongly convex, then with it follows that

    (24)

The following remarks concerning Theorem 7 are appropriate:

  • The rate (23) matches the rate of non-relaxed ADMM in the convex case Eckstein:2015; He:2012. We believe that the analogous result for relaxed ADMM presented in (23) is new.

  • The exponential rate in (24) is consistent with the linear convergence of relaxed ADMM algorithms in the strongly convex case Nishihara:2015; FrancaBento:2016.

  • It is interesting, although not surprising, that the relaxation parameter appears in the convergence rates. Moreover, it suggests that improved performance may be obtained by using over-relaxation, i.e., choosing . This is especially prominent in the strongly convex case since it appears inside the exponential in (24).

  • It may seem desirable to choose . However, one must be careful to avoid divergence in (21). In the extreme case, i.e., when , there is no dynamics according to (8). These observations are consistent with the empirical guideline that as suggested in Eckstein:1994; Eckstein:1998. The choice should be avoided because the system (8) then follows the gradient ascent direction; this is consistent with existing results for the discrete case FrancaBento:2016.

iii.2 Convergence of Relaxed and Accelerated Variants of ADMM

Consider the dynamical system (9) associated with the relaxed A-ADMM in Algorithm 2. To obtain a convergence rate in the convex case, inspired by Candes:2016; Wibisono:2016, we use the Lyapunov function

(25)

while for the strongly convex case we use

(26)

with . This choice is motivated by Candes:2016; Attouch:2016, which considered the differential equation related to Nesterov’s method. We now have our main result for the dynamical system (9).

Theorem 8.

Consider the dynamical system (9) under Assumption 1. Let be a minimizer of , and a trajectory of the system with and . The following then holds:

  • If is convex and , then

    (27)
  • If is strongly convex, then there exists , independent of parameters, such that

    (28)

    where ; see (78).

iii.3 Convergence of Relaxed Heavy Ball Variants of ADMM

We now turn to the dynamical system (20) associated with the relaxed Heavy Ball ADMM method in Algorithm 3. For the convex case, we define the Lyapunov function

(29)

while for the strongly convex case we use

(30)

to prove the following convergence rates associated to the dynamical system (20).

Theorem 9.

Consider the dynamical system (20) under Assumption 1. Let be a minimizer of and be a trajectory of the system with initial conditions and . Then, there exists a constant , that is independent of parameters, such that the following hold:

  • If is convex, then

    (31)
  • If is -strongly convex, then with we have

    (32)

The following remarks for Theorem 8 and Theorem 9 are appropriate.

  • Convergence of the objective in the strongly convex case (see (28) and (32)) automatically implies convergence of the state variables since .

  • Over-relaxation, i.e., choosing , seems to improve convergence in some cases more than others. For instance, in the convex cases (27) and (31) the improvement is linear in , while in the strongly convex cases  (28) and (32) the parameter appears raised to a power and inside an exponential, respectively.

  • Although the term in (28) seems to indicate faster convergence for larger values of , one must remember that the constant grows as .

  • Although the relaxed Heavy Ball ADMM method recovers the exponential convergence rate in (32), compared to (28) for relaxed A-ADMM, must be restricted below a threshold since it is a constant in (20). This is in contrast to (9) where vanishes asymptotically.

  • Theorem 9 applies to the Heavy Ball method (2) as a particular case, for which an rate was obtained only in a Cesàro average sense Ghadimi:2014, in contrast to (31). Linear convergence of the Heavy Ball method is known Ghadimi:2014; Polyak:2017, but not in the same form as (32).

Iv A Numerical Example

(a) (b)
Figure 1: Comparison between variants of ADMM (Algorithms 1, 2 and 3; R stands for relaxed and HB for Heavy Ball) and associated dynamical systems (respective flow equations (8), (9) and (20)). We choose a quadratic problem described in the text following (33). The parameters are and where the shaded areas are obtained with the discrete algorithms. The upper and lower boundaries are given by and , respectively. Solid lines corresponds to the discrete algorithms and dashed lines to the continuous dynamical systems, both with the same and . The numerical integrations of the flow equations are described in the text after (35); see also Appendix B for details. (a) Convex case with the choice (34). (b) Strongly convex case with the choice (35). Note the different behaviour between both cases, as predicted in Theorems 8 and 9.

Here we provide a simple numerical experiment illustrating our theoretical results. Consider the quadratic problem

(33)

where . The matrix is obtained as follows. Let be a random matrix with singular value decomposition . Then, . Thus, is full column rank with condition number . The matrix is a random symmetric matrix chosen to be (a) positive semi-definite for the convex case and (b) positive definite for the strongly convex case, as follows. First, let be a random matrix drawn from the compact orthogonal group (with Haar measure). Second, define where with the following choice:

(a) convex: (34)
(b) strongly convex: (35)

With these two setups we solve problem (33) using the variants of ADMM and numerical integrations of the corresponding differential equations. For the 1st-order differential equation (8) we use a standard 4th-order Runge-Kutta method. The integration of the 2nd-order differential equations (9) and (20) are more challenging due to strong oscillations of the trajectories. We thus write these dynamical systems in Hamiltonian form and use a symplectic integrator, which is a numerical scheme designed to preserve properties of the continuous dynamical system in discrete-time. We refer the reader to the Appendix B for more details. We use the simplest of such methods, the symplectic Euler method given by updates (127).

The results for both cases, convex and strongly convex, are shown in Figure 1 where we plot the objective function error (in scale) versus the iteration time. For the sake of visualization we vary (shaded area) only for the discrete algorithms and compare the results of the algorithms (solid lines) with the differential equations (dashed lines) only with . Notice that the curves for the algorithms and associated differential equations are close. Note also the different convergence behaviours between the convex and strongly convex cases, as predicted in Theorem 8 and Theorem 9. In particular, relaxed A-ADMM (and dynamical system (9)) converges faster than relaxed Heavy Ball ADMM (and dynamical system (20)) in the convex case, however the behaviour shifts in the strongly convex case since relaxed Heavy Ball ADMM has linear convergence while relaxed A-ADMM does not. Also, the convergence rates with are improved, as predicted in our previous theorems. The improvement is more prominent in the strongly convex case as seen by comparing the wider shaded areas of Figure 1b versus Figure 1a.

V Conclusion

We introduced two new families of relaxed and accelerated ADMM algorithms. The first follows Nesterov’s acceleration approach (see Algorithm 2), while the second was inspired by Polyak’s Heavy Ball method (see Algorithm 3). Moreover, we presented a new perspective for understanding these variants of ADMM by deriving differential equations that model them in a continuous-time limit (see Theorems 5 and 6). Such an approach allowed for a simple complexity analysis built upon Lyapunov stability that led to rate-of-convergence results for convex and strongly convex objective functions for the associated continuous dynamical systems (see Theorems 8 and 9 whose proofs are in Appendix A). Most of the complexity results in this paper (see Table 1) are new to the best of our knowledge. A numerical verification of these convergence rates comparing variants of ADMM algorithms with the corresponding continuous dynamical systems, through a Hamiltonian symplectic integrator, was provided in Figure 1. Although these results were derived in the continuous-time limit, they suggest that the same rates hold for the corresponding discrete algorithms, for which the proofs are more difficult and currently unknown. Hopefully, our approach can provide valuable insight into tackling the discrete case.

Acknowledgements.
This work was supported by grants ARO MURI W911NF-17-1-0304 and NSF 1447822.

Appendix A Proofs of the Main Results

In this section we provide the proofs of the theorems stated in the main part of this paper.

Proof of relations (7).

Let be a twice continuously differentiable function of time . We can obtain a discrete sample from by computing its values at intervals of time. Therefore, define where and . Denote by and the th component of these vectors, where . From the Mean Value Theorem we have

(36)

for some . Hence, as . Since this holds for each component , we obtain

(37)

Analogously, we also have that

(38)

To consider second derivatives we use Taylor’s theorem

(39)

for some . Thus,

(40)

for some . Thus, as . Since this holds for each component , we conclude that

(41)

This concludes the proof of relations (7). ∎

Proof of Theorem 4.

Since and are convex, and has full column, the optimization problems in the proximal operators of Algorithm 1 are strongly convex, so that is unique. It follows from the optimality conditions that

(42a)
(42b)
(42c)

The equations (42a) and (42b) can be combined to obtain

(43)

Let where . Choosing , from (43) we have