Relax, and Accelerate:
A Continuous Perspective on ADMM
The acceleration technique first introduced by Nesterov for gradient descent is widely used in many machine learning applications, however it is not yet well-understood. Recently, significant progress has been made to close this understanding gap by using a continuous-time dynamical system perspective associated with gradient-based methods. In this paper, we extend this perspective by considering the continuous limit of the family of relaxed Alternating Direction Method of Multipliers (ADMM). We also introduce two new families of relaxed and accelerated ADMM algorithms, one follows Nesterov’s acceleration approach and the other is inspired by Polyak’s Heavy Ball method, and derive the continuous limit of these families of relaxed and accelerated algorithms as differential equations. Then, using a Lyapunov stability analysis of the dynamical systems, we obtain rate-of-convergence results for convex and strongly convex objective functions.
Accelerated gradient-based methods
A popular method to accelerate the convergence of gradient descent was proposed by Nesterov in the seminal paper Nesterov:1983. In the convex case, accelerated gradient descent attains a convergence rate of in terms of the error in the objective function value, with denoting the iteration time, which is known to be optimal in the sense of a worst-case complexity Nesterov:2004. Another accelerated variant of gradient descent was introduced by Polyak Polyak:1964, called the Heavy Ball method, which is known to have a convergence rate of for convex functions, and linear convergence for strongly convex functions Ghadimi:2014; Polyak:2017. Nonetheless, adding momentum as a mechanism for accelerating optimization methods is still considered not well-understood.
Recently, there has been progress in understanding acceleration by using a differential equation to model the continuous limit of Nesterov’s method Candes:2016. Additional follow-up work has brought an even larger class of accelerated methods into a Hamiltonian formalism Wibisono:2016, thus opening opportunities for analysis through the lens of continuous dynamical systems. For example, analyses based on Lyapunov’s theory were explored for both continuous and discrete settings Krichene:2015; Wilson:2016. However, such connections have thus far been limited to gradient descent methods for unconstrained optimization.
Perhaps, the simplest example to illustrate the interplay between discrete and continuous approaches is the gradient descent method. In this case, one can make a simple correspondence between the following discrete update and a continuous dynamical system:
where is the stepsize, is a continuous function of time such that with , and . Interestingly, the differential equation in (1) was used to solve an optimization problem by Cauchy Cauchy:1847 before gradient descent was invented. It is not hard to show that the differential equation (1) shares the convergence rate with gradient descent. Analogously, for the Heavy Ball method Polyak:1964, one has the correspondence
for some constants and , and where and . For Nesterov’s accelerated gradient descent method Nesterov:1983, only recently has its continuous limit been obtained as Candes:2016
The differential equation in (3) has a convergence rate of for convex functions , which matches the optimal rate of its discrete counterpart.
A separate important algorithm is the Alternating Direction Method of Multipliers (ADMM) Gabay:1976; Glowinsky:1975; Boyd:2011. ADMM is well-known for its ease of implementation, scalability, and applicability in many important areas of machine learning and statistics. In the convex case, it converges at a rate of He:2012; Eckstein:2015, while in the strongly convex case, it converges at a linear rate of for some Deng:2016. Many variants of ADMM exist. A popular example is one that incorporates a relaxation strategy, which empirically is known to improve convergence Eckstein:1994; Eckstein:1998. However, compared to standard ADMM, few theoretical results for relaxed ADMM are available; for strongly convex functions, relaxed ADMM has been shown to be linearly convergent Damek:2017; Nishihara:2015; Giselsson:2014; FrancaBento:2016.
An accelerated version of ADMM (here called A-ADMM) was proposed Goldstein:2014. For composite objective functions of the form with and strongly convex and quadratic, it was shown that A-ADMM attains a convergence rate of Goldstein:2014. To the best of our knowledge, no other convergence rates are known. Numerical experiments Goldstein:2014 show that A-ADMM performs better compared to other accelerated methods, such as Nesterov acceleration. Recently, a differential equation modeling A-ADMM was proposed Franca:2018 that generalizes previous results such as (3), as well as shows that the continuous dynamical system has an convergence rate under merely a convexity assumption on the objective. In this paper, we analyze more general frameworks and provide analyses that go beyond these results in several aspects that we describe next.
We propose new variants of ADMM for solving the following problem:111 Our results can be extended to the formulation provided is invertible, which is not an uncommon assumption Nishihara:2015; FrancaBento:2016; Eckstein:2015. Since , one can easily redefine to cast the problem into a similar form as (4).
where , with , , and . We first consider the known family of relaxed ADMM algorithms. Then, we introduce two accelerated variants to the relaxed ADMM scheme: one follows Nesterov’s approach, which we refer to as relaxed and accelerated ADMM (relaxed A-ADMM), and the other is closer to Polyak’s Heavy Ball method, which we call relaxed Heavy Ball ADMM (see Algorithms 2 and 3). To the best of our knowledge, this is the first time that acceleration and relaxation are considered jointly in an optimization framework.
Having introduced these new families of algorithms, we turn our attention to deriving their continuous limit as differential equations that model their behavior. We then obtain rates of convergence for the dynamical systems by constructing appropriate Lyapunov functions in both the convex and strongly convex cases. Our results are more general than those in Franca:2018, with their results being one particular instance of our framework. For a summary of the results obtained in this paper, we refer the reader to Table 1. We can see that by incorporating relaxation into A-ADMM (the relaxation parameter is ) an improved constant in the complexity bound is obtained. Also, the proposed relaxed Heavy Ball ADMM recovers linear convergence in the strongly convex case, which contrasts with (relaxed) A-ADMM.
|Relaxed Heavy Ball ADMM|
Preliminaries and notation
Given , we let denote the Euclidean norm and denote the inner product. Given a matrix , the corresponding induced matrix norm is denoted by , where the respective largest and smallest singular values are and . The condition number of is denoted as . For future reference, we formalize our general assumptions as follows.
The functions and in (4) are continuously differentiable and the matrix has full column rank. Moreover, the function is Lipschitz continuous.
We stress that differentiability of and is a natural assumption when drawing connections to differential equations. One may be able to relax these conditions by using subdifferential calculus and differential inclusions, but this is beyond the scope of this work. Moreover, Lipschitz continuity of is to ensure the existence and uniqueness of global solutions to the differential equations.
Definition 2 (Convex function).
We say that is convex if and only if
Definition 3 (Strongly convex function).
We say that is strongly convex if and only if there exists a constant such that
Let , where denotes the continuous-time variable. The corresponding state of an algorithm at discrete-time will be denoted by . We assume that , at instant , for some small enough . One simple, but important, relation when taking the continuous limit is . Also, using the Mean Value Theorem and Taylor’s Theorem on the components of and , one may show that (see Appendix A)
Ii Variants of ADMM as Continuous Dynamical Systems
In this section, we first consider a family of relaxed ADMM algorithms, and then introduce two new accelerated variants. Moreover, we present differential equations modelling their behavior in the continuous-time limit. The family of ADMM algorithms is developed for problem (4), after adding an auxiliary variable and considering the (scaled) augmented Lagrangian , where is the Lagrange multiplier vector and is the penalty parameter.
ii.1 Relaxed ADMM
Let us start with a relaxed ADMM framework Boyd:2011 that is presented in Algorithm 1. The relaxation parameter is introduced to speed up convergence. We obtain the continuous limit of these updates in Theorem 4. The derivation of such a differential equation is considerably easier compared to its accelerated variant (see Theorem 5), and is therefore relegated to the Appendix.
ii.2 Relaxed and Accelerated ADMM
Algorithm 2 has not been considered in the literature, and is a relaxation of A-ADMM Goldstein:2014222 Strictly speaking, Goldstein:2014 uses the parametrization with the recursion and . However, asymptotically this is the same as with . that may be recovered by setting and . It is also worth noting that even for relaxed ADMM (without acceleration) the existing theoretical results are sparse compared to standard ADMM. Our next result derives the continuous limit of Algorithm 2.
Since and are convex, and has full column rank, the optimization problems associated with the proximal operators in steps 3 and 4 of Algorithm 2 are strongly convex, so that they have unique solutions. From the optimality conditions we thus have
The first two equations in (10) can be further combined into
We now consider each term separately. From (10e) we have By adding to the right hand side and then reorganizing, we obtain
Therefore, we have shown that
We now focus on the third term of (11). By adding and then reorganizing, we obtain
Considering (10d) and noting that , it now follows that
Combining these results with (11), in the limit , we have
for all . Letting , which also forces , we have . Since , it follows that , which completes the proof. ∎
ii.3 Relaxed Heavy Ball ADMM
Another acceleration scheme for gradient descent is the Heavy Ball Method introduced by Polyak Polyak:1964. Motivated by his work, we now introduce another accelerated variant of relaxed ADMM that we call relaxed Heavy Ball ADMM, which is presented in Algorithm 3. Note that the key difference when compared with Algorithm 2 is the choice , which is now a constant that depends on the penalty parameter . This choice is inspired by the continuous limit but is otherwise not obvious. The proof of the next result is similar to the proof of Theorem 5 and is deferred to the Appendix.
The differential equation (20) is closely related to (9). The key difference is that the different damping strategies render different stabilities in the dynamical systems, which we believe reflects upon the different empirical behavior observed between Algorithm 2 and Algorithm 3; see also Table 1.
Iii Convergence Rates of the Dynamical Systems
We now provide the convergence rates of the dynamical systems derived in Section II when is either convex or strongly convex. The overall strategy for second-order systems consists of constructing a Lyapunov function such that and over trajectories of the system.333 Strictly speaking, for non-autonomous systems, has to dominate a positive-definite function in order to be called a Lyapunov function LaSalle. However, this is only important when considering the stability of the dynamical system. To obtain convergence rates, and are sufficient, and explains our abuse of terminology. If one is able to find such an , then the convergence rate usually follows from . In what follows, we state the results and, for reference, write down the corresponding Lyapunov functions. The actual proofs are deferred to the Appendix.
iii.1 Convergence of Relaxed ADMM
for the convex case, where is a minimizer of , and the Lyapunov function
for the strongly convex case. We also use the shorthand notation in several parts.
The following remarks concerning Theorem 7 are appropriate:
The exponential rate in (24) is consistent with the linear convergence of relaxed ADMM algorithms in the strongly convex case Nishihara:2015; FrancaBento:2016.
It is interesting, although not surprising, that the relaxation parameter appears in the convergence rates. Moreover, it suggests that improved performance may be obtained by using over-relaxation, i.e., choosing . This is especially prominent in the strongly convex case since it appears inside the exponential in (24).
It may seem desirable to choose . However, one must be careful to avoid divergence in (21). In the extreme case, i.e., when , there is no dynamics according to (8). These observations are consistent with the empirical guideline that as suggested in Eckstein:1994; Eckstein:1998. The choice should be avoided because the system (8) then follows the gradient ascent direction; this is consistent with existing results for the discrete case FrancaBento:2016.
iii.2 Convergence of Relaxed and Accelerated Variants of ADMM
Consider the dynamical system (9) associated with the relaxed A-ADMM in Algorithm 2. To obtain a convergence rate in the convex case, inspired by Candes:2016; Wibisono:2016, we use the Lyapunov function
while for the strongly convex case we use
with . This choice is motivated by Candes:2016; Attouch:2016, which considered the differential equation related to Nesterov’s method. We now have our main result for the dynamical system (9).
iii.3 Convergence of Relaxed Heavy Ball Variants of ADMM
while for the strongly convex case we use
to prove the following convergence rates associated to the dynamical system (20).
Consider the dynamical system (20) under Assumption 1. Let be a minimizer of and be a trajectory of the system with initial conditions and . Then, there exists a constant , that is independent of parameters, such that the following hold:
If is convex, then
If is -strongly convex, then with we have
Over-relaxation, i.e., choosing , seems to improve convergence in some cases more than others. For instance, in the convex cases (27) and (31) the improvement is linear in , while in the strongly convex cases (28) and (32) the parameter appears raised to a power and inside an exponential, respectively.
Although the term in (28) seems to indicate faster convergence for larger values of , one must remember that the constant grows as .
Iv A Numerical Example
Here we provide a simple numerical experiment illustrating our theoretical results. Consider the quadratic problem
where . The matrix is obtained as follows. Let be a random matrix with singular value decomposition . Then, . Thus, is full column rank with condition number . The matrix is a random symmetric matrix chosen to be (a) positive semi-definite for the convex case and (b) positive definite for the strongly convex case, as follows. First, let be a random matrix drawn from the compact orthogonal group (with Haar measure). Second, define where with the following choice:
|(b) strongly convex:||(35)|
With these two setups we solve problem (33) using the variants of ADMM and numerical integrations of the corresponding differential equations. For the 1st-order differential equation (8) we use a standard 4th-order Runge-Kutta method. The integration of the 2nd-order differential equations (9) and (20) are more challenging due to strong oscillations of the trajectories. We thus write these dynamical systems in Hamiltonian form and use a symplectic integrator, which is a numerical scheme designed to preserve properties of the continuous dynamical system in discrete-time. We refer the reader to the Appendix B for more details. We use the simplest of such methods, the symplectic Euler method given by updates (127).
The results for both cases, convex and strongly convex, are shown in Figure 1 where we plot the objective function error (in scale) versus the iteration time. For the sake of visualization we vary (shaded area) only for the discrete algorithms and compare the results of the algorithms (solid lines) with the differential equations (dashed lines) only with . Notice that the curves for the algorithms and associated differential equations are close. Note also the different convergence behaviours between the convex and strongly convex cases, as predicted in Theorem 8 and Theorem 9. In particular, relaxed A-ADMM (and dynamical system (9)) converges faster than relaxed Heavy Ball ADMM (and dynamical system (20)) in the convex case, however the behaviour shifts in the strongly convex case since relaxed Heavy Ball ADMM has linear convergence while relaxed A-ADMM does not. Also, the convergence rates with are improved, as predicted in our previous theorems. The improvement is more prominent in the strongly convex case as seen by comparing the wider shaded areas of Figure 1b versus Figure 1a.
We introduced two new families of relaxed and accelerated ADMM algorithms. The first follows Nesterov’s acceleration approach (see Algorithm 2), while the second was inspired by Polyak’s Heavy Ball method (see Algorithm 3). Moreover, we presented a new perspective for understanding these variants of ADMM by deriving differential equations that model them in a continuous-time limit (see Theorems 5 and 6). Such an approach allowed for a simple complexity analysis built upon Lyapunov stability that led to rate-of-convergence results for convex and strongly convex objective functions for the associated continuous dynamical systems (see Theorems 8 and 9 whose proofs are in Appendix A). Most of the complexity results in this paper (see Table 1) are new to the best of our knowledge. A numerical verification of these convergence rates comparing variants of ADMM algorithms with the corresponding continuous dynamical systems, through a Hamiltonian symplectic integrator, was provided in Figure 1. Although these results were derived in the continuous-time limit, they suggest that the same rates hold for the corresponding discrete algorithms, for which the proofs are more difficult and currently unknown. Hopefully, our approach can provide valuable insight into tackling the discrete case.
Acknowledgements.This work was supported by grants ARO MURI W911NF-17-1-0304 and NSF 1447822.
Appendix A Proofs of the Main Results
In this section we provide the proofs of the theorems stated in the main part of this paper.
Proof of relations (7).
Let be a twice continuously differentiable function of time . We can obtain a discrete sample from by computing its values at intervals of time. Therefore, define where and . Denote by and the th component of these vectors, where . From the Mean Value Theorem we have
for some . Hence, as . Since this holds for each component , we obtain
Analogously, we also have that
To consider second derivatives we use Taylor’s theorem
for some . Thus,
for some . Thus, as . Since this holds for each component , we conclude that
This concludes the proof of relations (7). ∎
Proof of Theorem 4.
Since and are convex, and has full column, the optimization problems in the proximal operators of Algorithm 1 are strongly convex, so that is unique. It follows from the optimality conditions that
Let where . Choosing , from (43) we have