Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints
This manuscript develops a new framework to analyze and design iterative optimization algorithms built on the notion of Integral Quadratic Constraints (IQC) from robust control theory. IQCs provide sufficient conditions for the stability of complicated interconnected systems, and these conditions can be checked by semidefinite programming. We discuss how to adapt IQC theory to study optimization algorithms, proving new inequalities about convex functions and providing a version of IQC theory adapted for use by optimization researchers. Using these inequalities, we derive numerical upper bounds on convergence rates for the Gradient method, the Heavy-ball method, Nesterov’s accelerated method, and related variants by solving small, simple semidefinite programming problems. We also briefly show how these techniques can be used to search for optimization algorithms with desired performance characteristics, establishing a new methodology for algorithm design.
|Laurent Lessard and Benjamin Recht and Andrew Packard|
Convex optimization algorithms provide a powerful toolkit for robust, efficient, large-scale optimization algorithms. They provide not only effective tools for solving optimization problems, but are guaranteed to converge to accurate solutions in provided time budgets [23, 26], are robust to errors and time delays [41, 22], and are amendable to declarative modeling that decouples the algorithm design from the problem formulation [2, 8, 16]. However, as we push up against the boundaries of the convex analysis framework, try to build more complicated models, and aim to deploy optimization systems in highly complex environments, the mathematical guarantees of convexity start to break. The standard proof techniques for analyzing convex optimization rely on deep insights by experts and are devised on an algorithm-by-algorithm basis. It is thus not clear how to extend the toolkit to more diverse scenarios where multiple objectives—such as robustness, accuracy, and speed—need to be delicately balanced.
This paper marks an attempt at providing a systematized approach to the design and analysis optimization algorithms using techniques from control theory. Our strategy is to adapt the notion of an integral quadratic constraint from robust control theory . These constraints link sequences of inputs and outputs of operators, and are ideally suited to proving algorithmic convergence. We will see that for convex functions, we can derive these constraints using only the standard first-order characterization of convex functions, and that these inequalities will be sufficient to reduce the analysis of first-order methods to the solution of a very small semidefinite program. Our IQC framework puts the analysis of algorithms in a unified proof framework, and enables new analyses of algorithms by minor perturbations of existing proofs. This new system aims to simplify and automate the analysis of optimization programs, and perhaps to open new directions for algorithm design.
Our methods are inspired by the recent work of Drori and Teboulle . In their manuscript, the authors propose writing down the first-order convexity inequality for all steps of an algorithmic procedure. They then derive a semidefinite program that analytically verifies very tight bounds for the convergence rate for the Gradient method, and numerically precise bounds for convergence of Nesterov’s method and other first-order methods. The main drawback of the Drori and Teboulle approach is that the size of the semidefinite program scales with the number of time steps desired. Thus, it becomes computationally laborious to analyze algorithms that require more than a few hundred iterations.
Integral quadratic constraints will allow us to circumvent this issue. A typical example of one of our semidefinite programs might have a positive semidefinite decision variable, 3 scalar variables, a semidefinite cone constraint, and 4 scalar constraints. Such a problem can be solved in less than 10 milliseconds on a laptop with standard solvers.
We are able to analyze a variety of methods in our framework. We show that our framework recovers the standard rates of convergence for the Gradient method applied to strongly convex functions. We show that we can numerically estimate the performance of Nesterov’s method. Indeed, our analysis provides slightly sharper bounds than Nesterov’s proof. We show how our system fails to certify the stability of the popular Heavy-ball method of Polyak for strongly convex functions whose condition ratio is larger than 18. Based on this analysis, we are able to construct a one-dimensional strongly convex function whose condition ratio is 25 and prove analytically that the Heavy-ball method fails to find the global minimum of this function. This suggests that our tools can also be used as a way to guide the construction of counterexamples.
We show that our methods extend immediately to the projected and proximal variants of all the first order methods we analyze. We also show how to extend our analysis to functions that are convex but not strongly convex, and provide bounds on convergence that are within a logarithmic factor of the best upper bounds. We also demonstrate that our methods can bound convergence rates when the gradient is perturbed by relative deterministic noise. We show how different parameter settings lead to very different degradations in performance bounds as the noise increases.
Finally, we turn to algorithm design. Since our semidefinite program takes as input the parameters of our iterative scheme, we can search over these parameters. For simple two-step methods, our algorithms are parameterized by 3 parameters, and we show how we can derive first-order methods that achieve nearly the same rate of convergence as Nesterov’s accelerated method but are more robust to noise.
The manuscript is organized as follows. We begin with a discussion of discrete-time dynamical system and how common optimization algorithms can be viewed as feedback interconnections between a known linear system with an uncertain nonlinear component. We then turn to show how quadratic Lyapunov functions can be used to certify rates of convergence for optimization problems and can be found by semidefinite programming. This immediately leads to the notion of an integral quadratic constraint. Another contribution of this work is a new form of IQC analysis geared specifically toward rate-of-convergence conclusions, and accessible to optimization researchers. We also discuss their history in robust control theory and how they can be derived. With these basic IQCs in hand, we then turn to analyzing the Gradient method and Nesterov method, their projected and proximal variants, and their robustness to noise. We discuss one possible brute-force technique for designing new algorithms, and how we can outperform existing methods. Finally, we conclude with many directions for future work.
1.1 Notation and conventions
The identity matrix and zero matrix are denoted and , respectively. Subscripts are omitted when they are to be inferred by context.
Norms and sequences.
We define to be the set of all one-sided sequences . We sometimes omit and simply write when the superscript is clear from context. The notation denotes the standard 2-norm. The subset consists of all square-summable sequences. In other words, if and only if is convergent.
For a given , we define to be the set of functions that are continuously differentiable, strongly convex with parameter , and have Lipschitz gradients with parameter . In other words, satisfies
We call the condition ratio of . We adopt this terminology to distinguish the condition ratio of a function from the related concept of condition number of a matrix. The connection is that if is twice differentiable, we have the bound: for all , where is the condition number.
The Kronecker product of two matrices and is denoted and given by:
Two useful properties of the Kronecker product are that and that whenever the matrix dimensions are such that the products and make sense.
2 Optimization algorithms as dynamical systems
A linear dynamical system is a set of recursive linear equations of the form
At each timestep , is the input, is the output, and is the state. We can write the dynamical system (2.1) compactly by stacking the matrices into a block using the notation
We can connect this linear system in feedback with a nonlinearity by defining the rule
In this case, the output is transformed by the map and is then used as the input to the linear system.
In this paper, we will be interested in the case when the interconnected nonlinearity has the form where . In particular, we will consider algorithms designed to solve the optimization problem
as dynamical systems and see how this new viewpoint can give us insights into convergence analysis. Section 5.3 considers variants of (2.3) where the decision variable is constrained or is non-smooth.
Standard first order methods such as the Gradient method, Heavy-ball method, and Nesterov’s accelerated method, can all be cast in the form (2.2). In all cases, the nonlinearity is the mapping . The state transition matrices , , , differ for each algorithm. The Gradient method can be expressed as
Eliminating and and renaming to yields
which is the familiar Gradient method with constant stepsize. Nesterov’s accelerated method for strongly convex functions is given by the dynamical system
Eliminating and renaming to yields the common form of Nesterov’s method
Note that other variants of this algorithm exist for which the and parameters are updated at each iteration. In this paper, we restrict our analysis to the constant-parameter version above. The Heavy-ball method is given by
One can check by similar analysis that (2.7) is equivalent to the update rule
2.1 Proving algorithm convergence
Convergence analysis of convex optimization algorithms typically follows a two step procedure. First one must show that the algorithm has a fixed point that solves the optimization problem in question. Then, one must verify that from a reasonable starting point, the algorithm converges to this optimal solution at a specified rate.
In dynamical systems, such proofs are called stability analysis. By writing common first order methods as dynamical systems, we can unify their stability analysis. For a general problem with minimum occurring at , a necessary condition for optimality is that . Substituting into (2.1), the fixed point satisfies
In particular, must have an eigenvalue of . If the blocks of are diagonal as in the Gradient, Heavy-ball, or Nesterov methods shown above, then the eigenvalue of will have a geometric multiplicity of at least .
Proving that all paths lead to the optimal solution requires more effort and constitutes the bulk of what is studied herein. Before we proceed for general convex , it is instructive to study what happens for quadratic .
2.2 Quadratic problems
Suppose is a convex, quadratic function , where in the positive definite ordering. The gradient of is simply and the optimal solution is .
What happens when we run a first order method on a quadratic problem? Assume throughout this section that . Substituting the equation for and back into (2.2), we obtain the system of equations:
Now make use of the fixed-point equations and and we obtain . Eliminating and from the above equations, we obtain
Let denote the closed-loop state transition matrix. A necessary and sufficient condition for to converge to is that the spectral radius of is strictly less than . Recall that the spectral radius of a matrix is defined as the largest magnitude of the eigenvalues of . We denote the spectral radius by . It is a fact that
where is the induced -norm. Therefore, for any , we have for all sufficiently large that . Hence, we can bound the convergence rate:
So the spectral radius also determines the rate of convergence of the algorithm. With only bounds on the eigenvalues of , we can provide conditions under which the algorithms above converge for quadratic .
The following table gives worst-case rates for different algorithms and parameter choices when applied to a class of convex quadratic functions. We assume here that where and is any matrix that satisfies . We also define .
|Method||Parameter choice||Rate bound||Comment|
All of these results are proven by elementary linear algebra and the bounds are tight. In other words, there exists a quadratic function that achieves the worst-case . See Appendix A for more detail.
Unfortunately, the proof technique used in Proposition 1 does not extend to the case where is a more general strongly convex function. However, a different characterization of stability does generalize and will be described in Section 3. It turns out that for linear systems, stability is equivalent to the feasibility of a particular semidefinite program. We will see in the sequel that similar semidefinite programs can be used to certify stability of nonlinear systems.
Suppose . Then if and only if there exists a satisfying .
The proof of Proposition 2 is elementary so we omit it. The use of Linear Matrix Inequalities (LMI) to characterize stability of a linear time-invariant system dates back to Lyapunov , and we give a more detailed account of this history in Section 3.4. Now suppose we are studying a dynamical system of the form as in (2.8). Then, if there exists a satisfying ,
along all trajectories. If , then the sequence converges linearly to . Iterating (2.9) down to , we see that
which implies that
where is the condition number of . In what follows, we will generalize this semidefinite programming approach to yield feasibility problems that are sufficient to characterize when the closed loop system (2.2) converges and which provide bounds on the distance to optimality as well. The function
is called a Lyapunov function for the dynamical system. This function strictly decreases over all trajectories and hence certifies that the algorithm is stable, i.e., converges to nominal values. The conventional method for proving stability of an electromechanical system is to show that some notion of total energy always decreases over time. Lyapunov functions provide a convenient mathematical formulation of this notion of total energy.
The question for the remainder of the paper is how can we search for Lyapunov-like functions that guarantee algorithmic convergence when is not quadratic.
3 Proving convergence using integral quadratic constraints
When the function being minimized is quadratic as explored in Section 2.2, its gradient is affine and the interconnected dynamical system is a simple linear difference equation whose stability and convergence rate is analyzed solely in terms of eigenvalues of the closed-loop system. When the cost function is not quadratic, the gradient update is not an affine function and hence a different analysis technique is required.
A popular technique in the control theory literature is to use integral quadratic constraints (IQCs) to capture features of the behavior of partially-known components. The term IQC was introduced in the seminal paper by Megretski and Rantzer . In that work, the authors analyzed continuous time dynamical systems and the constraints involved integrals of quadratic functions, hence the name IQC.
In the development that follows, we repurpose the classical IQC theory for use in algorithm analysis. This requires using discrete time dynamical systems so our constraints will involve sums of quadratics rather than integrals. We also adapt the theory in a way that allows us to certify a specific convergence rate in addition to stability.
3.1 An introduction to IQCs
IQCs provide a convenient framework for analyzing interconnected dynamical systems that contain components that are noisy, uncertain, or otherwise difficult to model. The idea is to replace this troublesome component by a quadratic constraint on its inputs and outputs that is known to be satisfied by all possible instances of the component. If we can certify that the newly constrained system performs as desired, then the original system must do so as well.
Suppose is the troublesome function we wish to analyze. The equation can be represented using a block diagram, as in Figure 1.
Although we do not know exactly, we assume that we have some knowledge of the constraints it imposes on the pair . For example, suppose it is known that satisfies the following properties:
is static and memoryless: for some .
is -Lipschitz: for all .
Now suppose that is an arbitrary sequence of vectors in , and is the output of the unknown function applied to . Property (ii) implies that for all , where is any pair of vectors satisfying that will serve as a reference point. In matrix form, this is
Core idea behind IQC.
Instead of analyzing a system that contains , we analyze the system where is removed, but we enforce the constraints (3.1) on the signals . Since (3.1) is true for all admissible choices of , then any properties we can prove for the constrained system must hold for the original system as well.
Note that (3.1) is rather special in that the quadratic coupling of is pointwise; it only manifests itself as separate quadratic constraints on each . It is possible to specify more general quadratic constraints that couple different values, and the key insight above still holds. To do this, introduce auxiliary sequences together with a map characterized by the matrices and the recursion
where we will define the initial condition shortly. The equations (3.2) define an affine map . Assuming a reference point as before, we can define the associated reference that is a fixed point of (3.2). In other words,
We then consider the quadratic forms for a given symmetric matrix (typically indefinite). Note that each such quadratic form is a function of that is determined by our choice of . In our previous example (3.1), has no dynamics and the corresponding and are
In other words, if we use the definitions (3.4), then is the same as (3.1). In general, these sorts of quadratic constraints are called IQCs. We consider four different types of IQCs, which we now define.
Suppose is an unknown map and is a given linear map of the form (3.2) with . Suppose is a given reference point and let be the unique solution of (3.3). Suppose is an arbitrary square-summable sequence. Let and let as in (3.2). We say that satisfies the
Pointwise IQC defined by if for all and ,
Hard IQC defined by if for all and ,
-Hard IQC defined by if for all and ,
Soft IQC defined by if for all ,
Note that the example (3.1) is a pointwise IQC. Examples of the other types of IQCs will be described in Section 3.3. Note that the sets of maps satisfying the various IQCs defined above are nested as follows:
For example, if satisfies a pointwise IQC defined by then it must also satisfy the hard IQC defined by the same . The notions of hard IQC and the more general soft IQC (sometimes simply called IQC) were introduced in  and their relationship is discussed in . These concepts are useful in proving that a dynamic system is stable, but do not directly allow for the derivation of useful bounds on convergence rates. The definitions of pointwise and -hard IQCs are new, and were created for the purpose of better characterizing convergence rates, as we will see in Section 3.2.
Finally, note that and are nominal inputs and outputs for the unknown , and they can be tuned to certify different fixed points of the interconnected system. We will see in Section 3.2 that certifying a particular convergence rate to some fixed point does not require prior knowledge of fixed point; only knowledge that the fixed point exists.
3.2 Stability and performance results
In this section, we show how IQCs can be used to prove that iterative algorithms converge and to bound the rate of convergence. In both cases, the certification requires solving a tractable convex program. We note that the original work on IQCs  only proved stability (boundedness). Some other works have addressed exponential stability [12, 34, 35], but the emphasis of these works is on proving the existence of an exponential decay rate, and so the rates constructed are very conservative. We require rates that are less conservative, and this is reflected in the inclusion of in the LMI of our main result, Theorem 4.
where are matrices of appropriate dimensions. The map is affine rather than linear because of the initial condition . As in Section 2, is the iterative algorithm we wish to analyze, and using the general formalism of Section 3.1, is the nonlinear map that characterizes the feedback. Of course, this framework subsumes the special case of interest in which for each . We assume that satisfies an IQC, and this IQC is characterized by a map and matrix . We can interpret as a filtered version of the signals and . These equations can be represented using a block-diagram as in Figure 1(a).
More succinctly, (3.6) can be written as
Theorem 4 (Main result)
Suppose satisfies the -hard IQC defined by where . Consider the following LMI.
If (3.9) is feasible for some and , then for any , we have
where is the condition number of .
Multiply (3.10) by for each and sum over . The first two terms yield a telescoping sum and we obtain
Because satisfies the -hard IQC defined by , the summation part of the inequality is nonnegative for all . Therefore,
and this completes the proof.
We now make several comments regarding Theorem 4.
Pointwise and hard IQCs
Theorem 4 can easily be adapted to other types of IQCs.
Theorem 4 can also be generalized to the case where satisfies multiple IQCs. Suppose satisfies the -hard IQCs defined by for . Simply redefine the matrices in a manner analogous to (3.7), but where the output is now . Instead of (3.9), use
and the rest of the proof proceeds as in Theorem 4.
Remark on Lyapunov functions
In the quadratic case treated in Section 2.2, a quadratic Lyapunov function is constructed from the solution in (2.12). In the case of IQCs, such a quadratic function cannot serve as a Lyapunov function because it does not strictly decrease over all trajectories. Nevertheless, Theorem 4 shows how -hard IQCs can be used to certify a convergence rate and no Lyapunov function is explicitly constructed. We can explain this difference more explicitly. If is a Lyapunov function, then by definition it satisfies the properties;
for all and .
for all system trajectories .
Property (ii) implies that
which, combined with Property (i) implies that . In Theorem 4, we use , which satisfies (i) but not (ii). So is not a Lyapunov function in the technical sense. Nevertheless, we prove directly that (3.12) holds, and so the desired result still holds. That is, serves the same purpose as a Lyapunov function.
3.3 IQCs for convex functions
We will derive three IQCs that are useful for describing gradients of strongly convex functions: the sector (pointwise) IQC, the off-by-one (hard) IQC, and weighted off-by-one (-hard) IQC. In general, gradients of strongly convex functions satisfy an infinite family of IQCs, originally characterized by Zames and Falb for the single-input-single-output case . A generalization of the Zames-Falb IQCs to multidimensional functions is derived in . Both the sector and off-by-one IQCs are special cases of Zames-Falb, while the weighted off-by-one IQC is a convex combination of the sector and off-by-one IQCs. While the Zames-Falb family is infinite, the three simple IQCs mentioned above are the only ones used in this paper. IQCs can be used to describe many other types of functions as well, and further examples are available in . We begin with some fundamental inequalities that describe strongly convex function.
Proposition 5 (basic properties)
Suppose . Then the following properties hold for all .
Proof. Property (3.13a) follows from the definition of Lipschitz gradients. Properties (3.13b) and (3.13c) are commonly known as co-coercivity. To prove (3.13d), define and note that . Applying (3.13b) to and rearranging, we obtain
Lemma 6 (sector IQC)
Suppose for each , and is a common reference point for the gradients of . In other words, for all . Let . If , then satisfies the pointwise IQC defined by
The corresponding quadratic inequality is that for all and , we have
and therefore (3.14) is equivalent to as required.
In Lemma 6, we use a slight abuse of notation in representing the map . In writing as a matrix in , we mean that is a static map that operates pointwise on . In other words,
Lemma 8 (off-by-one IQC)
Suppose and is a reference for the gradient of . In other words, . Let <