Design of FirstOrder Optimization Algorithms via
SumofSquares Programming
Abstract
In this paper, we propose a framework based on sumofsquares programming to design iterative firstorder optimization algorithms for smooth and strongly convex problems. Our starting point is to develop a polynomial matrix inequality as a sufficient condition for exponential convergence of the algorithm. The entries of this matrix are polynomial functions of the unknown parameters (exponential decay rate, stepsize, momentum coefficient, etc.). We then formulate a polynomial optimization, in which the objective is to optimize the exponential decay rate over the parameters of the algorithm. Finally, we use sumofsquares programming as a tractable relaxation of the proposed polynomial optimization problem. We illustrate the utility of the proposed framework by designing a firstorder algorithm that shares the same structure as the Nesterov’s accelerated gradient method.
I Introduction
Many applications in science and engineering involve solving convex optimization problems of the form
where is the vector of decision variables and is a convex objective function. In most cases of practical relevance, obtaining a closedform solution to a convex optimization problem is not possible and iterative algorithms are designed to find an approximate solution using a finite amount of computational resources. In the development of iterative optimization algorithms, there are several objectives that algorithm designers should consider [1]:

[leftmargin=*]

Robust performance guarantees: Algorithms should perform well for a wide range of optimization problems in their class.

Time and space efficiency: Algorithms should be efficient in terms of both time and storage, although these two may conflict.

Accuracy: Algorithms should be able to provide arbitrarily accurate solutions to the problem at a reasonable computational cost.
In general, these goals may conflict. For example, a rapidly convergent method (e.g., Newton’s method) may require too much computer storage and/or computation. In contrast, a robust method, resilient to noise and uncertainties, may also be too slow in finding an optimal solution. Tradeoffs between, for example, convergence rate and storage requirements, or between robustness and speed, are central issues in numerical optimization [1].
In recent years, there has been an increasing interest in using tools from control theory to study the convergence properties of iterative optimization algorithms. The connection between control and optimization is made by interpreting these iterative algorithms as discretetime dynamical systems with feedback. This interpretation provides many insights and new directions of research. In particular, we can use control tools to study disturbance rejection properties of optimization algorithms [2, 3, 4], study robustness to parameter and model uncertainty [5], and analyze tracking and adaptation capabilities [6]. This interpretation also opens the door to the use of control tools for algorithm design [5, 7].
An important step in algorithm tuning and design is the analysis of its theoretical performance (e.g., convergence rate). This step typically amounts to characterizing the worstcase performance of a given iterative algorithm for a specified class of problems. It is not possible to design an efficient algorithm without a firm grasp of the supporting convergence analysis. However, a theoretical analysis of the convergence properties is a mathematically involved process, which is typically done on a casebycase basis, and does not generalize well. These drawbacks make algorithm design even more challenging.
The main aim of this paper is to develop an optimization framework, based on tools from robust control theory and polynomial optimization, to design firstorder optimization algorithms for the class of smooth and strongly convex problems, in which the convergence rate is exponential ( with ). To this end, we first represent firstorder optimization algorithms as dynamical systems in statespace form, where the matrices representing the system dynamics depend on unknown design parameters (e.g., stepsize, momentum coefficient, etc.). Next, we utilize a family of nonquadratic Lyapunov functions to derive nonconservative bounds on the exponential convergence rate . Based on the results in [5] and [8], we derive a sufficient condition for exponential convergence of the algorithm in terms of a polynomial matrix inequality. The entries of this matrix are polynomial functions of (i) the parameters of the Lyapunov function, (ii) the parameters of the algorithm, and (iii) the unknown convergence rate. We then map this polynomial matrix inequality into a set of polynomial inequalities. This allows us to write down the algorithm design problem as an optimization program with polynomial constraints where the objective is to optimize the upper bound on the exponential decay rate over the space of design variables. Finally, we use sumofsquares machinery to find the optimal parameter choice for the algorithm to optimize the convergence rate. We illustrate our approach by designing a firstorder method sharing the same structure as Nesterov’s accelerated method.
The rest of the paper is organized as follows. In Section II, we use a Lyapunov analysis framework, in which we cast the problem of finding the worstcase exponential convergence rate of a firstorder optimization algorithm as a semidefinite program. In Section III, we turn to algorithm synthesis and use the results of Section II to formulate the algorithm design problem as a polynomial optimization and use sumofsquares machinery to solve the algorithm design problem. Finally, we illustrate the performance of our design framework via numerical simulations.
Ii Algorithm Analysis
Iia Preliminaries
We denote the set of real numbers by , the set of real dimensional vectors by , the set of dimensional matrices by , and the dimensional identity matrix by . We denote by , , and the sets of by symmetric, positive semidefinite, and positive definite matrices, respectively. We denote by a linear transformation which converts the matrix into a column vector. For a function , we denote by the effective domain of . The norm () is displayed by . For two matrices and of arbitrary dimensions, we denote their Kronecker product by . We define as the polynomial ring in variables and as the polynomial in variables of degree at most .
Definition 1 (Smoothness)
A differentiable function is smooth on if the following inequality
(1a)  
holds for some and all . An equivalent definition is that  
(1b) 
for all .
Definition 2 (Strong convexity)
A differentiable function is strongly convex on if the following inequality
(2a)  
holds for some and all . An equivalent definition is that  
(2b) 
for all .
IiB Algorithm Representation
Consider the following optimization problem
(4) 
where is smooth and strongly convex. Under this assumption, the wellknown optimality condition of a point is that
Consider a firstorder algorithm that generates a sequence of points that solves (4). In other words, we have , where . We can represent the algorithm in the following statespace form [5, 8]:
(5)  
where () is the state of the algorithm, is the output at which the gradient is evaluated, and is another output, which we assume it converges to the optimal point, i.e., with . Therefore, the fixed points of the algorithm must naturally satisfy
(6) 
Note that the matrices , , , and in (5) are parameterized by the vector , which is the concatenation of the parameters of the algorithm (e.g., stepsize, momentum coefficient, etc.). We give two examples below.
Example 1
The gradient method is given by the recursion
(7) 
where is the stepsize. For this algorithm, is the only parameter, and the matrices and are given by
(8) 
Example 2
Consider the following recursion defined on the two sequences and ,
(9)  
where and are nonnegative scalars. By defining the state vector and the parameter vector , we can represent (9) in the canonical form (5), as follows,
(10)  
Therefore, the matrices are given by
(11)  
Notice that depending on the selection of and , (9) describes various existing algorithms. For example, the gradient method corresponds to the case (see Example 1). In Nesterov’s accelerated method, we have [10]. Finally, we recover the Heavyball method by setting [11]. In Table I, we provide various parameter selections and convergence rates for the gradient method, the heavyball method, and the Nesterov’s accelerated method [10, 11, 5].
Algorithm  Parameters  Decay Rate  


,  

,  

,  





IiC Exponential Convergence via SDPs
To measure the progress of the algorithm in (5) towards optimality, we make use of the following Lyapunov function [8]:
(12) 
where is the minimizer of , and is a positive semidefinite matrix that does not depend on and is to be determined. The first term in (12) is the suboptimality of and the second term is a weighted “distance” between the state and the fixed point . Notice that by this definition, is positive everywhere and zero at optimality. Suppose we select such that the Lyapunov function satisfies the inequality
(13) 
for some . By iterating down (13) to , we can conclude for all . This implies that
(14) 
In other words, the algorithm exhibits an convergence rate if the Lyapunov function satisfies the decrease condition (13). In the following result, developed in [8], we present a matrix inequality whose feasibility implies (13) and hence, the exponential convergence of the algorithm. For notational convenience, we drop the argument wherever the dependence of the matrices and on is clear from the context.
Theorem 1
Proof.
See [8]. ∎
According to Theorem 1, any pair that satisfies the matrix inequality in (16), certifies an convergence rate for the algorithm. In particular, the fastest convergence rate can be found by solving the following optimization problem:
(18)  
subject to  
where the data is given, and the decision variables are . Note that (18) is a quasiconvex program since the constraint is affine in for a fixed . We can therefore use a bisectioning search to find the smallest possible value of the decay rate . This approach has been pursued in [5] using a different Lyapunov function.
Remark 1
We can often exploit some special structure in the matrices , and to reduce the dimension of the LMI (16). For many algorithms, these matrices are in the form
where now and are lower dimensional matrices independent of [5, 4.2]. By selecting , where is a lower dimensional matrix, we can factor out all the Kronecker products from the matrices and make the dimension of the corresponding LMI (16) independent of . In particular, a multistep method with steps yields an LMI. For instance, the gradient method () and the Nesterov’s accelerated method () yield and LMIs, respectively.
Iii Algorithm Synthesis
We saw in the previous section that, the exponential stability of a firstorder algorithm can be certified by solving an SDP feasibility problem. More precisely, given an algorithm in (5) with a prespecified value of , we can search for a suitable Lyapunov function and establish a rate bound for the algorithm by solving an LMI problem. A natural question to ask is whether we can leverage the same framework to design new algorithms and provide a convergence guarantee simultaneously. We formalize this problem as follows.
Problem 1
Let , where . Given a parameterized family of firstorder methods given by (5), tune the parameters of the algorithm, within a compact set , such that the resulting algorithm converges at an rate to with a minimal .
Using the result of Theorem 1, Problem 1 can be formally written as the following nonconvex optimization problem:
(19)  
subject to  
where the decision variables are now , and . We recall from (15) that the parameter appears in the optimization problem through the matrices . Intuitively, (19) searches for the parameters of the algorithm for optimal performance (minimal ) while respecting the stability condition (13), which is imposed by the matrix inequality constraint in (19).
Note that if we fix , which means the algorithm parameters are given, then (19) reduces to the quasiconvex program in (18). This suggests a natural but inefficient way to solve (19): We do an exhaustive search over the parameter space , and solve (19) to find the optimal for each value of . We therefore need to solve a sequence of quasiconvex programs in order to find the optimal tuning. As an illustration, we implement this approach to tune the parameters of the Nesterov’s accelerated method (the algorithm in (9) with ), where the tuning parameters are the stepsize and the momentum coefficient . In Figure 1, we plot the level curves of the convergence factor as a function of in the region .
Note that, in general, the exhaustive search approach described above becomes prohibitively costly as the dimension and/or the granularity of the search space increase. We therefore need an efficient way to solve (19). Although this problem is nonconvex, the special structure of the constraint set makes the problem tractable. To see this, we note that all the entries of the constraint matrix in (19) are polynomial functions of the decision variables. This matrix inequality constraint can be alternatively “scalarized” and rewritten in terms of scalar polynomial inequalities (e.g., by considering minors, or coefficients of the characteristic polynomial). The resulting optimization problem is of the form
minimize  
subject to 
where is the vector of decision variables, and and are all polynomials. It is here that we can draw a direct connection from the algorithm design problem in (19) to polynomial optimization, which are tractable problems in most cases [12]. We briefly introduce polynomial optimization next.
Iiia SumofSquares Programs and Polynomial Optimization
The main difficulty in solving problems involving polynomial constraints, such as the ones in (19), is the lack of efficient numerical methods able to handle multivariate nonnegativity conditions. A computationally efficient approach is to use sumofsquares (SOS) relaxations [13, 14]. In what follows, we introduce the basic results used in our derivations.
Definition 3
A multivariate polynomial of degree in variables with real coefficients, , is a sumofsquares (SOS) if there exist polynomials such that
(20) 
We will denote the set of SOS polynomials in variables of degree at most by . A polynomial being an SOS is sufficient to certify its global nonnegativity, since any satisfies for all . Hence, , where is the set of nonnegative polynomials in . Given a polynomial , the existence of an SOS decomposition of the form (20) is equivalent to the existence of a positive semidefinite matrix , called the Gram matrix, such that
(21) 
where is the vector of all monomials in of degree at most . Notice that the equality constraint in (21) is affine in the matrix , since the expansion of the righthand side results in a polynomial whose coefficients depend affinely in the entries of and must be equal to the corresponding coefficients of the given polynomial . Hence, finding an SOS decomposition is computationally equivalent to finding a positive semidefinite matrix subject to the affine constraint in (21), which is a semidefinite program [13, 15].
Using the notion of SOS polynomials, we can now define the class of sumofsquares programs (SOSP). An SOSP is an optimization program in which we maximize a linear function over a feasible set given by the intersection of an affine family of polynomials and the set of SOS polynomials in , as described below [16]:
maximize  
subject to 
where , and are given multivariate polynomials in . Despite their generality, it can be proved that SOSPs are equivalent to SDPs; hence, they are convex programs and can be solved in polynomial time [15]. In recent years, SOSPs have been used as convex relaxations for various computationally hard optimization and control problems (see, for example, [13, 17, 14, 18] and the volume [19]).
Finding an SOS decomposition of a multivariate polynomial provides an explicit certificate of its nonnegativity for all values of the indeterminates . In order to design an algorithmic procedure to search for the optimal convergence rate, we need to use a similar idea to ensure that a polynomial matrix is positive semidefinite for every value of the indeterminates. A natural definition is as follows [20]:
Definition 4
Consider a symmetric matrix with polynomial entries , and let be a vector of indeterminates. Then is a sumofsquares matrix (SOSM) if is an SOS polynomial in .
Obviously, a polynomial matrix being SOSM provides an explicit certificate for being positive semidefinite for all . In our particular application, it is computationally convenient to use an alternative certificate using Descartes’ rule of signs [21]. In particular, we will use the rule of sign to certify that a given polynomial matrix is positive semidefinite by certifying that three multivariate polynomials are SOS. As a result of our analysis, we will design an optimization algorithm with the fastest convergence rate by solving a polynomial optimization problem over a feasible set defined by polynomial inequalities. In general, the resulting problem will be of the form
minimize  
subject to 
where , for all . Under mild technical conditions, this problem can be solved using the following SOSP [14]
sup  
s.t.  
In practice, a relaxation of this problem is numerically solved, where the relaxation is the result of truncating the order of the degrees of the polynomials , as described in [14]. In what follows, we show how to use this methodology to design optimization algorithms with optimal exponential convergence rates.
Iv Numerical Simulations
In this section, we validate the proposed approach on two examples: the gradient method and Nesterov’s accelerated method. There are several software packages that convert polynomial optimization problems into a hierarchy of SDP relaxations and then call an external solver to solve them. In this paper, we use the software Gloptipoly3 [22], which is oriented toward global polynomial optimization problems.
Iva The Gradient Method
Consider the gradient method of Example 1. By choosing (), we can apply the dimensionality reduction outlined in Remark 1 and reduce the dimension of the LMI. After dimensionality reduction, the matrices , and in the LMI (16) read as
By substituting these matrices back in (16), we obtain the following matrix inequality constraint:  
where the entries  
are all polynomials functions of , and . Therefore, the design problem in (19) for the gradient method is equivalent to the following polynomial optimization problem:
(23)  
subject to  
with decision variables , and . In Figure 2, we plot the optimal convergence rate , obtained by solving (23), for various values of the condition number . We also plot the theoretical decay rate from Table I. We observe that we obtain a slightly faster convergence rate (in terms of the suboptimality ) than the theoretical bound.
IvB The Nesterov’s Accelerated Method
Consider the algorithm of Example 2 with , which corresponds to Nesterov’s accelerated method. The statespace matrices of this algorithm are given in (11). By applying the dimensionality reduction of Remark 1, we arrive at the following matrices that appear in the LMI (15):
(24)  
where is given in (3b) and is now a and positive semidefinite matrix. By defining and , the corresponding optimization problem can be written as
(25) 
We then replace the polynomial matrix inequality in (25) with three polynomials by applying Descartes’ rule of signs described in Section IIIA. In Figure 3, we plot the optimal decay rate , obtained by solving (25), for various values of . We observe that the obtained rate is better than Nesterov’s rate and worse than the theoretical lower bound, which is achieved by the Heavyball method on quadratics–see Table I.
V Conclusions
We have considered the problem of designing firstorder iterative optimization algorithms for solving smooth and strongly convex optimization problems. By using a family of parameterized nonquadratic Lyapunov functions, we presented a polynomial matrix inequality as a sufficient condition for exponential stability of the algorithm. All the entries of this matrix has a polynomial dependence on the unknown parameters, which are the parameters of the algorithm, the parameters of the Lyapunov function, and the convergence rate. We then formulated a polynomial optimization problem to search for the optimal convergence rate subject to the polynomial matrix inequality. Finally, we proposed a sumofsquares relaxation to solve the resulting design problem. We illustrated the proposed approach via numerical simulations.
References
 [1] S. Wright and J. Nocedal, “Numerical optimization,” Springer Science, vol. 35, no. 6768, p. 7, 1999.
 [2] J. Wang and N. Elia, “Distributed agreement in the presence of noise,” in Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on, pp. 1575–1581, IEEE, 2009.
 [3] J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference, pp. 3800–3805, IEEE, 2011.
 [4] J. Wang and N. Elia, “Control approach to distributed optimization,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 557–561, IEEE, 2010.
 [5] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integral quadratic constraints,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 57–95, 2016.
 [6] M. Fazlyab, S. Paternain, V. M. Preciado, and A. Ribeiro, “Predictioncorrection interiorpoint method for timevarying convex optimization,” IEEE Transactions on Automatic Control, 2017.
 [7] S. Cyrus, B. Hu, B. V. Scoy, and L. Lessard, “A robust accelerated optimization algorithm for strongly convex functions,” arXiv eprint, 2017.
 [8] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado, “Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems,” arXiv preprint arXiv:1705.03615, 2017.
 [9] Y. Nesterov, “Introductory lectures on convex optimization: A basic course,” 2013.
 [10] Y. Nesterov, “A method of solving a convex programming problem with convergence rate o (1/k2),” in Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.
 [11] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1 – 17, 1964.
 [12] G. Blekherman, P. A. Parrilo, and R. R. Thomas, Semidefinite optimization and convex algebraic geometry. SIAM, 2012.
 [13] P. A. Parrilo, Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, 2000.
 [14] J. B. Lasserre, “Global optimization with polynomials and the problem of moments,” SIAM Journal on Optimization, vol. 11, no. 3, pp. 796–817, 2001.
 [15] S. Prajna, A. Papachristodoulou, P. Seiler, and P. A. Parrilo, “Sum of squares optimization toolbox for matlab user s guide,” 2004.
 [16] G. Blekherman, P. A. Parrilo, and R. R. Thomas, Semidefinite optimization and convex algebraic geometry. SIAM, 2012.
 [17] P. A. Parrilo, “Semidefinite programming relaxations for semialgebraic problems,” Mathematical programming, vol. 96, no. 2, pp. 293–320, 2003.
 [18] S. Prajna, A. Papachristodoulou, P. Seiler, and P. A. Parrilo, “Sostools: Control applications and new developments,” in Computer Aided Control Systems Design, 2004 IEEE International Symposium on, pp. 315–320, IEEE, 2004.
 [19] D. Henrion and A. Garulli, Positive polynomials in control, vol. 312. Springer Science & Business Media, 2005.
 [20] K. Gatermann and P. A. Parrilo, “Symmetry groups, semidefinite programs, and sums of squares,” Journal of Pure and Applied Algebra, vol. 192, no. 13, pp. 95–128, 2004.
 [21] D. Curtiss, “Recent extentions of descartes’ rule of signs,” Annals of Mathematics, pp. 251–278, 1918.
 [22] D. Henrion, J.B. Lasserre, and J. Löfberg, “Gloptipoly 3: moments, optimization and semidefinite programming,” Optimization Methods & Software, vol. 24, no. 45, pp. 761–779, 2009.