Design of First-Order Optimization Algorithms via Sum-of-Squares Programming

Design of First-Order Optimization Algorithms via
Sum-of-Squares Programming

Mahyar Fazlyab, Manfred Morari, Victor M. Preciado Work supported by the NSF under grants CAREER-ECCS-1651433 and IIS-1447470. The authors are with the Department of Electrical and Systems Engineering, University of Pennsylvania. The authors are with the Department of Electrical and Systems Engineering, University of Pennsylvania. Email: {mahyarfa,morari,preciado}@seas.upenn.edu.
Abstract

In this paper, we propose a framework based on sum-of-squares programming to design iterative first-order optimization algorithms for smooth and strongly convex problems. Our starting point is to develop a polynomial matrix inequality as a sufficient condition for exponential convergence of the algorithm. The entries of this matrix are polynomial functions of the unknown parameters (exponential decay rate, stepsize, momentum coefficient, etc.). We then formulate a polynomial optimization, in which the objective is to optimize the exponential decay rate over the parameters of the algorithm. Finally, we use sum-of-squares programming as a tractable relaxation of the proposed polynomial optimization problem. We illustrate the utility of the proposed framework by designing a first-order algorithm that shares the same structure as the Nesterov’s accelerated gradient method.

I Introduction

Many applications in science and engineering involve solving convex optimization problems of the form

where is the vector of decision variables and is a convex objective function. In most cases of practical relevance, obtaining a closed-form solution to a convex optimization problem is not possible and iterative algorithms are designed to find an approximate solution using a finite amount of computational resources. In the development of iterative optimization algorithms, there are several objectives that algorithm designers should consider [1]:

  • [leftmargin=*]

  • Robust performance guarantees: Algorithms should perform well for a wide range of optimization problems in their class.

  • Time and space efficiency: Algorithms should be efficient in terms of both time and storage, although these two may conflict.

  • Accuracy: Algorithms should be able to provide arbitrarily accurate solutions to the problem at a reasonable computational cost.

In general, these goals may conflict. For example, a rapidly convergent method (e.g., Newton’s method) may require too much computer storage and/or computation. In contrast, a robust method, resilient to noise and uncertainties, may also be too slow in finding an optimal solution. Trade-offs between, for example, convergence rate and storage requirements, or between robustness and speed, are central issues in numerical optimization [1].

In recent years, there has been an increasing interest in using tools from control theory to study the convergence properties of iterative optimization algorithms. The connection between control and optimization is made by interpreting these iterative algorithms as discrete-time dynamical systems with feedback. This interpretation provides many insights and new directions of research. In particular, we can use control tools to study disturbance rejection properties of optimization algorithms [2, 3, 4], study robustness to parameter and model uncertainty [5], and analyze tracking and adaptation capabilities [6]. This interpretation also opens the door to the use of control tools for algorithm design [5, 7].

An important step in algorithm tuning and design is the analysis of its theoretical performance (e.g., convergence rate). This step typically amounts to characterizing the worst-case performance of a given iterative algorithm for a specified class of problems. It is not possible to design an efficient algorithm without a firm grasp of the supporting convergence analysis. However, a theoretical analysis of the convergence properties is a mathematically involved process, which is typically done on a case-by-case basis, and does not generalize well. These drawbacks make algorithm design even more challenging.

The main aim of this paper is to develop an optimization framework, based on tools from robust control theory and polynomial optimization, to design first-order optimization algorithms for the class of smooth and strongly convex problems, in which the convergence rate is exponential ( with ). To this end, we first represent first-order optimization algorithms as dynamical systems in state-space form, where the matrices representing the system dynamics depend on unknown design parameters (e.g., stepsize, momentum coefficient, etc.). Next, we utilize a family of nonquadratic Lyapunov functions to derive nonconservative bounds on the exponential convergence rate . Based on the results in [5] and [8], we derive a sufficient condition for exponential convergence of the algorithm in terms of a polynomial matrix inequality. The entries of this matrix are polynomial functions of (i) the parameters of the Lyapunov function, (ii) the parameters of the algorithm, and (iii) the unknown convergence rate. We then map this polynomial matrix inequality into a set of polynomial inequalities. This allows us to write down the algorithm design problem as an optimization program with polynomial constraints where the objective is to optimize the upper bound on the exponential decay rate over the space of design variables. Finally, we use sum-of-squares machinery to find the optimal parameter choice for the algorithm to optimize the convergence rate. We illustrate our approach by designing a first-order method sharing the same structure as Nesterov’s accelerated method.

The rest of the paper is organized as follows. In Section II, we use a Lyapunov analysis framework, in which we cast the problem of finding the worst-case exponential convergence rate of a first-order optimization algorithm as a semidefinite program. In Section III, we turn to algorithm synthesis and use the results of Section II to formulate the algorithm design problem as a polynomial optimization and use sum-of-squares machinery to solve the algorithm design problem. Finally, we illustrate the performance of our design framework via numerical simulations.

Ii Algorithm Analysis

Ii-a Preliminaries

We denote the set of real numbers by , the set of real -dimensional vectors by , the set of -dimensional matrices by , and the -dimensional identity matrix by . We denote by , , and the sets of -by- symmetric, positive semidefinite, and positive definite matrices, respectively. We denote by a linear transformation which converts the matrix into a column vector. For a function , we denote by the effective domain of . The -norm () is displayed by . For two matrices and of arbitrary dimensions, we denote their Kronecker product by . We define as the polynomial ring in variables and as the polynomial in variables of degree at most .

Definition 1 (Smoothness)

A differentiable function is -smooth on if the following inequality

(1a)
holds for some and all . An equivalent definition is that
(1b)

for all .

Definition 2 (Strong convexity)

A differentiable function is -strongly convex on if the following inequality

(2a)
holds for some and all . An equivalent definition is that
(2b)

for all .

We denote the class of functions satisfying (1a) and (2a) by . A differentiable function belongs to the class on if and only if the inequality

(3a)
holds for all [9, 5], where the indefinite matrix is given by
(3b)

Ii-B Algorithm Representation

Consider the following optimization problem

(4)

where is smooth and strongly convex. Under this assumption, the well-known optimality condition of a point is that

Consider a first-order algorithm that generates a sequence of points that solves (4). In other words, we have , where . We can represent the algorithm in the following state-space form [5, 8]:

(5)

where () is the state of the algorithm, is the output at which the gradient is evaluated, and is another output, which we assume it converges to the optimal point, i.e., with . Therefore, the fixed points of the algorithm must naturally satisfy

(6)

Note that the matrices , , , and in (5) are parameterized by the vector , which is the concatenation of the parameters of the algorithm (e.g., stepsize, momentum coefficient, etc.). We give two examples below.

Example 1

The gradient method is given by the recursion

(7)

where is the stepsize. For this algorithm, is the only parameter, and the matrices and are given by

(8)
Example 2

Consider the following recursion defined on the two sequences and ,

(9)

where and are nonnegative scalars. By defining the state vector and the parameter vector , we can represent (9) in the canonical form (5), as follows,

(10)

Therefore, the matrices are given by

(11)

Notice that depending on the selection of and , (9) describes various existing algorithms. For example, the gradient method corresponds to the case (see Example 1). In Nesterov’s accelerated method, we have [10]. Finally, we recover the Heavy-ball method by setting [11]. In Table I, we provide various parameter selections and convergence rates for the gradient method, the heavy-ball method, and the Nesterov’s accelerated method [10, 11, 5].

Algorithm Parameters Decay Rate
Gradient
(Strongly Convex)
,
Gradient
(Strongly Convex)
,
Nesterov
(Strongly Convex)
,
Nesterov
(Quadratics)
,
Heavy-ball
(Quadratics)
, ,
TABLE I: Decay rate of the algorithm in (9) for various parameter selections.

Ii-C Exponential Convergence via SDPs

To measure the progress of the algorithm in (5) towards optimality, we make use of the following Lyapunov function [8]:

(12)

where is the minimizer of , and is a positive semidefinite matrix that does not depend on and is to be determined. The first term in (12) is the suboptimality of and the second term is a weighted “distance” between the state and the fixed point . Notice that by this definition, is positive everywhere and zero at optimality. Suppose we select such that the Lyapunov function satisfies the inequality

(13)

for some . By iterating down (13) to , we can conclude for all . This implies that

(14)

In other words, the algorithm exhibits an convergence rate if the Lyapunov function satisfies the decrease condition (13). In the following result, developed in [8], we present a matrix inequality whose feasibility implies (13) and hence, the exponential convergence of the algorithm. For notational convenience, we drop the argument wherever the dependence of the matrices and on is clear from the context.

Theorem 1

Let , where . Consider the dynamical system in (5), whose fixed point satisfies (6). Define the matrices

(15a)
(15b)
(15c)

where

Suppose there exists a positive semidefinite , and nonnegative scalars , that satisfy the following LMI:

(16)

Then the sequence satisfies

(17)
Proof.

See [8]. ∎

According to Theorem 1, any pair that satisfies the matrix inequality in (16), certifies an convergence rate for the algorithm. In particular, the fastest convergence rate can be found by solving the following optimization problem:

(18)
subject to

where the data is given, and the decision variables are . Note that (18) is a quasiconvex program since the constraint is affine in for a fixed . We can therefore use a bisectioning search to find the smallest possible value of the decay rate . This approach has been pursued in [5] using a different Lyapunov function.

Remark 1

We can often exploit some special structure in the matrices , and to reduce the dimension of the LMI (16). For many algorithms, these matrices are in the form

where now and are lower dimensional matrices independent of [5, 4.2]. By selecting , where is a lower dimensional matrix, we can factor out all the Kronecker products from the matrices and make the dimension of the corresponding LMI (16) independent of . In particular, a multi-step method with steps yields an LMI. For instance, the gradient method () and the Nesterov’s accelerated method () yield and LMIs, respectively.

Fig. 1: Plot of convergence rate of the Nesterov’s accelerated method as a function of stepsize and momentum parameter for .

Iii Algorithm Synthesis

We saw in the previous section that, the exponential stability of a first-order algorithm can be certified by solving an SDP feasibility problem. More precisely, given an algorithm in (5) with a prespecified value of , we can search for a suitable Lyapunov function and establish a rate bound for the algorithm by solving an LMI problem. A natural question to ask is whether we can leverage the same framework to design new algorithms and provide a convergence guarantee simultaneously. We formalize this problem as follows.

Problem 1

Let , where . Given a parameterized family of first-order methods given by (5), tune the parameters of the algorithm, within a compact set , such that the resulting algorithm converges at an rate to with a minimal .

Using the result of Theorem 1, Problem 1 can be formally written as the following nonconvex optimization problem:

(19)
subject to

where the decision variables are now , and . We recall from (15) that the parameter appears in the optimization problem through the matrices . Intuitively, (19) searches for the parameters of the algorithm for optimal performance (minimal ) while respecting the stability condition (13), which is imposed by the matrix inequality constraint in (19).

Note that if we fix , which means the algorithm parameters are given, then (19) reduces to the quasiconvex program in (18). This suggests a natural but inefficient way to solve (19): We do an exhaustive search over the parameter space , and solve (19) to find the optimal for each value of . We therefore need to solve a sequence of quasiconvex programs in order to find the optimal tuning. As an illustration, we implement this approach to tune the parameters of the Nesterov’s accelerated method (the algorithm in (9) with ), where the tuning parameters are the stepsize and the momentum coefficient . In Figure 1, we plot the level curves of the convergence factor as a function of in the region .

Note that, in general, the exhaustive search approach described above becomes prohibitively costly as the dimension and/or the granularity of the search space increase. We therefore need an efficient way to solve (19). Although this problem is non-convex, the special structure of the constraint set makes the problem tractable. To see this, we note that all the entries of the constraint matrix in (19) are polynomial functions of the decision variables. This matrix inequality constraint can be alternatively “scalarized” and rewritten in terms of scalar polynomial inequalities (e.g., by considering minors, or coefficients of the characteristic polynomial). The resulting optimization problem is of the form

minimize
subject to

where is the vector of decision variables, and and are all polynomials. It is here that we can draw a direct connection from the algorithm design problem in (19) to polynomial optimization, which are tractable problems in most cases [12]. We briefly introduce polynomial optimization next.

Iii-a Sum-of-Squares Programs and Polynomial Optimization

The main difficulty in solving problems involving polynomial constraints, such as the ones in (19), is the lack of efficient numerical methods able to handle multivariate nonnegativity conditions. A computationally efficient approach is to use sum-of-squares (SOS) relaxations [13, 14]. In what follows, we introduce the basic results used in our derivations.

Definition 3

A multivariate polynomial of degree in variables with real coefficients, , is a sum-of-squares (SOS) if there exist polynomials such that

(20)

We will denote the set of SOS polynomials in variables of degree at most by . A polynomial being an SOS is sufficient to certify its global nonnegativity, since any satisfies for all . Hence, , where is the set of nonnegative polynomials in . Given a polynomial , the existence of an SOS decomposition of the form (20) is equivalent to the existence of a positive semidefinite matrix , called the Gram matrix, such that

(21)

where is the vector of all monomials in of degree at most . Notice that the equality constraint in (21) is affine in the matrix , since the expansion of the right-hand side results in a polynomial whose coefficients depend affinely in the entries of and must be equal to the corresponding coefficients of the given polynomial . Hence, finding an SOS decomposition is computationally equivalent to finding a positive semidefinite matrix subject to the affine constraint in (21), which is a semidefinite program [13, 15].

Using the notion of SOS polynomials, we can now define the class of sum-of-squares programs (SOSP). An SOSP is an optimization program in which we maximize a linear function over a feasible set given by the intersection of an affine family of polynomials and the set of SOS polynomials in , as described below [16]:

maximize
subject to

where , and are given multivariate polynomials in . Despite their generality, it can be proved that SOSPs are equivalent to SDPs; hence, they are convex programs and can be solved in polynomial time [15]. In recent years, SOSPs have been used as convex relaxations for various computationally hard optimization and control problems (see, for example, [13, 17, 14, 18] and the volume [19]).

Finding an SOS decomposition of a multivariate polynomial provides an explicit certificate of its nonnegativity for all values of the indeterminates . In order to design an algorithmic procedure to search for the optimal convergence rate, we need to use a similar idea to ensure that a polynomial matrix is positive semidefinite for every value of the indeterminates. A natural definition is as follows [20]:

Definition 4

Consider a symmetric matrix with polynomial entries , and let be a vector of indeterminates. Then is a sum-of-squares matrix (SOSM) if is an SOS polynomial in .

Obviously, a polynomial matrix being SOSM provides an explicit certificate for being positive semidefinite for all . In our particular application, it is computationally convenient to use an alternative certificate using Descartes’ rule of signs [21]. In particular, we will use the rule of sign to certify that a given polynomial matrix is positive semidefinite by certifying that three multivariate polynomials are SOS. As a result of our analysis, we will design an optimization algorithm with the fastest convergence rate by solving a polynomial optimization problem over a feasible set defined by polynomial inequalities. In general, the resulting problem will be of the form

minimize
subject to

where , for all . Under mild technical conditions, this problem can be solved using the following SOSP [14]

sup
s.t.

In practice, a relaxation of this problem is numerically solved, where the relaxation is the result of truncating the order of the degrees of the polynomials , as described in [14]. In what follows, we show how to use this methodology to design optimization algorithms with optimal exponential convergence rates.

Iv Numerical Simulations

In this section, we validate the proposed approach on two examples: the gradient method and Nesterov’s accelerated method. There are several software packages that convert polynomial optimization problems into a hierarchy of SDP relaxations and then call an external solver to solve them. In this paper, we use the software Gloptipoly3 [22], which is oriented toward global polynomial optimization problems.

Iv-a The Gradient Method

Consider the gradient method of Example 1. By choosing (), we can apply the dimensionality reduction outlined in Remark 1 and reduce the dimension of the LMI. After dimensionality reduction, the matrices , and in the LMI (16) read as

By substituting these matrices back in (16), we obtain the following matrix inequality constraint:
where the entries

are all polynomials functions of , and . Therefore, the design problem in (19) for the gradient method is equivalent to the following polynomial optimization problem:

(23)
subject to

with decision variables , and . In Figure 2, we plot the optimal convergence rate , obtained by solving (23), for various values of the condition number . We also plot the theoretical decay rate from Table I. We observe that we obtain a slightly faster convergence rate (in terms of the suboptimality ) than the theoretical bound.

Fig. 2: Convergence rate of the gradient method obtained by the SOS technique and the analytical rate .

Iv-B The Nesterov’s Accelerated Method

Consider the algorithm of Example 2 with , which corresponds to Nesterov’s accelerated method. The state-space matrices of this algorithm are given in (11). By applying the dimensionality reduction of Remark 1, we arrive at the following matrices that appear in the LMI (15):

(24)

where is given in (3b) and is now a and positive semidefinite matrix. By defining and , the corresponding optimization problem can be written as

(25)

We then replace the polynomial matrix inequality in (25) with three polynomials by applying Descartes’ rule of signs described in Section III-A. In Figure 3, we plot the optimal decay rate , obtained by solving (25), for various values of . We observe that the obtained rate is better than Nesterov’s rate and worse than the theoretical lower bound, which is achieved by the Heavy-ball method on quadratics–see Table I.

Fig. 3: Convergence rate of the Nesterov’s accelerated method obtained by the SOS technique versus the analytical rate (see Table I).

V Conclusions

We have considered the problem of designing first-order iterative optimization algorithms for solving smooth and strongly convex optimization problems. By using a family of parameterized nonquadratic Lyapunov functions, we presented a polynomial matrix inequality as a sufficient condition for exponential stability of the algorithm. All the entries of this matrix has a polynomial dependence on the unknown parameters, which are the parameters of the algorithm, the parameters of the Lyapunov function, and the convergence rate. We then formulated a polynomial optimization problem to search for the optimal convergence rate subject to the polynomial matrix inequality. Finally, we proposed a sum-of-squares relaxation to solve the resulting design problem. We illustrated the proposed approach via numerical simulations.

References

  • [1] S. Wright and J. Nocedal, “Numerical optimization,” Springer Science, vol. 35, no. 67-68, p. 7, 1999.
  • [2] J. Wang and N. Elia, “Distributed agreement in the presence of noise,” in Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on, pp. 1575–1581, IEEE, 2009.
  • [3] J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference, pp. 3800–3805, IEEE, 2011.
  • [4] J. Wang and N. Elia, “Control approach to distributed optimization,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 557–561, IEEE, 2010.
  • [5] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integral quadratic constraints,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 57–95, 2016.
  • [6] M. Fazlyab, S. Paternain, V. M. Preciado, and A. Ribeiro, “Prediction-correction interior-point method for time-varying convex optimization,” IEEE Transactions on Automatic Control, 2017.
  • [7] S. Cyrus, B. Hu, B. V. Scoy, and L. Lessard, “A robust accelerated optimization algorithm for strongly convex functions,” arXiv e-print, 2017.
  • [8] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado, “Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems,” arXiv preprint arXiv:1705.03615, 2017.
  • [9] Y. Nesterov, “Introductory lectures on convex optimization: A basic course,” 2013.
  • [10] Y. Nesterov, “A method of solving a convex programming problem with convergence rate o (1/k2),” in Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.
  • [11] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1 – 17, 1964.
  • [12] G. Blekherman, P. A. Parrilo, and R. R. Thomas, Semidefinite optimization and convex algebraic geometry. SIAM, 2012.
  • [13] P. A. Parrilo, Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, 2000.
  • [14] J. B. Lasserre, “Global optimization with polynomials and the problem of moments,” SIAM Journal on Optimization, vol. 11, no. 3, pp. 796–817, 2001.
  • [15] S. Prajna, A. Papachristodoulou, P. Seiler, and P. A. Parrilo, “Sum of squares optimization toolbox for matlab user s guide,” 2004.
  • [16] G. Blekherman, P. A. Parrilo, and R. R. Thomas, Semidefinite optimization and convex algebraic geometry. SIAM, 2012.
  • [17] P. A. Parrilo, “Semidefinite programming relaxations for semialgebraic problems,” Mathematical programming, vol. 96, no. 2, pp. 293–320, 2003.
  • [18] S. Prajna, A. Papachristodoulou, P. Seiler, and P. A. Parrilo, “Sostools: Control applications and new developments,” in Computer Aided Control Systems Design, 2004 IEEE International Symposium on, pp. 315–320, IEEE, 2004.
  • [19] D. Henrion and A. Garulli, Positive polynomials in control, vol. 312. Springer Science & Business Media, 2005.
  • [20] K. Gatermann and P. A. Parrilo, “Symmetry groups, semidefinite programs, and sums of squares,” Journal of Pure and Applied Algebra, vol. 192, no. 1-3, pp. 95–128, 2004.
  • [21] D. Curtiss, “Recent extentions of descartes’ rule of signs,” Annals of Mathematics, pp. 251–278, 1918.
  • [22] D. Henrion, J.-B. Lasserre, and J. Löfberg, “Gloptipoly 3: moments, optimization and semidefinite programming,” Optimization Methods & Software, vol. 24, no. 4-5, pp. 761–779, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
133691
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description