OSGA: A fast subgradient algorithm

with optimal complexity

Arnold Neumaier

Fakultät für Mathematik, Universität Wien

Oskar-Morgenstern-Platz 1, A-1090 Wien, Austria

email: Arnold.Neumaier@univie.ac.at

WWW: http://www.mat.univie.ac.at/neum/

July 5, 2019

osga.tex

Abstract. This paper presents an algorithm for approximately minimizing a convex function in simple, not necessarily bounded convex domains, assuming only that function values and subgradients are available. No global information about the objective function is needed apart from a strong convexity parameter (which can be put to zero if only convexity is known). The worst case number of iterations needed to achieve a given accuracy is independent of the dimension (which may be infinite) and – apart from a constant factor – best possible under a variety of smoothness assumptions on the objective function.

Keywords: complexity bound, convex optimization, optimal subgradient method, large-scale optimization, Nesterov’s optimal method, nonsmooth optimization, optimal first-order method, smooth optimization, strongly convex

2010 MSC Classification: primary 90C25; secondary 90C60, 49M37, 65K05, 68Q25

## 1 Introduction

In the recent years, first order methods for convex optimization have become prominent again as they are able to solve large-scale problems in millions of variables (often arising from applications to image processing, compressed sensing, or machine learning), where matrix-based interior point methods cannot even perform a single iteration. (However, a matrix-free interior point method by Fountoulakis et al. [12] works well in some large compressed sensing problems.)

In 1983, Nemirovsky & Yudin [20] proved lower bounds on the complexity of first order methods (measured in the number of subgradient calls needed to achieve a given accuracy) for convex optimization under various regularity assumptions for the objective functions. (See Nesterov [22, Sections 2.1.2 and 3.2.1] for a simplified account.) They constructed convex, piecewise linear functions in dimensions , where no first order method can have function values more accurate than after subgradient evaluations. This implies the need for at least subgradient evaluations in the worst case if is a nondifferentiable but Lipschitz continuous convex function. They also constructed convex quadratic functions in dimensions where no first order method can have function values more accurate than after gradient evaluations. This implies the need for at least gradient evaluations in the worst case if is an arbitrarily often differentiable convex function. However in case of strongly convex functions with Lipschitz continuous gradients, the known lower bounds on the complexity allow a dimension-independent linear rate of convergence with .

Algorithms by Nesterov [22, 23, 26] (dating back in the unconstrained, not strongly convex case to 1983 [21]), achieve the optimal complexity order in all three cases. These algorithms need as input the knowledge of global parameters – a global Lipschitz constant for the objective functions in the nonsmooth case, a global Lipschitz constant for the gradient in the smooth case, and an explicit constant of strong convexity in the strongly convex case. Later many variants were described (see, e.g., Auslender & Teboulle [5], Lan et al. [18]), some of which are adaptive in the sense that they estimate all required constants during the execution of the algorithm. Beck & Teboulle [8] developed an adaptive proximal point algorithm called FISTA, popular in image restauration applications. Like all proximal point based methods, the algorithm needs more information about the objective function than just subgradients, but delivers in return a higher speed of convergence. Tseng [28] gives a common uniform derivation of several variants of fast first order algorithms based on proximal points. Becker et al. [9] (among other thinds) add adaptive features to Tseng’s class of algorithms, making them virtually independent of global (and hence often pessimistic) Lipschitz information. Other such adaptive algorithms include Gonzaga et al. [13, 14] and Meng & Chen [19]. Devolder et al. [11] show that both the nonsmooth case and the smooth case can be understood in a common way in terms of inexact gradient methods.

If the Lipschitz constant is very large, the methods with optimal complexity for the smooth case are initially much slower than the methods that have an optimal complexity for the nonsmooth case. This counterintuitive situation was remedied by Lan [17], who provides an algorithm that needs (expensive auxiliary computations but) no knowledge about the function except convexity and has the optimal complexity, both in the nonsmooth case and in the smooth case, without having to know whether or not the function is smooth. However, its worst case behavior on strongly convex problem is unknown. Similarly, if the constant of strong convexity is very tiny, the methods with optimal complexity for the strongly convex case are initially much slower than the methods that do not rely on strong convexity. Prior to the present work, no algorithm was known with optimal complexity both for the general nonsmooth case and for the strongly convex case.

Content. In this paper, we derive an algorithm for approximating a solution of the convex optimization problem

(1) |

using first order information (function values and subgradients ) only. Here is a convex function defined on a nonempty, convex subset of a vector space with bilinear pairing defined for and in the dual space . The minimum in (1) exists if there is a point such that the level set is bounded.

Our method is based on monotonically reducing bounds on the error of the function value of the currently best point . These bounds are derived from suitable linear relaxations and inequalities obtained with the help of a prox function. The solvability of an auxiliary optimization subproblem involving the prox function is assumed. In many cases, this auxiliary subproblem has a cheap, closed form solution; this is shown here for the unconstrained case with a quadratic prox function, and in Ahookhosh & Neumaier [3, 4] for more general cases involving simple, practically important convex sets .

The OSGA algorithm presented here provides a fully adaptive alternative to current optimal first order methods. If no strong convexity is assumed, it shares the uniformity, the freeness of global parameters, and the optimal complexity properties of Lan’s method, but has a far simpler structure and derivation. Beyond that, it also gives the optimal complexity in the strongly convex case, though it needs in this case – like all other known methods with provable optimal linear convergence rate – the knowledge of an explicit constant of strong convexity. Furthermore – like Nesterov’s algorithm from [23] for the smooth case, but unlike his linearly convergent algorithm for the strongly convex case, scheme (2.2.19) in Nesterov [22], the algorithm derived here does not evaluate and outside their domain. The method for analyzing the complexity of OSGA is also new; neither Tseng’s complexity analysis nor Nesterov’s estimating sequences are applicable to OSGA.

The OSGA algorithm can be used in place of Nesterov’s optimal algorithms for smooth convex optimization and its variants whenever the latter are traditionally employed. Thus it may be used as the smooth solver with methods for solving nonsmooth convex problems via smoothing (Nesterov [23]), and for solving very large linear programs (see, e.g., Aybat & Iyengar [7], Chen & Burer [10], Bu et al. [15], Nesterov [24, 25], Richtarik [27])

Acknowledgment. I’d like to thank Masoud Ahookhosh for numerous useful remarks on an earlier version of the manuscript.

## 2 The OSGA algorithm

In this section we motivate and formulate the new algorithm.

In the following, denotes a Banach space with norm , and is the dual Banach space with the dual norm . is a closed, conves subset of . The objective function is assumed to be convex, and denotes a particular computable subgradient of at .

The basic idea. The method is based on monotonically reducing bounds on the error of the function value of the currently best point . These bounds are derived from suitable linear relaxations

(2) |

(where and ) with the help of a continuously differentiable prox function satisfying

(3) |

(4) |

where denotes the gradient of at . (Thus is strongly convex with strong convexity parameter . Choosing simplified the formulas, and is no restriction of generality, as we may always rescale a prox function to enforce .) We require that, for each and ,

(5) |

is attained at some . This requirement implies that, for arbitrary and ,

From (2) for and (3), we find that

(6) |

Typically,

(7) |

are computed together; clearly one may update a given by , thus improving the bound.

Note that the form of the auxiliary optimization problem (5) is forced by this argument. Although this is a nonconvex optimization problem, it is shown in Ahookhosh & Neumaier [3, 4] that there are many important cases where and are cheap to compute. In particular, we shall show in Section 5 that this is the case when and the prox function is quadratic.

If an upper bound for is known or assumed, the bound (6) translates into a computable error estimate for the minimal function value. But even in the absence of such an upper bound, we can solve the optimization problem (1) to a target accuracy

(8) |

if we manage to decrease the error factor from its initial value until for some target tolerance . This will be achieved by Algorithm 2.4 defined below. We shall prove for this algorithm complexity bounds on the number of iterations that are independent of the dimension of (which may be infinite), and – apart from a constant factor – best possible under a variety of assumptions on the objective function.

Constructing linear relaxations. The convexity of implies for the bound

(9) |

where denotes a subgradient of at . Therefore (2) always holds with

We can find more general relaxations of the form (2) by accumulating past information. Indeed, if (2) holds, , and then (2) remains valid when we substitute

in place of , as by (9),

For appropriate choices of and , this may give much improved error bounds. We discuss suitable choices for later.

Step size selection. The step size parameter controls the fraction of the new information (9) incorporated into the new relaxation. It is chosen with the hope for a reduction factor of approximately in the current error factor , and must therefore be adapted to the actual progress made.

First we note that in practice, is unknown; hence the numerical value of is meaningless in itself. However, quotients of at different iterations have a meaning, quantifying the amount of progress made.

In the following, we use bars to denote quantities tentatively modified in the current iteration, but they replace the current values of these quantities only if an acceptance criterion is met that we now motivate. We measure progress in terms of the quantity

(10) |

where is a fixed number. A value indicates that we made sufficient progress in that

(11) |

was reduced at least by a fraction of the designed improvement of by ; thus the step size is acceptable or may even be increased if . On the other hand, if , the step size must be reduced significantly to improve the chance of reaching the design goal. Introducing a maximal step size and two parameters with to control the amount of increase or decrease in , we update the step size according to

(12) |

Since updating the linear relaxation and makes sense only when was improved, we obtain the following update scheme.

###### 2.1 Algorithm.

(Update scheme)

global tuning parameters: ;
;
;

input: ;

output: ;

;

if , ;

else ;

end;

;

if ,

; ;
; ;

end;

If denotes the smallest actually occurring step size (which is not known in advance), we have global linear convergence with a convergence factor of . However, and hence this global rate of convergence may depend on the target tolerance ; thus the convergence speed in the limit may be linear or sublinear depending on the properties of the specific function minimized.

Strongly convex relaxations. If is strongly convex, we may know a number such that is still convex. In this case, we have in place of (9) the stronger inequality

(13) |

In the following, we only assume that , thus covering the case of linear relaxations, too.

(13) allows us to construct strongly convex relaxations of the form

(14) |

For example, (14) always holds with

Again more general relaxations of the form (14) are found by accumulating past information.

###### 2.2 Proposition.

###### Proof.

By (13) and the assumptions,

The relaxations (14) lead to the following error bound.

###### 2.3 Proposition.

Note that for , we simply recover the previous results for general convex functions.

An optimal subgradient algorithm. For a nonsmooth convex function, the subgradient at a point does not always determine a direction of descent. However, we may hope to find better points by moving from the best point into the direction of the point (7) used to determine our error bound. We formulate on this basis the following algorithm, for which optimal complexity bounds will be proved in Section 4.

###### 2.4 Algorithm.

(Optimal subgradient algorithm, OSGA)

global tuning parameters: ;
;

input parameters: ; ; ;

output: ;

assumptions: is convex;

begin

choose ; stop if ;

; ;

;
; ;

;

while 1,

; ;

;
;

;

;
; ;

choose with ;

;
;
;

;

stop if some user-defined test is passed;

update , , , ,
by Algorithm 2.1;

end;

end;

Note that the strong convexity parameter needs to be specified to use the algorithm. If is unknown, one may always put (ignoring possible strong convexity), at the cost of possibly slower worst case asymptotic convergence. (Techniques like those used in Juditsky & Nesterov [16] or Gonzaga & Karas [13] for choosing adaptively can probably be applied to the above algorithm to remove the dependence on having to know . However, [16] requires an explicit knowledge of a Lipschitz constant for the gradient, while [13] proves only sublinear convergence. It is not yet clear how to avoid both problems.)

The analysis of the algorithm will be independent of the choice of allowed in Algorithm 2.4. The simplest choice is . If the best function value is stored and updated, each iteration then requires the computation of two function values and and one subgradient .

However, the algorithm allows the incorporation of heuristics to look for improved function values before deciding on the choice of . This may involve additional function evaluations at points selected by a line search procedure (see, e.g., Beck & Teboulle [8]), a bundle optimization (see, e.g., Lan [17]), or a local quadratic approximation (see, e.g., Yu et al. [29]).

## 3 Inequalities for the error factor

The possibility to get worst case complexity bounds rests on the establishment of a strong upper bound on the error factor . This bound depends on global information about the function ; while not necessary for executing the algorithm itself, it is needed for the analysis. Depending on the properties of , global information of different strength can be used, resulting in inequalities of corresponding strength. The key steps in the analysis rely on the following lower bound for the term .

###### 3.1 Proposition.

Let . Then

(16) |

Moreover, if then for all ,

(17) |

(18) |

###### Proof.

###### 3.2 Theorem.

###### Proof.

We first establish some inequalities needed for the later estimation. By convexity of and the definition of ,

By definition of , we have

Hence (13) (with ) implies

By definition of , we conclude from these two inequalities that

Using this, (16) (with in place of and in place of ), and now gives

(20) |

Using (17) (with in place of ) and , we find

(21) |

where

(22) |

If then (19) holds trivially. Thus we assume that . Then

(23) |

Since , we conclude again that (19) holds. Thus (19) holds generally.

Note that the arguments used in this proof did not make use of ; thus (19) even holds when one sets in the algorithm, saving some work.

###### 3.3 Theorem.

If has Lipschitz continuous gradients with Lipschitz constant then, in Algorithm 2.4,

(24) |

###### Proof.

The proof follows the general line of the preceding proof, but now we must consider the information provided by .

Since is monotone decreasing in its first argument and , the hypothesis of (24) implies that

By convexity of and the definition of ,

By definition of , we have

Hence (13) (with ) implies

By definition of , we conclude from the last two inequalities that

Using this, (16) (with in place of and in place of ), and now gives

(25) |

Using (17) (with in place of ) and , we find

(26) |

where

giving

Now

(27) |

so that under the hypothesis of (24)

Thus , and the conclusion of (24) holds.

## 4 Bounds for the number of iterations

We now use the inequalities from Theorem 3.2 and Theorem 3.3 to derive bounds for the number of iterations. The weakest global assumption, mere convexity, leads to the weakest bounds and guarantees sublinear convergence only, while the strongest global assumption, strong convexity and Lipschitz continuous gradients, leads to the strongest bounds guaranteeing -linear convergence. Our main result shows that, asymptotically as , the number of iterations needed by the OSGA algorithm matches the lower bounds on the complexity derived by Nemirovski & Yudin [20], apart from constant factors:

###### 4.1 Theorem.

Suppose that is convex. Then:

(i) (Nonsmooth complexity bound)

If the points generated by Algorithm 2.4 stay in a bounded
region of the interior of , or if is Lipschitz continuous in ,
the total number of iterations needed to reach a point with
is at most .
Thus the asymptotic worst case complexity is when
and when .

(ii) (Smooth complexity bound)

If has Lipschitz continuous gradients with Lipschitz constant ,
the total number of iterations needed by Algorithm 2.4 to
reach a point with is at most
if , and at most if .

In particular, if is strongly convex and differentiable with Lipschitz continuous gradients, holds with arbitrary quadratic prox functions, and we get a complexity bound similar to that achieved by the preconditioned conjugate gradient method for linear systems; cf. Axelsson & Lindskog [6].

Note that (24) generalizes to other situations by replacing (27) with a weaker smoothness property of the form

(28) |

with convex and monotone increasing. For example, this holds with if has subgradients with bounded variation, and with if has Hölder continuous gradients with exponent , and with linear combinations thereof in the composite case considered by Lan [17]. Imitating the analysis below of the two cases stated in the theorem then gives corresponding complexity bounds matching those obtained by Lan.

Theorem 4.1 follows from the two propositions below covering the different cases, giving in each case explicit upper bounds on the number of further iterations needed to complete the algorithm form a point where the values of and given as arguments of were achieved. We write and for the initial values of and . Only the dependence on , , and is made explicit.

###### 4.2 Proposition.

Suppose that the dual norm of the subgradients encountered during the iteration remains bounded by the constant . Let , and define

(i) In each iteration,

(29) |

(ii) The algorithm stops after at most

(30) |

further iterations.

In particular, (i) and (ii) hold when the iterates stay in a bounded region of the interior of , or when is Lipschitz continuous in .

Note that any convex function is Lipschitz continuous in any closed and bounded domain inside its support. Hence if the iterates stay in a bounded region of the interior of , is bounded by the Lipschitz constant of in the closure of the region .

###### Proof.

(i) Condition (29) holds initially, and is preserved in each update unless is reduced. But then , hence . Thus Theorem 3.2 implies

This implies

and since ,

Thus (29) holds again after the reduction, and hence always.

(ii) As the algorithm stops once , (29) implies that in each iteration . As is reduced only when , and then by a fixed factor , this cannot happen more than times in turn. Thus after some number of -reductions we must always have another step with . By (11), this gives a reduction of by a factor of at least . But this implies that the stopping criterion is eventually reached. Therefore the algorithm stops eventually. Since , (12) implies . Therefore

(31) |

Now (31), (11), and (29) imply