On Lower Complexity Bounds for
Large-Scale Smooth Convex Optimization111This research was supported by the NSF grant CMMI-1232623.
We derive lower bounds on the black-box oracle complexity of large-scale smooth convex minimization problems, with emphasis on minimizing smooth (with Hölder continuous, with a given exponent and constant, gradient) convex functions over high-dimensional -balls, . Our bounds turn out to be tight (up to logarithmic in the design dimension factors), and can be viewed as a substantial extension of the existing lower complexity bounds for large-scale convex minimization covering the nonsmooth case and the “Euclidean” smooth case (minimization of convex functions with Lipschitz continuous gradients over Euclidean balls). As a byproduct of our results, we demonstrate that the classical Conditional Gradient algorithm is near-optimal, in the sense of Information-Based Complexity Theory, when minimizing smooth convex functions over high-dimensional -balls and their matrix analogies – spectral norm balls in the spaces of square matrices.
keywords:Smooth Convex Optimization, Lower Complexity Bounds, Optimal Algorithms.
Msc: 90C25, 90C06, 68Q25
Huge sizes of convex optimization problems arising in some modern applications (primarily, in big-data-oriented signal processing and machine learning) are beyond the “practical grasp” of the state-of-the-art Interior Point Polynomial Time methods with their computationally demanding iterations. Indeed, aside of rare cases of problems with “extremely favourable” structure, the arithmetic cost of an interior point iteration is at least cubic in the design dimension of the instance; with in the range of – , as is the case in the outlined applications, this makes a single iteration “lasting forever.” The standard techniques for handling large-scale convex problems – those beyond the practical grasp of Interior Point methods – are First Order methods (FOM’s). Under favorable circumstances, iterations of FOM’s are much cheaper than those of interior point methods, and the convergence rate, although just sublinear, is fully or nearly dimension-independent, which makes FOM’s the methods of choice when medium-accuracy solutions to large-scale convex programs are sought. Now, as a matter of fact, all known FOM’s are “black-box-oriented” – they “learn” the problem being solved solely via the local information (values and (sub)gradients of the objective and the constraints) accumulated along the search points generated by the algorithm. As a result, “limits of performance” of FOM’s are governed by Information-Based Complexity Theory. Some basic results in this direction have been established in the literature Nemirovski:1983 (); in particular, we know well enough what is the Information-Based Complexity of natural families of convex minimization problems with nonsmooth Lipschitz continuous objectives and how the complexity depends on the geometry and the dimension of the domain . In the smooth case, our understanding is somehow limited; essentially, tight lower complexity bounds are known only in the case when is Euclidean ball and is convex function with Lipschitz continuous gradient. Lower bounds here come from least-squares problems Nemirovski:1991 (); Nemirovski:1992 (), and the underlying techniques for generating “hard instances” heavily utilize the rotational invariance of a Euclidean ball.
In this paper, we derive tight lower bounds on information-based complexity of families of convex minimization problems , where is -dimensional -ball, , and is the family of all continuously differentiable convex objectives with given smoothness parameters (Hölder exponent and constant). We believe that these bounds could be of interest in some modern applications, like and nuclear norm minimization in Compressed Sensing, where one seeks to minimize a smooth, most notably, quadratic convex function over high-dimensional -ball in or nuclear norm ball in the space of matrices. Another instructive application of our results is establishing the near-optimality, in the sense of information-based complexity, of Conditional Gradient (a.k.a. Frank-Wolfe) algorithm as applied to minimizing smooth convex functions over large-scale boxes (or unit balls of spectral norm on the space of matrices)222Originating from Frank:1956 (), Conditional Gradient algorithm was intensively studied in 1970’s (see Dem:Rub:1970 (); Pshe:1994 () and references therein); recently, there is a significant burst of interest in this technique, due to its ability to handle smooth large-scale convex programs on “difficult geometry” domains, see Hazan:2008 (); Jaggi:2011 (); Jaggi:2013 (); HJN:2013 (); CJN:2013 () and references therein..
Our first contribution is a unified framework to prove lower bounds for a variety of domains and different smoothness parameters of the objective with respect to a norm (for consistency we use the norm induced by the domain). In order to construct hard instances for lower bounds we need the normed space under consideration to satisfy a “smoothing property.” Namely, we need the existence of a “smoothing kernel” – a convex function with Lipschitz continuous gradient and “fast growth.” These properties guarantee that the inf-convolution HiriartUrruty:2001 () of a Lipschitz continuous convex function and the smoothing kernel is smooth, and its local behaviour depends only on the local behavior of . A novelty here, if any, stems from the fact that we need Lipschitz continuity of the gradient w.r.t. a given, not necessarily Euclidean, norm, while the standard Moreau envelope technique is adjusted to the case of the Euclidean norm333It well may happen that the extensions of the classical Moreau results which we present in Section 2 are known, so that the material in this section does not pretend to be novel. This being said, at this point in time we do not have at our disposal references to the results on smoothing we need, and therefore we decided to augment these simple results with their proofs, in order to make our presentation self-contained..
We establish lower bounds on complexity of smooth convex minimization for general spaces satisfying the smoothing property. Our proof mimics the construction of hard instances for nonsmooth convex minimization Nemirovski:1983 (), which now are properly smoothed by the inf-convolution.
With this general result, we are able to provide a unified analysis for lower bounds for smooth convex minimization over -dimensional -balls, . We show that in the large-scale case, our lower complexity bounds match, within at worst a logarithmic in factor, the upper complexity bounds associated with Nesterov’s fast gradient algorithms Nemirovski:1985 (); Elster:1993 (). When , this result implies near optimality of the Conditional Gradient algorithm.
As a final application, we point out how our lower bounds extend to matrix optimization under Schatten norm constraints.
1.2 Related work
Oracle Complexity: The analysis of convex optimization algorithms via oracle complexity and lower complexity bounds were first studied in Nemirovski:1983 (). Other standard references are Nemirovski:1994 (); Nesterov:2004 (). The oracle complexity of smooth convex optimization over Euclidean domains was studied in Nemirovski:1983 (); Nemirovski:1991 (); Nemirovski:1992 ().
For optimal methods under non-Euclidean domains for smooth spaces and -norms, where , we refer to Elster:1993 () (for the case there is an interesting new algorithm that adapts itself to the smoothness parameter in the objective Nesterov:2013 ()).
It should be mentioned that for the case
the lower bounds in this paper were announced
in Nemirovski:1985 (); Elster:1993 () (and proved
by the second author of this paper);
however, aside of the very special case of , the
highly technical original proofs of the bounds were
never published. For this reason,
we recently have revisited the original proofs and
were able to simplify them dramatically, thus
making them publishable.
The Conditional Gradient algorithm and complexity under Linear Optimization oracles: The recent body of work on the Conditional Gradient algorithm is enormous. For upper bounds on its complexity we refer to Clarkson:2008 (); Hazan:2008 (); Jaggi:2013 (); Lan:2013 (); Garber:2013 (). Interestingly, the last two references include results on linear convergence of the Conditional Gradient method for the strongly convex case, accelerated methods based on Linear Optimization oracles, and applications to stochastic and online convex programming.
Besides these accuracy upper bounds, there are some
interesting lower bounds for algorithms based on a Linear
Optimization oracle (whose only assumption is that the
Linear Optimization oracle returns a solution that is a vertex of the
domain): some of these contributions can be found in
Jaggi:2013 (); Lan:2013 (). Observe that a Linear Optimization oracle
is in general less powerful than an arbitrary local oracle
(in particular the first-order one) considered in our paper, and thus their lower
bounds do not imply ours. However, our result for
improves on their lower bounds (disregarding logarithmic
1.3 Notation and preliminaries
Algorithms and Complexity: In the black-box oracle complexity model for convex optimization we are interested in solving problems of the form
where is a given convex compact subset of a normed space , and is known to belong to a given family of continuous convex functions on . This defines the family of problems comprised of problems with . We assume that the family is equipped with an oracle which, formally, is a function of and taking values in some information space ; when solving , an algorithm at every step can sequentially call the oracle at a query point , obtaining the value . In the sequel, we always assume the oracle to be local, meaning that for all and such that in a neighbourhood of , we have . The most common example of oracle is the first-order oracle, which returns the value and a subgradient of at . However, observe that when the subdifferential is not a singleton not every such oracle satisfies the local property, and we need to further restrict it to satisfy locality.
A -step algorithm , utilizing oracle , for the family is a procedure as follows. As applied to a problem with , generates a sequence , of search points according to the recurrence
where the search rules are deterministic functions of their arguments; we can identify with the collection of these rules. Thus, is specified by and is independent of , and all subsequent search points are deterministic functions of the preceding search points and the information on provided by when queried at these points. We treat as the approximate solution generated by the -step solution method applied to , and define the minimax risk associated with the family and oracle as the function of defined by
where the right hand side infinum is taken over all -step solution algorithms utilizing oracle and such that for all . The inverse to the risk function
for is called the information-based (or oracle) complexity of
the family with respect to oracle .
Geometry and Smoothness: Let be an -dimensional Euclidean space, and be a norm on (not necessarily the Euclidean one). Let, further, be a nonempty closed and bounded convex set in . Given a positive real and , consider the family of all continuously differentiable convex functions which are -smooth w.r.t. , i.e. satisfy the relation
where is the norm conjugate to . We associate with the family of convex optimization problems .
We assume the family is equipped with a local oracle . To avoid extra words, we assume that this oracle is at least as powerful as the First Order oracle, meaning that is a component of .
Our goal is to establish lower bounds on the risk , taken w.r.t. the oracle , of the just defined family of problems . In the sequel, we focus solely on the ‘large-scale’ case , and the reason is as follows: it is known Nemirovski:1983 () that when , “basically forgets the details specifying ” and is upper-bounded by , where is an absolute constant, with the data of affecting only the hidden factor in the outer and thus irrelevant when . In contrast to this, in the large-scale regime , is (at least in the cases we are about to consider) nearly independent of and goes to 0 sublinearly as grows, and its behavior in this range heavily depends on . In what follows, we focus solely on the large-scale regime.
2 Local Smoothing
In this section we introduce the main component of our technique, a Moreau-type approximation of a nonsmooth convex function by a smooth one. The main feature of this smoothing, instrumental for our ultimate goals, is that it is local – the local behaviour of the approximation at a point depends solely on the restriction of onto a neighbourhood of the point, the size of the neighbourhood being under our full control.
2.1 Smoothing Kernel
Let be a finite-dimensional Euclidean space, be a norm on (not necessarily induced by ), and be the set of all Lipschitz continuous, with constant 1 w.r.t. , convex functions on . Let also (“smoothing kernel”) be a twice continuously differentiable convex function defined on an open convex set with the following properties:
and , ;
There exists a compact convex set such that and for all .
For some we have
Note that A and B imply that for all , the function attains its minimum on the set . Indeed, for every we have , so that the (clearly existing) minimizer of on is a point from . As a result, for every and one has
and the right hand side minimum is achieved.
2.2 Approximating a function by smoothing
For and , let
Observe that can be obtained as follows:
We associate with the function ; observe that this function belongs to along with ;
We pass from to its smoothing
It follows that
The latter relation combines with (5) to imply that
As bottom-line, if we can find a function as described above we have that for any convex function with Lipschitz constant 1 w.r.t. and every there exists a smooth (i.e., with Lipschitz continuous gradient) approximation that satisfies:
is convex and Lipschitz continuous with constant 1 w.r.t. and has a Lipschitz continuous gradient, with constant , w.r.t. :
. Moreover, .
depends on in a local fashion: the value and the derivative of at depends only on the restriction of onto the set .
2.3 Example: -norm smoothing
Let and , and consider the case of , endowed with the standard inner product, and . Assume for a moment that , and let be a real such that . Let also be such that . Let us set
Observe that is twice continuously differentiable on function satisfying A. Besides this, ensures that whenever , so that when , which implies B. Besides, by choosing and selecting close enough to 1, C is satisfied for (for a proof we refer to B).
For the case of , we can set and, as above, , clearly ensuring A, B, and the validity of with .
Applying the results of the previous section, we get
Let and be a Lipschitz continuous, with constant w.r.t. the norm , convex function. For every , there exists a convex continuously differentiable function with the following properties:
(i) , for all ;
(ii) for all ;
(iii) For every , the restriction of on a small enough neighbourhood of depends solely on the restriction of on the set
3 Lower complexity Bounds for Smooth Convex Minimization
In this section we utilize Proposition 1 to prove our main result, namely, a general lower bound on the oracle complexity of smooth convex minimization, and then specify this result for the case of minimization over balls, where .
be a norm on and be a nonempty convex set containing the unit ball of ;
be a positive integer and be a positive real with the following property:
One can point out linear forms on , , such that
(a) for , and
(b) for every collection with , it holds
and be positive reals such that for properly selected convex twice continuously differentiable on an open convex set function and a convex compact subset the triple satisfies properties A, B, C from Section 2.1 and .
Then for every , , every local oracle and every -step method associated with this oracle there exists a problem with such that
Proof. 1. Let us set
2. Given a permutation of and a collection , we associate with these data the functions
Observe that all these functions belong to due to , for , so that the smoothed functions
whence for all , it holds
Recalling the definition of , we conclude that .
3. Given a local oracle and an associated -step method , let us define a sequence of points in , a permutation of and a collection by the following -step recurrence:
Step 1: is the first point of the trajectory of (this point depends solely on the method and is independent of the problem the method is applied to). We define as the index , , that maximizes , and specify in such a way that . We set
Step , : At the beginning of this step, we have at our disposal the already built points , distinct from each other integers and quantities , for . At step , we build , , , as follows. We set
thus getting a function from , and define its smoothing which, same as above, belongs to . We further define
as the -th point of the trajectory of as applied to ,
as the index that maximizes , over distinct from ,
thus completing step .
After steps of this recurrence, we get at our disposal a sequence of points from , a permutation of indexes and a collection ; these entities define the functions
4. We claim that is the trajectory of as applied to . By construction, indeed is the first point of the trajectory of as applied to . In view of this fact, taking into account the definition of and the locality of the oracle , all we need to support our claim is to verify that for every , , the functions and coincide in some neighbourhood of . By construction, we have that for
Invoking (11), we get
Since both and belong to , it follows that in the -ball of radius centered at , whence, by (12),
From we have that and coincide on the set , whence, as we know from item S.3 in Section 2.2, and coincide in a neighbourhood of , as claimed.
5. We have
whence, by item S.2 in Section 2.2, , implying that
On the other hand, by (7) there exists such that , whence and thus . Since, as we have seen, is the trajectory of as applied to , is the approximate solution generated by as applied to , and we see that the inaccuracy of this solution, in terms of the objective, is at least as required. Besides this, is of the form , and we have seen that all these functions belong to . ∎
Note that the previous result immediately implies the lower bound
on the complexity of the family , provided contains the unit -ball. Note that this bound is independent of the local oracle .
The case where contains a -ball of radius instead of the unit ball can be reduced to the latter case by scaling instances , which corresponds to the transformation of the smoothness parameters. Thus, assuming that contains -ball of radius , we have
4 Case of
In this section we provide lower complexity bounds for smooth convex optimization over -balls for the case when . In section 4.1 we show that Proposition 2 implies nearly tight optimal complexity bounds for the range ; moreover, for fixed and finite , the bound is tight within a factor depending solely on . For the case , our lower bound matches the approximation guarantees of the Conditional Gradient algorithm, up to a logarithmic factor, proving near-optimality of the algorithm.
In section 4.2 we study the range . Here we prove nearly optimal complexity bounds by using nearly-Euclidean sections of the -ball, together with the lower bound.
4.1 Smooth Convex Minimization over -balls,
Consider the case when is the norm on , . Given positive integer , let us specify , , as the first standard basic orths, so that for every collection one clearly has
Invoking the results from Section 2.3 (cf. Proposition 1), we see that when is a convex set containing the unit -ball, Assumptions II and III in Proposition 2 are satisfied with , and . Applying Proposition 2, we arrive at
Let , , , and let be a convex set containing the unit ball w.r.t. . Then, for every and every local oracle , the minimax risk of the family of problems with admits the lower bound
independent of the local oracle in use.
Let us discuss some interesting consequences of the above result.
A. Complexity of smooth minimization over the box: Corollary 2 implies that when is the unit -ball in , the -step minimax risk of minimizing over of objectives from the family in the range is lower-bounded by . On the other hand, from the standard efficiency estimate of Conditional Gradient algorithm (see, e.g., Dem:Rub:1970 (); Pshe:1994 (); CJN:2013 ()) it follows that when applying the method to minimizing over a function over a convex compact domain of -diameter , the inaccuracy after steps does not exceed
We see that when is in-between two -balls with ratio of sizes , the lower complexity bound coincides with the upper one within the factor . In particular, when minimizing functions over -dimensional unit box , the performance of the Conditional Gradient algorithm, as expressed by its minimax risk, cannot be improved by more than factor, for any local oracle in use. In fact, the same conclusion remains true when and the unit box are replaced with and the unit -ball with “large” , specifically, .
B. Tightness: In fact, in the case of the lower complexity bounds for smooth convex minimization over -balls established in Corollary 2, are tight: it is shown in Nemirovski:1985 (), see also (Elster:1993, , Section 2.3) that a properly modified Nesterov’s algorithm for smooth convex optimization via the first-order oracle, as applied to problems of minimizing functions from over the -dimensional unit -ball , for any number of steps ensures that
with depending solely on , which is in full accordance with (15).
4.2 Smooth Convex Minimization over -balls,
We have obtained lower complexity bounds for smooth convex minimization over -balls, where . Now we consider the case . We will build nearly tight bounds by reducing to the case of .
Let , , , and let be a convex set containing the unit -ball. For properly selected absolute constant and for every , the minimax risk of the family of problems with admits the lower bound
independent of the local oracle in use.
Proof. 1. By Dvoretzky’s Theorem for the -ball (Pisier:1989, , Theorem 4.15), there exists an absolute constant , such that for any positive integer there is a subspace of dimension , and a centered at the origin ellipsoid , such that
Let be linear forms on such that . By the second inclusion in (17), for every , the maximum of the linear form over does not exceed , whence, by the Hahn-Banach Theorem, the form can be extended from to a linear form on the entire to have the maximum over not exceeding . In other words, we can point out vectors , such that for every and , for all . Now consider the linear mapping
By the above, the operator norm of this mapping induced by the norms on the argument and on the image spaces does not exceed 1. As a result, when belongs to , the function defined by , for , belongs to the family 444To avoid abuse of notation, we have added to our usual notation for families of smooth convex functions superscript indicating the argument dimension of the functions in question.. Setting , we get a convex compact set in .
2.Observe that an optimization problem of the form
can be naturally reduced to the problem
and when the objective of the former problem belongs to
, the objective of the latter problem
belongs to . It is intuitively clear that
the outlined reducibility implies that the complexity of solving problems from
the family cannot be smaller than the complexity
of solving problems from the family .
Taking this claim for granted (for a proof, see C), let us derive
from it the desired result. To this end, observe that from the first inclusion in
(17) it follows that contains the centered at the origin -ball
of radius (indeed, by construction this ball is already
contained in the image of ). By Corollary 2
as applied to and to in the role of ,
the worst-case, w.r.t. problems from the family , inaccuracy of
any -step method based on a local oracle is at least
Finally, we remark that the lower complexity bound stated in Proposition 3 in the smooth case is, to the best of our knowledge, new (the nonsmooth case was considered already in Nemirovski:1983 ()). This lower bound matches, up to logarithmic in factors, the upper complexity bound for the family in question, see Elster:1993 ().
4.3 Matrix case
We have proved lower bounds for smooth optimization over -balls for all . Now we show how these bounds can be used for proving lower complexity bounds on smooth convex minimization over Schatten norm balls in the spaces of matrices. Recall that the Shatten -norm of an matrix is, by definition the -norm of the vector of singular values of . The problems we are interested in now are of the form
Observe that Corollary 2 remains true when replacing in it the embedding space of with the space of matrices, the norm on with the Schatten norm , and the requirement “ is a convex set containing the unit ball of ” with the requirement “ is a convex set containing the unit ball of .” This claim is an immediate consequence of the fact that when restricting an matrix onto its diagonal, we get a linear mapping of onto , and the factor norm on induced, via this mapping, by is nothing but the usual -norm. Consequently, minimizing a function from