Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining an understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes.
Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization
|Alekh Agarwal||Peter L. Bartlett|
|Pradeep Ravikumar||Martin J. Wainwright|
|Department of Electrical Engineering and Computer Sciences|
|Department of Statistics|
|UC Berkeley, Berkeley, CA|
|Mathematical Sciences||Department of Computer Sciences|
|QUT, Brisbane, Australia||UT Austin, Austin, TX|
July 13, 2019
Convex optimization forms the backbone of many algorithms for statistical learning and estimation. Given that many statistical estimation problems are large-scale in nature—with the problem dimension and/or sample size being large—it is essential to make efficient use of computational resources. Stochastic optimization algorithms are an attractive class of methods, known to yield moderately accurate solutions in a relatively short time . Given the popularity of such stochastic optimization methods, understanding the fundamental computational complexity of stochastic convex optimization is thus a key issue for large-scale learning. A large body of literature is devoted to obtaining rates of convergence of specific procedures for various classes of convex optimization problems. A typical outcome of such analysis is an upper bound on the error—for instance, gap to the optimal cost—as a function of the number of iterations. Such analyses have been performed for many standard optimization algorithms, among them gradient descent, mirror descent, interior point programming, and stochastic gradient descent, to name a few. We refer the reader to various standard texts on optimization (e.g., [2, 3, 4]) for further details on such results.
On the other hand, there has been relatively little study of the inherent complexity of convex optimization problems. To the best of our knowledge, the first formal study in this area was undertaken in the seminal work of Nemirovski and Yudin , hereafter referred to as NY. One obstacle to a classical complexity-theoretic analysis, as these authors observed, is that of casting convex optimization problems in a Turing Machine model. They avoided this problem by instead considering a natural oracle model of complexity, in which at every round the optimization procedure queries an oracle for certain information on the function being optimized. This information can be either noiseless or noisy, depending on whether the goal is to lower bound the oracle complexity of deterministic or stochastic optimization algorithms. Working within this framework, the authors obtained a series of lower bounds on the computational complexity of convex optimization problems, both in deterministic and stochastic settings. In addition to the original text NY , we refer the interested reader to the book by Nesterov , and the lecture notes by Nemirovski  for further background.
In this paper, we consider the computational complexity of stochastic convex optimization within this oracle model. In particular, we improve upon the work of NY  for stochastic convex optimization in two ways. First, our lower bounds have an improved dependence on the dimension of the space. In the context of statistical estimation, these bounds show how the difficulty of the estimation problem increases with the number of parameters. Second, our techniques naturally extend to give sharper results for optimization over simpler function classes. We show that the complexity of optimization for strongly convex losses is smaller than that for convex, Lipschitz losses. Third, we show that for a fixed function class, if the set of optimizers is assumed to have special structure such as sparsity, then the fundamental complexity of optimization can be significantly smaller. All of our proofs exploit a new notion of the discrepancy between two functions that appears to be natural for optimization problems. They involve a reduction from a statistical parameter estimation problem to the stochastic optimization problem, and an application of information-theoretic lower bounds for the estimation problem. We note that special cases of the first two results in this paper appeared in the extended abstract , and that a related study was independently undertaken by Raginsky and Rakhlin .
The remainder of this paper is organized as follows. We begin in
Section 2 with background on oracle complexity, and a
precise formulation of the problems addressed in this paper.
Section 3 is devoted to the statement of our main
results, and discussion of their consequences. In
Section 4, we provide the proofs of our main results,
which all exploit a common framework of four steps. More technical
aspects of these proofs are deferred to the appendices.
For the convenience of the reader, we collect here some notation used throughout the paper. For , we use to denote the -norm of a vector , and we let denote the conjugate exponent, satisfying . For two distributions and , we use to denote the Kullback-Leibler (KL) divergence between the distributions. The notation refers to the 0-1 valued indicator random variable of the set . For two vectors , we define the Hamming distance . Given a convex function , the subdifferential of at is the set .
2 Background and problem formulation
We begin by introducing background on the oracle model of convex optimization, and then turn to a precise specification of the problem to be studied.
2.1 Convex optimization in the oracle model
Convex optimization is the task of minimizing a convex function over a convex set . Assuming that the minimum is achieved, it corresponds to computing an element that achieves the minimum—that is, an element . An optimization method is any procedure that solves this task, typically by repeatedly selecting values from . For a given class of optimization problems, our primary focus in this paper is to determine lower bounds on the computational cost, as measured in terms of the number of (noisy) function and subgradient evaluations, required to obtain an -optimal solution to any optimization problem within the class.
More specifically, we follow the approach of Nemirovski and Yudin , and measure computational cost based on the oracle model of optimization. The main components of this model are an oracle and an information set. An oracle is a (possibly random) function that answers any query by returning an element in an information set . The information set varies depending on the oracle; for instance, for an exact oracle of order, the answer to a query consists of and the first derivatives of at . For the case of stochastic oracles studied in this paper, these values are corrupted with zero-mean noise with bounded variance. We then measure the computational labor of any optimization method as the number of queries it poses to the oracle.
In particular, given a positive integer corresponding to the number of iterations, an optimization method designed to approximately minimize the convex function over the convex set proceeds as follows. At any given iteration , the method queries at , and the oracle reveals the information . The method then uses the information to decide at which point the next query should be made. For a given oracle function , let denote the class of all optimization methods that make queries according to the procedure outlined above. For any method , we define its error on function after steps as
where is the method’s query at time . Note that by definition of as a minimizing argument, this error is a non-negative quantity.
When the oracle is stochastic, the method’s query at time is itself random, since it depends on the random answers provided by the oracle. In this case, the optimization error is also a random variable. Accordingly, for the case of stochastic oracles, we measure the accuracy in terms of the expected value , where the expectation is taken over the oracle randomness. Given a class of functions defined over a convex set and a class of all optimization methods based on oracle queries, we define the minimax error
In the sequel, we provide results for particular classes of oracles. So as to ease the notation, when the oracle is clear from the context, we simply write .
2.2 Stochastic first-order oracles
In this paper, we study stochastic oracles for which the information set consists of pairs of noisy function and subgradient evaluations. More precisely, we have:
For a given set and function class , the class of first-order stochastic oracles consists of random mappings of the form such that
We use to denote the class of all stochastic first-order
oracles with parameters . Note that the first two
conditions imply that is an unbiased estimate of the
function value , and that is an unbiased estimate
of a subgradient . When is actually
differentiable, then is an unbiased estimate of the
gradient . The third condition in
equation (3) controls the “noisiness” of the
subgradient estimates in terms of the -norm.
Stochastic gradient methods are a widely used class of algorithms that can be understood as operating based on information provided by a stochastic first-order oracle. As a particular example, consider a function of the separable form , where each is differentiable. Functions of this form arise very frequently in statistical problems, where each term corresponds to a different sample and the overall cost function is some type of statistical loss (e.g., maximum likelihood, support vector machines, boosting etc.) The natural stochastic gradient method for this problem is to choose an index uniformly at random, and then to return the pair . Taking averages over the randomly chosen index yields , so that is an unbiased estimate of , with an analogous unbiased property holding for the gradient of .
2.3 Function classes of interest
We now turn to the classes of convex functions for which we study oracle complexity. In all cases, we consider real-valued convex functions defined over some convex set . We assume without loss of generality that contains an open set around , and many of our lower bounds involve the maximum radius such that
Our first class consists of convex Lipschitz functions:
For a given convex set and parameter , the class consists of all convex functions such that
We have defined the Lipschitz condition (5) in terms of the conjugate exponent , defined by the relation . To be clear, our motivation in doing so is to maintain consistency with our definition of the stochastic first-order oracle, in which we assumed that . We note that the Lipschitz condition (5) is equivalent to the condition
If we consider the case of a differentiable function , the unbiasedness condition in Definition 1 implies that
where inequality (a) follows from the convexity of the
-norm and Jensen’s inequality, and inequality (b) is a
result of Jensen’s inequality applied to the concave function
. This bound implies that must be Lipschitz with
constant at most with respect to the dual -norm.
Therefore, we necessarily must have , in order for the
function class from Definition 2 to be consistent with
the stochastic first-order oracle.
A second function class consists of strongly convex functions, defined as follows:
For a given convex set and parameter , the class consists of all convex functions such that the Lipschitz condition (5) holds, and such that satisfies the -strong convexity condition
In this paper, we restrict our attention to the case of strong convexity with respect to the -norm. (Similar results on the oracle complexity for strong convexity with respect to different norms can be obtained by straightforward modifications of the arguments given here). For future reference, it should be noted that the Lipschitz constant and strong convexity constant interact with one another. In particular, whenever contains the -ball of radius , the Lipschitz and strong convexity constants must satisfy the inequality
In order to establish this inequality, we note that strong convexity condition with implies that
We now choose the pair such that and . Such a choice is
possible whenever contains the ball of radius
. Since we have , this choice yields , which establishes the
As a third example, we study the oracle complexity of optimization over the class of convex functions that have sparse minimizers. This class of functions is well-motivated, since a large body of statistical work has studied the estimation of vectors, matrices and functions under various types of sparsity constraints. A common theme in this line of work is that the ambient dimension enters only logarithmically, and so has a mild effect. Consequently, it is natural to investigate whether the complexity of optimization methods also enjoys such a mild dependence on ambient dimension under sparsity assumptions.
For a vector , we use to denote the number of non-zero elements in . Recalling the set from Definition 2, we now define a class of Lipschitz functions with sparse minimizers.
For a convex set and positive integer , let
be the class of all convex functions that are -Lipschitz in the -norm, and have at least one -sparse optimizer.
We frequently use the shorthand notation when the set and parameter are clear from context.
3 Main results and their consequences
With the setup of stochastic convex optimization in place, we are now in a position to state the main results of this paper, and to discuss some of their consequences. As previously mentioned, a subset of our results assume that the set contains an ball of radius . Our bounds scale with , thereby reflecting the natural dependence on the size of the set . Also, we set the oracle second moment bound to be the same as the Lipschitz constant in our results.
3.1 Oracle complexity for convex Lipschitz functions
We begin by analyzing the minimax oracle complexity of optimization for the class of bounded and convex Lipschitz functions from Definition 2.
Let be a convex set such that for some . Then for a universal constant , the minimax oracle complexity over the class satisfies the following lower bounds:
Nemirovski and Yudin  proved the lower bound for the function class , in the special case that is the unit ball of a given norm, and the functions are Lipschitz in the corresponding dual norm. For , they established the minimax optimality of this dimension-independent result by appealing to a matching upper bound achieved by the method of mirror descent. In contrast, here we do not require the two norms—namely, that constraining the set and that for the Lipschitz constraint—to be dual to one other; instead, we give give lower bounds in terms of the largest ball contained within the constraint set . As discussed below, our bounds do include the results for the dual setting of past work as a special case, but more generally, by examining the relative geometry of an arbitrary set with respect to the ball, we obtain results for arbitrary sets. (We note that the constraint is natural in many optimization problems arising in machine learning settings, in which upper and lower bounds on variables are often imposed.) Thus, in contrast to the past work of NY on stochastic optimization, our analysis gives sharper dimension dependence under more general settings. It also highlights the role of the geometry of the set in determining the oracle complexity.
In general, our lower bounds cannot be improved, and hence specify the optimal minimax oracle complexity. We consider here some examples to illustrate their sharpness. Throughout we assume that is large enough to ensure that the term attains the lower bound and not the term. (This condition is reasonable given our goal of understanding the rate as increases, as opposed to the transient behavior over the first few iterations.)
We start from the special case that has been primarily considered in past works. We consider the class with and the stochastic first-order oracles for this class. Then the radius of the largest ball inscribed within the scales as . By inspection of the lower bounds bounds (9) and (10), we see that
As mentioned previously, the dimension-independent lower bound for the case was demonstrated in Chapter 5 of NY, and shown to be optimal111There is an additional logarithmic factor in the upper bounds for . since it is achieved using mirror descent with the prox-function . For the case of , the lower bounds are also unimprovable, since they are again achieved (up to constant factors) by stochastic gradient descent. See Appendix C for further details on these matching upper bounds.
Let us now consider how our bounds can also make sharp predictions for non-dual geometries, using the special case . For this choice, we have , and hence Theorem 1 implies that for all , the minimax oracle complexity is lower bounded as
This lower bound is sharp for all . Indeed, for any convex set , stochastic gradient descent achieves a matching upper bound (see Section 5.2.4, p. 196 of NY , as well as Appendix C in this paper for further discussion).
As another example, suppose that . Observe that this -norm unit ball satisfies the relation , so that we have . Consequently, for this choice, the lower bound (9) takes the form
which is a dimension-independent lower bound. This lower bound for is indeed tight for , and as before, this rate is achieved by stochastic gradient descent .
Turning to the case of , when , the lower bound (10) can be achieved (up to constant factors) using mirror descent with the dual norm ; for further discussion, we again refer the reader to Section 5.2.1, p. 190 of NY , as well as to Appendix C of this paper. Also, even though this lower bound requires the oracle to have only bounded variance, our proof actually uses a stochastic oracle based on Bernoulli random variables, for which all moments exist. Consequently, at least in general, our results show that there is no hope of achieving faster rates by restricting to oracles with bounds on higher-order moments. This is an interesting contrast to the case of having less than two moments, in which the rates are slower. For instance, as shown in Section 5.3.1 of NY , suppose that the gradient estimates in a stochastic oracle satisfy the moment bound for some . In this setting, the oracle complexity is lower bounded by . Since for all , there is a significant penalty in convergence rates for having less than two bounded moments.
Even though the results have been stated in a first-order stochastic oracle model, they actually hold in a stronger sense. Let denote the -order derivative of evaluated at , when it exists. With this notation, our results apply to an oracle that responds with a random function such that
along with appropriately bounded second moments of all the derivatives. Consequently, higher-order gradient information cannot improve convergence rates in a worst-case setting. Indeed, the result continues to hold even for the significantly stronger oracle that responds with a random function that is a noisy realization of the true function. In this sense, our result is close in spirit to a statistical sample complexity lower bound. Our proof technique is based on constructing a “packing set” of functions, and thus has some similarity to techniques used in statistical minimax analysis (e.g., [9, 10, 11, 12]) and learning theory (e.g., [13, 14, 15]). A significant difference, as will be shown shortly, is that the metric of interest for optimization is very different than those typically studied in statistical minimax theory.
3.2 Oracle complexity for strongly convex Lipschitz functions
We now turn to the statement of lower bounds over the class of Lipschitz and strongly convex functions from Definition 3. In all these statements, we assume that , as is required for the definition of to be sensible.
Let . Then there exist universal constants such that the minimax oracle complexity over the class satisfies the following lower bounds:
For , we have
For , we have:
As with Theorem 1, these lower bounds are sharp. In particular, for , stochastic gradient descent achieves the rate (12) up to logarithmic factors , and closely related algorithms proposed in very recent works [17, 18] match the lower bound exactly up to constant factors. It should be noted Theorem 2 exhibits an interesting phase transition between two regimes. On one hand, suppose that the strong convexity parameter is large: then as long as is sufficiently large, the first term determines the minimax rate, which corresponds to the fast rate possible under strong convexity. In contrast, if we consider a poorly conditioned objective with , then the term involving is dominant, corresponding to the rate for a convex objective. This behavior is natural, since Theorem 2 recovers (as a special case) the convex result with . However, it should be noted that Theorem 2 applies only to the set , and not to arbitrary sets like Theorem 1. Consequently, the generalization of Theorem 2 to arbitrary convex, compact sets remains an interesting open question.
3.3 Oracle complexity for convex Lipschitz functions with sparse optima
Finally, we turn to the oracle complexity of optimization over the class from Definition 4.
Let be the class of all convex functions that are -Lipschitz with respect to the norm and that have a -sparse optimizer. Let be a convex set with . Then there exists a universal constant such that for all , we have
If for some (so that ), then this bound is sharp up to constant factors. In particular, suppose that we use mirror descent based on the norm with . As we discuss in more detail in Appendix C, it can be shown that this technique will achieve a solution accurate to within iterations; this achievable result matches our lower bound (14) up to constant factors under the assumed scaling . To the best of our knowledge, Theorem 3 provides the first tight lower bound on the oracle complexity of sparse optimization.
4 Proofs of results
We now turn to the proofs of our main results. We begin in Section 4.1 by outlining the framework and establishing some basic results on which our proofs are based. Sections 4.2 through 4.4 are devoted to the proofs of Theorems 1 through 3 respectively.
4.1 Framework and basic results
We begin by establishing a basic set of results that are exploited in the proofs of the main results. At a high-level, our main idea is to show that the problem of convex optimization is at least as hard as estimating the parameters of Bernoulli variables—that is, the biases of independent coins. In order to perform this embedding, for a given error tolerance , we start with an appropriately chosen subset of the vertices of a -dimensional hypercube, each of which corresponds to some values of the Bernoulli parameters. For a given function class, we then construct a “difficult” subclass of functions that are indexed by these vertices of the hypercube. We then show that being able to optimize any function in this subclass to -accuracy requires identifying the hypercube vertex. This is a multiway hypothesis test based on the observations provided by queries to the stochastic oracle, and we apply Fano’s inequality  or Le Cam’s bound [20, 12] to lower bound the probability of error. In the remainder of this section, we provide more detail on each of steps involved in this embedding.
4.1.1 Constructing a difficult subclass of functions
Our first step is to construct a subclass of functions that we use to derive lower bounds. Any such subclass is parametrized by a subset of the hypercube, chosen as follows. Recalling that denotes the Hamming metric, we let be a subset of the vertices of the hypercube such that
meaning that is a -packing in the Hamming norm. It is a classical fact (e.g., ) that one can construct such a set with cardinality .
Now let denote some base set of functions defined on the convex set , to be chosen appropriately depending on the problem at hand. For a given tolerance , we define, for each vertex , the function
Depending on the result to be proven, our choice of the base functions and the pre-factor will ensure that each satisfies the appropriate Lipschitz and/or strong convexity properties over . Moreover, we will ensure that that all minimizers of each are contained within .
Based on these functions and the packing set , we define the function class
Note that contains a total of functions by construction, and as mentioned previously, our choices of the base functions etc. will ensure that . We demonstrate specific choices of the class in the proofs of Theorems 1 through 3 to follow.
4.1.2 Optimizing well is equivalent to function identification
We now claim that if a method can optimize over the subclass up to a certain tolerance, then it must be capable of identifying which function was chosen. We first require a measure for the closeness of functions in terms of their behavior near each others’ minima. Recall that we use to denote a minimizing point of the function . Given a convex set and two functions , we define
This discrepancy measure is non-negative, symmetric in its arguments, and satisfies if and only if , so that we may refer to it as a premetric. (It does not satisfy the triangle inequality nor the condition that if and only if , both of which are required for to be a metric.)
Given the subclass , we quantify how densely it is packed with respect to the premetric using the quantity
We denote this quantity by when the class is clear from the context. We now state a simple result that demonstrates the utility of maintaining a separation under among functions in .
For any , there can be at most one function such that
Thus, if we have an element that approximately minimizes one function in the set up to tolerance , then it cannot approximately minimize any other function in the set.
For a given , suppose that there exists an such that . From the definition of in (19), for any , we have
Re-arranging yields the inequality , from which the claim (20) follows.
Suppose that for some fixed but unknown function , some method is allowed to make queries to an oracle with information function , thereby obtaining the information sequence
Our next lemma shows that if the method achieves a low minimax error over the class , then one can use its output to construct a hypothesis test that returns the true parameter at least of the time. (In this statement, we recall the definition (2) of the minimax error in optimization.)
Suppose that based on the data , there exists a method that achieves a minimax error satisfying
Based on such a method , one can construct a hypothesis test such that .
Given a method that satisfies the bound (21), we construct an estimator of the true vertex as follows. If there exists some such that then we set equal to . If no such exists, then we choose uniformly at random from . From Lemma 1, there can exist only one such that satisfies this inequality. Consequently, using Markov’s inequality, we have . Maximizing over completes the proof. ∎
We have thus shown that having a low minimax optimization error over implies that the vertex can be identified most of the time.
4.1.3 Oracle answers and coin tosses
We now describe stochastic first order oracles for which the samples can be related to coin tosses. In particular, we associate a coin with each dimension , and consider the set of coin bias vectors lying in the set
Given a particular function —or
equivalently, vertex —we consider two
different types of stochastic first-order oracles , defined as
Oracle A: 1-dimensional unbiased gradientsPick an index uniformly at random. Draw according to a Bernoulli distribution with parameter . For the given input , return the value and a sub-gradient of the function
By construction, the function value and gradients returned by Oracle A are unbiased estimates of those of . In particular, since each co-ordinate is chosen with probability , we have
with a similar relation for the gradient. Furthermore, as long as the
base functions and have gradients bounded by , we
have for all .
Parts of proofs are based on an oracle which responds with function
values and gradients that are -dimensional in nature.
Oracle B: -dimensional unbiased gradientsFor , draw according to a Bernoulli distribution with parameter . For the given input , return the value and a sub-gradient of the function
As with Oracle A, this oracle returns unbiased estimates of the function values and gradients. We frequently work with functions that depend only on the coordinate . In such cases, under the assumptions and , we have
In our later uses of Oracles A and B, we choose the pre-factor appropriately so as to produce the desired Lipschitz constants.
4.1.4 Lower bounds on coin-tossing
Finally, we use information-theoretic methods to lower bound the probability of correctly estimating the true parameter in our model. At each round of either Oracle A or Oracle B, we can consider a set of coin tosses, with an associated vector of parameters. At any round, the output of Oracle A can (at most) reveal the instantiation of a randomly chosen index, whereas Oracle B can at most reveal the entire vector . Our goal is to lower bound the probability of estimating the true parameter , based on a sequence of length . As noted previously in remarks following Theorem 1, this part of our proof exploits classical techniques from statistical minimax theory, including the use of Fano’s inequality (e.g., [9, 10, 11, 12]) and Le Cam’s bound (e.g., [20, 12]).
Suppose that the Bernoulli parameter vector is chosen uniformly at random from the packing set , and suppose that the outcome of coins chosen uniformly at random is revealed at each round . Then for any , any hypothesis test satisfies
where the probability is taken over both randomness in the oracle and the choice of .
Note that we will apply the lower bound (24) with in the case of Oracle A, and in the case of Oracle B.
For each time , let denote the randomly chosen subset of size , be the outcome of oracle’s coin toss at time for coordinate and let be a random vector with entries
By Fano’s inequality , we have the lower bound
where denotes the mutual information between the sequence and the random parameter vector . As discussed earlier, we are guaranteed that . Consequently, in order to prove the lower bound (24), it suffices to establish the upper bound .
By the independent and identically distributed nature of the sampling model, we have
so that it suffices to upper bound the mutual information for a single round. To simplify notation, from here onwards we write to mean the pair . With this notation, the remainder of our proof is devoted to establishing that ,
By chain rule for mutual information , we have
Since the subset is chosen independently of , we have , and so it suffices to upper bound the first term. By definition of conditional mutual information , we have
Since has a uniform distribution over , we have , and convexity of the Kullback-Leibler (KL) divergence yields the upper bound
Now for any pair , the KL divergence can be at most the KL divergence between independent pairs of Bernoulli variates with parameters and . Letting denote the Kullback-Leibler divergence between a single pair of Bernoulli variables with parameters and , a little calculation yields
Consequently, as long as , we have . Returning to the bound (26), we conclude that . Taking averages over , we obtain the bound , and applying the decomposition (25) yields , thereby completing the proof. ∎
The reader might have observed that Fano’s inequality yields a non-trivial lower bound only when is large enough. Since depends on the dimension for our construction, we can apply the Fano lower bound only for large enough. Smaller values of can be lower bounded by reduction to the case ; here we state a simple lower bound for estimating the bias of a single coin, which is a straightforward application of Le Cam’s bounding technique [20, 12]. In this special case, we have , and we recall that the estimator takes values in .
Given a sample size and a parameter , let be i.i.d Bernoulli variables with parameter . Let be any test function based on these samples and returning an element of . Then for any , we have the lower bound
We observe first that for , , so that it suffices to lower bound the expected error. To ease notation, let and denote the probability distributions indexed by and respectively. By Lemma 1 of Yu , we have