Information-Based Complexity, Feedback
and Dynamics in Convex Programming
We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an oracle for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimization algorithm can gather observations in a controlled manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of “information” in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.
Many problems arising in such areas as communications and signal processing, contrtol, machine learning, economics, and many others require solving mathematical programs of the form
where is a convex objective function and is a compact, convex subset of . Therefore, it is important to have a clear understanding of the fundamental limits on the efficiency of convex programming methods.
A systematic study of these fundamental limits was initiated in the 1970’s by Nemirovski and Yudin . In their framework, an optimization algorithm is a sequential procedure that repeatedly queries a black-box oracle for information about the function being optimized, each query depending on the past information. The oracle may be deterministic (for example, giving the value of the function and its derivatives up to some order at any point) or stochastic. This leads to the notion of information-based complexity, i.e., the smallest number of oracle calls needed to minimize any function in a given class to a desired accuracy. The results in  are very wide in scope and cover a variety of convex programming problems in Banach spaces; finite-dimensional versions are covered in  and .
For deterministic oracles, Nemirovski and Yudin derived lower bounds on the information complexity of convex programming using a “counterfactual” argument: given any algorithm that purports to optimize all functions in some class to some degree of accuracy using at most oracle calls, one explicitly constructs, for a particular history of queries and oracle responses, a function in which is consistent with this history, and yet cannot be -minimized by the algorithm using fewer than oracle calls (see also ). A similar approach was also used for stochastic oracles.
Proper application of this method of resisting oracles requires a lot of ingenuity. In particular, the stochastic case involves fairly contrived noise models, unlikely to be encountered in practice. In this paper, which expands upon our preliminary work , we will show that the same (and many other) lower bounds can be derived using a much simpler information-theoretic technique reminiscent of the way one proves minimax lower bounds in statistics [5, 6, 7]. Namely, we reduce optimization to hypothesis testing with controlled observations and then relate the resulting probability of error to information complexity using Fano’s inequality and a series of mutual information bounds. These bounds highlight the role of feedback in choosing the next query based on past observations. One notable feature of our approach is that it does not require constructing particularly “strange” functions or noise models. Moreover, we derive a “law of diminishing returns” for a wide class of convex optimization schemes, which says that the decay of optimization error is offset by the decay of the rate at which the algorithm can reduce its uncertainty about the objective function.
The idea of relating optimization to hypothesis testing is not new. For instance, Shapiro and Nemirovski  derive a lower bound on the information complexity of a certain class of one-dimensional linear optimization problems by reducing optimization to a binary hypothesis testing problem pertaining to the parameter of a Bernoulli random variable (the outcome of a coin toss). The reduction consists in showing that any good optimization algorithm can be converted into an accurate estimator of the coin bias based on repeated independent trials; then one can derive the lower bound on the information complexity (equivalently, the minimum necessary number of coin tosses) from the data processing inequality for divergence (or Fano’s inequality). This approach was recently extended to multidimensional optimization problems by Agarwal et al. [9, 10]. Like the present paper, their work uses information-theoretic methods to derive lower bounds on the oracle complexity of convex optimization, and their results are qualitatively similar to some of ours. However, what sets our work apart from [8, 9, 10] is that we explicitly account for the controlled manner in which the algorithm interacts with the oracle. This, in turn, allows us to derive tight lower bounds on the rate of error decay for certain types of infinite-step descent algorithms, which is not possible with the reduction to coin tossing.
Sequential procedures have become increasingly popular in the field of machine learning, mostly due to the abundance of data and the resulting need to perform computation on-line. Convex optimization is not the only sequential setting being studied: recent research in machine learning has also focused on such scenarios as active learning, multi-armed bandits, and experimental design, to name a few. In all these settings, one element is common: each additional “action” should provide additional “information” about some unknown quantity. Translating this intuitive notion of “information” into precise information-theoretic statements is often difficult. Our contribution consists in offering such a translation for convex optimization and closely related problems.
Given a continuous function on a compact domain , we denote by its minimum value over :
We will use several basic notions from nonsmooth convex analysis . The subdifferential of at , denoted by , is the set of all , such that
Any such is a subgradient of at . For a convex , the subdifferential is always nonempty. When , its only element is precisely the gradient . By we denote the norm of ; the norm will also be denoted by . By we denote the unit ball in in the norm. The -diameter of is defined as
The identity matrix will be denoted by .
All abstract spaces are assumed to be standard Borel (i.e., Borel subsets of a complete separable metric space), and will be equipped with their Borel -fields. If is such a space, then will denote the corresponding -field. All functions between such spaces are assumed to be measurable. If and are two such spaces, then a Markov kernel [12, 13] from to is a mapping , such that for any is a probability measure on and for any is a measurable function on . We will use the standard notation for such a kernel.
We will work with the usual information-theoretic quantities, which are well-defined in standard Borel spaces . Given two (Borel) probability measures and on , their divergence is
where the notation means that is absolutely continuous w.r.t. , i.e., for any implies that as well. If is a product space, , then the conditional divergence between two probability distributions and on given (the -marginal of ) is
where and are any versions of the regular conditional probability distributions of given under and , respectively. This definition extends in the obvious way to situations when or are themselves product spaces. Thus, if and are two probability distributions for a random triple taking values in a product space , such that, under , and are conditionally independent given , i.e., -a.s., then we will write
Given a random couple with probability distribution , the mutual information between and is
Given a random triple , the conditional mutual information between and given is
where (4) follows from Bayes’ rule and from (I-A). In other words, the conditional mutual information is given by the conditional divergence between the joint distribution of and the distribution under which and are conditionally independent given .
Ii Sequential optimization algorithms and their information-based complexity
The work of Nemirovski and Yudin  deals with fundamental limitations of sequential optimization algorithms in the real-number model of computation. The basic setting is as follows. We have a class of convex functions on some compact convex domain . We seek an “optimal” algorithm that would solve the optimization problem (1) with a given guarantee of accuracy regardless of which were to be optimized. The algorithms of interest operate by repeatedly querying an oracle for information about the unknown objective at appropriately selected points in and then combining the accumulated information to form a solution. The notion of optimality of an algorithm pertains to the number of queries it makes before producing a solution, without regard to the combinatorial complexity of computing each query. In other words, we are interested in the information-based complexity (IBC) [15, 16] of convex optimization problems.
The theory of IBC is concerned with intrinsic difficulty of computational problems in terms of the minimum amount of information needed to solve every problem in a given class with a given guarantee of accuracy. The word “information” here does not refer to information in the sense of Shannon, but rather to what is known a priori about the problem being solved, as well as what an algorithm is allowed to learn during its operation. There are three aspects inherent in this notion of information — it is partial, noisy, and priced. Let us explain informally what these three terms mean in the context of optimization by means of a simple example.
Let , and consider the function class
We wish to design an algorithm that minimizes every to a given accuracy . At the outset, the only a priori information available to the algorithm consists of the problem domain , the function class , and the desired accuracy . The algorithm is allowed to query the value and the derivative of at any finite set of points before arriving at a solution, which we denote by . The queries are answered by an oracle, i.e., a (possibly stochastic) device that knows the function (or, equivalently, the parameter ) and responds to any query with , where is a random element from some probability space that represents oracle noise. The random variable is assumed to be a noisy observation of the pair . For concreteness, let us suppose that
where and are an i.i.d. pair of random variables.
The interaction of the algorithm and the oracle takes place as follows. Let be an i.i.d. sequence. At time , the algorithm computes the query as a function of the past queries and the corresponding oracle responses . At time the algorithm knows only that ; this represents the a priori information. At time , the algorithm acquires additional data , and so can refine its a priori information. At every time step, the information is partial in the sense that there are (potentially infinitely) many functions consistent with it, and it is also noisy due to the presence of the additive disturbances .
Formally, for the example outlined above, an algorithm that makes queries (or a -step algorithm) is a tuple , where , so that, for , is the query at time , and is the solution. We assume that information is priced in the sense that the algorithm is charged some fixed cost for every query it makes. Thus, it is desired to keep the number of queries to a minimum. With this in mind, we can define the IBC for a given accuracy as
where the expectation is taken w.r.t. the noise process . For this particular problem it can be shown that
The first entry follows because the algorithm can just query , obtain the response , and immediately compute ; the last entry follows because the maximum value of any on is at most . The intermediate regime is more involved. The main contribution of the present paper is a unified information-theoretic framework for deriving lower bounds on the IBC of arbitrary sequential algorithms for solving convex programming problems.
Ii-a Formal definitions
The above discussion can be formalized as follows:
A problem class is a triple consisting of the following objects:
A compact, convex problem domain ;
An instance space , which is a class of convex functions ;
An oracle , where is the oracle information space and , is a Markov kernel111Recall that is a subset of , the space of all continuous real-valued functions on . Equipped with the usual sup norm, is a separable Banach space, so a Markov kernel from into is well-defined..
Some restrictions must be imposed in order to exclude oracles that are “too informative,” an extreme example being and . One way to rule this out is to require the oracle in question to be local :
We say that an oracle is local if for every and every pair such that in some open neighborhood of , we have
It is easy to see that the oracle described right before the definition is not local. Indeed, fix a point and consider any two functions that agree on some open neighborhood of , but are not equal outside this neighborhood. Then , but , which violates locality. Most oracles encountered in practice are local (see, for instance, the examples in Section III).
To gain more insight into stochastic oracles, we can appeal to the basic structural result for Markov kernels: If and are standard Borel spaces, then any Markov kernel from to can be realized in the form , where is a random variable uniformly distributed on and is a measurable mapping [13, Lemma 3.22]. Thus, for any stochastic oracle we can find a deterministic oracle with some information space and a measurable mapping , such that can be realized as
with as above. Thus, will be local in the sense of Definition 2 whenever its “deterministic part” is local.
Next, we make the notion of an optimization algorithm precise. In this paper, we deal only with deterministic algorithms, although all the results can be easily extended to cover randomized algorithms as well (cf.  for details):
A -step algorithm for a given is a sequence of mappings . The set of all -step algorithms for will be denoted by .
The interaction of any with , shown in Figure 1, is described recursively as follows:
At time , a problem instance is selected by Nature and revealed to , but not to .
At each time :
queries with , where is the algorithm’s query and the oracle’s response at time .
responds with a random element according to .
At time , outputs the candidate minimizer .
We can view the set-up of Figure 1 as a discrete-time stochastic dynamical system with an unknown “parameter” , input sequence , and output sequence . The objective is to drive the system as quickly as possible to an -minimizing state, i.e., any such that , for every . We are interested in the fundamental limits on the speed with which this can be done. Defining the error of on by
we introduce the following definition:
Fix a problem class . For any , , and , we define the th-order -complexity and the -complexity of , respectively, as
When the underlying problem class is clear from context, we will write simply and . Moreover, when we will simply write or .
The following is immediate from definitions (the proof is in Appendix B):
For any ,
The complexities and capture the intrinsic difficulty of sequential optimization over the problem class using any finite-step algorithm. However, most iterative optimization algorithms used in practice (such as stochastic gradient descent) are not run for a prescribed finite number of steps. Instead, they are run for however many steps are necessary until a desired accuracy is reached. Moreover, the error of the successive candidate minimizers produced by such an algorithm should decay monotonically with time. This observation motivates the following definitions:
A weak infinite-step algorithm for is a sequence of mappings . The set of all weak infinite-step algorithms for will be denoted by .
Given a problem class and some , an algorithm is -anytime if
We can now ask about fundamental limits on the rate of convergence in (9):
For any problem class , we define the -anytime exponent as
According to the above definitions, the candidate minimizer produced by a weak infinite-step algorithm after queries is simultaneously the query at time . Many algorithms used in practice, such as stochastic gradient descent, are weak infinite-step algorithms. A more general class of algorithms, which we may call strong infinite-step algorithms, would also include strategies in which the process of issuing queries (i.e., gathering information about the objective) is separated from the process of generating candidate minimizers. Stochastic gradient descent with trajectory averaging [17, 3] is an example of such a strong algorithm. We do not consider strong infinite-step algorithms in this paper (except for a brief discussion in Appendix A), although their study is an interesting and important avenue for further research.
Iii Examples of problem classes and preview of selected results
The following six examples show the variety of settings captured by our framework, ranging from “standard” optimization problems to such scenarios as parameter estimation, sequential experimental design, and active learning.
Given , let be the set of all convex functions that are -Lipschitz, i.e.,
Let and let be a point mass concentrated at , where, for each , is an arbitrary subgradient in . This oracle provides noiseless first-order information. When , we will write instead of .
Take as above, but now suppose that the oracle responds with
where and are zero-mean random variables with finite second moments. Thus, any algorithm receives noisy first-order information, and the oracle is local.
Given , let be the set of all differentiable functions that are -strongly convex, i.e.,
As in the previous example, the oracle responds with
where and are zero-mean random variables with finite second moments. When , we will write instead of .
Fix a compact convex set and a family of probability measures on . Consider the class of convex functions
such that for every . Consider also the oracle , defined by
This oracle ignores the query and simply outputs a random element . The problem class thus describes the statistical problem of estimating the parameter of a probability distribution. More generally, we can consider the function class
where we assume that:
For each fixed , the function is convex
The second condition says that is a contrast function . Most classical problems in statistical inference, such as estimating the mean, the median, or the variance of a distribution, can be cast as minimizing a convex contrast function of the form (12). For instance, if , , for each , and , then
with , so we recover the problem of estimating the mean.
As we have just seen, the queries are of no use in statistical estimation since the samples the statistician obtains depend only on the unknown parameter . By contrast, the setting in which the statistician’s queries do affect the observations is known as sequential experimental design [19, 20, 21]. Consider the case when is compact and convex, as in the above example. Suppose also that we have two families of probability measures on , and . The function class is as in (12) but with replacing , while the oracle now is defined by
Thus, the role of is to provide a measure of performance (or goodness-of-fit) of the final estimate of , while describes the experimental model (i.e., the relationship between the input and the response given the parameter ).
Our last example is at the intersection of statistical learning theory and sequential experimental design. Let
To define the oracle, suppose that there exist some and , such that
where the first inequality holds for all in a sufficiently small neighborhood of . This oracle provides a noisy subgradient of at , and the amount of noise depends on the distance between and . This problem class is related to active learning of a threshold function on the unit interval , and will be treated in detail in Section VI.
on the number of oracle calls required to -minimize every function in a given class, where the exponent depends on the geometry of the problem domain and on the complexity of the instance space . For convex Lipschitz functions and noiseless first-order oracles (Example 1), or more generally for stochastic oracles that are sufficiently “informative” in a sense we make precise, this lower bound holds with (cf. the discussion right after Theorem 1). This lower bound is known to be optimal in the noiseless case  and in certain noisy scenarios when ; however, our techniques lead to a much more transparent proof of the bound.
For the noisy first-order oracle with zero-mean Gaussian noise of variance , we obtain lower bounds of the form
where the exponent depends, as before, on the geometry of , on the complexity of , as well as on whether the oracle supplies full first-order information (function value and subgradient) or just the subgradient. The exponent depends on the details of the function class . More specifically:
The corresponding result for convex Lipschitz functions in can be found in [1, 8], yet we obtain the optimal dependence on for higher dimensions. Our lower bound for strongly convex functions seems to be new; in particular, Nemirovski and Yudin  only consider the noiseless case, while Agarwal et al. [9, 10] consider noisy first-order oracles, but with a different oracle model, which does not allow additive noise due to a coin-tossing construction. Ignoring the dependence on the dimension, we also obtain the error decay rate for when we restrict ourselves to anytime infinite-step algorithms (Theorem 5 in Section VI). To the best of our knowledge, such analysis does not appear anywhere else in the literature. The bounds of Eq. (7) essentially capture the fundamental limits of strongly convex programming in one dimension and can be easily deduced using our techniques (a sketch of the derivation is given in Section IV-A). We also derive new (and tighter) lower bounds on anytime algorithms for minimizing higher-order polynomials under a second-moment error criterion (Theorems 6 and 7 in Section VI).
Apart from “standard” optimization problems, our framework seamlessly captures several statistical problems with an optimization flavor. In particular, in Section V-D we look at information-based complexity of statistical estimation and sequential experimental design (Examples 4 and 5, respectively). Here we do not aim at obtaining tight rates for specific settings of interest, but rather show the connections to the techniques employed in statistics. Finally, we show in Section VI-C that our methodology leads to a particularly easy derivation of a lower bound for the active learning problem of Example 6. This bound was previously obtained in  using a much more involved argument relying on a careful construction of a “difficult” subset of functions.
Overall, our main contributions are the development of a general framework that captures many diverse settings with optimization flavor, as well as a novel analysis that takes into account the effect of feedback upon the dynamics of the interaction between the algorithm and the oracle.
Iv Setting the stage: optimization vs. hypothesis testing with feedback
We now lay down the foundations of our information-theoretic method for determining lower bounds on the information complexity of convex programming. The basic strategy is to show that the minimum number of oracle queries is constrained by the average rate at which each new query can reduce the algorithm’s uncertainty about the function being optimized.
Conceptually, our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures [5, 6, 7]. The main idea is this. Given a problem class , we construct a “difficult” finite subclass , such that the functions in it are nearly indistinguishable from one another based on the information supplied by the oracle in response to any possible query, and yet they are sufficiently far apart from one another, so that a candidate approximate minimizer for any one of them fails to minimize all the remaining functions to the same accuracy. Once such a class is constructed, we consider a fictitious situation in which Nature selects an element of uniformly at random. Then for every -step algorithm we can construct a probability space with the following random variables defined on it:
, which encodes the random choice of a problem instance in
, where are the queries issued by and is the candidate minimizer
are the responses of to the queries issued by .
These variables describe the interaction between Nature, the algorithm, and the oracle, and thus have the causal ordering
where, -almost surely,
for all . In other words, and are Markov chains for every .
The reason for such punctilious bookkeeping is that now we can relate the problem faced by to sequential hypothesis testing with feedback, as defined by Burnashev . We can think of as encoding the choice of one of equiprobable hypotheses. At each time , the algorithm issues a query and receives an observation which is stochastically related to and via the kernel . The current query may depend only on the past queries and observations. At time , the algorithm produces a candidate minimizer, . As we will shortly demonstrate, we can use the information available to at time to construct an estimate of the true hypothesis .222It is important to keep in mind that the hypothesis testing set-up is purely fictitious — indeed, may or may not know that the problem instances are drawn at random among , rather than arbitrarily from the entire instance space . The point is, though, that the average performance of on cannot be better than its worst-case performance on . In statistical terms, the minimax risk of over is bounded below by the Bayes risk over any subset of . Once this is done, we can analyze the mutual information , which is well-defined because we have specified . In particular, the analysis hinges on the following observations. Suppose that is such that for some , , and we have
where the probability is w.r.t. the randomness in the oracle’s responses. Then, first of all,
We will use this fact, together with the “geometric” distinguishability of the functions , to show that and, as a consequence, that there exists some , such that
In other words, a good algorithm should be able to obtain a nontrivial amount of information about the hypothesis . On the other hand, by the data processing inequality, , and we will use statistical indistinguishability of , as well as the structure of the oracle, to obtain an upper bound of the form
with some . The two bounds are then combined to yield
Iv-a An illustrative example
To illustrate our method in action, we will sketch the derivation of the nontrivial part of the lower bound in (7), i.e., when . Let
It is easy to see that . Consider two functions
A simple calculation shows that for any such that
we must have , and the same holds with the roles of and reversed. Thus, any -minimizer of fails to -minimize , and vice versa.
On the other hand, the probability distribution of the output of the first-order Gaussian oracle (6) for any query when is very close to its counterpart. Indeed, letting denote the output of the oracle, we have
Then it is not hard to show that, for ,
In other words, the functions and are nearly indistinguishable from one another based on the outcome of a single query.
Now suppose that Nature selects an index uniformly at random. Consider a -step algorithm that -minimizes every function in the class defined in (5) with probability at least , where . Let . Then Lemma 1 in Section IV-C can be used to show that the lower bound (14) holds with
Iv-B Reduction to hypothesis testing with feedback
We now develop our information-theoretic methodology in the general setting of Section II-A.
Let us fix a problem class . To set up our analysis, we first endow the instance space with a “distance” that has the following property: for any and any ,
In other words, an -minimizer of a function cannot simultaneously be an -minimizer of a distant function. It is easy to construct a satisfying (18) for any particular class of continuous functions, although such a need not be a metric. For example, if we consider the class
for some , then satisfies (18). Indeed, and imply by the triangle inequality. For a general , we can also define
the distance-like function introduced in [9, 10]. This definition coincides with for the parametric set ; however, (18) is the most general requirement. Note that we will often implicitly restrict our consideration to a subclass of and define an appropriate on that subclass.
Let us fix the exponent and consider any finite , such that any two distinct are at least apart in . Given any and an algorithm , we can now construct the probability space , as described in the introduction to this section. Given , the output of , we can define the “estimator”
which simply selects that function in for which the error of is the smallest. Since is -measurable, the estimator is indeed a function only of the information available to after time .
Iv-C Information bounds
The main object of interest will be the mutual information . We first show that any “good” -step algorithm obtains a nonzero amount of information about at the end of its operation:
Fix some , , and . Suppose attains
Let be a finite set of functions, such that
Let be uniformly distributed on , and suppose that is fed with the random problem instance . If , then the estimator defined in (19) satisfies the bound
If , then
where is the binary entropy function.
In the sequel, we will consider only the cases when the set is either “rich”, so that , or has only two elements, so .
Consider an algorithm with the claimed properties. Define, for each