Adaptive Online Learning
We propose a general framework for studying adaptive regret bounds in the online learning framework, including model selection bounds and data-dependent bounds. Given a data- or model-dependent bound we ask, “Does there exist some algorithm achieving this bound?” We show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities to bound suprema of offset random processes.
Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, second-order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes theorem that holds for countably infinite sets.
Some of the recent progress on the theoretical foundations of online learning has been motivated by the parallel developments in the realm of statistical learning. In particular, this motivation has led to martingale extensions of empirical process theory, which were shown to be the “right” notions for online learnability. Two topics, however, have remained elusive thus far: obtaining data-dependent bounds and establishing model selection (or, oracle-type) inequalities for online learning problems. In this paper we develop new techniques for addressing both these topics.
Oracle inequalities and model selection have been topics of intense research in statistics in the last two decades [1, 2, 3]. Given a sequence of models whose union is , one aims to derive a procedure that selects, given an i.i.d. sample of size , an estimator from a model that trades off bias and variance. Roughly speaking the desired oracle bound takes the form
where is a penalty for the model . Such oracle inequalities are attractive because they can be shown to hold even if the overall model is too large. A central idea in the proofs of such statements (and an idea that will appear throughout the present paper) is that should be “slightly larger” than the fluctuations of the empirical process for the model . It is therefore not surprising that concentration inequalities—and particularly Talagrand’s celebrated inequality for the supremum of the empirical process—have played an important role in attaining oracle bounds. In order to select a good model in a data-driven manner, one first establishes non-asymptotic data-dependent bounds on the fluctuations of an empirical process indexed by elements in each model (see the monograph ).
Lifting the ideas of oracle inequalities and data-dependent bounds from statistical to online learning is not an obvious task. For one, there is no concentration inequality available, even for the simple case of a sequential Rademacher complexity. (For the reader already familiar with this complexity: a change of the value of one Rademacher variable results in a change of the remaining path, and hence an attempt to use a version of a bounded difference inequality grossly fails). Luckily, as we show in this paper, the concentration machinery is not needed and one only requires a one-sided tail inequality. This realization is motivated by the recent work of [5, 6, 7]. At the high level, our approach will be to develop one-sided inequalities for the suprema of certain offset processes , with an offset that is chosen to be “slightly larger” than the complexity of the corresponding model. We then show that these offset processes also determine which data-dependent adaptive rates are achievable for a given online learning problem, drawing strong connections to the ideas of statistical learning described earlier.
Let be the set of observations, the space of decisions, and the set of outcomes. Let denote the set of distributions on a set . Let be a loss function. The online learning framework is defined by the following process: For , Nature provides input instance ; Learner selects prediction distribution ; Nature provides label , while the learner draws prediction and suffers loss .
Two specific scenarios of interest are supervised learning (, ) and online linear (or convex) optimization ( is the singleton set, and are unit balls in dual Banach spaces and ).
For a class , we define the learner’s cumulative regret to as
A uniform regret bound is achievable if there exists a randomized algorithm for selecting such that
where stands for . Achievable rates depend on complexity of the function class . For example, sequential Rademacher complexity of is one of the tightest achievable uniform rates for a variety of loss functions [8, 7].
An adaptive regret bound has the form and is said to be achievable if there exists a randomized algorithm for selecting such that
We distinguish three types of adaptive bounds, according to whether depends only on , only on , or on both quantities. Whenever depends on , an adaptive regret can be viewed as an oracle inequality which penalizes each according to a measure of its complexity (e.g. the complexity of the smallest model to which it belongs). As in statistical learning, an oracle inequality (2) may be proved for certain functions even if a uniform bound (1) cannot hold for any nontrivial .
1.2 Related Work
The case when does not depend on has received most of the attention in the literature. The focus is on bounds that can be tighter for “nice sequences,” yet maintain near-optimal worst-case guarantees. An incomplete list of prior work includes [9, 10, 11, 12], couched in the setting of online linear/convex optimization, and  in the experts setting.
The present paper was partly motivated by the work of  who presented an algorithm that competes with all experts simultaneously, but with varied regret with respect to each of them, depending on the quantile of the expert. This is a bound of the type (dependent only on , where denotes the quantile we compete against) for the finite experts setting. The work of  considers online linear optimization with an unbounded set and provides oracle inequalities with an appropriately chosen function .
Finally, the third category of adaptive bounds are those that depend on both the hypothesis and the data. The bounds that depend on the loss of the best function (so-called “small-loss” bounds, [16, Sec. 2.4], [17, 13]) fall in this category trivially, since one may overbound the loss of the best function by the performance of . We would like to draw attention to the recent result of  who show an adaptive bound in terms of both the loss of comparator and the KL divergence between the comparator and some pre-fixed prior distribution over experts. An MDL-style bound in terms of the variance of the loss of the comparator (under the distribution induced by the algorithm) was recently given in .
Our study was also partly inspired by Cover  who characterized necessary and sufficient conditions for achievable bounds in prediction of binary sequences. The methods in , however, rely on the structure of the binary prediction problem and do not readily generalize to other settings.
The framework we propose recovers the vast majority of known adaptive rates in literature, including variance bounds, quantile bounds, localization-based bounds, and fast rates for small losses. It should be noted that while existing literature on adaptive online learning has focused on simple hypothesis classes such as finite experts and finite-dimensional -norm balls, our results extend to general hypothesis classes, including large nonparametric ones discussed in .
2 Adaptive Rates and Achievability: General Setup
The first step in building a general theory for adaptive online learning is to identify what adaptive regret bounds are possible to achieve. Recall that an adaptive regret bound of is said to be achievable if there exists an online learning algorithm that produces predictions/decisions such that (2) holds.
In the rest of this work, we use the notation to denote the interleaved application of the operators inside the brackets, repeated over rounds (see ). Achievability of an adaptive rate can be formalized by the following minimax quantity.
Given an adaptive rate we define the offset minimax value:
quantifies how behaves when the optimal learning algorithm that minimizes this difference is used against Nature trying to maximize it. Directly from this definition,
If is a uniform rate, i.e., , achievability reduces to the minimax analysis explored in . The uniform rate is achievable if and only if , where is the minimax value of the online learning game.
We now focus on understanding the minimax value for general adaptive rates. We first show that the minimax value is bounded by an offset version of the sequential Rademacher complexity studied in . The symmetrization Lemma 1 below provides us with the first step towards a probabilistic analysis of achievable rates. Before stating the lemma, we need to define the notion of a tree and the notion of sequential Rademacher complexity.
Given a set , a -valued tree of depth is a sequence of functions . One may view as a complete binary tree decorated by elements of . Let be a sequence of independent Rademacher random variables. Then may be viewed as a predictable process with respect to the filtration . For a tree , the sequential Rademacher complexity of a function class on is defined as
and we denote . Let be the labels of the tree along the path given by .
For any lower semi-continuous loss , and any adaptive rate that only depends on outcomes (i.e. ), we have that
Further, for any general adaptive rate ,
Finally, if one considers the supervised learning problem where , and is a loss that is convex and -Lipschitz in its first argument, then for any adaptive rate ,
The above lemma tells us that to check whether an adaptive rate is achievable, it is sufficient to check that the corresponding adaptive sequential complexity measures are non-positive. We remark that if the above complexities are bounded by some positive quantity of a smaller order, one can form a new achievable rate by adding the positive quantity to .
3 Probabilistic Tools
As mentioned in the introduction, our technique rests on certain one-sided probabilistic inequalities. We now state the first building block: a rather straightforward maximal inequality.
Let , , be a set of indices and let be a sequence of random variables satisfying the following tail condition: for any ,
for some positive sequence , nonnegative sequence and nonnegative sequence of numbers, and for constants . Then for any , , and
it holds that
We remark that need not be the expected value of , as we are not interested in two-sided deviations around the mean.
One of the approaches to obtaining oracle-type inequalities is to split a large class into smaller ones according to a “complexity radius” and control a certain stochastic process separately on each subset (also known as the peeling technique). In the applications below, will often stand for the (random) supremum of this process, and will be an upper bound on its typical size. Given deviation bounds for above , the dilated size then allows one to pass to maximal inequalities (7) and thus verify achievability in Lemma 1. The same strategy works for obtaining data-dependent bounds, where we first prove tail bounds for the given size of the data-dependent quantity, and then appeal to (7).
A simple yet powerful example for the control of the supremum of a stochastic process is an inequality due to Pinelis  for the norm (which can be written as a supremum over the dual ball) of a martingale in a 2-smooth Banach space. Here we state a version of this result that can be found in [23, Appendix A].
Let be a unit ball in a separable -smooth Banach space . Then for any -valued tree ,
When the class of functions is not linear, we may no longer appeal to the above lemma. Instead, we make use of the following result from  that extends Lemma 3 at a price of a poly-logarithmic factor. Before stating the lemma, we briefly define the relevant complexity measures (see  for more details). First, a set of -valued trees is called an -cover of on with respect to if
The size of the smallest -cover is denoted by , and .
The set is an -cover of on with respect to if
We let be the smallest such cover and set .
Lemma 4 ().
Let . Suppose with and that the following mild assumptions hold: , , and there exists a constant such that . Then for any , for any -valued tree of depth ,
The above lemma yields a one-sided control on the size of the supremum of the sequential Rademacher process, as required for our oracle-type inequalities.
Next, we turn our attention to an offset Rademacher process, where the supremum is taken over a collection of negative-mean random variables. The behavior of this offset process was shown to govern the optimal rates of convergence for online nonparametric regression . Such a one-sided control of the supremum will be necessary for some of the data-dependent upper bounds we develop.
Let be a -valued tree of depth , and let . For any and ,
where and .
We observe that the probability of deviation has both subgaussian and subexponential components.
Using the above result and Proposition 2 leads to useful bounds on the quantities in Lemma 1 for specific types of adaptive rates. Given a tree , we obtain a bound on the expected size of the sequential Rademacher process when we subtract off the data-dependent -norm of the function on the tree , adjusted by logarithmic terms.
Suppose , and let be any -valued tree of depth . Assume for some . Then
is at most .
The next corollary yields slightly faster rates than Corollary 6 when .
Suppose with , and let be any -valued tree of depth . Then
4 Achievable Bounds
In this section we use Lemma 1 along with the probabilistic tools from the previous section to obtain an array of achievable adaptive bounds for various online learning problems. We subdivide the section into one subsection for each category of adaptive bound described in Section 1.1.
4.1 Adapting to Data
Here we consider adaptive rates of the form or , uniform over . We show the power of the developed tools on the following example.
Example 4.1 (Online Linear Optimization in ).
Consider the problem of online linear optimization where , , , and . The following adaptive rate is achievable:
where is the spectral norm. Let us deduce this result from Corollary 6. First, observe that
The linear function class can be covered point-wise at any scale with balls and thus
for any -valued tree . We apply Corollary 6 with (the integral vanishes) to conclude the claimed statement.
4.2 Model Adaptation
In this subsection we focus on achievable rates for oracle inequalities and model selection, but without dependence on data. The form of the rate is therefore . Assume we have a class , with the property that for any . If we are told by an oracle that regret will be measured with respect to those hypotheses with , then using the minimax algorithm one can guarantee a regret bound of at most the sequential Rademacher complexity . On the other hand, given the optimality of the sequential Rademacher complexity for online learning problems for commonly encountered losses, we can argue that for any chosen in hindsight, one cannot expect a regret better than order . In this section we show that simultaneously for all , one can attain an adaptive upper bound of . That is, we may predict as if we knew the optimal radius, at the price of a logarithmic factor. This is the price of adaptation.
For any class of predictors with non-empty, if one considers the supervised learning problem with -Lipschitz loss , the following rate is achievable:
for absolute constants , and defined in Lemma 4.
In fact, this statement is true more generally with replaced by .
It is tempting to attempt to prove the above statement with the exponential weights algorithm running as an aggregation procedure over the solutions for each . In general, this approach will fail for two reasons. First, if function values grow with , the exponential weights bound will scale linearly with this value. Second, an experts bound yields a rate which spoils any faster rates one may obtain using offset Rademacher complexities.
As a special case of the above lemma, we obtain an online PAC-Bayesian theorem for infinite classes of experts. However, we postpone this example to the next sub-section where we get a data-dependent version of this result. Neither of these bounds appear to be available in the literature, to the best of our knowledge.
We now provide a bound for online linear optimization in -smooth Banach spaces that automatically adapts to the norm of the comparator. To prove it, we use the concentration bound from  (Lemma 3) within the proof of the above corollary to remove the extra logarithmic factors.
Example 4.2 (Unconstrained Linear Optimization).
Consider linear optimization with being the unit ball of some reflexive Banach space with norm . Let be the dual space and the loss (where we are using to represent the linear functional in the first argument to the second argument). Define where is the norm dual to . If the unit ball of is -smooth, then the following rate is achievable for all with :
For the case of a Hilbert space, the above bound was achieved by .
4.3 Adapting to Data and Model Simultaneously
We now study achievable bounds that perform online model selection in a data-adaptive way. Of specific interest is the example of online optimistic PAC-Bayesian bound which —in contrast to earlier results—does not have dependence on the number of experts, and so holds for countably infinite sets of experts. The bound simultaneously adapts to the loss of the mixture of experts. This example subsumes and improves upon the recent results from [18, 14] and provides an exact analogue to the PAC Bayesian theorem from statistical learning. Further, quantile experts bounds can be easily recovered from the result.
Example 4.3 (Generalized Predictable Sequences (Supervised Learning)).
Consider an online supervised learning problem with a convex -Lipschitz loss. Let be any predictable sequence that the learner can compute at round based on information provided so far, including (One can think of the predictable sequence as a prior guess for the hypothesis we would compare with in hindsight). Then the following adaptive rate is achievable:
for constants from Corollary 6. The achievability is a direct consequence of Eq. (5) in Lemma 1, followed by Corollary 6 (one can include any predictable sequence in the Rademacher average part because is zero mean). Particularly, if we assume that the sequential covering of class grows as for some , we get that
As gets closer to , we get full adaptivity and replace by . On the other hand, as gets closer to (i.e. more complex function classes), we do not adapt and get a uniform bound in terms of . For , we attain a natural interpolation.
Example 4.4 (Regret to Fixed Vs Regret to Best (Supervised Learning)).
Consider an online supervised learning problem with a convex -Lipschitz loss and let . Let be a fixed expert chosen in advance. The following bound is achievable:
In particular, against we have
and against an arbitrary expert we have
Example 4.5 (Optimistic PAC-Bayes).
Assume that we have a countable set of experts and that the loss for each expert on any round is non-negative and bounded by . The function class is the set of all distributions over these experts, and . This setting can be formulated as online linear optimization where the loss of mixture over experts, given instance , is , the expected loss under the mixture. The following adaptive bound is achievable:
This adaptive bound is an online PAC-Bayesian bound. The rate adapts not only to the KL divergence of with fixed prior but also replaces with . Note that we have , yielding the small-loss type bound described earlier. This is an improvement over the bound in  in that the bound is independent of number of experts, and thus holds even for countably infinite sets of experts. The KL term in our bound may be compared to the MDL-style term in the bound of . If we have a large (but finite) number of experts and take the uniform distribution , the above bound provides an improvement over both  and  for quantile bounds for experts. Specifically, if we want quantile bounds simultaneously for every quantile then for any given quantile we can use uniform distribution over the top experts and hence the KL term is replaced by .
Evaluating the above bound with a distribution that places all its weight on any one expert appears to address the open question posed by  of obtaining algorithm-independent oracle-type variance bounds for experts.
The proof of achievability of the above rate is shown in the appendix because it requires a slight variation on the symmetrization lemma specific to the problem.
5 Relaxations for Adaptive Learning
To design algorithms for achievable rates, we extend the framework of online relaxations from . A relaxation is admissible for an adaptive rate if satisfies the initial condition
and the recursive condition
The corresponding strategy enjoys the adaptive bound
It follows immediately that the strategy achieves the rate . Our goal is then to find relaxations for which the strategy is computationally tractable and or at least has smaller order than . Similar to , conditional versions of the offset minimax values yield admissible relaxations, but solving these relaxations may not be computationally tractable.
Example 5.1 (Online PAC-Bayes).
Consider the experts setting described in Example 4.5 and an adaptive bound,
Let for and let denote the exponential weights distribution with learning rate given losses : (wherein is the th standard basis vector). The following is an admissible relaxation achieving :
To achieve this strategy we maintain a distribution with . We predict by drawing according to , then drawing an expert according to .
This algorithm can be interpreted as running a “low-level” instance of the exponential weights algorithm for each complexity radius , then combining the predictions of these algorithms with a “high-level” instance. The high-level distribution differs slightly from the usual exponential weights distribution in that it incorporates a prior whose weight decreases as the complexity radius increases. The prior distribution prevents the strategy from incurring a penalty that depends on the range of values the complexity radii take on, which would happen if the standard exponential weights distribution were used.
While in general the problem of obtaining an efficient adaptive relaxation might be hard, one can ask the question, “If and efficient relaxation is available for each , can one obtain an adaptive model selection algorithm for all of ?”. To this end for supervised learning problem with convex Lipschitz loss we delineate a meta approach which utilizes existing relaxations for each to obtain algorithm for general adaptation.
Let be the randomized strategy corresponding to , obtained after observing outcomes , and let be nonnegative. The following relaxation is admissible for the rate :
Playing according to the strategy for will guarantee a regret bound of the form , and can be bounded using proposition 2 when the form of is as in that proposition.
We remark that the above strategy is not necessarily obtained by running a high-level experts algorithm over the discretized values of . It is an interesting question to determine the cases when such a strategy is optimal. More generally, whenever the adaptive rate depends on data, it is not possible to obtain the rates we show non-constructively in this paper using some form of exponential weights algorithms using meta-experts as the required weighting over experts would be data dependent (and hence is not a prior over experts). Further, the bounds from exponential-weights-type algorithms are more akin to having sub-exponential tails in Proposition 2, but for many problems we might have sub-gaussian tails.
Obtaining computationally efficient methods from the proposed framework is an interesting research direction. Proposition 2 provides a useful non-constructive tool to establish achievable adaptive bounds, and a natural question to ask is if one can obtain a constructive counterpart for the proposition.