On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities
We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed regret bounds), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers for martingales), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence, we prove exponential tail bounds for norms of Banach space valued martingales via deterministic regret bounds for the online mirror descent algorithm with an adaptive step size. We extend these results beyond the linear structure of the Banach space: we define a notion of martingale type for general classes of real-valued functions and show its equivalence (up to a logarithmic factor) to various sequential complexities of the class (in particular, the sequential Rademacher complexity and its offset version). For classes with the general martingale type 2, we exhibit a finer notion of variation that allows partial adaptation to the function indexing the martingale. Our proof technique rests on sequential symmetrization and on certifying the existence of regret minimization strategies for certain online prediction problems.
Let be a martingale difference sequence taking values in a separable -smooth Banach space . A result due to Pinelis  asserts that for any
where is a constant satisfying . Writing the norm as the supremum over the dual ball, we may re-interpret as a one-sided tail control for the supremum of a stochastic process . In this paper, we consider several extensions of , motivated by the following questions:
Can be strengthened by replacing with a “path-dependent” version of variation?
Does a version of hold when we move away from the linear structure of the Banach space?
Positive answers to these questions constitute the first contribution of our paper. The second contribution involves the actual technique. The cornerstone of our analysis is a certain equivalence of martingale inequalities and deterministic pathwise statements. The latter inequalities are studied in the field of online learning (or, sequential prediction), and are referred to as regret bounds. We show that the existence (which can be certified via the minimax theorem) of prediction strategies that minimize regret yields predictable processes that help in answering (a) and (b). The equivalence is exploited in both directions, whereby stronger regret bounds are derived from the corresponding probabilistic bounds, and vice versa. To obtain one of the main results in the paper, we sharpen the bound by passing several times between the deterministic statements and probabilistic tail bounds. The equivalence asserts a strong connection between probabilistic inequalities for martingales and online learning algorithms.
In the remainder of this section, we present a simple example of the equivalence based on the gradient descent method, arguably the most popular convex optimization procedure. The example captures, loosely speaking, a correspondence between deterministic optimization methods and probabilistic bounds. Consider the unit Euclidean ball in . Let and define, recursively, the Euclidean projections
for each , with the initial value . Elementary algebra
Applying the deterministic inequality to a -valued martingale difference sequence ,
The latter upper bound is an application of the Azuma-Hoeffding’s inequality. Indeed, the process is predictable with respect to , and thus is a -valued martingale difference sequence. It is worth emphasizing the conclusion: one-sided deviation tail bounds for a norm of a vector-valued martingale can be deduced from tail bounds for real-valued martingales with the help of a deterministic inequality. Next, integrating the tail bound in yields a seemingly weaker in-expectation statement
for an appropriate constant . The twist in this uncomplicated story comes next: with the help of the minimax theorem,  established existence of strategies such that
with the supremum taken over all martingale difference sequences with respect to a dyadic filtration. In view of , this bound is .
What have we achieved? Let us summarize. The deterministic inequality , which holds for all sequences, implies a tail bound . The latter, in turn, implies an in-expectation bound , which implies (with a worse constant) through a minimax argument, thus closing the loop. The equivalence—studied in depth in this paper—is informally stated below:
Informal: The following bounds imply each other: (a) an inequality that holds for all sequences; (b) a deviation tail probability for the size of a martingale; (c) an in-expectation bound on the size of a martingale.
The equivalence, in particular, allows us to amplify the in-expectation bounds to appropriate high-probability tail bounds.
As already mentioned, the pathwise inequalities, such as , are extensively studied in the field of online learning. In this paper, we employ some of the recently developed data-dependent (adaptive) regret inequalities to prove tail bounds for martingales. In turn, in view of the above equivalence, martingale inequalities shall give rise to novel deterministic regret bounds.
While writing the paper, we learned of the trajectorial approach, extensively studied in recent years. In particular, it has been shown that Doob’s maximal inequalities and Burkholder-Davis-Gundy inequalities have deterministic counterparts . The online learning literature contains a trove of pathwise inequalities, and further synthesis with the trajectorial approach (and the applications in mathematical finance) appears to be a promising direction.
This paper is organized as follows. In the next section, we extend the Euclidean result to martingales with values in Banach spaces, and also improve it by replacing with square root of variation. We define a notion of martingale type for general classes of functions in Section 3, and exhibit a tight connection to the growth of sequential Rademacher complexity. Section 4 presents sequential symmetrization; here we prove that statements for the dyadic filtration automatically yield corresponding tail bounds for general discrete-time stochastic processes. In Section 5, we introduce the machinery for obtaining regret inequalities, and show how these inequalities allow one to amplify certain in-expectation bounds into high-probability statements (Section Section 6). The last two sections contain some of the main results: In Section 7 we prove a high probability bound for the notion of martingale type, and present a finer analysis of adaptivity of the variation term in Section 8.
2Results in Banach spaces
For the case of the Euclidean (or Hilbertian) norm, it is easy to see that the bound of can be improved to a distribution-dependent quantity . Given the equivalence sketched earlier, one may wonder whether this implies existence of a gradient-descent-like method with a sequence-dependent variation governing the rate of convergence of this optimization procedure. Below, we indeed present such a method for -smooth Banach spaces.
Let be a separable Banach space, and let denote its dual. is of martingale type (for ) if there exists a constant such that
for any -valued martingale difference sequence. The best possible constant in this inequality (as well as its finiteness) is known to depend on the geometry of the Banach space. For instance, for a Hilbert space holds for with constant . On the other hand, triangle inequality implies that any space has the trivial type .
An equivalent way to define martingale type is to ask that there exist a constant such that
We now show that the strengthening to a sequence-dependent variation holds for any -smooth Banach space, as we show next. Based on the equivalence mentioned earlier, we immediately obtain tail bounds.
Assume is -smooth. Let be the Bregman divergence with respect to a convex function , which is assumed to be -strongly convex on the unit ball of . Denote . We extend and improve as follows.
In addition to extending the Euclidean result of the previous section to Banach spaces, offers several advantages. First, it is -independent. Second, deviations are self-normalized (that is, scaled by root-variation terms). We refer to Lemma ? for other forms of probabilistic bounds.
To prove the theorem, we start with a deterministic inequality from . For completeness, the proof is provided in the Appendix.
We take to be the unit ball in , ensuring . For any martingale difference sequence with values in , the above lemma implies, by definition of the norm,
for all sample paths. Dividing both sides by , we conclude that the left-hand side in is upper bounded by
To control this probability, we recall the following result :
To apply this theorem, we verify assumption :
The proof of the Lemma, as well as most of the proofs in this paper, is postponed to the Appendix. This concludes the proof of Theorem ?.
Let us make several remarks. First,  proves a more general deterministic inequality: for any collection of functions , there exists a strategy such that
Second, the reader will notice that the pathwise inequality does not depend on and the construction of is also oblivious to this value. A simple argument (Lemma ? in the Appendix) then allows us to lift the real-valued Burkholder-Davis-Gundy inequality (with the constant from ) to the Banach space valued martingales:
Notably, the constant in the resulting BDG inequality is proportional to .
We also remark that Theorem ? can be naturally extended to -smooth Banach spaces . This is accomplished in a straightforward manner by extending Lemma ?.
In conclusion, we were able to replace the distribution-independent bound with a sequence-dependent quantity . One may ask whether this phenomenon is general; that is, whether a sequence-dependent variation bound necessarily holds whenever the corresponding distribution-independent bound does. We prove in Theorem ? below that this is indeed the case (up to a logarithmic factor), a result that holds for general classes of functions.
3Martingale Type for a General Class of Functions
We now define the analogue of a martingale type for a class of real-valued measurable functions on some abstract measurable space . To this end, we assume that is a discrete time process on a probability space . Let denote the expectation on this probability space, and let denote the conditional (given ) expectation with respect to . For any ,
is a sum of martingale differences . We let be a tangent sequence; that is, and are independent and identically distributed conditionally on . Let denote the conditional (given ) expectation with respect to .
In proving , we shall work with a dyadic filtration. Let generated by independent Rademacher (symmetric -valued) random variables . Let be a predictable process with respect to this filtration (that is, is -measurable) with values in some set . Sequential Rademacher complexity
We will work with a particular class of functions defined on . It is immediate that exhibits whenever does, and vice versa, with at most doubling of the constant .
Using a sequential symmetrization technique, it holds (see ) that
Therefore, the statement “ has martingale type whenever exhibits an growth” corresponds to the phenomenon that, loosely speaking, “one may replace the distribution-independent bound with a sequence-dependent variation.”
The next theorem shows a tight connection between the complexity growth and martingale type.
The proof relies on the development in the next few sections, and especially on Lemma ?. The technique is partly inspired by the work of Burkholder  and Pisier . In particular, a key tool is the reverse Hölder principle .
where is expectation with respect to the tangent sequence, conditionally on . Then Theorem ? states that with high probability controlled by ,
whenever exhibits growth of sequential Rademacher complexity. Theorem Section 8 addresses the case of martingale type and states that with high probability controlled by ,
whenever sequential entropy (defined below) at scale behaves as .
3.1Other complexity measures
We see that the martingale type of is described by the behavior of sequential Rademacher complexity. The latter behavior can, in turn, be quantified in terms of geometric quantities, such as sequential covering numbers and the sequential scale-sensitive dimension. We present the following two definitions from , both stated in terms of a predictable process with respect to the dyadic filtration. It may be beneficial (at least it was for the authors of ) to think of as a complete binary tree of depth , decorated by elements of , and specifying a path in this tree.
The sequential covering numbers and the fat-shattering dimension are natural extensions of the classical notions, as shown in . In particular, a Dudley-type entropy integral upper bound in terms of sequential covering numbers holds for sequential Rademacher complexity. The sequential covering numbers, in turn, are upper bounded in terms of the fat-shattering dimensions, in a parallel to the way classical empirical covering numbers are controlled by the scale-sensitive version of the Vapnik-Chervonenkis dimension. We summarize the implications of these relationships in the following corollary:
We have established a relation between the martingale type of a function class and several sequential complexities of the class. However, unlike our starting point and Theorem ?, our results so far do not quantify the tail behavior for the difference between the supremum of the martingale process and the corresponding variation. A natural idea is to mimic the “equivalence” argument used in Section 2 to conclude the exponential tail bounds. Unfortunately, the deviation inequalities of the previous section rest on pathwise regret bounds that, in turn, rely on the linear structure of the associated Banach space, as well as on properties such as smoothness and uniform convexity. Without the linear structure, it is not clear whether the analogous pathwise statements hold. The goal of the rest of the paper is to bring forth some of the tools recently developed within the online learning literature, and to apply these pathwise regret bounds to conclude high probability tail bounds associated to martingale type. In addition to this goal, we will seek a version of Theorem ?(i) for bounded functions, where the growth of sequential Rademacher complexity implies martingale type (rather than any ), but with an additional factor. Our third goal will be to establish per-function variation bounds (similar to the notion of a weak variance ). We show that this latter bound is a finer version of the variation term, possible for classes that are “not too large”.
Our plan is as follows. First, we reduce the problem to one based on the dyadic filtration. After that, we shall introduce certain deterministic inequalities from the online learning literature that are already stated for the dyadic filtration.
4Symmetrization: dyadic filtration is enough
The purpose of this section is to prove that statements for the dyadic filtration can be lifted to general processes via sequential symmetrization. Consider the martingale
indexed by . If is adapted to a dyadic filtration , each increment takes on the value
or its negation, where is a predictable process with values in and defined by . In the rest of the paper, we work directly with martingales of the form , indexed by an abstract class and an abstract -valued predictable process .
We extend the symmetrization approach of Panchenko  to sequential symmetrization for the case of martingales. In contrast to the more frequently-used Giné-Zinn symmetrization proof (via Chebyshev’s inequality)  that allows a direct tail comparison of the symmetrized and the original processes, Panchenko’s approach allows for an “indirect” comparison. The following immediate extension of  will imply that any type tail behavior of the symmetrized process yields the same behavior for the original process.
As in , the lemma will be used with and as functions of a single sample and the double sample, respectively. The expression for the double sample will be symmetrized in order to pass to the dyadic filtration. However, unlike , we are dealing with a dependent sequence , and the meaning ascribed to the “second sample” is that of a tangent sequence. That is, are independent and have the same distribution conditionally on . Let stand for the conditional expectation given .
We conclude that it is enough to prove tail bounds for a supremum
of a martingale with respect to the dyadic filtration, offset by a function . This will be achieved with the help of deterministic regret inequalities.
5Deterministic regret inequalities
We let and for some abstract measurable set . Let be a class of -valued functions on . Fix a cost function , convex in the first argument. For a given function , we aim to construct such that
We may view as a prediction of the next value having observed and all the data thus far. In this paper, we focus on the linear loss (equivalently, absolute loss when ) and . We equivalently write for the linear cost function as
while for the square loss it becomes
Given a function and a class , there are two goals we may consider: (a) certify the existence of satisfying the pathwise inequality for all sequences ; or (b) give an explicit construction of . Both questions have been studied in the online learning literature, but the non-constructive approach will play an especially important role. Indeed, explicit constructions—such as the simple gradient descent update — might not be available in more complex situations, yet it is the existence of that yields the sought-after tail bounds.
5.2Existence of strategies
To certify the existence of a strategy , consider the following object:
where the notation stands for the repeated application of the operators (the outer operators corresponding to ). The variable ranges over , is in the set , and ranges in . It follows that
is a necessary and sufficient condition for the existence of such that holds.
Indeed, the optimal choice for is made given ; the optimal choice for is made given , and so on. This choice defines the optimal strategy .
Suppose we can find an upper bound on and then prove that this upper bound is non-positive. This would serve as a sufficient condition for the existence of . Next, we present such an upper bound for the case when the cost function is linear. More general results for convex Lipschitz cost functions can be found in .
As before, let be a sequence of independent Rademacher random variables. Let and be predictable processes with respect to the dyadic filtration , with values in and , respectively. In other words, and for each .
Condition in the previous lemma implies the existence of a strategy for . However, there might be situations when can be verified for a function of the predictable process that does not have a corresponding representation in the sense of . The next lemma provides a variant of Lemma ?.
6Amplification and equivalence
We now describe an interesting amplification phenomenon, already presented in the Introduction for the simple Euclidean case. Whenever holds, the deterministic inequality holds, and, therefore, we may apply it to a particular martingale difference sequence to obtain high-probability bounds. Below, we detail this amplification for both linear and square loss functions.
Take any -valued predictable process with respect to the dyadic filtration. The deterministic inequality applied to and becomes
for any , and thus we have the comparison of tails
Given the boundedness of the increments , the tail bounds follow immediately from the Azuma-Hoeffding’s inequality or from Freedman’s inequality . More precisely, we use the fact that the martingale differences are bounded by , and conclude
Let us emphasize the conclusion of the above lemma: the non-positivity of the expected supremum of a collection of martingales, offset by a function , implies existence of a regret-minimization strategy, which implies a high-probability tail bound. To close the loop, we integrate out the tails, obtaining an in-expectation bound of the form , but possibly with a larger function. This is a more general form of the equivalence promised in the introduction.
The next goal is to find nontrivial functions such that holds. The most basic is a constant that depends on the complexity of , but not on or the data. Define the worst-case sequential Rademacher averages as
Clearly, satisfies . The following is immediate.
Superficially, looks like a one-sided version of the concentration bound for classical (i.i.d.) Rademacher averages . However, sequential Rademacher averages are not Lipschitz with respect to a flip of a sign, as the whole remaining path may change after a flip.
As for the case of the linear loss function, take any -valued predictable process with respect to the dyadic filtration. Fix . The deterministic inequality for and becomes
As in the proof of , we obtain a tail comparison
Once again, the most basic choice for is the constant that depends on the complexity of the class. We recall the following result from .
We conclude that is satisfied with the data-independent constant . Hence, the following analogue of Corollary ? holds:
To summarize, in Section 5 we presented the machinery of regret inequalities, as well as sufficient conditions for existence of strategies. In the present section we used the pathwise statements, along with real-valued deviation inequalities, to conclude tail bounds, which, in turn, certify existence of regret-minimization strategies. In the next two sections we put these techniques to use.
7Uniform variation and tail bounds for general martingale type
We now make an extensive use of the amplification technique to prove in-probability versions of the “martingale type” definition. We start by working with dyadic martingales of the form where is a predictable process (with respect to the dyadic filtration) with values in . Once the results for these objects are established, we conclude the corresponding statements for general processes of the form via the sequential symmetrization technique summarized in Corollary ?.
As in Section 3, we assume a growth condition on sequential Rademacher complexity.
The second part of the proof of Lemma ? uses the amplification idea of the previous section.
Using Lemma ?, we can now conclude existence of prediction strategies whose regret is controlled by sequence-dependent variance. This greatly extends the scope of available variance-type bounds in the online learning literature where results in this direction have been obtained for either finite or linear classes.
In addition to being a novel result in the online learning domain, the above corollary serves as an amplification step to boost the in-expectation of bound of Lemma ? to a high probability statement. We then invoke Corollary ? and Lemma ? to prove the following theorem.
We remark that the tail bound can be viewed as a ratio inequality (see ) of the form , where the deviations are scaled by the square root of the variance.
8Finer control via per-function variation
From the point of view of the previous section (and Theorem ?), all classes with sequential Rademacher complexity growth are treated equally. However, classes with such a growth can be as simple as a set consisting of two functions, or as complex as a set of linear functions indexed by a ball in the infinite-dimensional Hilbert space. In this section, a different complexity measure will be used for the regime when the growth hides the difference in complexity. This measure will be given by sequential covering numbers (and, as a consequence, by the offset Rademacher complexity). In the regime , , for the growth of sequential entropy, we exhibit a finer analysis of the variation term that allows part of the variance to be adapted to the function.
Let . We say that a class has the growth (as decreases) of sequential entropy if there is a constant such that for all ,
As for sequential Rademacher complexity, it is easy to check that the class and the derived class of functions have the same growth of sequential entropy. Moreover, this growth controls the rate of growth of the offset Rademacher complexity, as shown in . In particular, for the finite function class,
while for a parametric class of “dimension” (such that for some ),
and for a class with sequential entropy growth ,
for some absolute constant (the bound gains an extra logarithmic factor at ). In this last nonparametric regime, Corollary ? implies that for any ,
and the analogous statements hold for the finite and parametric cases. As the next Theorem shows, the offset Rademacher complexity brings out (for smaller classes) the finer complexity control obscured by the sequential Rademacher complexity which only provides bounds.
The finite and parametric cases can be thought of as a “” regime. Here, we have a bound that depends on at most logarithmically. On the other hand, for the term is replaced with , without any per-function adaptivity (as studied in the previous section). Between these two regimes, we obtain an interpolation, whereby the power is split into a non-adaptive part and the adaptive part . This constitutes a finer analysis of classes with martingale type 2.
We may compare the bound of Theorem ? in the finite case to the in-expectation bound of  in terms of “weak variance” for i.i.d. zero mean random variables :
In contrast to this bound, Theorem ? matches the coordinate on the left-hand side to the variance of the th coordinate on the right-hand side. Further, our bound holds for martingale difference sequences rather than i.i.d. random vectors. Finally, Theorem ? holds well beyond the finite case.
9Some Open Questions
The following are a few open-ended questions raised by this work:
In the definition of martingale type, can we replace with and reach the same conclusions? The latter version of variation is closer to the generalization of the martingale type for Banach spaces.
If for some , sequential Rademacher complexity exhibits growth rate, then does have martingale type ? Currently, we only prove martingale type for any . For the case of Banach spaces (linear ), the above question is answered in the positive in the work of Pisier . However, the result of  relies on the notions of uniform convexity or uniform smoothness which are specific to linear functionals and Banach spaces.
Is it possible to get a mix of uniform and per-funtion variance for general function classes with martingale type ? In Section 8, for martingale type we prove a finer control through per function variance. A natural question is whether one can replace the -dependent part by uniform variance terms thus giving a mix of per-function and uniform variance in the same bound.
The following two-line proof is standard. By the property of a projection,
Summing over yields the desired statement.
Because of the “anytime” property of the regret bound and the strategy definition, we can write as
simply because the right-hand side is largest for . Sub-additivity of implies
By the Burkholder-Davis-Gundy inequality (with the constant from ),
In view of , we conclude the statement.
Because of the update form,
Summing over ,
where we used strong convexity of and the fact that is nonincreasing. Next, we write
and upper bound the second term by noting that
Combining the bounds,
Now observe that
and thus the second term in is upper bounded as
For the first term, we use and
Since is a convex function,