A non-exponential extension of Sanov’s theorem via convex duality
This work is devoted to a vast extension of Sanov’s theorem, in Laplace principle form, based on alternatives to the classical convex dual pair of relative entropy and cumulant generating functional. The abstract results give rise to a number of probabilistic limit theorems and asymptotics. For instance, widely applicable non-exponential large deviation upper bounds are derived for empirical distributions and averages of i.i.d. samples under minimal integrability assumptions, notably accommodating heavy-tailed distributions. Other interesting manifestations of the abstract results include uniform large deviation bounds, variational problems involving optimal transport costs, and constrained super-hedging problems, as well as an application to error estimates for approximate solutions of stochastic optimization problems. The proofs build on the Dupuis-Ellis weak convergence approach to large deviations as well as the duality theory for convex risk measures.
An original goal of this paper was to extend the weak convergence methodology of Dupuis and Ellis  to the context of non-exponential (e.g., heavy-tailed) large deviations. While we claim only modest success in this regard, we do find some general-purpose large deviation upper bounds which can be seen as polynomial-rate analogs of the upper bounds in the classical theorems of Sanov and Cramér. At least as interesting, however, are the abstract principles behind these bounds, which have broad implications beyond the realm of large deviations. Let us first describe these abstract principles before specializing them in various ways.
Let be a Polish space, and let denote the set of Borel probability measures on endowed with the topology of weak convergence. Let (resp. ) denote the set of measurable (resp. continuous) and bounded real-valued functions on . For and , define and measurable maps for via the disintegration
In other words, if is an -valued random variable with law , then is the law of , and is the conditional law of given . Of course, are uniquely defined up to -almost sure equality.
The protagonist of the paper is a proper (i.e., not identically ) convex function with compact sub-level sets; that is, is compact for every . For define by
and note that . Define the convex conjugate by
Our main interest is in evaluating at functions of the empirical measure defined by
The guiding example is the relative entropy, , where is a fixed reference measure, and is defined by
It turns out that , by the so-called chain rule of relative entropy [17, Theorem B.2.1]. The dual is well known to be , and the formula relating and is often known as the Gibbs variational principle or the Donsker-Varadhan formula. In this case Theorem 1.1 reduces to the Laplace principle form of Sanov’s theorem:
Well known theorems of Varadhan and Dupuis-Ellis (see [17, Theorem 1.2.1 and 1.2.3]) assert the equivalence of this form of Sanov’s theorem with the more common form: for every Borel set with closure and interior ,
To derive this heuristically, apply Theorem 1.1 to the function
For general , Theorem 1.1 does not permit an analogous equivalent formulation in terms of deviation probabilities. In fact, for many , Theorem 1.1 has nothing to do with large deviations (see Sections 1.3 and 1.4 below). Nonetheless, for certain , Theorem 1.1 implies interesting large deviations upper bounds, which we prove by formalizing the aforementioned heuristic. While many admit fairly explicit known formulas for the dual , the recurring challenge in applying Theorem 1.1 is finding a useful expression for , and herein lies but one of many instances of the wonderful tractability of relative entropy. The examples to follow do admit good expressions for , or at least workable one-sided bounds, but we also catalog in Section 1.5 some natural alternative choices of for which we did not find useful bounds or expressions for .
The functional is (up to a sign change) a convex risk measure, in the language of Föllmer and Schied . A rich duality theory for convex risk measures emerged over the past two decades, primarily geared toward applications in financial mathematics and optimization. We take advantage of this theory in Section 2 to demonstrate how can be reconstructed from in many cases, which shows that could be taken as the starting point instead of . Additionally, the theory of risk measures provides insight on how to deal with the subtleties that arise in extending the domain of (and Theorem 1.1) to accommodate unbounded functions or stronger topologies on . Section 1.7 briefly reinterprets Theorem 1.1 in a language more consistent with the risk measure literature. The reader familiar with risk measures may notice a time consistent dynamic risk measure (see  for definitions and survey) hidden in the definition of above.
We will make no use of the interpretation in terms of dynamic risk measures, but it did inspire a recursive formula for (similar to a result of ). To state it loosely, if then we may write
To make rigorous sense of this, we must note that is merely upper semianalytic and not Borel measurable in general, and that is well defined for such functions. We make this precise in Proposition A.1. This recursive formula is not essential for any of the arguments but is convenient for calculations.
1.1. Nonexponential large deviations
A first applicatoin of Theorem 1.1 is to derive large deviation upper bounds in the absence of exponential rates or finite moment generating functions. While Cramér’s theorem in full generality does not require any finite moments, the upper bound is often vacuous when the underlying random variables have heavy tails. This simple observation has driven a large and growing literature on large deviation asymptotics for sums of i.i.d. random variables, to be reviewed shortly. Our approach is well suited not to precise asymptotics but rather to widely applicable upper bounds. In Section 4.1 we derive alternatives to the upper bounds of Sanov’s and Cramér’s theorems by applying (an extension of) Theorem 1.1 with
where is fixed. We state the results here: For a continuous function , let denote the set of satisfying , and equip with the topology induced by the linear maps , where is continuous and .
Let , and let denote the conjugate exponent. Let , and suppose for some continuous . Then, for every closed set ,
Let and . Let be a separable Banach space. Let be i.i.d. -valued random variables with . Define by111In the following, denotes the continuous dual of .
and define for . Then, for every closed set ,
In analogy with the classical Cramér’s theorem, the function in Corollary 1.3 plays the role of the cumulant generating function. In both Theorem 1.2 and Corollary 1.3, notice that as soon as the constant on the right-hand side is finite, we may conclude that the probabilities in question are . This is consistent with some now-standard results on one-dimensional heavy tailed sums, for events of the form , for . For instance, it is known [44, Chapter IX, Theorem 28] that if are i.i.d. real-valued random variables with mean zero and , then . For , the well known inequality of Fuk-Nagaev provides a related non-asymptotic bound; see [40, Corollary 1.8], or  for a Banach space version.
If instead a stronger assumption is made on , such as regular variation, then there are corresponding lower bounds for certain sets . Refer to the books [10, 23] and the survey of Mikosch and Nagaev  for detailed reviews of such results, as well as the more recent  and references therein. Indeed, precise asymptotics require detailed assumptions on the shape of the tails of , and this is especially true in multivariate and infinite-dimensional contexts. A recent line of interesting work extends the theory of regular variation to metric spaces [13, 28, 27, 36], but again the typical assumptions on the underlying law are substantially stronger than mere existence of a finite moment.
The main advantage of our results is their broad applicability, requiring only finite moments, but two other strengths are worth emphasizing. First, our bounds apply to arbitrary closed sets , which enables a natural contraction principle (i.e., continuous mapping). Section 4.4 illustrates this by using Theorem 1.2 to find error bounds for Monte Carlo schemes in stochastic optimization, essentially providing a heavy-tailed analog of the results of . Lastly, while this discussion has focused on literature related to our analog of Cramér’s upper bound (Corollary 1.3), our analog of Sanov’s upper bound (Theorem 1.2) seems even more novel. No other results are known to the author on empirical measure large deviations in heavy-tailed contexts. Of course, Sanov’s theorem applies without any moment assumptions, but the upper bound provides no information in many heavy-tailed applications, such as in Section 4.4.
1.2. Uniform upper bounds and martingales
Certain classes of dependent sequences admit uniform upper bounds, which we derive from Theorem 1.1 by working with
for a given convex weakly compact set . The conjugate , unsurprisingly, is , and turns out to be tractable as well:
where is defined as the set of laws with for each , -almost surely; in other words, is the set of laws of -valued random variables , when the law of belongs to and so does the conditional law of given , almost surely, for each . Theorem 1.1 becomes
From this we derive a uniform large deviation upper bound, for closed sets :
With a prudent choice of , this specializes to an asymptotic relative of the Azuma-Hoeffding inequality. The surprising feature here is that we can work with arbitrary closed sets and in multiple dimensions:
Let , and define to be the set of -valued martingales , defined on a common but arbitrary probability space, satisfying and
Suppose the effective domain spans . Then, for closed sets , we have
1.3. Laws of large numbers
Some specializations of Theorem 1.1 appear to have nothing to do with large deviations. For example, suppose is convex and compact, and let
When is a singleton, so is , and this simply expresses the weak convergence . The general case can be interpreted as a robust law of large numbers, where “robust” refers to perturbations of the joint law of an i.i.d. sequence. This is closely related to laws of large numbers under nonlinear expectations .
1.4. Optimal transport costs
Another interesting consequence of Theorem 1.1 comes from choosing as an optimal transport cost. Fix and a lower semicontinuous function , and define
where is the set of probability measures on with first marginal and second marginal . Under a modest additional assumption on (stated shortly in Corollary 1.5, proven later in Lemma 6.2), satisfies our standing assumptions.
The dual can be identified using Kantorovich duality, and turns out to be the value of a stochastic optimal control problem. To illustrate this, it is convenient to work with probabilistic notation: Suppose is a sequence of i.i.d. -valued random variables with common law , defined on some fixed probability space. For each , let denote the set of -valued random variables where is -measurable for each . We think of elements of as adapted control processes. For each and each , we show in Proposition 6.3 that
Suppose that for each compact set , the function has pre-compact sub-level sets.222That is, the closure of is compact for each . This assumption holds, for example, if is a subset of Euclidean space and there exists such that as , uniformly for in compacts. For each , we have
This can be seen as a long-time limit of the optimal value of the control problems. However, the renormalization in is a bit peculiar in that it enters inside of the terminal cost , and there does not seem to be a direct connection with ergodic control. A direct proof of (1.12) is possible but seems to be no simpler and potentially narrower in scope.
The limiting object of Corollary 1.5 encapsulates a wide variety of interesting variational problems involving optimal transport costs. Variational problems of this form are surely more widespread than the author is aware, but two notable recent examples can be found in the study of Cournot-Nash equilibria in large-population games  and in the theory of Wasserstein barycenters .
1.5. Alternative choices of
There are many other natural choices of for which the implications of Theorem 1.1 remain unclear. For example, consider the -divergence
where and is convex and satisfies as . This has weakly compact (actually -compact) sub-level sets, according to [14, Lemma 6.2.16], and it is clearly convex. The dual, known in the risk literature as the optimized certainty equivalent, was computed by Ben-Tal and Teboulle [6, 7] to be
where is the convex conjugate. Unfortunately, we did not find any good expressions or estimates for or , so the interpretation of the main Theorem 1.1 eludes us in this case.
A related choice is the so-called shortfall risk measure introduced by Föllmer and Schied :
This choice of and the corresponding (tractable!) are discussed briefly in Section 4.1. The choice of corresponds to (1.7), and we make extensive use of this in Section 4, as was discussed in Section 1.1. The choice of recovers the classical case . Aside from these two examples, for general , we found no useful expressions or estimates for or . In connection with tails of random variables, shortfall risk measures have an intuitive appeal stemming from the following simple analog of Chernoff’s bound, observed in [34, Proposition 3.3]: If for all , where is some given measurable function, then for all , where .
It is worth pointing out the natural but ultimately fruitless idea of working with , where is increasing. Such functionals were studied first it seems by Hardy, Littlewood, and Pólya [25, Chapter 3], providing necessary and sufficient conditions for to be convex (rediscovered in ). Using the formula (1.6) to compute , this choice would lead to the exceptionally pleasant formula , which we observed already in the classical case . Unfortunately, however, such a cannot come from a functional on , in the sense that (1.1) cannot hold unless is affine or exponential. Another way of seeing this is that the convex conjugate of (with respect to the dual pairing of with the space of bounded signed measures) fails to be infinite outside of the set . The problem, as is known in the risk measure literature, is that the additivity property for all and fails unless is affine or exponential (c.f. [22, Proposition 2.46]).
The consequences of Theorem 1.1 remain unexplored for several other potentially interesting choices of with well understood duals: To name just a few, we mention the Schrödinger problem surveyed in  and related functionals arising from stochastic optimal control problems , martingale optimal transport costs , and functionals related to Orlicz norms studied in .
1.6. Connection to superhedging
Again, the challenge in working with Theorem 1.1 is in computing or estimating or . With this in mind, we present an alternative expression for as the value of a particular type of optimal control problem, more specifically a superhedging problem (see, e.g., [22, Chapter 7]). To a given dual pair we may associate the acceptance set
As is well known in the risk measure literature, we may express in terms of by
Indeed, this follows easily from the fact that for constants . In fact, can also be reconstructed from , and this provides a third possible entry point to the duality. To elaborate on this would take us too far afield, but see  for details.
Now, let us compute in terms of the acceptance set. For , define to be the set of , where is a measurable function from to satisfying
where the inequality is understood pointwise. Moreover, the infimum is attained.
To interpret this as a control problem, consider the partial sum process as a state process, which we must “steer” to be larger than pointwise at the final time . The control at each time must be admissible in the sense of (1.16), and notice that the dependence of on only is an expression of adaptedness or non-anticipativity. We seek the minimal starting point for which this steering can be done. The iterative form of in (1.6) (more precisely stated in Proposition A.1) can be seen as an expression of the dynamic programming principle for the control problem of Theorem 1.6.
For a concrete example, if is the shortfall risk measure (1.13), and if denote i.i.d. -valued random variables with common law , then Theorem 1.6 expresses as the infimum over all for which there exists an -adapted process333We say is -adapted if is -measurable for each . satisfying and a.s., for each .
1.7. Interpreting Theorem 1.1 in terms of risk measures
It is straightforward to rewrite Theorem 1.1 in a language more in line with the literature on convex risk measures, for which we again defer to  for background. Let be a measurable space, and suppose is a convex risk measure on the set of bounded measurable functions. That is, is convex, for all and , and whenever pointwise. Suppose we are given a sequence of -valued random variables , i.e., measurable maps . Assume have the following independence property, identical to Peng’s notion of independence under nonlinear expectations : for and
In particular, for all . Define by
Indeed, in our previous notation, for .
In the risk measure literature, one thinks of as the risk associated to an uncertain financial loss . With this in mind, and with , the quantity appearing in (1.18) is the risk-per-unit of an investment in units of . One might interpret as capturing the composition of the investment, while the multiplicative factor represents the size of the investment. As increases, say to , the investment is “rebalanced” in the sense that one additional independent component, , is incorporated and the size of the total investment is increased by one unit. The limit in (1.18) is then an asymptotic evaluation of the risk-per-unit of this rebalancing scheme.
Broadly speaking, the book of Dupuis and Ellis  and numerous subsequent works illustrate how the classical convex duality between relative entropy and cumulant-generating functions can serve as a foundation from which to derive an impressive range of large deviation principles. Similarly, each alternative dual pair should provide an alternative foundation for a potentially equally wide range of limit theorems. From this perspective, our work raises far more questions than it answers by restricting attention to analogs of the two large deviation principles of Sanov and Cramér. It is likely, for instance, that an analog of Mogulskii’s theorem (see  or [17, Section 3]) holds in our context. Moreover, our framework is not as restricted to i.i.d. samples as it may appear. While the definition of reflects our focus on i.i.d. samples, we might accommodate Markov chains by redefining . For instance, we may try
where is an initial law of a Markov chain, is its transition kernel, and plays the role of . This again simplifies in the classical case , leading to , where is the law of the path of the Markov chain described above. These speculations are meant simply to convey the versatility of our framework but are pursued no further, with the paper instead focusing on exploring the implications of various choices of in our analog of Sanov’s theorem.
1.9. Outline of the paper
The remainder of the paper is organized as follows. Section 2 begins by clarifying the duality, explaining some useful properties of and and extending their definitions to unbounded functions. Section 3 is devoted to the statement and proof Theorem 3.1, which contains Theorem 1.1 as a special case but is extended to stronger topologies and unbounded functions . See also Section 3.3 for abstract analogs of the contraction principle and Cramér’s theorem. These extensions are put to use in Section 4, which proves and elaborates on the non-exponential forms of Sanov’s and Cramér’s theorems discussed in Section 1.1. Section 4.4 applies these results to obtain error estimates for a common Monte Carlo approach to stochastic optimization. Sections 5 and 6 respectively elaborate on the examples of 1.2 and 1.4. Section A proves two different representations of , namely those of (1.6) and Theorem 1.6. The short Appendix B describes a natural but largely unsuccessful attempt to derive tractable large deviation upper bounds from Theorem 1.1 by working with a class of functionals of not one but two measures, such as -divergences. Finally two minor technical results are relegated to Appendix C.
2. Convex duality preliminaries
This section outlines the key features of the duality. The first three theorems, stated in this subsection, are borrowed from the literature on convex risk measures, for which an excellent reference is the book of Föllmer and Schied . While we will make use of some of the properties listed in Theorem 2.1, the goal of the first two theorems is more to illustrate how one can make the starting point rather than . In particular, Theorem 2.2 will not be needed in the sequel. For proofs of Theorems 2.1 and 2.2, refer to Bartl [4, Theorem 2.4].
Suppose is convex and has weakly compact sub-level sets. Define as in (1.1). Then the following hold:
If pointwise then .
If and , then .
If with pointwise, then .
If and with pointwise, then .
Moreover, for we have
We state also a useful theorem of Föllmer and Schied  which allows us to verify tightness of the sub-level sets of by checking a property of .
Theorem 2.3 (Proposition 4.30 of ).
Suppose a functional admits the representation
for some function . Suppose also that there is a sequence of compact subsets of such that
Then has tight sub-level sets.
The goal of the rest of the section is to extend the domain of to unbounded functions and study the compactness of the sub-level sets of with respect to stronger topologies. From now on, we work at all times with the standing assumptions on described in the introduction:
The function is convex, has weakly compact sub-level sets, and is not identically equal to . Lastly, is defined as in (1.1).
2.1. Extending and to unbounded functions
This section extends the domain of to unbounded functions. Let . We adopt the convention that , although this will have few consequences aside from streamlined definitions. In particular, if and a measurable function and satisfies , we define .
For and measurable , define
As usual, abbreviate .
It is worth emphasizing that while is finite for bounded , it can be either or when is unbounded. The following simple lemma will aid in some computations in Section 4.
If is measurable and bounded from below, then
Define . Monotone convergence yields
One checks easily that this is consistent with the convention . ∎
3. An extension of Theorem 1.1
In this section we state and prove a useful generalization of Theorem 1.1 for stronger topologies and unbounded functions, taking advantage of the results of the previous section. At all times in this section, the standing assumptions on (stated just before Section 2.1) are in force.
We prepare by defining a well known class of topologies on subsets of . Given a continuous function , define
Endow with the (Polish) topology generated by the maps , where is continuous and ; we call this the -weak topology. A useful fact about this topology is that a set is pre-compact if and only if for every there exists a compact set such that
This is easily proven directly, or refer to [22, Corollary A.47].
In the following theorem, the extension of the upper bound to the -weak topology requires the assumption that the sub-level sets of are pre-compact in . This rather opaque assumption is explored in more detail in the subsequent Section 3.1.
Let be continuous. If is lower semicontinuous (with respect to the -weak topology) and bounded from below, then
Suppose also that the sub-level sets of are pre-compact subsets of . If is upper semicontinuous and bounded from above, then
Lower bound: Let us prove first the lower bound. It is immediate from the definition that for each , where denotes the -fold product measure. Thus
For , the law of large numbers implies weakly, i.e. in . For , the convergence takes place in . Lower semicontinuity of on then implies, for each ,
Take the supremum over to complete the proof of the lower bound. It is worth noting that if is a compatible metric on and for some fixed and , then the -weak topology is nothing but the -Wasserstein topology.
Upper bound, bounded: The upper bound is more involved. First we prove it in four steps under the assumption that is bounded.
Step 1: First we simplify the expression somewhat. For each the definition of and convexity of imply
Combine this with (3.1) to get
It is convenient to switch now to a probabilistic notation: One some sufficiently rich probability space, find an