A nonexponential extension of Sanov’s theorem via convex duality
Abstract.
This work is devoted to a vast extension of Sanov’s theorem, in Laplace principle form, based on alternatives to the classical convex dual pair of relative entropy and cumulant generating functional. The abstract results give rise to a number of probabilistic limit theorems and asymptotics. For instance, widely applicable nonexponential large deviation upper bounds are derived for empirical distributions and averages of i.i.d. samples under minimal integrability assumptions, notably accommodating heavytailed distributions. Other interesting manifestations of the abstract results include uniform large deviation bounds, variational problems involving optimal transport costs, and constrained superhedging problems, as well as an application to error estimates for approximate solutions of stochastic optimization problems. The proofs build on the DupuisEllis weak convergence approach to large deviations as well as the duality theory for convex risk measures.
1. Introduction
An original goal of this paper was to extend the weak convergence methodology of Dupuis and Ellis [17] to the context of nonexponential (e.g., heavytailed) large deviations. While we claim only modest success in this regard, we do find some generalpurpose large deviation upper bounds which can be seen as polynomialrate analogs of the upper bounds in the classical theorems of Sanov and Cramér. At least as interesting, however, are the abstract principles behind these bounds, which have broad implications beyond the realm of large deviations. Let us first describe these abstract principles before specializing them in various ways.
Let be a Polish space, and let denote the set of Borel probability measures on endowed with the topology of weak convergence. Let (resp. ) denote the set of measurable (resp. continuous) and bounded realvalued functions on . For and , define and measurable maps for via the disintegration
In other words, if is an valued random variable with law , then is the law of , and is the conditional law of given . Of course, are uniquely defined up to almost sure equality.
The protagonist of the paper is a proper (i.e., not identically ) convex function with compact sublevel sets; that is, is compact for every . For define by
and note that . Define the convex conjugate by
(1.1) 
Our main interest is in evaluating at functions of the empirical measure defined by
The main abstract result of the paper is the following extension of Sanov’s theorem, proven in a more general form in Section 3 by adapting the weak convergence techniques of DupuisEllis [17].
Theorem 1.1.
For ,
The guiding example is the relative entropy, , where is a fixed reference measure, and is defined by
(1.2) 
It turns out that , by the socalled chain rule of relative entropy [17, Theorem B.2.1]. The dual is well known to be , and the formula relating and is often known as the Gibbs variational principle or the DonskerVaradhan formula. In this case Theorem 1.1 reduces to the Laplace principle form of Sanov’s theorem:
Well known theorems of Varadhan and DupuisEllis (see [17, Theorem 1.2.1 and 1.2.3]) assert the equivalence of this form of Sanov’s theorem with the more common form: for every Borel set with closure and interior ,
(1.3)  
(1.4) 
To derive this heuristically, apply Theorem 1.1 to the function
(1.5) 
For general , Theorem 1.1 does not permit an analogous equivalent formulation in terms of deviation probabilities. In fact, for many , Theorem 1.1 has nothing to do with large deviations (see Sections 1.3 and 1.4 below). Nonetheless, for certain , Theorem 1.1 implies interesting large deviations upper bounds, which we prove by formalizing the aforementioned heuristic. While many admit fairly explicit known formulas for the dual , the recurring challenge in applying Theorem 1.1 is finding a useful expression for , and herein lies but one of many instances of the wonderful tractability of relative entropy. The examples to follow do admit good expressions for , or at least workable onesided bounds, but we also catalog in Section 1.5 some natural alternative choices of for which we did not find useful bounds or expressions for .
The functional is (up to a sign change) a convex risk measure, in the language of Föllmer and Schied [22]. A rich duality theory for convex risk measures emerged over the past two decades, primarily geared toward applications in financial mathematics and optimization. We take advantage of this theory in Section 2 to demonstrate how can be reconstructed from in many cases, which shows that could be taken as the starting point instead of . Additionally, the theory of risk measures provides insight on how to deal with the subtleties that arise in extending the domain of (and Theorem 1.1) to accommodate unbounded functions or stronger topologies on . Section 1.7 briefly reinterprets Theorem 1.1 in a language more consistent with the risk measure literature. The reader familiar with risk measures may notice a time consistent dynamic risk measure (see [1] for definitions and survey) hidden in the definition of above.
We will make no use of the interpretation in terms of dynamic risk measures, but it did inspire a recursive formula for (similar to a result of [11]). To state it loosely, if then we may write
(1.6) 
To make rigorous sense of this, we must note that is merely upper semianalytic and not Borel measurable in general, and that is well defined for such functions. We make this precise in Proposition A.1. This recursive formula is not essential for any of the arguments but is convenient for calculations.
1.1. Nonexponential large deviations
A first applicatoin of Theorem 1.1 is to derive large deviation upper bounds in the absence of exponential rates or finite moment generating functions. While Cramér’s theorem in full generality does not require any finite moments, the upper bound is often vacuous when the underlying random variables have heavy tails. This simple observation has driven a large and growing literature on large deviation asymptotics for sums of i.i.d. random variables, to be reviewed shortly. Our approach is well suited not to precise asymptotics but rather to widely applicable upper bounds. In Section 4.1 we derive alternatives to the upper bounds of Sanov’s and Cramér’s theorems by applying (an extension of) Theorem 1.1 with
(1.7) 
where is fixed. We state the results here: For a continuous function , let denote the set of satisfying , and equip with the topology induced by the linear maps , where is continuous and .
Theorem 1.2.
Let , and let denote the conjugate exponent. Let , and suppose for some continuous . Then, for every closed set ,
Corollary 1.3.
Let and . Let be a separable Banach space. Let be i.i.d. valued random variables with . Define by^{1}^{1}1In the following, denotes the continuous dual of .
and define for . Then, for every closed set ,
In analogy with the classical Cramér’s theorem, the function in Corollary 1.3 plays the role of the cumulant generating function. In both Theorem 1.2 and Corollary 1.3, notice that as soon as the constant on the righthand side is finite, we may conclude that the probabilities in question are . This is consistent with some nowstandard results on onedimensional heavy tailed sums, for events of the form , for . For instance, it is known [44, Chapter IX, Theorem 28] that if are i.i.d. realvalued random variables with mean zero and , then . For , the well known inequality of FukNagaev provides a related nonasymptotic bound; see [40, Corollary 1.8], or [19] for a Banach space version.
If instead a stronger assumption is made on , such as regular variation, then there are corresponding lower bounds for certain sets . Refer to the books [10, 23] and the survey of Mikosch and Nagaev [38] for detailed reviews of such results, as well as the more recent [15] and references therein. Indeed, precise asymptotics require detailed assumptions on the shape of the tails of , and this is especially true in multivariate and infinitedimensional contexts. A recent line of interesting work extends the theory of regular variation to metric spaces [13, 28, 27, 36], but again the typical assumptions on the underlying law are substantially stronger than mere existence of a finite moment.
The main advantage of our results is their broad applicability, requiring only finite moments, but two other strengths are worth emphasizing. First, our bounds apply to arbitrary closed sets , which enables a natural contraction principle (i.e., continuous mapping). Section 4.4 illustrates this by using Theorem 1.2 to find error bounds for Monte Carlo schemes in stochastic optimization, essentially providing a heavytailed analog of the results of [30]. Lastly, while this discussion has focused on literature related to our analog of Cramér’s upper bound (Corollary 1.3), our analog of Sanov’s upper bound (Theorem 1.2) seems even more novel. No other results are known to the author on empirical measure large deviations in heavytailed contexts. Of course, Sanov’s theorem applies without any moment assumptions, but the upper bound provides no information in many heavytailed applications, such as in Section 4.4.
1.2. Uniform upper bounds and martingales
Certain classes of dependent sequences admit uniform upper bounds, which we derive from Theorem 1.1 by working with
(1.8) 
for a given convex weakly compact set . The conjugate , unsurprisingly, is , and turns out to be tractable as well:
(1.9) 
where is defined as the set of laws with for each , almost surely; in other words, is the set of laws of valued random variables , when the law of belongs to and so does the conditional law of given , almost surely, for each . Theorem 1.1 becomes
From this we derive a uniform large deviation upper bound, for closed sets :
(1.10) 
With a prudent choice of , this specializes to an asymptotic relative of the AzumaHoeffding inequality. The surprising feature here is that we can work with arbitrary closed sets and in multiple dimensions:
Theorem 1.4.
Let , and define to be the set of valued martingales , defined on a common but arbitrary probability space, satisfying and
Suppose the effective domain spans . Then, for closed sets , we have
where .
1.3. Laws of large numbers
Some specializations of Theorem 1.1 appear to have nothing to do with large deviations. For example, suppose is convex and compact, and let
It can be shown that , where is defined as in Section 1.2, for instance by a direct computation using (1.6). Theorem 1.1 then becomes
When is a singleton, so is , and this simply expresses the weak convergence . The general case can be interpreted as a robust law of large numbers, where “robust” refers to perturbations of the joint law of an i.i.d. sequence. This is closely related to laws of large numbers under nonlinear expectations [42].
1.4. Optimal transport costs
Another interesting consequence of Theorem 1.1 comes from choosing as an optimal transport cost. Fix and a lower semicontinuous function , and define
where is the set of probability measures on with first marginal and second marginal . Under a modest additional assumption on (stated shortly in Corollary 1.5, proven later in Lemma 6.2), satisfies our standing assumptions.
The dual can be identified using Kantorovich duality, and turns out to be the value of a stochastic optimal control problem. To illustrate this, it is convenient to work with probabilistic notation: Suppose is a sequence of i.i.d. valued random variables with common law , defined on some fixed probability space. For each , let denote the set of valued random variables where is measurable for each . We think of elements of as adapted control processes. For each and each , we show in Proposition 6.3 that
(1.11) 
The expression (1.11) yields the following corollary of Theorem 1.1:
Corollary 1.5.
Suppose that for each compact set , the function has precompact sublevel sets.^{2}^{2}2That is, the closure of is compact for each . This assumption holds, for example, if is a subset of Euclidean space and there exists such that as , uniformly for in compacts. For each , we have
(1.12) 
where .
This can be seen as a longtime limit of the optimal value of the control problems. However, the renormalization in is a bit peculiar in that it enters inside of the terminal cost , and there does not seem to be a direct connection with ergodic control. A direct proof of (1.12) is possible but seems to be no simpler and potentially narrower in scope.
The limiting object of Corollary 1.5 encapsulates a wide variety of interesting variational problems involving optimal transport costs. Variational problems of this form are surely more widespread than the author is aware, but two notable recent examples can be found in the study of CournotNash equilibria in largepopulation games [9] and in the theory of Wasserstein barycenters [2].
1.5. Alternative choices of
There are many other natural choices of for which the implications of Theorem 1.1 remain unclear. For example, consider the divergence
where and is convex and satisfies as . This has weakly compact (actually compact) sublevel sets, according to [14, Lemma 6.2.16], and it is clearly convex. The dual, known in the risk literature as the optimized certainty equivalent, was computed by BenTal and Teboulle [6, 7] to be
where is the convex conjugate. Unfortunately, we did not find any good expressions or estimates for or , so the interpretation of the main Theorem 1.1 eludes us in this case.
A related choice is the socalled shortfall risk measure introduced by Föllmer and Schied [21]:
(1.13) 
This choice of and the corresponding (tractable!) are discussed briefly in Section 4.1. The choice of corresponds to (1.7), and we make extensive use of this in Section 4, as was discussed in Section 1.1. The choice of recovers the classical case . Aside from these two examples, for general , we found no useful expressions or estimates for or . In connection with tails of random variables, shortfall risk measures have an intuitive appeal stemming from the following simple analog of Chernoff’s bound, observed in [34, Proposition 3.3]: If for all , where is some given measurable function, then for all , where .
It is worth pointing out the natural but ultimately fruitless idea of working with , where is increasing. Such functionals were studied first it seems by Hardy, Littlewood, and Pólya [25, Chapter 3], providing necessary and sufficient conditions for to be convex (rediscovered in [6]). Using the formula (1.6) to compute , this choice would lead to the exceptionally pleasant formula , which we observed already in the classical case . Unfortunately, however, such a cannot come from a functional on , in the sense that (1.1) cannot hold unless is affine or exponential. Another way of seeing this is that the convex conjugate of (with respect to the dual pairing of with the space of bounded signed measures) fails to be infinite outside of the set . The problem, as is known in the risk measure literature, is that the additivity property for all and fails unless is affine or exponential (c.f. [22, Proposition 2.46]).
The consequences of Theorem 1.1 remain unexplored for several other potentially interesting choices of with well understood duals: To name just a few, we mention the Schrödinger problem surveyed in [35] and related functionals arising from stochastic optimal control problems [37], martingale optimal transport costs [5], and functionals related to Orlicz norms studied in [12].
1.6. Connection to superhedging
Again, the challenge in working with Theorem 1.1 is in computing or estimating or . With this in mind, we present an alternative expression for as the value of a particular type of optimal control problem, more specifically a superhedging problem (see, e.g., [22, Chapter 7]). To a given dual pair we may associate the acceptance set
(1.14) 
As is well known in the risk measure literature, we may express in terms of by
(1.15) 
Indeed, this follows easily from the fact that for constants . In fact, can also be reconstructed from , and this provides a third possible entry point to the duality. To elaborate on this would take us too far afield, but see [22] for details.
Now, let us compute in terms of the acceptance set. For , define to be the set of , where is a measurable function from to satisfying
(1.16) 
Theorem 1.6.
For ,
where the inequality is understood pointwise. Moreover, the infimum is attained.
To interpret this as a control problem, consider the partial sum process as a state process, which we must “steer” to be larger than pointwise at the final time . The control at each time must be admissible in the sense of (1.16), and notice that the dependence of on only is an expression of adaptedness or nonanticipativity. We seek the minimal starting point for which this steering can be done. The iterative form of in (1.6) (more precisely stated in Proposition A.1) can be seen as an expression of the dynamic programming principle for the control problem of Theorem 1.6.
For a concrete example, if is the shortfall risk measure (1.13), and if denote i.i.d. valued random variables with common law , then Theorem 1.6 expresses as the infimum over all for which there exists an adapted process^{3}^{3}3We say is adapted if is measurable for each . satisfying and a.s., for each .
1.7. Interpreting Theorem 1.1 in terms of risk measures
It is straightforward to rewrite Theorem 1.1 in a language more in line with the literature on convex risk measures, for which we again defer to [22] for background. Let be a measurable space, and suppose is a convex risk measure on the set of bounded measurable functions. That is, is convex, for all and , and whenever pointwise. Suppose we are given a sequence of valued random variables , i.e., measurable maps . Assume have the following independence property, identical to Peng’s notion of independence under nonlinear expectations [43]: for and
(1.17) 
In particular, for all . Define by
Additional assumptions on (see, e.g., Theorem 2.2 below) can ensure that has weakly compact sublevel sets, so that Theorem 1.1 applies. Then, for ,
(1.18) 
Indeed, in our previous notation, for .
In the risk measure literature, one thinks of as the risk associated to an uncertain financial loss . With this in mind, and with , the quantity appearing in (1.18) is the riskperunit of an investment in units of . One might interpret as capturing the composition of the investment, while the multiplicative factor represents the size of the investment. As increases, say to , the investment is “rebalanced” in the sense that one additional independent component, , is incorporated and the size of the total investment is increased by one unit. The limit in (1.18) is then an asymptotic evaluation of the riskperunit of this rebalancing scheme.
1.8. Extensions
Broadly speaking, the book of Dupuis and Ellis [17] and numerous subsequent works illustrate how the classical convex duality between relative entropy and cumulantgenerating functions can serve as a foundation from which to derive an impressive range of large deviation principles. Similarly, each alternative dual pair should provide an alternative foundation for a potentially equally wide range of limit theorems. From this perspective, our work raises far more questions than it answers by restricting attention to analogs of the two large deviation principles of Sanov and Cramér. It is likely, for instance, that an analog of Mogulskii’s theorem (see [39] or [17, Section 3]) holds in our context. Moreover, our framework is not as restricted to i.i.d. samples as it may appear. While the definition of reflects our focus on i.i.d. samples, we might accommodate Markov chains by redefining . For instance, we may try
where is an initial law of a Markov chain, is its transition kernel, and plays the role of . This again simplifies in the classical case , leading to , where is the law of the path of the Markov chain described above. These speculations are meant simply to convey the versatility of our framework but are pursued no further, with the paper instead focusing on exploring the implications of various choices of in our analog of Sanov’s theorem.
1.9. Outline of the paper
The remainder of the paper is organized as follows. Section 2 begins by clarifying the duality, explaining some useful properties of and and extending their definitions to unbounded functions. Section 3 is devoted to the statement and proof Theorem 3.1, which contains Theorem 1.1 as a special case but is extended to stronger topologies and unbounded functions . See also Section 3.3 for abstract analogs of the contraction principle and Cramér’s theorem. These extensions are put to use in Section 4, which proves and elaborates on the nonexponential forms of Sanov’s and Cramér’s theorems discussed in Section 1.1. Section 4.4 applies these results to obtain error estimates for a common Monte Carlo approach to stochastic optimization. Sections 5 and 6 respectively elaborate on the examples of 1.2 and 1.4. Section A proves two different representations of , namely those of (1.6) and Theorem 1.6. The short Appendix B describes a natural but largely unsuccessful attempt to derive tractable large deviation upper bounds from Theorem 1.1 by working with a class of functionals of not one but two measures, such as divergences. Finally two minor technical results are relegated to Appendix C.
2. Convex duality preliminaries
This section outlines the key features of the duality. The first three theorems, stated in this subsection, are borrowed from the literature on convex risk measures, for which an excellent reference is the book of Föllmer and Schied [22]. While we will make use of some of the properties listed in Theorem 2.1, the goal of the first two theorems is more to illustrate how one can make the starting point rather than . In particular, Theorem 2.2 will not be needed in the sequel. For proofs of Theorems 2.1 and 2.2, refer to Bartl [4, Theorem 2.4].
Theorem 2.1.
Suppose is convex and has weakly compact sublevel sets. Define as in (1.1). Then the following hold:

If pointwise then .

If and , then .

If with pointwise, then .

If and with pointwise, then .
Moreover, for we have
(2.1) 
Theorem 2.2.
We state also a useful theorem of Föllmer and Schied [22] which allows us to verify tightness of the sublevel sets of by checking a property of .
Theorem 2.3 (Proposition 4.30 of [22]).
Suppose a functional admits the representation
for some function . Suppose also that there is a sequence of compact subsets of such that
Then has tight sublevel sets.
The goal of the rest of the section is to extend the domain of to unbounded functions and study the compactness of the sublevel sets of with respect to stronger topologies. From now on, we work at all times with the standing assumptions on described in the introduction:
Standing assumptions.
The function is convex, has weakly compact sublevel sets, and is not identically equal to . Lastly, is defined as in (1.1).
2.1. Extending and to unbounded functions
This section extends the domain of to unbounded functions. Let . We adopt the convention that , although this will have few consequences aside from streamlined definitions. In particular, if and a measurable function and satisfies , we define .
Definition 2.4.
For and measurable , define
As usual, abbreviate .
It is worth emphasizing that while is finite for bounded , it can be either or when is unbounded. The following simple lemma will aid in some computations in Section 4.
Lemma 2.5.
If is measurable and bounded from below, then
Proof.
Define . Monotone convergence yields
One checks easily that this is consistent with the convention . ∎
3. An extension of Theorem 1.1
In this section we state and prove a useful generalization of Theorem 1.1 for stronger topologies and unbounded functions, taking advantage of the results of the previous section. At all times in this section, the standing assumptions on (stated just before Section 2.1) are in force.
We prepare by defining a well known class of topologies on subsets of . Given a continuous function , define
Endow with the (Polish) topology generated by the maps , where is continuous and ; we call this the weak topology. A useful fact about this topology is that a set is precompact if and only if for every there exists a compact set such that
This is easily proven directly, or refer to [22, Corollary A.47].
In the following theorem, the extension of the upper bound to the weak topology requires the assumption that the sublevel sets of are precompact in . This rather opaque assumption is explored in more detail in the subsequent Section 3.1.
Theorem 3.1.
Let be continuous. If is lower semicontinuous (with respect to the weak topology) and bounded from below, then
Suppose also that the sublevel sets of are precompact subsets of . If is upper semicontinuous and bounded from above, then
Proof.
Lower bound: Let us prove first the lower bound. It is immediate from the definition that for each , where denotes the fold product measure. Thus
(3.1)  
For , the law of large numbers implies weakly, i.e. in . For , the convergence takes place in . Lower semicontinuity of on then implies, for each ,
Take the supremum over to complete the proof of the lower bound. It is worth noting that if is a compatible metric on and for some fixed and , then the weak topology is nothing but the Wasserstein topology.
Upper bound, bounded: The upper bound is more involved. First we prove it in four steps under the assumption that is bounded.
Step 1: First we simplify the expression somewhat. For each the definition of and convexity of imply
Combine this with (3.1) to get
(3.2) 
Now choose arbitrarily some such that . The choice and boundedness of show that the supremum in (3.2) is bounded below by , where . For each , choose attaining the supremum in (3.2) to within . Then
(3.3) 
It is convenient to switch now to a probabilistic notation: One some sufficiently rich probability space, find an