A non-exponential extension of Sanov’s theorem via convex duality

# A non-exponential extension of Sanov’s theorem via convex duality

## Abstract

This work is devoted to a vast extension of Sanov’s theorem, in Laplace principle form, based on alternatives to the classical convex dual pair of relative entropy and cumulant generating functional. The abstract results give rise to a number of probabilistic limit theorems and asymptotics. For instance, widely applicable non-exponential large deviation upper bounds are derived for empirical distributions and averages of i.i.d. samples under minimal integrability assumptions, notably accommodating heavy-tailed distributions. Other interesting manifestations of the abstract results include uniform large deviation bounds, variational problems involving optimal transport costs, and constrained super-hedging problems, as well as an application to error estimates for approximate solutions of stochastic optimization problems. The proofs build on the Dupuis-Ellis weak convergence approach to large deviations as well as the duality theory for convex risk measures.

## 1Introduction

An original goal of this paper was to extend the weak convergence methodology of Dupuis and Ellis [17] to the context of non-exponential (e.g., heavy-tailed) large deviations. While we claim only modest success in this regard, we do find some general-purpose large deviation upper bounds which can be seen as polynomial-rate analogs of the upper bounds in the classical theorems of Sanov and Cramér. At least as interesting, however, are the abstract principles behind these bounds, which have broad implications beyond the realm of large deviations. Let us first describe these abstract principles before specializing them in various ways.

Let be a Polish space, and let denote the set of Borel probability measures on endowed with the topology of weak convergence. Let (resp. ) denote the set of measurable (resp. continuous) and bounded real-valued functions on . For and , define and measurable maps for via the disintegration

In other words, if is an -valued random variable with law , then is the law of , and is the conditional law of given . Of course, are uniquely defined up to -almost sure equality.

The protagonist of the paper is a proper (i.e., not identically ) convex function with compact sub-level sets; that is, is compact for every . For define by

and note that . Define the convex conjugate by

Our main interest is in evaluating at functions of the empirical measure defined by

The main abstract result of the paper is the following extension of Sanov’s theorem, proven in a more general form in Section 3 by adapting the weak convergence techniques of Dupuis-Ellis [17].

The guiding example is the relative entropy, , where is a fixed reference measure, and is defined by

It turns out that , by the so-called chain rule of relative entropy [17]. The dual is well known to be , and the formula relating and is often known as the Gibbs variational principle or the Donsker-Varadhan formula. In this case Theorem ? reduces to the Laplace principle form of Sanov’s theorem:

Well known theorems of Varadhan and Dupuis-Ellis (see [17]) assert the equivalence of this form of Sanov’s theorem with the more common form: for every Borel set with closure and interior ,

To derive this heuristically, apply Theorem ? to the function

For general , Theorem ? does not permit an analogous equivalent formulation in terms of deviation probabilities. In fact, for many , Theorem ? has nothing to do with large deviations (see Sections Section 1.3 and Section 1.4 below). Nonetheless, for certain , Theorem 1.1 implies interesting large deviations upper bounds, which we prove by formalizing the aforementioned heuristic. While many admit fairly explicit known formulas for the dual , the recurring challenge in applying Theorem ? is finding a useful expression for , and herein lies but one of many instances of the wonderful tractability of relative entropy. The examples to follow do admit good expressions for , or at least workable one-sided bounds, but we also catalog in Section 1.5 some natural alternative choices of for which we did not find useful bounds or expressions for .

The functional is (up to a sign change) a convex risk measure, in the language of Föllmer and Schied [22]. A rich duality theory for convex risk measures emerged over the past two decades, primarily geared toward applications in financial mathematics and optimization. We take advantage of this theory in Section 2 to demonstrate how can be reconstructed from in many cases, which shows that could be taken as the starting point instead of . Additionally, the theory of risk measures provides insight on how to deal with the subtleties that arise in extending the domain of (and Theorem ?) to accommodate unbounded functions or stronger topologies on . Section 1.7 briefly reinterprets Theorem ? in a language more consistent with the risk measure literature. The reader familiar with risk measures may notice a time consistent dynamic risk measure (see [1] for definitions and survey) hidden in the definition of above.

We will make no use of the interpretation in terms of dynamic risk measures, but it did inspire a recursive formula for (similar to a result of [11]). To state it loosely, if then we may write

To make rigorous sense of this, we must note that is merely upper semianalytic and not Borel measurable in general, and that is well defined for such functions. We make this precise in Proposition ?. This recursive formula is not essential for any of the arguments but is convenient for calculations.

### 1.1Nonexponential large deviations

A first applicatoin of Theorem ? is to derive large deviation upper bounds in the absence of exponential rates or finite moment generating functions. While Cramér’s theorem in full generality does not require any finite moments, the upper bound is often vacuous when the underlying random variables have heavy tails. This simple observation has driven a large and growing literature on large deviation asymptotics for sums of i.i.d. random variables, to be reviewed shortly. Our approach is well suited not to precise asymptotics but rather to widely applicable upper bounds. In Section 4.1 we derive alternatives to the upper bounds of Sanov’s and Cramér’s theorems by applying (an extension of) Theorem ? with

where is fixed. We state the results here: For a continuous function , let denote the set of satisfying , and equip with the topology induced by the linear maps , where is continuous and .

In analogy with the classical Cramér’s theorem, the function in Corollary ? plays the role of the cumulant generating function. In both Theorem ? and Corollary ?, notice that as soon as the constant on the right-hand side is finite, we may conclude that the probabilities in question are . This is consistent with some now-standard results on one-dimensional heavy tailed sums, for events of the form , for . For instance, it is known [44] that if are i.i.d. real-valued random variables with mean zero and , then . For , the well known inequality of Fuk-Nagaev provides a related non-asymptotic bound; see [40], or [19] for a Banach space version.

If instead a stronger assumption is made on , such as regular variation, then there are corresponding lower bounds for certain sets . Refer to the books [10] and the survey of Mikosch and Nagaev [38] for detailed reviews of such results, as well as the more recent [15] and references therein. Indeed, precise asymptotics require detailed assumptions on the shape of the tails of , and this is especially true in multivariate and infinite-dimensional contexts. A recent line of interesting work extends the theory of regular variation to metric spaces [13], but again the typical assumptions on the underlying law are substantially stronger than mere existence of a finite moment.

The main advantage of our results is their broad applicability, requiring only finite moments, but two other strengths are worth emphasizing. First, our bounds apply to arbitrary closed sets , which enables a natural contraction principle (i.e., continuous mapping). Section 4.6 illustrates this by using Theorem ? to find error bounds for Monte Carlo schemes in stochastic optimization, essentially providing a heavy-tailed analog of the results of [30]. Lastly, while this discussion has focused on literature related to our analog of Cramér’s upper bound (Corollary ?), our analog of Sanov’s upper bound (Theorem ?) seems even more novel. No other results are known to the author on empirical measure large deviations in heavy-tailed contexts. Of course, Sanov’s theorem applies without any moment assumptions, but the upper bound provides no information in many heavy-tailed applications, such as in Section 4.6.

### 1.2Uniform upper bounds and martingales

Certain classes of dependent sequences admit uniform upper bounds, which we derive from Theorem ? by working with

for a given convex weakly compact set . The conjugate , unsurprisingly, is , and turns out to be tractable as well:

where is defined as the set of laws with for each , -almost surely; in other words, is the set of laws of -valued random variables , when the law of belongs to and so does the conditional law of given , almost surely, for each . Theorem ? becomes

From this we derive a uniform large deviation upper bound, for closed sets :

With a prudent choice of , this specializes to an asymptotic relative of the Azuma-Hoeffding inequality. The surprising feature here is that we can work with arbitrary closed sets and in multiple dimensions:

Föllmer and Knispel [20] found some results which loosely resemble (see Corollary 5.3 therein), based on an analysis of the same risk measure . See also [26] for somewhat related results on large deviations for capacities.

### 1.3Laws of large numbers

Some specializations of Theorem ? appear to have nothing to do with large deviations. For example, suppose is convex and compact, and let

It can be shown that , where is defined as in Section 1.2, for instance by a direct computation using . Theorem ? then becomes

When is a singleton, so is , and this simply expresses the weak convergence . The general case can be interpreted as a robust law of large numbers, where “robust” refers to perturbations of the joint law of an i.i.d. sequence. This is closely related to laws of large numbers under nonlinear expectations [42].

### 1.4Optimal transport costs

Another interesting consequence of Theorem ? comes from choosing as an optimal transport cost. Fix and a lower semicontinuous function , and define

where is the set of probability measures on with first marginal and second marginal . Under a modest additional assumption on (stated shortly in Corollary ?, proven later in Lemma ?), satisfies our standing assumptions.

The dual can be identified using Kantorovich duality, and turns out to be the value of a stochastic optimal control problem. To illustrate this, it is convenient to work with probabilistic notation: Suppose is a sequence of i.i.d. -valued random variables with common law , defined on some fixed probability space. For each , let denote the set of -valued random variables where is -measurable for each . We think of elements of as adapted control processes. For each and each , we show in Proposition ? that

The expression yields the following corollary of Theorem ?:

This can be seen as a long-time limit of the optimal value of the control problems. However, the renormalization in is a bit peculiar in that it enters inside of the terminal cost , and there does not seem to be a direct connection with ergodic control. A direct proof of is possible but seems to be no simpler and potentially narrower in scope.

The limiting object of Corollary ? encapsulates a wide variety of interesting variational problems involving optimal transport costs. Variational problems of this form are surely more widespread than the author is aware, but two notable recent examples can be found in the study of Cournot-Nash equilibria in large-population games [9] and in the theory of Wasserstein barycenters [2].

### 1.5Alternative choices of α

There are many other natural choices of for which the implications of Theorem ? remain unclear. For example, consider the -divergence

where and is convex and satisfies as . This has weakly compact (actually -compact) sub-level sets, according to [14], and it is clearly convex. The dual, known in the risk literature as the optimized certainty equivalent, was computed by Ben-Tal and Teboulle [6] to be

where is the convex conjugate. Unfortunately, we did not find any good expressions or estimates for or , so the interpretation of the main Theorem ? eludes us in this case.

A related choice is the so-called shortfall risk measure introduced by Föllmer and Schied [21]:

This choice of and the corresponding (tractable!) are discussed briefly in Section 4.1. The choice of corresponds to , and we make extensive use of this in Section 4, as was discussed in Section 1.1. The choice of recovers the classical case . Aside from these two examples, for general , we found no useful expressions or estimates for or . In connection with tails of random variables, shortfall risk measures have an intuitive appeal stemming from the following simple analog of Chernoff’s bound, observed in [34]: If for all , where is some given measurable function, then for all , where .

It is worth pointing out the natural but ultimately fruitless idea of working with , where is increasing. Such functionals were studied first it seems by Hardy, Littlewood, and Pólya [25], providing necessary and sufficient conditions for to be convex (rediscovered in [6]). Using the formula to compute , this choice would lead to the exceptionally pleasant formula , which we observed already in the classical case . Unfortunately, however, such a cannot come from a functional on , in the sense that cannot hold unless is affine or exponential. Another way of seeing this is that the convex conjugate of (with respect to the dual pairing of with the space of bounded signed measures) fails to be infinite outside of the set . The problem, as is known in the risk measure literature, is that the additivity property for all and fails unless is affine or exponential (c.f. [22]).

The consequences of Theorem ? remain unexplored for several other potentially interesting choices of with well understood duals: To name just a few, we mention the Schrödinger problem surveyed in [35] and related functionals arising from stochastic optimal control problems [37], martingale optimal transport costs [5], and functionals related to Orlicz norms studied in [12].

### 1.6Connection to superhedging

Again, the challenge in working with Theorem ? is in computing or estimating or . With this in mind, we present an alternative expression for as the value of a particular type of optimal control problem, more specifically a superhedging problem (see, e.g., [22]). To a given dual pair we may associate the acceptance set

As is well known in the risk measure literature, we may express in terms of by

Indeed, this follows easily from the fact that for constants . In fact, can also be reconstructed from , and this provides a third possible entry point to the duality. To elaborate on this would take us too far afield, but see [22] for details.

Now, let us compute in terms of the acceptance set. For , define to be the set of , where is a measurable function from to satisfying

To interpret this as a control problem, consider the partial sum process as a state process, which we must “steer” to be larger than pointwise at the final time . The control at each time must be admissible in the sense of , and notice that the dependence of on only is an expression of adaptedness or non-anticipativity. We seek the minimal starting point for which this steering can be done. The iterative form of in (more precisely stated in Proposition ?) can be seen as an expression of the dynamic programming principle for the control problem of Theorem ?.

For a concrete example, if is the shortfall risk measure , and if denote i.i.d. -valued random variables with common law , then Theorem ? expresses as the infimum over all for which there exists an -adapted process1 satisfying and a.s., for each .

### 1.7Interpreting Theorem in terms of risk measures

It is straightforward to rewrite Theorem ? in a language more in line with the literature on convex risk measures, for which we again defer to [22] for background. Let be a measurable space, and suppose is a convex risk measure on the set of bounded measurable functions. That is, is convex, for all and , and whenever pointwise. Suppose we are given a sequence of -valued random variables , i.e., measurable maps . Assume have the following independence property, identical to Peng’s notion of independence under nonlinear expectations [43]: for and

In particular, for all . Define by

Additional assumptions on (see, e.g., Theorem ? below) can ensure that has weakly compact sub-level sets, so that Theorem ? applies. Then, for ,

Indeed, in our previous notation, for .

In the risk measure literature, one thinks of as the risk associated to an uncertain financial loss . With this in mind, and with , the quantity appearing in is the risk-per-unit of an investment in units of . One might interpret as capturing the composition of the investment, while the multiplicative factor represents the size of the investment. As increases, say to , the investment is “rebalanced” in the sense that one additional independent component, , is incorporated and the size of the total investment is increased by one unit. The limit in is then an asymptotic evaluation of the risk-per-unit of this rebalancing scheme.

### 1.8Extensions

Broadly speaking, the book of Dupuis and Ellis [17] and numerous subsequent works illustrate how the classical convex duality between relative entropy and cumulant-generating functions can serve as a foundation from which to derive an impressive range of large deviation principles. Similarly, each alternative dual pair should provide an alternative foundation for a potentially equally wide range of limit theorems. From this perspective, our work raises far more questions than it answers by restricting attention to analogs of the two large deviation principles of Sanov and Cramér. It is likely, for instance, that an analog of Mogulskii’s theorem (see [39] or [17]) holds in our context. Moreover, our framework is not as restricted to i.i.d. samples as it may appear. While the definition of reflects our focus on i.i.d. samples, we might accommodate Markov chains by redefining . For instance, we may try

where is an initial law of a Markov chain, is its transition kernel, and plays the role of . This again simplifies in the classical case , leading to , where is the law of the path of the Markov chain described above. These speculations are meant simply to convey the versatility of our framework but are pursued no further, with the paper instead focusing on exploring the implications of various choices of in our analog of Sanov’s theorem.

### 1.9Outline of the paper

The remainder of the paper is organized as follows. Section 2 begins by clarifying the duality, explaining some useful properties of and and extending their definitions to unbounded functions. Section 3 is devoted to the statement and proof Theorem ?, which contains Theorem ? as a special case but is extended to stronger topologies and unbounded functions . See also Section 3.3 for abstract analogs of the contraction principle and Cramér’s theorem. These extensions are put to use in Section 4, which proves and elaborates on the non-exponential forms of Sanov’s and Cramér’s theorems discussed in Section 1.1. Section 4.6 applies these results to obtain error estimates for a common Monte Carlo approach to stochastic optimization. Sections Section 5 and Section 6 respectively elaborate on the examples of Section 1.2 and Section 1.4. Section Appendix A proves two different representations of , namely those of and Theorem ?. The short Appendix B describes a natural but largely unsuccessful attempt to derive tractable large deviation upper bounds from Theorem ? by working with a class of functionals of not one but two measures, such as -divergences. Finally two minor technical results are relegated to Appendix C.

## 2Convex duality preliminaries

This section outlines the key features of the duality. The first three theorems, stated in this subsection, are borrowed from the literature on convex risk measures, for which an excellent reference is the book of Föllmer and Schied [22]. While we will make use of some of the properties listed in Theorem ?, the goal of the first two theorems is more to illustrate how one can make the starting point rather than . In particular, Theorem ? will not be needed in the sequel. For proofs of Theorems ? and ?, refer to Bartl [4].

We state also a useful theorem of Föllmer and Schied [22] which allows us to verify tightness of the sub-level sets of by checking a property of .

The goal of the rest of the section is to extend the domain of to unbounded functions and study the compactness of the sub-level sets of with respect to stronger topologies. From now on, we work at all times with the standing assumptions on described in the introduction:

### 2.1Extending ρ and ρn to unbounded functions

This section extends the domain of to unbounded functions. Let . We adopt the convention that , although this will have few consequences aside from streamlined definitions. In particular, if and a measurable function and satisfies , we define .

It is worth emphasizing that while is finite for bounded , it can be either or when is unbounded. The following simple lemma will aid in some computations in Section 4.

Define . Monotone convergence yields

One checks easily that this is consistent with the convention .

## 3An extension of Theorem

In this section we state and prove a useful generalization of Theorem ? for stronger topologies and unbounded functions, taking advantage of the results of the previous section. At all times in this section, the standing assumptions on (stated just before Section 2.1) are in force.

We prepare by defining a well known class of topologies on subsets of . Given a continuous function , define

Endow with the (Polish) topology generated by the maps , where is continuous and ; we call this the -weak topology. A useful fact about this topology is that a set is pre-compact if and only if for every there exists a compact set such that

This is easily proven directly, or refer to [22].

In the following theorem, the extension of the upper bound to the -weak topology requires the assumption that the sub-level sets of are pre-compact in . This rather opaque assumption is explored in more detail in the subsequent Section 3.1.

Lower bound:

Let us prove first the lower bound. It is immediate from the definition that for each , where denotes the -fold product measure. Thus

For , the law of large numbers implies weakly, i.e. in . For , the convergence takes place in . Lower semicontinuity of on then implies, for each ,

Take the supremum over to complete the proof of the lower bound. It is worth noting that if is a compatible metric on and for some fixed and , then the -weak topology is nothing but the -Wasserstein topology.

Upper bound, bounded:

The upper bound is more involved. First we prove it in four steps under the assumption that is bounded.

Step 1:

First we simplify the expression somewhat. For each the definition of and convexity of imply

Combine this with to get

Now choose arbitrarily some such that . The choice and boundedness of show that the supremum in is bounded below by , where . For each , choose attaining the supremum in to within . Then

It is convenient to switch now to a probabilistic notation: One some sufficiently rich probability space, find an -valued random variable with law . Define the random measures

Use and the unwrap the definitions to find

Moreover, implies

Step 2:

We next show that the sequence is tight, viewed as -valued random variables. Here we use the assumption that the sub-level sets of are -weakly compact subsets of . It then follows from that is tight (see, e.g., [17]).

To see that the pair is tight, it remains to check that is tight. To this end, we first notice that and have the same mean measure for each , in the sense that for every we have

To prove is tight, it suffices (by Prohorov’s theorem) to show that for all there exists a -weakly compact set such that . We will look for of the form , where a sequence of compact subsets of to be specified later; indeed, sets of this form are pre-compact in according to a form of Prohorov’s theorem discussed at the beginning of this section (see also [22]). For such a set , use Markov’s inequality and to compute

By a form of Jensen’s inequality (see Lemma ?),

where is the probability measure on defined by . Hence, the sequence is pre-compact in , thanks to the assumption that sub-level sets of are pre-compact subsets of . It follows that for every there exists a compact set such that . With this in mind, we may choose to make arbitrarily small, uniformly in . This shows that is tight, completing Step 2.

Step 3:

We next show that every limit in distribution of is concentrated on the diagonal . By definition of , we have

for every . That is, the terms inside the expectation form a martingale difference sequence. Thus, for , we have

where . It is straightforward to check that implies that every weak limit of is concentrated on (i.e., almost surely belongs to) the diagonal (c.f. [17]).

Step 4:

We can now complete the proof of the upper bound. With Step 3 in mind, fix a subsequence and a -valued random variable such that in distribution (where we relabeled the subsequence). Recall that is bounded from below and -weakly lower semicontinuous, whereas is upper semicontinuous and bounded. Returning to , we conclude now that

Of course, we abused notation by relabeling the subsequences, but we have argued that for every subsequence there exists a further subsequence for which this bound holds, which proves the upper bound for bounded.

Upper bound, unbounded :

With the proof complete for bounded , we now remove the boundedness assumption using a natural truncation procedure. Let be upper semicontinuous and bounded from above. For let . Since is bounded and upper semicontinuous, the previous step yields

for each . Since , we have

for each , and it remains only to show that

Clearly , since . Note that , as and are bounded from above and from below, respectively. If , then whenever , and we conclude that, as ,

Now suppose instead that is finite. Fix . For each , find such that

Since is bounded from above and , it follows that . The sub-level sets of are -weakly compact, and thus the sequence has a limit point (in ). Let denote any limit point, and suppose . Then

where the second inequality follows from upper semicontinuity of and lower semicontinuity of . This holds for any limit point of the pre-compact sequence , and it follows from that

Since was arbitrary, this proves .

### 3.1Pre-compactness in Pψ(E) and Cramér’s condition

This section identifies an important sufficient condition for the sub-level sets of to be pre-compact subsets of