Dynamics of Bayesian Updating with Dependent Data and Misspecified Models
Abstract
Much is now known about the consistency of Bayesian updating on infinitedimensional parameter spaces with independent or Markovian data. Necessary conditions for consistency include the prior putting enough weight on the correct neighborhoods of the datagenerating distribution; various sufficient conditions further restrict the prior in ways analogous to capacity control in frequentist nonparametrics. The asymptotics of Bayesian updating with misspecified models or priors, or nonMarkovian data, are far less well explored. Here I establish sufficient conditions for posterior convergence when all hypotheses are wrong, and the data have complex dependencies. The main dynamical assumption is the asymptotic equipartition (ShannonMcMillanBreiman) property of information theory. This, along with Egorov’s Theorem on uniform convergence, lets me build a sievelike structure for the prior. The main statistical assumption, also a form of capacity control, concerns the compatibility of the prior and the datagenerating process, controlling the fluctuations in the loglikelihood when averaged over the sievelike sets. In addition to posterior convergence, I derive a kind of large deviations principle for the posterior measure, extending in some cases to rates of convergence, and discuss the advantages of predicting using a combination of models known to be wrong. An appendix sketches connections between these results and the replicator dynamics of evolutionary theory.
0901.1342
Dynamics of Bayesian Updating {aug}
class=AMS] \kwd[Primary ]62C10 \kwd62G20 \kwd62M09 \kwd[; secondary ]60F10 \kwd62M05 \kwd92D15 \kwd94A17
asymptotic equipartition \kwdBayesian consistency \kwdBayesian nonparametrics \kwdEgorov’s theorem \kwdlarge deviations \kwdposterior convergence \kwdreplicator dynamics \kwdsofic systems
1 Introduction
The problem of the convergence and frequentist consistency of Bayesian learning goes as follows. We encounter observations , which we would like to predict by means of a family of models or hypotheses (indexed by ). We begin with a prior probability distribution over , and update this using Bayes’s rule, so that our distribution after seeing is . If the observations come from a stochastic process with infinitedimensional distribution , when does converge almost surely? What is the rate of convergence? Under what conditions will Bayesian learning be consistent, so that doesn’t just converge but its limit is ?
Since the Bayesian estimate is really the whole posterior probability distribution rather than a point or set in , consistency becomes concentration of around . One defines some sufficiently strong set of neighborhoods of in the space of probability distributions on , and says that is consistent when, for each such neighborhood , . When this holds, the posterior increasingly approximates a degenerate distribution centered at the truth.
The greatest importance of these problems, perhaps, is their bearing on the objectivity and reliability of Bayesian inference; consistency proofs and convergence rates are, as it were, frequentist licenses for Bayesian practices. Moreover, if Bayesian learners starting from different priors converge rapidly on the same posterior distribution, there is less reason to worry about the subjective or arbitrary element in the choice of priors. (Such “merger of opinion” results [BlackwellDubinsmerging, ] are also important in economics and game theory [Chamleyherds, ].) Recent years have seen considerable work on these problems, especially in the nonparametric setting where the model space is infinitedimensional [GhoshRamamoorthi, ].
Pioneering work by Doob [Doobbayesianconsistency, ], using elegant martingale arguments, established that when any consistent estimator exists, and lies in the support of , the set of sample paths on which the Bayesian learner fails to converge to the truth has prior probability zero. (See [ChoiRamamoorthiposteriorconsistency, ] and [Bayesianconsistencyforstationarymodels, ] for extensions of this result to nonIID settings, and also the discussion in [Schervishtheoryofstats, ; EarmanonBayes, ].) This is not, however, totally reassuring, since generally also has prior probability zero, and it would be unfortunate if these two measurezero sets should happen to coincide. Indeed, Diaconis and Freedman established that the consistency of Bayesian inference depends crucially on the choice of prior, and that even very natural priors can lead to inconsistency (see [DiaconisFreedman1986, ] and references therein).
Subsequent work, following a path established by Schwartz [SchwartzonBayesprocedures, ], has shown that, no matter what the true datagenerating distribution , Bayesian updating converges along almostall sample paths, provided that (a) is contained in , (b) every KullbackLeibler neighborhood in the has some positive prior probability (the “KullbackLeibler property”), and (c) certain restrictions hold on the prior, amounting to versions of capacity control, as in the method of sieves or structural risk minimization. These contributions also make (d) certain dynamical assumptions about the datagenerating process, most often that it is IID [BarronSchervishWasserman, ; Ghosaletalconsistencyissues, ; Walkernewapproachestobayesianconsistency, ] (in this setting, [GhosalGhoshvanderVaart, ] and [ShenWassermanrates, ] in particular consider convergence rates), independent nonidentically distributed [ChoiRamamoorthiposteriorconsistency, ; GhosalvanderVaartposteriorconvergencenoniid, ], or, in some cases, Markovian [GhosalTangbayesianconsistencyforMarkov, ; GhosalvanderVaartposteriorconvergencenoniid, ]; [ChoudhuriGhosalRoybayesianspectra, ] and [RoyGhosalRosenbergersequentialbayes, ] work with spectral density estimation and sequential analysis, respectively, again exploiting specific dynamical properties.
For misspecified models, that is settings where (a) above fails, important early results were obtained by Berk [Berklimitingbehaviorofposterior, ; Berkconsistency, ] for IID data, albeit under rather strong restrictions on likelihood functions and parameter spaces, showing that the posterior distribution concentrates on an “asymptotic carrier”, consisting of the hypotheses which are the best available approximations, in the KullbackLeibler sense, to within the support of the prior. More recently, [KleijnvanderVaart, ], [ZhangfromepsilontoKL, ] and [Lianratesundermisspecification, ] have dealt with the convergence of nonparametric Bayesian estimation for IID data when is not in the support of the prior, obtaining results similar to Berk’s in far more general settings, extending in some situations to rates of convergence. All of this work, however, relies on the dynamical assumption of an IID datasource.
This paper gives sufficient conditions for the convergence of the posterior without assuming (a), and substantially weakening (c) and (d). Even if one uses nonparametric models, cases where one knows that the true data generating process is exactly represented by one of the hypotheses in the model class are scarce. Moreover, while IID data can be produced, with some trouble and expense, in the laboratory or in a wellconducted survey, in many applications the data are not just heterogeneous and dependent, but their heterogeneity and dependence is precisely what is of interest. This raises the question of what Bayesian updating does when the truth is not contained in the support of the prior, and observations have complicated dependencies.
To answer this question, I first weaken the dynamical assumptions to the asymptotic equipartition property (ShannonMcMillanBreiman theorem) of information theory, i.e., for each hypothesis , the loglikelihood per unit time converges almost surely. This loglikelihood per unit time is basically the growth rate of the KullbackLeibler divergence between and , . As observations accumulate, areas of where exceeds its essential infimum tend to lose posterior probability, which concentrates in divergenceminimizing regions. Some additional conditions on the prior distribution are needed to prevent it from putting too much weight initially on hypotheses with high divergence rates but slow convergence of the loglikelihood. As the latter assumptions are strengthened, more and more can be said about the convergence of the posterior.
Using the weakest set of conditions (Assumptions 1–3), the longrun exponential growth rate of the posterior density at cannot exceed (Theorem 1). Adding Assumptions 4–6 to provide better control over the integrated or marginal likelihood establishes (Theorem 2) that the longrun growth rate of the posterior density is in fact . One more assumption (7) then lets us conclude (Theorem 3) that the posterior distribution converges, in the sense that, for any set of hypotheses , the posterior probability unless the essential infimum of over equals . In fact, we then have a kind of large deviations principle for the posterior measure (Theorem 4), as well as a bound on the generalization ability of the posterior predictive distribution (Theorem 5). Convergence rates for the posterior (Theorem 6) follow from the combination of the large deviations result with an extra condition related to assumption 6. Importantly, Assumptions 4–7, and so the results following from them, involve both the prior distribution and the datagenerating process, and require the former to be adapted to the latter. Under misspecification, it does not seem to be possible to guarantee posterior convergence by conditions on the prior alone, at least not with the techniques used here.
For the convenience of reader, the development uses the usual statistical vocabulary and machinery. In addition to the asymptotic equipartition property, the main technical tools are on the one hand Egorov’s theorem from basic measure theory, which is used to construct a sievelike sequence of sets on which loglikelihood ratios converge uniformly, and on the other hand Assumption 6 bounding how long averages over these sets can remain far from their longrun limits. The latter assumption is crucial, novel, and, in its present form, awkward to check; I take up its relation to more familiar assumptions in the discussion. It may be of interest, however, that the results were first found via an apparentlynovel analogy between Bayesian updating and the “replicator equation” of evolutionary dynamics, which is a formalization of the Darwinian idea of natural selection. Individual hypotheses play the role of distinct replicators in a population, the posterior distribution being the population distribution over replicators and fitness being proportional to likelihood. Appendix A gives details.
2 Preliminaries and Notation
Let be a probability space, and , for short , be a sequence of random variables, taking values in the measurable space , whose infinitedimensional distribution is . The natural filtration of this process is . The only dynamical properties are those required for the ShannonMcMillanBreiman theorem (Assumption 3); more specific assumptions such as being a product measure, Markovian, exchangeable, etc., are not required. Unless otherwise noted, all probabilities are taken with respect to , and always means expectation under that distribution.
Statistical hypotheses, i.e., distributions of processes adapted to , are denoted by , the index taking values in the hypothesis space, a measurable space , generally infinitedimensional. For convenience, assume that and all the are dominated by a common reference measure, with respective densities and . I do not assume that , still less that — i.e., quite possibly all of the available hypotheses are false.
We will study the evolution of a sequence of probability measures on , starting with a nonrandom prior measure . (A filtration on is not needed; the measures change but not the field .) Assume all are absolutely continuous with respect to a common reference measure, with densities . Expectations with respect to these measures will be written either as explicit integrals or de Finetti style, ; when is a set, .
Let be the conditional likelihood of under , i.e., , with . The integrated conditional likelihood is . Bayesian updating of course means that, for any ,
or, in terms of the density,
It will also be convenient to express Bayesian updating in terms of the prior and the total likelihood:
where is the ratio of model likelihood to true likelihood. (Note that for all a.s.) Similarly,
The onestepahead predictive distribution of the hypothesis is given by , with the convention that gives the marginal distribution of the first observation. Abbreviate this by . Similarly, let ; this is the best probabilistic prediction we could make, did we but know [Knightpredictiveview, ]. The posterior predictive distribution is given by mixing the individual predictive distributions with weights given by the posterior:
Remark on the topology of and on
The hope in studying posterior convergence is to show that, as grows, with higher and higher () probability, concentrates more and more on sets which come closer and closer to . The tricky part here is “closer and closer”: points in represent infinitedimensional stochastic process distributions, and the topology of such spaces is somewhat odd, and irritatingly abrupt, at least under the more common distances. Any two ergodic measures are either equal or have completely disjoint supports [Grayergodicproperties, ], so that the KullbackLeibler divergence between distinct ergodic processes is always infinity (in both directions), and the total variation and Hellinger distances are likewise maximal. Most previous work on posterior consistency has restricted itself to models where the infinitedimensional process distributions are formed by products of fixeddimensional base distributions (IID, Markov, etc.), and in effect transferred the usual metrics’ topologies from these finitedimensional distributions to the processes. It is possible to define metrics for general stochastic processes [Grayergodicproperties, ], and if readers like they may imagine that is a Borel field under some such metric. This is not necessary for the results presented here, however.
2.1 Example
The following example will be used to illustrate the assumptions (§2.2.1 and Appendix B), and, later, the conclusions (§3.6).
The datagenerating process is a stationary and ergodic measure on the space of binary sequences, i.e., , and the field . The measure is naturally represented as a function of a twostate Markov chain , with . The transition matrix is
so that the invariant distribution puts probability on state 1 and probability on state 2; take to be distributed accordingly. The observed process is a binary function of the latent state transitions, if and otherwise. Figure 1 depicts the transition and observation structure. Qualitatively, consists of blocks of s of even length, separated by blocks of s of arbitrary length. Since the joint process is a stationary and ergodic Markov chain, is also stationary, ergodic and mixing.
This stochastic process comes from symbolic dynamics [LindMarcus, ; Kitchens, ], where it is known as the “even process”, and serves as a basic example of the class of sofic processes [Weiss1973, ], which have finite Markovian representations, as in Figure 1, but are not Markov at any finite order. (If for any finite , the corresponding must have alternated between one and two, but whether is one or two, and thus the distribution of , cannot be determined from the length history alone.) More exactly [KitchensTuncel, ], sofic systems or “finitary measures” are ones which are images of Markov chains under factor maps, and strictly sofic systems, such as the even process, are sofic systems which are not themselves Markov chains of any order. Despite their simplicity, these models arise naturally when studying the time series of chaotic dynamical systems [BadiiPoliti, ; JPCsemantics, ; CMPPSS, ; DawFinneyTracysymbolicdynamics, ], as well as problems in statistical mechanics [PerryBinderfinitestatcompl, ] and crystallography [Varninfiniterangeorder, ].
Let be the space of all binary Markov chains of order with strictly positive transition probabilities and their respective stationary distributions; each has dimension . (Allowing some transition probabilities to be zero creates uninteresting technical difficulties.) Since each hypothesis is equivalent to a function , we can give the topology of pointwise convergence of functions, and the corresponding Borel field. We will take , identifying with the appropriate subset of . Thus consists of all strictlypositive stationary binary Markov chains, of whatever order, and is infinitedimensional.
As for the prior , it will be specified in more detail below (§2.2.1). At the very least, however, it needs to have the “KullbackLeibler rate property”, i.e., to give positive probability to every “neighborhood” around every , i.e., the set of hypotheses whose KullbackLeibler divergence from grows no faster than :
(The limit exists for all combinations [Grayentropy, ].)
This example is simple, but it is also beyond the scope of existing work on Bayesian convergence in several ways. First, the datagenerating process is not even Markov. Second, , so all the hypotheses are wrong, and the truth is certainly not in the support of the prior. ( can however be approximated arbitrarily closely, in various process metrics, by distributions from .) Third, because is ergodic, and ergodic distributions are extreme points in the space of stationary distributions [Dynkinsuffstatsandextremepoints, ], it cannot be represented as a mixture of distributions in . This means that the Doobstyle theorem of Ref. [Bayesianconsistencyforstationarymodels, ] does not apply, and even the subjective certainty of convergence is not assured. The results of Refs. [KleijnvanderVaart, ; ZhangfromepsilontoKL, ; Berklimitingbehaviorofposterior, ; Berkconsistency, ] on misspecified models do not hold because the data are dependent. To be as concrete and explicit as possible, the analysis here will focus on the even process, but only the constants would change if were any other strictly sofic process. Much of it would apply even if were a stochastic contextfree language or pushdown automaton [Charniakstatisticallanguagelearning, ], where in effect the number of hidden states is infinite, though some of the details in Appendix B would change.
Ref. [OrnsteinWeisshowsamplingrevealsaprocess, ] describes a nonparametric procedure which will adaptively learn to predict a class of discrete stochastic processes which includes the even process. Ref. [CSSRforUAI, ] introduces a frequentist algorithm which consistently reconstructs the hiddenstate representation of sofic processes, including the even process. Ref. [StrelioffJPCHublerMarkovBayes, ] considers Bayesian estimation of the even process, using Dirichlet priors for finiteorder Markov chains, and employing Bayes factors to decide which order of chain to use for prediction.
2.2 Assumptions
The needed assumptions have to do with the dynamical properties of the data generating process , and with how well the dynamics meshes both with the class of hypotheses and with the prior distribution over those hypotheses.
Assumption 1
The likelihood ratio is measurable for all .
The next two assumptions actually need only hold for almostall . But this adds more measure0 caveats to the results, and it is hard to find a natural example where it would help.
Assumption 2
For every , the KullbackLeibler divergence rate from ,
exists (possibly being infinite) and is measurable.
As mentioned, any two distinct ergodic measures are mutually singular, so there is a consistent test which can separate them. ([RyabkoRyabkotestingergodic, ] constructs an explicit but not necessarily optimal test.) One interpretation of the divergence rate [Grayentropy, ] is that it measures the maximum exponential rate at which the power of such tests approaches 1, with and indicating sub and supra exponential convergence, respectively.
Assumption 3
For each , the generalized or relative asymptotic equipartition property holds, and so
(1) 
with probability 1.
Refs. [AlgoetandCoveronAEP, ; Grayentropy, ] give sufficient, but not necessary, conditions sufficient for Assumption 3 to hold for a given . The ordinary, nonrelative asymptotic equipartition property, also known as the ShannonMcMillanBreiman theorem, is that a.s., where is the entropy rate of the datagenerating process. (See [Grayentropy, ].) If this holds and is finite, one could rephrase Assumption 3 as a.s., and state results in terms of the likelihood rather than the likelihood ratio. (Cf. [AndyFraseronHMMs, , ch. 5].) However, there are otherwisewellbehaved processes for which , at least in the usual choice of reference measure, so I will restrict myself to likelihood ratios.
The meaning of Assumption 3 is that, relative to the true distribution, the likelihood of each goes to zero exponentially, the rate being the KullbackLeibler divergence rate. Roughly speaking, an integral of exponentiallyshrinking quantities will tend to be dominated by the integrand with the slowest rate of decay. This suggests that the posterior probability of a set depends on the smallest divergence rate which can be attained at a point of prior support within . Thus, adapting notation from large deviations theory, define
where here and throughout is the essential infimum with respect to , i.e., the greatest lower bound which holds with probability 1.
Our further assumptions are those needed for the “roughly speaking” and “should” statements of the previous paragraph to be true, so that, for reasonable sets ,
Let .
Assumption 4
If this assumption fails, then every hypothesis in the support of the prior doesn’t just diverge from the true datagenerating distribution, it diverges so rapidly that the error rate of a test against the latter distribution goes to zero faster than any exponential. (One way this can happen is if every hypothesis has a finitedimensional distribution assigning probability zero to some event of positive probability.) The methods of this paper seem to be of no use in the face of such extreme misspecification.
Our first substantial assumption is that the prior distribution does not give too much weight to parts of where the log likelihood converges badly.
Assumption 5
There exists a sequence of sets such that

, for some , ;

The convergence of Eq. 1 is uniform in over ;

.
Comment 1: An analogy with the method of sieves [GemanandHwangonmethodofsieves, ] may clarify the meaning of the assumption. If we were constrained to some fixed , the uniform convergence in the second part of the assumption would make the convergence of the posterior distribution fairly straightforward. Now imagine that the constraint set is gradually relaxed, so that at time the posterior is confined to , which grows so slowly that convergence is preserved. (Assumption 6 below is, in essence, about the relaxation being sufficiently slow.) The theorems work by showing that the behavior of the posterior distribution on the full space is dominated by its behavior on this “sieve”.
Comment 2: Recall that by Egorov’s theorem [Kallenbergmodprob, , Lemma 1.36, p. 18], if a sequence of finite, measurable functions converges pointwise to a finite, measurable function for almostall , then for each , there is a (possibly empty) such that , and the convergence is uniform on . Thus the first two parts of the assumption really follow for free from the measurability in of likelihoods and divergence rates. (That needs to be at least becomes apparent in the proof of Lemma 5, but that could always be arranged.) The extra content comes in the third part of the assumption, which could fail if the lowestdivergence hypotheses were also the ones where the convergence was slowest, consistently falling into the bad sets allowed by Egorov’s theorem.
For each measurable , for every , there exists a random natural number such that
for all , provided the is finite. We need this random lastentry time to state the next assumption.
Assumption 6
The sets of the previous assumption can be chosen so that, for every , the inequality holds a.s. for all sufficiently large .
The fraction of the prior probability mass outside of is exponentially small in , with the decay rate large enough that (Lemma 5) the posterior probability mass outside also goes to zero. Using the analogy to the sieve again, the meaning of the assumption is that the convergence of the loglikelihood ratio is sufficiently fast, and the relaxation of the sieve is sufficiently slow, that, at least eventually, every set has converged by , the time when we start using it.
To show convergence of the posterior measure, we need to be able to control the convergence of the loglikelihood on sets smaller than the whole parameter space.
Assumption 7
The sets of the previous two assumptions can be chosen so that, for any set with , .
Assumption 7 could be replaced by the logicallyweaker assumption that for each set , there exist a sequence of sets satisfying the equivalents of Assumptions 5 and 6 for the prior measure restricted to . Since the most straightforward way to check such an assumption would be to verify Assumption 7 as stated, the extra generality does not seem worth it.
2.2.1 Verification of Assumptions for the Example
Since every is a finiteorder Markov chain, and is stationary and ergodic, Assumption 1 is unproblematic, while Assumptions 2 and 3 hold by virtue of [AlgoetandCoveronAEP, ].
It is easy to check that for each . (The infimum is not in general attained by any , though it could be if the chains were allowed to have some transition probabilities equal to zero.) The infimum over as a whole, however, is zero. Also, everywhere (because none of the hypotheses’ transition probabilities are zero), so the possible set of with infinite divergence rates is empty, disposing of Assumption 4.
Verifying the remaining assumptions means building a sequence of increasing subsets of on which the convergence of is uniform and sufficiently rapid, and ensuring that the prior probability of these sets grows fast enough. This will be done by exploiting some finitesample deviation bounds for the even process, which in turn rest on its mixing properties and the method of types. Details are referred to Appendix B. The upshot is that the sets consist of chains whose order is less than or equal to , for some , and where the absolute logarithm of all the transition probabilities is bounded by , where the positive constant is arbitrary but . (With a different strictly sofic process , the constant in the preceding expressions should be replaced by .) The exponential rate for the prior probability of can be chosen to be arbitrarily small.
3 Results
I first give the theorems here, without proof. The proofs, in §§3.1–3.5, are accompanied by restatements of the theorems, for the reader’s convenience.
There are six theorems. The first upperbounds the growth rate of the posterior density at a given point in . The second matches the upper bound on the posterior density with a lower bound, together providing the growthrate for the posterior density. The third is that for any set with , showing that the posterior concentrates on the divergenceminimizing part of the hypothesis space. The fourth is a kind of large deviations principle for the posterior measure. The fifth bounds the asymptotic Hellinger and total variation distances between the posterior predictive distribution and the actual conditional distribution of the next observation. Finally, the sixth theorem establishes rates of convergence.
The first result uses only Assumptions 1–3. (It is not very interesting, however, unless 4 is also true.) The latter three, however, all depend on finer control of the integrated likelihood, and so finer control of the prior, as embodied in Assumptions 5–6. More exactly, those additional assumptions concern the interplay between the prior and the datagenerating process, restricting the amount of prior probability which can be given to hypotheses whose loglikelihoods converge excessively slowly under . I build to the first result in the next subsection, then turn to the control of the integrated likelihood and its consequences in the next three subsections, and then consider how these results apply to the example.
Theorem 4
Under the conditions of Theorem 3, if is such that
then
In particular, this holds whenever or for some .
Theorem 5
Theorem 6
3.1 Upper Bound on the Posterior Density
The primary result of this section is a pointwise upper bound on the growth rate of the posterior density. To establish it, I use some subsidiary lemmas, which also recur in later proofs. Lemma 2 extends the almostsure convergence of the likelihood (Assumption 3) from holding pointwise in to holding simultaneously for all on a (possibly random) set of measure 1. Lemma 3 shows that the priorweighted likelihood ratio, tends to be at least . (Both assertions are made more precise in the lemmas themselves.)
I begin with a proposition about exchanging the order of universal quantifiers (with almostsure caveats).
Lemma 1
Let be jointly measurable, with sections and . If, for some probability measure on ,
(2) 
then for any probability measure on
(3) 
In words, if, for all , some property holds a.s., then a.s. the property holds simultaneously for almost all .
Proof: Since is measurable, for all and , the sections are measurable, and the measures of the sections, and , are measurable functions of and , respectively. Using Fubini’s theorem,
By hypothesis, however, for all . Hence it must be the case that for almostall . (In fact, the set of for which this is true must be a measurable set.)
Lemma 2
Proof: Let the set consist of the pairs where Eq. 1 holds, i.e., for which
being explicit about the dependence of the likelihood ratio on . Assumption 3 states that , so applying Lemma 1 just needs the verification that is jointly measurable. But, by Assumptions 1 and 2, is measurable, and is measurable for each , so the set where the convergence holds are measurable. Everything then follows from the preceding lemma.
Remark: Lemma 2 generalizes Lemma 3 in [BarronSchervishWasserman, ]. Lemma 1 is a specialization of the quantifierreversal lemma used in [McAllisterPACBayes, ] to prove PACBayesian theorems for learning classifiers. Lemma 1 could be used to extend any of the results below which hold a.s. for each to ones which a.s. hold simultaneously almost everywhere in . This may seem too good to be true, like an alchemist’s recipe for turning the lead of pointwise limits into the gold of uniform convergence. Fortunately or not, however, the lemma tells us nothing about the rate of convergence, and is compatible with its varying across from instantaneous to arbitrarily slow, so uniform laws need stronger assumptions.
Lemma 3
Proof: It’s enough to show that Eq. 4 holds for all in the set from the previous lemma, since that set has probability 1.
Let be the set of all in the support of such that . Since , the previous lemma tells us there exists a set of for which Eq. 1 holds under the sequence .
By Assumption 3,
and for all , , so
a.s., for all . We must have , otherwise would not be the essential infimum, and we know from the previous lemma that , so . Thus, Fatou’s lemma gives
so
and hence
(6) 
for all but finitely many . Since this holds for all , and , Equation 6 holds a.s., as was to be shown. The corollary statement follows immediately.
Proof: As remarked,
so
By Assumption 3, for each , it’s almost sure that
for all sufficiently large , while by Lemma 3, it’s almost sure that
for all sufficiently large . Hence, with probability 1,
for all sufficiently large . Hence
Lemma 3 gives a lower bound on the integrated likelihood ratio, showing that in the long run it has to be at least as big as . (More precisely, it is significantly smaller than that on vanishingly few occasions.) It does not, however, rule out being larger. Ideally, we would be able to match this lower bound with an upper bound of the same form, since is the best attainable divergence rate, and, by Lemma 2, log likelihood ratios per unit time are converging to divergence rates for almostall , so values of for which are close to should come to dominate the integral in . It would then be fairly straightforward to show convergence of the posterior distribution.
Unfortunately, additional assumptions are required for such an upper bound, because (as earlier remarked) Lemma 2 does not give uniform convergence, merely universal convergence; with a large enough space of hypotheses, the slowest pointwise convergence rates can be pushed arbitrarily low. For instance, let be the distribution on which assigns probability 1 to endless repetitions of ; clearly, under this distribution, seeing is almost certain. If such measures fall within the support of , they will dominate the likelihood, even though under all but very special circumstances (e.g., ). Generically, then, the likelihood and the posterior weight of will rapidly plummet at times . To ensure convergence of the posterior, overlyflexible measures like the family of ’s must be either excluded from the support of (possibly because they are excluded from ), or be assigned so little prior weight that they do not end up dominating the integrated likelihood, or the posterior must concentrate on them.
3.2 Convergence of Posterior Density via Control of the Integrated Likelihood
The next two lemmas tell us that sets in of exponentiallysmall prior measure make vanishingly small contributions to the integrated likelihood, and so to the posterior. They do not require assumptions beyond those used so far, but their application will.
Lemma 4
Proof: By Markov’s inequality. First, use Fubini’s theorem and the chain rule for RadonNikodym derivatives to calculate the expectation value of the ratio.
Now apply Markov’s inequality:
for all sufficiently large . Since these probabilities are summable, the BorelCantelli lemma implies that, with probability 1, Eq. 8 holds for all but finitely many .
The next lemma asserts a sequence of exponentiallysmall sets makes a (logarithmically) negligible contribution to the posterior distribution, provided the exponent is large enough compared to .
Lemma 5
Let be as in the previous lemma. If , then
(9) 
Proof: Begin with the likelihood integrated over rather than its complement, and apply Lemmas 3 and 4: for any
(10)  
(11) 
provided is sufficiently large. If , this bound can be made to go to zero as by taking to be sufficiently small. Since
it follows that
Lemma 6
Proof: Pick any . By the hypothesis of uniform convergence, there almost surely exists a such that, for all and for all , . Hence
(13)  
(14)  
(15) 
Let denote the probability measure formed by conditioning to be in the set . Then
for any integrable function . Apply this to the last term from Eq. 15.
The second term on the righthand side is the cumulant generating function of with respect to , which turns out (cf. Berkconsistency, ) to have exactly the right behavior as .
(16)  
Since , , and the norm of the latter will grow towards its norm as grows. Hence, for sufficiently large ,
(17)  
where the nexttolast step uses the monotonicity of and .
Putting everything together, we have that, for any and all sufficiently large ,
(18) 
Hence the limit superior of the lefthand side is at most .
Proof: By Lemma 5,
implying that
so for every , for large enough
Consequently, again for large enough ,
Now, for each set , for every , if then
by Lemma 6. By Assumption 6, for all sufficiently large . Hence
for all and all sufficiently large. Since, by Assumption 5, , for every , is within of for large enough , so
Thus, for every , then we have that
for large enough , or, in short,
The standard version of Egorov’s theorem concerns sequences of finite measurable functions converging pointwise to a finite measurable limiting function. However, the proof is easily adapted to an infinite limiting function.
Lemma 9
Let be a sequence of finite, measurable functions, converging to almost everywhere () on . Then for each , there exists a possiblyempty such that , and the convergence is uniform on .
Proof: Parallel to the usual proof of Egorov’s theorem. Begin by removing the measurezero set of points on which pointwise convergence fails; for simplicity, keep the name for the remaining set. For each natural number and , let — the points where the function fails to be at least by step . Since the limit of is everywhere on , each has a last such that , no matter how big is. Hence . By continuity of measure, for any , there exists an such that if . Fix as in the statement of the lemma, and set . Finally, set . By the union bound, , and by construction, the rate of convergence to is uniform on .
Lemma 10
The conclusion of Lemma 8 is unchanged if .
Proof: The integrated likelihood ratio can be divided into two parts, one from integrating over and one from integrating over its complement. Previous lemmas have established that the latter is upper bounded, in the long run, by a quantity which is . We can use Lemma 9 to divide into a sequence of subsets, on which the convergence is uniform, and hence on which the integrated likelihood shrinks faster than any exponential function, and remainder sets, of prior measure no more than