Posterior consistency for partially observed Markov models
Abstract
In this work we establish the posterior consistency for a parametrized family of partially observed, fully dominated Markov models. As a main assumption, we suppose that the prior distribution assigns positive probability to all neighborhoods of the true parameter, for a distance induced by the expected KullbackLeibler divergence between the family members’ Markov transition densities. This assumption is easily checked in general. In addition, under some additional, mild assumptions we show that the posterior consistency is implied by the consistency of the maximum likelihood estimator. The latter has recently been established also for models with noncompact state space. The result is then extended to possibly noncompact parameter spaces and nonstationary observations. Finally, we check our assumptions on examples including the partially observed Gaussian linear model with correlated noise and a widely used stochastic volatility model.
Posterior consistency for fPOMMs
and
rdSAMOVAR, CNRS UMR 5157, Institut Télécom/Télécom SudParis, 9 rue Charles Fourier, 91000 Evry. \thankstextjoKTH Royal Institute of Technology, Stockholm, Sweden. Jimmy Olsson is supported by the Swedish Research Council, Grant 20115577. \thankstextfrLTCI, CNRS 5141, Institut Télécom/Télécom Paristech, 42 rue Barrault, 75000 Paris.
1 Introduction
We consider a very general framework where a bivariate Markov chain taking on values in some product state space , i.e., , is only partially observed through the second component . In this model, which we refer to as a partially observed Markov model (POMM) ([33] uses the alternative term pairwise Markov chain), any statistical inference has to be carried through on the basis of the observations only, which is generally far from straightforward due to the fact that the observation process is, on the contrary to , generally nonMarkovian. Of particular interest are the hidden Markov models (HMMs) (alternatively termed statespace models in the case where is continuous), which constitute a special case of the POMMs in which the process is itself a Markov chain, referred to as the state process, and the observations are conditionally independent given the states, such that the marginal conditional distribution of each observation depends only on the corresponding state . The use of unobservable states provides the HMMs with a relatively simple dependence structure which is still generic enough to handle complex, realworld time series in a large variety of scientific and engineering disciplines (such as financial econometrics [21, 29], speech recognition [23], biology [7], neurophysiology [15], etc.; see also the monographs [28] and [6] for introductive and state of the art treatments of the topic, respectively), and the POMMs can be viewed as a natural extension and generalization of this model class.
In this paper, we will consider a parameterized family of POMMs with parameter space , where the latter is assumed to be furnished with some metric. For each , the dynamics of the model is specified by the transition kernel of on , and we will in this work restrict ourselves to the fully dominated case where has a transition density w.r.t. some dominating measure (all these objects will be defined rigorously in the next section). Each transition kernel is assumed to have a unique invariant distribution .
We assume that we have access to a single observation trajectory sampled from the canonical law induced by and some initial distribution on , where is a distinguished parameter interpreted as the true parameter and is generally different from . In order to estimate via the observations we adopt a Bayesian framework and introduce a possibly improper prior distribution on , reflecting our a priori belief concerning , and compute the conditional—posterior—distribution of given the observations , which is, for measurable and , given by
where denotes the density of given . In this general setting, we examine the asymptotics of the posterior distribution and identify model conditions under which the posterior consistency
(1) 
holds true, where denotes weak convergence and denotes the Dirac mass located at the true parameter . In (1), denotes the distribution of corresponding to the true parameter ; in this sense we adopt, by proving (1), a frequentist point of view for the asymptotic behavior of the posterior distribution. However, establishing that the influence of the prior is overwhelmed by the data as the sample size grows to infinity is of fundamental interest in Bayesian analysis.
1.1 Previous work
From the frequentist inference point of view, POMMs have been subjected to extensive research during the last decades. For the important subclass formed by HMMs with finite state space, the asymptotic consistency of the maximum likelihood estimator (MLE) was established by [5, 32] and [26] in the cases of finite and general observation spaces, respectively, and these results were generalized gradually to more general HMMs in [10, 13, 18]. The first MLE consistency result for general HMMs with possibly noncompact state space was obtained in [12], and [11] extended further this result to misspecified models. For POMMs that fall outside the HMM class, [13] established the MLE consistency for autoregressive models with Markov regimes by applying strong mixing assumptions requiring typically the state space of the latent Markov chain to be compact. Recently, [9] established the MLE consistency for general observationdriven time series with possibly noncompact state space and [14] established the analogous result for general partially dominated POMMs, covering the observationdriven models as a special case. The latter work can be viewed as the state of the art when it concerns MLE analysis for POMMs. The mentioned works demonstrate a variety of techniques, but share the assumption that the parameter space is a compact set.
On the other hand, when it concerns Bayesian asymptotic analysis of POMMs, there are only a handful results of which all treat exclusively HMMs. In the case of HMMs with a finite state space, [16] provides the posterior consistency (with rates) for parametric models with an unknown number of states and the recent paper [36] deals with posterior concentration in the nonparametric case. For more general HMMs, [8] establishes, along the now classical lines of [25, Theorem 8.3], a Bernsteinvon Misestype result under the assumption that the model satisfies, first, a law of large numbers for the loglikelihood function, second, a central limit theorem for the score function and, third, a law of large numbers for the observed Fisher information. As these asymptotic properties, which are the cornerstones of the proof of the asymptotic normality of the MLE, can be established for models satisfying the strong mixing assumption (see [13] and [6, Chapter 12]), the result holds, in principle, true for HMMs with a compact state space. A more direct approach to the posterior consistency for HMMs is taken in [31], where the author works with a large deviation bound for the observation process; nevertheless, the analysis is driven by very restrictive model assumptions in terms of strong mixing and additive observation noise.
In conclusion, all available results on posterior concentration for parametric HMMs rest on very stringent model assumptions, in the sense that the state space of the unobservation process is assumed to be compact. Needless to say, this is not the case for many models met in practical applications (such as the linear Gaussian statespace models). In addition, the mentioned results require, without exception, also the parameter space to be compact, which is generally a severe restriction for the Bayesian. Consequently, a general posterior consistency result for parametric POMMs (and, in particular, parametric HMMs) has hitherto been lacking. In the light of the widespread and ever increasing interest in Bayesian inference in models of this sort, which is boosted by novel achievements in computational statistics (especially in the form of particle [20] and particle Markov chain Monte Carlo methods [1]), this is indeed remarkable, and the goal of the present paper is to fill this gap.
1.2 Our approach
In this paper, we establish the posterior consistency (1) under very mild assumptions which can be checked for a large class of POMMs used in practice. The result is stated in Theorem 2.2 for general POMMs with positive transition density (Theorem 2.2 deals with the case of a compact parameter space) and in Theorem 2.2 for HMMs under an alternative set of assumptions requiring, e.g., only the emission density to be positive. The starting point of our analysis is the observation that it is, by the Portmanteau lemma, enough to show that for all , ,
(2) 
(see Remark 8 for details). Now, by expressing the posterior as
(3) 
we conclude that (2) will hold if

all closed sets not containing are remote from in the sense that the numerator of (3) tends to zero exponentially fast under ;

for all , there exists some subset of which is charged by the prior, i.e., , and such that for all , the ratio is, a.s., eventually bounded from below by . This asymptotic merging property forces the numerator of (3) to vanish at a faster rate than the denominator for all remote sets , implying (2).
This machinery, which is adopted from [4] and described generally in Section 4.1, does not require the model under consideration to be a POMM; it is hence of independent interest. As we will see, the situation of a noncompact parameter space calls for a refined notion of remoteness; indeed, by operating under the assumption that the sequence of posterior distributions is tight, it is enough to require remoteness to hold on a sufficiently large compact subset of .
The remoteness, the asymptotic merging property and the tightness of the posterior are the fundamental building blocks of our analysis of the posterior concentration. Interestingly, a key finding of us is that the remoteness is closely related to the MLE consistency; more specifically, in Proposition 4.2 we establish that if all sequences of approximate MLEs (see Definition 3) on some compact subset of (with ) containing are strongly consistent, then is remote for all closed sets not containing . As mentioned in the literature review above, the MLE can, under the assumption that the parameter space is compact, be proven to be consistent under very mild model assumptions satisfied for most fully dominated POMMs, and we will hence obtain the remoteness for free for a large set of models.
When it concerns the asymptotic merging property, we derive the instrumental bound
(4) 
where , , denotes the density of the complete data given (see Lemma 4.1 for a more general formulation). In the stationary mode, i.e., when , the right hand side of (4) tends, by Birkhoff’s ergodic theorem, to minus the expectation of the KullbackLeibler divergence (KLD) between and under the stationary distribution. As a consequence, the asymptotic merging property holds true as long as the prior is information dense at in the sense that for all . This condition can however be checked straightforwardly in general, since involves a KLD between perfectly known transition kernels. Without access to the bound (4), an alternative strategy would have been to study directly the limit of the left hand side of (4) by, e.g., going to “the infinite past” in the spirit of [13, 11]; however, this approach would require the analysis of an expected KLD between ergodic limits and (we refer to the mentioned works for the meaning of these quantities) under the stationary distribution, which is infeasible in general.
As described above, our technique of handling models with a noncompact parameter space is based on the assumption that the sequence of posterior distributions is tight. Recalling that our objective is the establishment of the gradual concentration of these distributions around the true parameter as increases, we may expect this assumption to be mild. Indeed, by operating in the stationary mode using Kingman’s subadditive theorem, we are able to derive handy assumptions under which the posterior tightness holds at an exponential rate (see Theorem 2.2 and Proposition 5.4). As far as is known to us, this is the first result ever in this direction for models of this sort, and we believe that a similar approach may be used also for extending existing results on MLE consistency to the setting of a noncompact parameter space.
We remark that Birkhoff’s ergodic theorem and Kingman’s subadditive theorem require the observation process to be stationary (i.e. ). Nevertheless, if for all parameters and initial distributions , the distribution of , when initialized according to and evolving according to , admits a positive density w.r.t. the distribution of under the stationary distribution , one may prove that any property that holds a.s. under the latter distribution holds a.s. under the former as well. That such positive densities exist can be established for POMMs in general under the assumption that the transition density is positive (Lemma 5.2) and for HMMs in particular under the weaker assumption that the emission density is positive and the hidden chain is geometrically ergodic (Lemma 5.3), and we are consequently able to treat also the nonstationary case. As far as is known to us, this efficient approach to nonstationarity results has never been taken before.
Finally, we demonstrate the flexibility of our results by checking carefully our assumptions on a partially observed linear Gaussian Markov model as well as the widely used stochastic volatility model proposed by [21] (the latter falls into the framework of HMMs).
To sum up, our contribution is fourfold, since we

establish the posterior consistency for very general POMMs under mild assumptions, which allow the state space of the latent part of the model to be noncompact and which can be checked for a large number of models used in practice.

link, via the concept of remoteness, the posterior consistency to the consistency of the MLE.

are able to treat also the case of a noncompact parameter space.

treat efficiently the case of nonstationary observations.
The paper is structured as follows. In Section 2, we introduce the POMM framework under consideration and state, in Section 2.2, our main results (Theorems 2.2–2.2) and assumptions. In particular, we provide an alternative set of assumptions that are taylormade for the special case of HMMs. Section 3 treats the two examples mentioned previously and discusses generally our assumptions in the light of nonlinear statespace models. In Section 4 we embed the problem of posterior concentration into the general framework outlined above, serving as a machinery for the proofs of our main results. The latter proofs are found in Section 5 and in Section 6 we conclude the paper.
2 Fully dominated partially observed Markov models (fdPOMM)
2.1 Setting
Let and be general measurable spaces referred to as state space and observation space, respectively. The product space is then endowed with the product field , and we set and . Let further , and denote the canonical processes taking on values in the spaces , and , respectively, and defined by , and , where . Now, for all , only is observable, and for this reason we refer to the model as partially observed. Let us define for all , with serving as our general notation for vectors. In addition, let be a measurable space and let be a collection of Markov transition kernels on . Denote, for all , by the law of the canonical Markov chain induced by the Markov transition kernel and the initial distribution . We say that the model is fully dominated if there exist two finite measures and on and , respectively, such that for all and , the probability measure is dominated by , and we denote by the corresponding density. We may now introduce the main class of models studied in this article.
Definition 1
We say that follows a fully dominated partially observed Markov model (fdPOMM) if, for all , under the previous definitions and assumptions, the distribution of is given by for some and initial distribution on .
We denote, with a slight abuse of notation, by the density of with respect to under , i.e.,
Now, let be some measure, called the prior distribution, on . We will always assume that is endowed with some metric and that is taken to be the corresponding Borel field.
Given observations , the posterior distribution associated with the initial probability distribution is defined by
(5) 
For the numerator and denominator of (5) to be welldefined, we will always assume that is measurable on . However, at this point it is not guaranteed that the ratio itself is welldefined (and does not degenerate into or ). In fact, we will only be interested in the case where is a probability distribution for large enough, where denotes the true distribution of and is introduced below.
We always assume the following.

For all , the Markov transition kernel has a unique stationary distribution .
Under (B2.1) it is typically assumed that the law of the observations is given by for some distinguished parameter interpreted as the true parameter (which is not known a priori). We proceed similarly and set . However, in the present paper we will also consider the more general case where for some possibly unknown initial distribution , and since the initial distribution appearing in (5) is designed arbitrarily by the user, we cannot assume that . See also Remark 10 for further comments concerning this.
Remark 1
Under (B2.1) , since is dominated by , the stationary probability measure is also dominated by , and by abuse of notation, we still denote by the associated density.
2.2 Main results
We now state the main results of this contribution, which consist in providing general sufficient conditions for the posterior consistency
where denotes weak convergence and denotes a Dirac point mass located at . The proof of this result is based on basically two main ingredients. The first is to ensure that only parameters close to have a large likelihood as the number of observations tends to infinity. This is formalized by the following assumption.

If is a compact set containing , then all valued, adapted random sequences such that for all ,
with
converges to .
For identifiable models, this assumption follows directly from standard consistency results on the maximum likelihood estimator for ergodic partially observed Markov chains; see [14] and the references therein. In other words, it does not require a specific treatment from a Bayesian point of view.
Remark 2
Note that in the case where is compact, (B2.2) only needs to be checked for .
The second ingredient is to ensure that the prior distribution does not concentrate around parameters whose likelihood is too small asymptotically. We provide below sufficient conditions that are easily checked in two specific situations.
The fully dominated case.
For all we set
(6) 
where for any two probability measures and defined on the same probability space, denotes the KLD of from defined by
Theorem
The proof of this result can be found in Section 5.2.
An alternative set of conditions for HMMs.
The HMMs can be viewed as a subclass of the fdPOMMs defined by the following assumption.

For all , and ,
where and are kernel densities on and , respectively.
Under (C2.2) we denote by the Markov transition kernel on associated with the transition density . In this subclass of models it may happen that the positiveness condition (B2.2) does not hold, but only since vanishes. In this case, we rely on the following weaker assumption.

For all and , a.s.
In this context, we define
(7) 
and consider the following assumption replacing (B2.2) .

For all , .
Finally, we impose the following condition.

For all and all initial distributions on , converges to the first marginal of in the total variation norm, i.e.,
The kernel notation in (C2.2) is standard and described in detail in Section A.
We can now state the following result, whose proof can be found in Section 5.3.
Remark 3
It is interesting to note that in the case of i.i.d. observations, which corresponds to the HMM case (C2.2) with arbitrary (say equal to 1 with an arbitrary probability measure) and not depending on , simply denoted by hereafter, we have
Hence (B2.2) and (C2.2) boil down to the well known condition of the i.i.d. setting introduced by [34], see also [19, Eqn. (1)].
Noncompact parameter space.
Needless to say, a drawback of Theorems and 2.2 is that is assumed to be compact. This assumption is standard in the frequentist setting, e.g., when studying the maximum likelihood estimator, but can be problematic in the Bayesian setting, where it is often convenient to work with priors defined on noncompact spaces for computational reasons. We now derive some additional conditions dealing with the noncompact case.
Define for all , and ,
(8) 
(with ). Now the following result holds true also for a possibly noncompact .
Theorem
Consider a fdPOMM satisfying (B2.1) and (B2.2) . Let be a Radon measure on and an arbitrary distribution on . In addition, suppose that the following conditions hold true.

There exist and a nondecreasing sequence of compact sets in such that
(9) (10)

There exists such that
(11) (12)
Then (B2.2–2.2) or (C2.2–2.2) imply that for all initial distributions on ,
The proof is postponed to Section 5.4. In (12), is, as usual, defined as the ratio , with the convention that the denominator is unity if . Note that this ratio is always well defined under (B2.2) or (C2.2–2.2) .
Remark 4
It is interesting to observe that, as detailed in Lemma , each condition in (B2.2) implies the same condition with replaced by any larger than . It is therefore sufficient to check the conditions independently with two possibly different . The fact that (11) holds for all is of particular interest, since it guaranties that both the numerator and denominator in the definition of the posterior in (5) are finite. On the other hand, by (B2.2) or (C2.2–2.2) the denominator is positive. Hence, if (11) holds, the posterior is well defined as a probability distribution for large enough.
3 Examples
3.1 Partially observed Gaussian linear Markov model
First, we consider a linear Gaussian fdPOMM defined on by
(13) 
where is matrix and is a sequence of i.i.d. centered Gaussian vectors with covariance matrix . In the following we assume that is a compact subset of and that for all , has spectral radius strictly less than unity and is positive definite. Then is a vector autoregressive process with transition density
with , satisfying (B2.1) . This framework includes the widespread linear Gaussian statespace model
(14) 
corresponding to
Note that the model (14) is an HMM only if and are uncorrelated for all , which is not assumed in the model (13). Assumption (B2.2) is trivially satisfied. The expected KLD in (6) is easily computed; indeed, for all and in ,
where
It follows that is continuous at (where it always vanishes) and thus that (B2.2) is satisfied whenever is strictly positive on (i.e., for every nonempty open set ). Theorem hence applies as soon as Assumption (B2.2) holds true (examples are treated in, e.g., [18] or [12, Section 3.3]).
3.2 General HMMs
In this section, we consider the case of HMMs, i.e., we assume (C2.2) . Up to our knowledge, the following set of assumptions, which are borrowed from [12, Theorem 1], are the weakest available for obtaining the strong consistency of the (approximate) MLE for wellspecified HMMs.

For all , the Markov kernel is aperiodic positive Harris recurrent.
Note that under (C2.2) , this assumption is sufficient for (B2.1) , i.e., the existence of a unique stationary probability measure for each complete chain kernel , .

, .

For all , there are a neighborhood of such that
and an integer such that

For all and , the function is uppersemicontinuous at , a.s.

For all such that a.s., we have
for some sequence of sets .
As a consequence of Theorems and 2.2, the only requirement on the prior in the case of general HMMs with a compact parameter space is given by (B2.2) or (C2.2) , depending on the positivity assumption on the kernel densities ( (B2.2) or (C2.2) ).
Theorem
Proof
Thus, for HMMs, one may, in order to apply Theorem , choose to check (B2.2–2.2) or (C2.2–2.2) , depending on the model. Consider for example the nonlinear statespace model on ,
(15) 
where is an dimensional parameter on a compact space , is an i.i.d. sequence of dimensional random vectors with density with respect to Lebesgue measure and and are given (measurable) matrixvalued functions. Conditions on , , and to ensure (D3.2–3.2) can be found in [12, Section 3.3]. These conditions are stated in the case where is positive over but can be easily adapted to the case where the support of is compact. However, in the latter case, (B2.2) does not hold, and assuming (C2.2) , we can rely on (C2.2) as an alternative for (B2.2) . In what follows, we explain how to deal with appearing in the definition (7) of using standard conditions.
Assume that there exist a measurable function and constants such that for all and all ,
(16) 
where for any signed measure on , , where the supremum is taken over all measurable functions such that and denotes the integral of w.r.t. (see Section A for details).
Proposition
Proof
Let . It is sufficient to show that the function is continuous at , where it takes on the value zero. For all , denote by the marginal probability measure on defined by for all , and let the function be defined by
We may then write
(17) 
(where, as usual, denotes the integral of w.r.t. ; see Section A). We proceed stepwise.
Step 1. We first show that . Denoting by the Banach space of signed measures such that , equipped with the norm, we will actually prove the following more precise assertion.

For all , converges to in , uniformly over .
(Hence for all .) For all probability measures , on , all and all such that , we have
Thus, (16) provides a constant such that
Taking and and combining with 2, we get that
and since is complete, we obtain that converges uniformly in over . Denoting by this limit, we are only required to show that in order to establish 1. Now, since all bounded functions have finite norm, for all ,
so that is an invariant probability measure for . It is thus as a consequence of (B2.1) .
Step 2. Next, we show that . Assumption 1, and the dominated convergence theorem give immediately that is continuous at , for . Moreover, . Consequently, using again the dominated convergence theorem, the last term on the right hand side of (17) converges to as tends to .
Step 3. Finally, we consider the term between brackets in (17) and show that it converges to zero as . Since we just showed that , it suffices to show that converges to in . By 1, this boils down to proving that for all , converges to in . This can be done by induction on . The base case corresponds to 3. The induction follows easily from the following decomposition, valid for all such that :
By observing that 2 implies for all , we may conclude the proof.
3.3 Stochastic volatility models
Consider the stochastic volatility model
(18) 
where and are independent sequences of i.i.d. Gaussian random vectors in with zero mean and identity covariance matrix; see [21]. A general description of stochastic volatility models as HMMs is provided in [17]. In this case, . If the parameter vector belongs to a compact parameter space, we may apply the theory developed in the previous section. However, in this example, is assumed to belong to the noncompact parameter space
where , and . Denote by the true value of the parameter. In this model,