Asymptotic Behaviour of Approximate Bayesian Estimators
Although approximate Bayesian computation (ABC) has become a popular technique for performing parameter estimation when the likelihood functions are analytically intractable there has not as yet been a complete investigation of the theoretical properties of the resulting estimators. In this paper we give a theoretical analysis of the asymptotic properties of ABC based parameter estimators for hidden Markov models and show that ABC based estimators satisfy asymptotically biased versions of the standard results in the statistical literature.
Approximate Bayesian Computation
t2T.A. Dean and S.S. Singh’s research is funded by the Engineering and Physical Sciences Research Council (EP/G037590/1) whose support is gratefully acknowledged. This
class=AMS] \kwd[Primary ]62M09 \kwd[; secondary ]62B99 \kwd62F12 \kwd65C05
Parameter Estimation \kwdHidden Markov Model \kwdMaximum Likelihood \kwdApproximate Bayesian Computation \kwdSequential Monte Carlo
One of the most fundamental problems in statistics is that of parameter estimation. Suppose that one has a collection of probability laws parametrised by a collection of parameter vectors . Suppose further that one has data generated by a process distributed according to some law where the exact value of is unknown. The problem of parameter estimation is to infer the value of the unknown parameter vector from the data . Many standard methods for estimating the value of are based upon using the likelihood function . For example Bayesian approaches use the likeilhood to reweight some prior distribution to obtain a posterior distribution on the space of parameter vectors that represents ones sense of certainty of any given parameter vector being equal to . Alternatively one may take a frequentist approach and estimate with the parameter vector which maximises the value of the corresponding likelihood (ie. maximum likelihood estimation (MLE)).
Of course these approaches all rely on one being able to compute the likelihood functions , either exactly or numerically. However, in a wide range of applications this is not possible, either because no analytic expression for the likelihoods exists or else because computing them is computationally intractable. Despite this one is often still able, in such cases, to generate random variables distributed according to the corresponding laws . This has led to the development of methods in which is estimated by implementing a standard likelihood based parameter estimator using some principled approximation to the likelihood instead of the true likelihood function itself. In general these approximations are estimated using Monte Carlo simulation based on generating samples from the relevant probability distributions.
A method which has recently become very popular in practice and on which we shall focus our attention for the rest of this paper is approximate Bayesian computation (ABC). A non-exhaustive list of references for applications of the method includes: (McKinley et al., 2009; Peters et al., 2010; Pritchard et al., 1999; Ratmann et al., 2009; Tavre et al., 1997). See also (Sisson and Fan, to be published) for a review on computational methodology. The standard ABC approach to approximating the likelihood is as follows. Suppose that the distributions all have a density on some space w.r.t. some dominating measure . Furthermore suppose that the functions cannot be evaluated directly but that one can generate random variables distributed according to the laws . Given some data the general ABC approach to approximating the values of the likelihood functions is to choose a metric on and a tolerance parameter and for all approximate the likelihood with
Typically the probabilities (1) are themselves estimated using Monte Carlo techniques. A particularly appealing feature of the ABC methodology is that, despite the methods name, the resulting approximations to the likelihoods may then be used in any likelihood based parameter inference methodology the user desires.
Intuitively, the justification for the ABC approximation is that for sufficiently small
where denotes the -ball of radius around the point and thus the probabilities (1) will provide a good approximation to the likelihood, up to the value of some renormalising factor which is independent of and hence can be ignored.
Clearly in general the estimators based on ABC approximations to the likelihood will differ from those based on the exact value of the likelihood function, however although the use of ABC has become commonplace there has to date been little investigation of the precise nature of the theoretical properties of ABC based estimators. One notable exception is (Fearnhead and Prangle, 2010). In this paper the authors consider the problem of finding the optimal choice, for a given data set, of summary statistic and in order to minimise the mean square error of the resulting ABC posterior distribution on parameter space. Unfortunately the resulting optimal choice of summary statistic involves computing a conditional expectation w.r.t. the unknown posterior distribution and hence it can only be computed approximately and not exactly. Further the analysis is done only for fixed size data sets and the asymptotic properties of the ABC estimator are left unexplored.
An alternative approach is taken in (Dean et al., 2010) in which the asymptotic behaviour of the MLE implemented with the ABC approximation to the likelihood (henceforth ABC MLE) was studied. The analysis in this paper is based on the observation that the ABC approximation to the likelihood can be considered as being equal to the likelihood function of a perturbed probability distribution. Using this observation it was shown that ABC MLE in some sense inherits its behaviour from the standard MLE but that the resulting estimator has an innate asymptotic bias. Furthermore, it is shown that this bias can be made arbitrarily small by choosing a sufficiently small values of the ABC parameter .
The results in (Dean et al., 2010) concerning the asymptotic behaviour of ABC MLE provide a mathematical justification of this method analgous to that provided for the standard MLE by the results concerning asymptotic consistency. However they do not establish any asymptotic normality type properties of this estimator and there are as yet no analogous results for the ABC Bayesian parameter estimator. The aim of this paper is to bridge these theoretical gaps by showing that the standard results in likelihood based parameter estimation, that is to say asymptotic consistency, asymptotic normality and Bernstein-von Mises type theorems, also hold in a suitably modified version for parameter estimators based on ABC approximations to the likelihood. In the next section we provide an outline of the approach that we shall take to proving these results.
1.1 Contributions and Structure
In this paper we shall study the asymptotic behaviour of ABC parameter estimators when used to perform inference for hidden Markov models. This will be convenient as (as we will show) the Markovian context imbues the ABC approximations with a particularly nice mathematical structure. Furthermore, as HMMs are used as statistical models in a wide range of applications including Bioinformatics (e.g. (Durbin et al., 1998)), Econometrics (e.g. (Kim et al., 1998)) and Population genetics (e.g. (Felsenstein and Churchill, 1996)) (see also (Cappé et al., 2005) for a recent overview), the class of models thus considered is sufficently general to be of genuine practical interest.
For the purpose of this paper a HMM will be considered to be a pair of discrete-time stochastic processes, and . The hidden process, , is a homogenous Markov chain taking values in some Polish space and the observed process takes values in for some . Conditional on the observations are statistically independent of the random variables . In many models the densities of the conditional laws of the observed process w.r.t. the hidden state either have no known analytic expression or else are computationally intractable. In this case it follows that standard methods to estimating the likelihoods of the observed process, eg. SMC, can no longer be used and that an alternative approach like ABC must be used. For the rest of this paper we shall consider performing ABC based parameter estimation for HMMs using the following specialization of the standard ABC likelihood approximation (1), proposed in (Jasra et al., 2010), for when the observations are generated by a HMM. Specifically, given a sequence of observations from a HMM, we shall approximate the corresponding likelihood functions with the probabilities
where for all , denotes the ball of radius centered around the point . The benefit of this approach is that it retains the Markovian structure of the model. This facilitates both simpler Markov chain Monte Carlo (MCMC) (e.g. (McKinley et al., 2009)) and sequential Monte Carlo (SMC) (e.g. (Jasra et al., 2010)) implementation of the ABC approximation. Furthermore the resulting approximation has a structure which is particularly tractable to mathematical analysis.
The purpose of this paper is to show that one can prove results about the asymptotic behaviour of ABC based parameter estimators analogous to the standard results in the literature concerning the asymptotic behaviour of estimators based on the exact value of the likelihood. In particular we show that one can develop a theoretical justification of ABC parameter estimation procedures based on their large sample properties analogous to those provided for Bayesian and maximum likelihood based procedures by the standard Bernstein-von Mises and asymptotic consistency and normality results respectively. Our approach is based on the observation in (Dean et al., 2010) that ABC can be considered as performing parameter estimation using the likelihoods of a collection of perturbed HMMs which suggests that in some sense ABC based parameter estimators should inherit their behaviour from the standard statistical estimators. We first show that unlike the MLE, which is asymptotically consistent, the ABC MLE estimator has an innate asymptotic bias in the sense that the value of the estimator converges to the wrong point in parameter space as the number of observations tends to infinity. Moreover we show that asymptotically the ABC MLE is normally distributed around this biased estimate. Secondly we show that the resulting ABC Bayesian posterior distributions obey a Bernstein-von Mises type theorem but that the posteriors are again asymptotically biased in the sense that as the number of data points goes to infinity the resulting posterior distributions concentrate about the limit of the ABC MLE rather than the true parameter value. Finally we show that the size of the asymptotic bias of both the ABC Bayesian and ABC MLE estimators goes to zero as tends to zero and under mild regularity conditions we obtain sharp rates for this convergence. Together these results show that ABC based parameter estimates are asymptotically biased with a bias which can be made arbitrarily small by taking a suitable choice of and thus provide a rigorous justification for performing statistical inference based on ABC approximations to the likelihood.
We note that the results in this paper extend those in (Dean et al., 2010) in several ways. In particular we provide a much sharper analysis of the ABC MLE than that contained in (Dean et al., 2010). The crucial difference between the current paper and (Dean et al., 2010) is that it is not possible using the techniques of (Dean et al., 2010) to show that the ABC MLE has a unique limit point. In contrast, in this paper we show that for sufficiently small values of the ABC MLE has one and only one limit point. This then enables us to extend the scope of the analysis in (Dean et al., 2010) to include asymptotic normality results for the ABC MLE and Bernstein-von Mises type results for ABC based Bayesian estimators.
This paper is structured as follows. In Section 2 the notation and assumptions are given and in Section 3 we present our main results concerning the asymptotic behaviour of ABC. The article is summarized in Section 4 and supporting technical lemmas and proofs of some of the theoretical results are housed in the four appendices.
2 Notation and Assumptions
2.1 Notation and Main Assumptions
Throughout this paper we shall use lower case letters to denote dummy variables and upper case letters to denote random variables. Observations of a random variable, i.e. data, will be denoted by . Given any and we shall let denote the closed ball of radius centered on the point and let denote the uniform distribution on . For any the indicator function of will be denoted by .
In what follows we need to refer to various different scalar, vector and matrix norms. Given a scalar and a vector we shall let and denote the standard Euclidean scalar and vector norms respectively and for any matrix we shall let denote the Frobenius norm. We note that although using to denote multiple norms is an abuse of notation there is in practice no loss of clarity as the precise meaning of these terms will always be made clear by the context in which they are used.
For any vector of variables we shall let denote the gradiant operator with respect to . Moreover given vectors of variables of dimensions and we shall let and denote the and matricies of partial derivatives with entries given by and respectively. Further, for any vector of variables we shall let and denote and respectively. Further given vectors we shall let and denote the outer products of and and and denote the outer products and respectively.
It is assumed that for any HMM the hidden state is time-homogenous and takes values in a compact Polish space with associated Borel -field . Throughout this paper it will be assumed that we have a collection of HMMs all defined on the same state space and parametrised by some parameter vector taking values in a connected compact set . Furthermore we shall reserve to denote the ‘true’ value of the parameter vector . For each we shall let denote the transition kernel of the corresponding Markov chain and for each and we assume that has a density w.r.t. some common finite dominating measure on . The initial distribution of the hidden state will be denoted by .
We also assume that the observations take values in a state space for some . Furthermore, for each we assume that the random variable is conditionally independent of and given and that the conditional laws have densities w.r.t. some common -finite dominating measure . We further assume that for every the joint chain is positive Harris recurrent and has a unique invariant distribution . For each we shall let denote the law of stationary distribution of the corresponding HMM and denote expectations with respect to the stationary distribution .
We shall frequently have to refer to various kinds of both finite, infinite and doubly infinite sequences. For brevity the following shorthand notations are used. For any pair of integers , denotes the sequence of random variables ; denotes the sequence ; denotes the sequence and denotes the sequence . Further given a measure on a Polish space we let denote integration w.r.t. the n-fold product measure on the n-fold product space .
For any two probability measures on a measurable space we let denote the total variation distance between them. For all we let denote the set of real valued measurable functions satisfying .
Finally we note that when writing the likelihood of a sequence of observations we shall typically suppress the dependence of the likelihood function on the the initial condition of the hidden state of the process unless we specifically need to refer to it in which case we shall write the likelihood as .
2.2 Particular Assumptions
In addition to the assumptions above, the following particular assumptions are made at various points in the article.
The parameter vector belongs to the interior of and if and only if .
For all , , the mappings and are three times continuously differentiable w.r.t. .
There exist constants such that for every , ,
There exists a constant such that for every , ,
for all .
Assumptions (A1)-(A6) are similar to those used in (Douc et al., 2004) to prove consistency of the MLE for HMMs. We use similar assumptions in this paper as, broadly speaking, our approach will be to show that the ABC parameter estimators inherit their properties from standard statistical estimators. However the methods and emphasis of this paper differ from those in (Douc et al., 2004) and as a result the assumptions we require have a slightly different flavour. In particluar we shall require slightly stronger conditions on the differentiability of the conditonal densities but slightly weaker conditions on their integrability.
In general assumptions (A3)-(A6) will hold when the state space is compact. However we expect that the behaviours predicted by Theorems 2, 3, 4 and 5 will provide a good qualitative guide to the behaviour of ABC MLE in practice even in cases where the underlying HMMs do not satisfy these assumptions.
3 Approximate Bayesian Computation
3.1 Structure of ABC Estimators
Suppose that a collection of HMMs
parameterised by some are given. For any sequence of observations for let denote the likelihood of the observations under the corresponding HMM (6). Following (Jasra et al., 2010) we consider approximating by the ABC approximation,
The purpose of this paper is to analyse the asymptotic properties of likelihood based parameter estimators implemented using the ABC approximate likelihoods (LABEL:eq:approxnewver1). The key to our analysis is the following observation, see (Dean et al., 2010) for more details;
The crucial point is that the quantity defined in (9) is the density of the measure obtained by convolving the measure corresponding to with where the density is taken w.r.t. the new dominating measure obtained by convolving with . One can then immediately see that the quantities and appearing in (8) are the transition kernels and conditional laws respectively for a perturbed HMM defined such that it is equal in law to the process
where is the original HMM and the are an i.i.d. sequence of distributed random variables.
3.2 Theoretical Results
It follows that performing statistical inference using the ABC approximations to the likelihood is equivalent to performing inference using a misspecified collection of models. It is well known (see for example (White, 1982)) that this will in general lead to biased estimates of the true parameter value. In the rest of this paper we shall investigate the theoretical consequences of this for ABC based parameter estimators.
We start by showing that almost surely the ABC MLE will converge, with increasing sample size, to a given point in parameter space that is not equal to the true parameter value (more generally the set of accumulation points will belong to a given subset of parameter space) and hence that the ABC MLE is asymptotically biased (Theorem 2). Further, we show that these accumulation points must lie in some neighbourhood of the true parameter value and that the size of this neighbourhood shrinks to zero as goes to zero. Next we show that for sufficiently small values of the ABC MLE has a unique limit point and that asymptotically the ABC MLE is normally distributed about this point with a variance that is proportional to (Theorem 3). Third we show that aymptotically the ABC Bayesian posterior converges to that of a Normal random variable, centered on the location of the ABC MLE and with variance again proportional to (Theorem 4). Finally we show that under certain Lipschitz conditions one can obtain a rate for the decrease in the size of the asymptotic bias of the ABC parameter estimators (Theorem 5).
These results show that the error of ABC based parameter estimators may be decomposed into two parts. A bias component whose size depends on and a variance component whose size is proportional to . Furthermore they show that the size of the bias can be made arbitrarily small by a suitable choice of . Thus taken together the results show that the accuracy of estimators based on ABC approximations to the likelihood can be made to be arbitrarily close to that of estimators based on the exact value of the likelihood, providing a rigourous mathematical justification for the ABC methodology.
We note that there are two important technical issues that arise in the proofs of these results. Firstly, as noted in (Dean et al., 2010), one cannot simply analyse the behaviour of the ABC MLE by extending the parameter space to include and then applying standard results from the theory of MLE because the perturbed likelihoods are in some sense insufficiently continuous. Instead one has to establish that in some sense the Lebesgue differentiation theorem still holds upon taking asymptotic limits.
Secondly we note that because the dominating measures of the original and perturbed HMMs are no longer necessarily mutually absolutely continuous with respect to each other we can no longer take the standard approach to analysing likelihood based estimators by studying the limits of
and interpreting them in terms of Kullback-Leibler distances. To avoid this problem we instead show that for any the relative mean log likelihood surfaces (considered as functions of )
almost surely converge to some limiting surface . The behaviour of ABC based parameter estimators can then be understood by examining the behaviour of the corresponding limiting log likelihood surfaces. The key result in doing so is the following whose proof is deferred until Appendix B.
Suppose that one has a collection of HMMs parameterized by some parameter vector that satisfy assumptions (A1)-(A6). For any let denote the likelihood function w.r.t. the perturbed HMMs (10) (and where by definition we let denote the likelihood function of the original HMM (6)). Let data generated by the HMM corresponding to an unknown parameter vector be given. Then for every there exists a twice continuously differentiable function such that for all one has that a.s.
uniformly in .
Furthermore as , where the convergence is again uniform in .
We can now use Theorem 1 to analyse ABC based parameter estimators by comparing their the asymptotic behaviour (encapsulated in the surfaces ) to the asymptotic behaviour of estimators based on using the true value of the likelihood (which is encapsulated in the surface ). we shall start by analysing the behaviour of the ABC MLE which we formally define below.
Procedure 1 (Abc Mle).
Given and data , estimate with
Using Theorem 1 we can now establish the following biased asymptotic consistency and normality type properties of the ABC MLE whose proofs are deferred to Appendix C.
Suppose that one has a collection of HMMs parameterized by some parameter vector that satisfy assumptions (A1)-(A6). Let data generated by the HMM corresponding to an unknown parameter vector be given and suppose that we use the ABC MLE to estimate the value of . Then for every there exists a collection of sets such that for all initial conditions the set of accumulation points of the ABC MLE lies a.s. in and
Furthermore let be as in Theorem 1. If is strictly negative definite then for sufficiently small values of the set consists of a singleton .
The quantity is equal to the asymptotic Fisher information of the HMM. For more details see (Douc et al., 2004).
Suppose that one has a collection of HMMs parameterized by some parameter vector that satisfy assumptions (A1)-(A6) and that is strictly negative definite where is as in Theorem 1. Let data generated by the HMM corresponding to an unknown parameter vector be given and suppose that we use the ABC MLE to estimate the value of . Then for sufficiently small values of there exists strictly positive definite matricies such that a.s.
Furthermore as where is as in Remark 3.
Next we consider the properties of the ABC Bayesian parameter estimator which we define below.
Procedure 2 (ABC Bayesian Estimator).
Given a prior distribution and data estimate via the ABC posterior
Suppose that the assumptions of Theorem 3 hold and that one tries to infer the true value of using the ABC approximate Bayesian posterior (15). Suppose further that the prior distribution has a continuous density w.r.t. Lebesgue measure, then for sufficiently small values of one has that a.s.
where is as in Theorem 3.
3.3 Asymptotic Rates of Convergence
Theorems 2, 3 and 4 show that asymptotically ABC based parameter estimators concentrate around a point and thus that the asymptotic bias will be of order . It is natural to ask at what rate does as . We begin our answer to this question with the following example.
Let be the distribution on the set of diadic numbers of the form given by for all and let be the distribution on the set of diadic numbers of the form given by for all . Furthermore let be the set of distributions defined such that for all , .
It is clear that the distributions satisfy the conditions of Theorem 1 and hence that for any the limiting approximate mean log likelihood surface exists and is well defined. Further if we assume that the true value of the parameter is equal to then it is easy to show that and that for all that from which it follows that
The above example shows that in the general case one should expect that the size of the asymptotic bias will be at least . The next theorem shows that the behaviour of the asymptotic bias will be no worse than this. In order for it to hold we need to make the following Lipschitz assumptions.
There exists some such that for all .
Suppose that in addition to all of the assumptions of Theorem 4 one has that assumption (A7) above also holds. Then
Moreover, if the dominating measure is Lebesgue measure then one can show, under slightly stronger Lipschitz assumptions, that the asymptotic error in the ABC parameter estimate is of order .
There exists some such that for all .
Suppose that is Lebesgue measure and that in addition to all of the assumptions of Theorem 5 one has that assumption (A8) above holds also. Then
The proofs of Theorems 5 and 6 are deferred to Appendix D. Finally we note that in the case that is Lebesgue measure we have from Theorems 3 and 4 that the variance of ABC based based estimators is of order while their bias is of order . It follows that (at least in theory) it is optimal to scale as as goes to infinity. Intriguingly this is the same rate as the optimal bandwidth in kernel density estimation (see for example (Wand and Jones, 1995)). This suggests an alternative interpretation of ABC as approximating the likelihood via a kind of kernel density based estimate.
In this paper we have shown that the framework developed in (Dean et al., 2010) to analyse the behaviour of the the ABC MLE can be extended to provide a rigourous analysis of the behaviour of ABC based estimators in both the Bayesian and frequentist contexts. In particular we have shown that ABC based parameter estimators satisfy results analogous to the asymptotic consistency, asymptotic normality and Bernstein-von Mises theorems for standard parameter estimators but that the ABC estimators are asymptotically biased. Furthermore we have shown that this asymptotic bias can be made arbitrarily small by choosing a sufficiently small value of the parameter . Together these theoretical resultshelp to solidify and extend existing intuition and provide a rigourous theoretical justification for ABC based parameter estimation procedures.
Appendix A: Auxillary Results
Let a connected compact set and some constant be given. Suppose that there exists a continuous function and sequence of continuous functions , , such that for all the function is Lipschitz- continuous. Then uniformly in if and only if pointwise on a countable dense subset of .
Let a connected compact set be given and suppose that there exists a continuous function and sequence of continuously differentiable functions , , such that uniformly in and is Cauchy for some . Then there exists a uniformly bounded and continuously differentiable function such that uniformly in and .
Suppose that one has a collection of HMMs parameterised some vectors that satisfy assumption (A2). Furthermore suppose that one has a HMM , defined on the same state spaces as the parameterised collection of HMMs, which satisfies assumption (A2) with the same values of and .
Given measurable functions and , and define the following functions of the HMM
and for any define the random variables , , and by
Then there exist measurable random variables , , and and constants and which depend only on and such that for any initial condition on the collection of parameterised HMMs
where for all
denotes expectation w.r.t. the law and stationary law respectively of the process .
Suppose that the assumptions of Lemma 3 all hold. Then there exist constants and such that for any initial condition on the collection of parameterised HMMs
Let the same assumptions and notation as Lemma 3 be given. Then there exist constants and such that for any
where denotes conditional expectation w.r.t. the law of the process .
The last Lemma is a statement of the Fisher identity and the Louis missing information principle (see for example (Douc et al., 2004)) plus an extension of these results to third order derivatives of the log likelihood function. Given assumptions (A2)-(A6) it follows from a simple application of the dominated convergence theorem.
Suppose that assumptions (A2)-(A6) hold for a collection of HMMs parametrised by some vector where for each we let and denote the densities of the conditional law and transition kernel of the corresponding HMM. For any let denote the density of the conditional law of the corresponding perturbed HMM (10). By convention we let .
Appendix B: Proof of Theorem 1
Theorem 1 is an immediate corollary of the following three lemmas.
Suppose that assumptions (A1)-(A6) hold for a collection of HMMs parametrised by some vector . Then for any there exists a twice continuously differentiable function such that