Adaptive approximate Bayesian computation
Sequential techniques can enhance the efficiency of the approximate Bayesian computation algorithm, as in Sisson et al.’s (sisson:fan:tanaka:2007) partial rejection control version. While this method is based upon the theoretical works of delmoral:doucet:jasra:2006, the application to approximate Bayesian computation results in a bias in the approximation to the posterior. An alternative version based on genuine importance sampling arguments bypasses this difficulty, in connection with the population Monte Carlo method of cappe:guillin:marin:robert:2003, and it includes an automatic scaling of the forward kernel. When applied to a population genetics example, it compares favourably with two other versions of the approximate algorithm.
Some key words: Markov chain Monte Carlo; partial rejection control; importance sampling; sequential Monte Carlo.
When the likelihood function is not available in a closed form, as in population genetics, pritchard:seielstad:perez:feldman:1999 introduced approximate Bayesian computational methods as a rejection technique bypassing the computation of the likelihood function via a simulation from the corresponding distribution. Namely, if we observe and if is the prior distribution on the parameter , then the original approximate Bayesian computation algorithm jointly simulates and , and accepts the simulated if, and only if, the auxiliary variable is equal to the observed value, . This algorithm is exact in that the accepted ’s are distributed from the posterior. In the more standard occurrence when is a continuous random variable, the approximate Bayesian computation algorithm relies on an approximation, where the equality is replaced with a tolerance condition, , being a measure of discrepancy, for instance a distance between summary statistics, and being the tolerance bound. The output is then distributed from the distribution with density proportional to , where represents the probability distribution of indexed by a value of . This density is denoted by , where the conditioning corresponds to the marginal distribution of given . Improvements to this general scheme have currently been achieved either by modifying the proposal distribution of the parameter to increase the density of ’s within a neighbourhood of (marjoram:etal:2003; bortot:coles:sisson:2007) or by viewing the problem as a conditional density estimation and developing techniques to allow for larger (beaumont:zhang:balding:2002).
marjoram:etal:2003 defined a Markov chain Monte Carlo version of the approximate Bayesian computation algorithm that enjoys the same validity as the original algorithm, namely that, if a Markov chain is created via the transition that sets equal to a new value only if is such that and if , then its stationary distribution is the posterior distribution . Since, in most settings, the distribution of is absolutely continuous, the above constraint is replaced with the approximation .
For the toy model studied in sisson:fan:tanaka:2007, where and , the posterior distribution associated with the observation is the normal mixture
restricted to the set . As in regular Markov chain Monte Carlo settings, the performance of Marjoram et al.’s (2003) algorithm depends on the choice of the scale in the random walk proposal, , where is most often a standardised normal or a density. However, even when as in sisson:fan:tanaka:2007, the Markov chain mixes slowly, but still produces an acceptable fit over iterations. Furthermore, the true target is available here as
and, for the value , it is indistinguishable from the exact posterior density . We can thus clearly separate the issue of poor convergence of the algorithm, depending on , from the issue of approximating the posterior density, depending on .
sisson:fan:tanaka:2007 modified the approximate Bayesian computation algorithm using partial rejection control, introduced in liu:2001. The method is sequential in that simulated populations of points are generated at each iteration of the algorithm and that those populations are exploited to produce better proposals for a given target distribution in later iterations. As demonstrated in douc:guillin:marin:robert:2005, the reliance on earlier populations to build proposals preserves convergence properties provided an importance sampling perspective is adopted. We also recall that a progressive improvement in the performance of proposals is the appeal of using a sequence of samples, rather than a single one. The sequential feature of the problem is augmented by the fact that the tolerance is decreasing with iterations, being chosen either from a deterministic scale or from quantiles of earlier iterations as in beaumont:zhang:balding:2002.
Sisson et al.’s (2007) algorithm produces samples by using, at iteration , a regular approximate Bayesian computation step and, at each iteration , Markov transition kernels for the generation of the ’s, namely until is such that , where is selected at random among the previous ’s with probabilities . The probability is derived by an importance sampling argument,
where is an arbitrary transition kernel. In their examples, sisson:fan:tanaka:2007 use , which means that the weights are all equal under a uniform prior. The ratio (1) is inspired from delmoral:doucet:jasra:2006 who use a sequence of backward kernels in a sequential Monte Carlo algorithm to achieve unbiasedness, up to the renormalisation effect, in the marginal distribution of the current value without computing this intractable marginal.
In this paper, we show via both theoretical and experimental arguments that the weight (1) is biased. Moreover, we introduce a population Monte Carlo correction to the algorithm. It is based on genuine importance sampling arguments and we demonstrate its applicability as well as the improvement it brings compared with the partial rejection control version. We also illustrate its efficiency in a population genetics example, when compared with two standard alternatives.
In order to establish the bias in the weight (1) as clearly as possible, we consider the limiting case when . In that case, both Pritchard et al.’s (1999) and Marjoram et al.’s (2003) algorithms are correct samplers from . The corresponding partial rejection control version selects a from the previous sample and then generates both and until . To properly evaluate the bias associated with one step of Sisson et al.’s (2007) algorithm, we also assume that the previous sample is correctly generated from the target, i.e. that . Then, denoting by the selected , the joint density of the accepted pair is proportional to where the normalisation constant only depends on . Using (1), the weighted distribution of is such that, for an arbitrary integrable function , is proportional to
with all proportionality terms being functions of only. Therefore we can conclude that there is a bias in the weight (1) unless the inner function integrates in to the same constant for all values of . Apart from this special case, which is achievable when but not in the random walk type proposal when , (1) is incorrect since the weighted output is not distributed from .
Paradoxically, the weight (1) used in this partial rejection control version misses a term in its denominator, while the method is used when is not available. This is exactly the difference between the weights used in sisson:fan:tanaka:2007 and those used in delmoral:doucet:jasra:2006, namely that, in the latter paper, the posterior explicitly appears in the denominator instead of the prior. The accept-reject principle at the core of approximate Bayesian computation allows for the replacement of the posterior by the prior in the numerator of the ratio (1), but not in the denominator.
When using the standard version of the algorithm that includes an additional approximation due to the tolerance zone , there is no reason for the bias to vanish, even though experiments show that the bias in the weights (1) for the tolerance target generally decreases as increases, which agrees with the fact that the limiting case is the prior. The following example illustrates this point:
For the mixture target and a normal random walk kernel , the first row of Figure 1 shows the output of five consecutive iterations of Sisson et al.’s (2007) algorithm, using a decreasing sequence of ’s, from down to , and a standard deviation in equal to . Clearly, using this algorithm leads to a bias in terms of the tolerance target for all values of , the tails being poorly covered. In contrast, using a much larger results in a good fit of the tolerance target, as shown by Figure 2 in sisson:fan:tanaka:2007. This fit does not contradict the bias exhibited above since using a large scale in the proposal very closely amounts to using a flat prior distribution. The corresponding normal density is then almost constant in the interval that supports the posterior distribution. For , using the weight (1) is thus equivalent to Pritchard et al.’s (1999) original algorithm and therefore as inefficient for this target.
Since the missing factor in (1) is the unknown likelihood , a first resolution of the problem is to resort to an estimation of the likelihood based on earlier samples. But a standard importance sampling perspective allows for a more direct approach, in a spirit similar to the population Monte Carlo algorithm of cappe:guillin:marin:robert:2003.
Since the -th iteration sample is produced from the proposal distribution
a natural importance weight associated with an accepted simulation is
The unbiasedness of this correction results from the identity
which does not depend on the distribution of the ’s. As in the original population Monte Carlo method, the fact that may depend on simulations from earlier iterations does not jeopardise the validity of the method. In addition, douc:guillin:marin:robert:2005 proved that must be modified at each iteration for the iterations to bring an asymptotic improvement on the Kullback–Leibler divergence between the proposal and the target: for instance, if the variance of the random walk does not change from one iteration to the next, the approximation of the target by does not change either and it is then more efficient to run a single iteration with twice as many points. Since is an approximation to the distribution of the sample simulated at iteration , when marginalised against all previous samples, this version of the approximate Bayesian computation algorithm can be interpreted as an approximate version of the sequential Monte Carlo algorithm of delmoral:doucet:jasra:2006, when using the optimal backward kernel.
When considering component-wise independent random walk proposals,
for each component of the parameter vector , the asymptotically optimal choice of the scale factor is available. Indeed, when using a Kullback–Leibler measure of divergence between the target and the proposal,
where the expectation is taken under the product distribution the minimisation of the Kullback divergence leads to the component-wise maximisation of under the product distribution . The optimal scale is then equal to , that is, under the posterior distribution. The implementation of this updating scheme on the scale is straightforward.
The algorithmic rendering of the corresponding optimised population Monte Carlo scheme is thus as follows:
Population Monte Carlo approximate Bayesian computation:
Given a decreasing sequence of tolerance thresholds ,
At iteration ,
Simulate and until
Take as twice the empirical variance of the ’s
At iteration ,
For , repeat
Pick from the ’s with probabilities
Take as twice the weighted empirical variance of the ’s
The expression of the importance weight involves a sum of terms and, therefore, the computational cost of this step of the population Monte Carlo approximate Bayesian computation algorithm is in . However, in realistic applications of approximate Bayesian computations, the main cost is associated with the previous step, namely the repeated simulations from the sampling density. For instance, in the population genetics example, the algorithm spends at least of the computing time in the repeat loop and less than of the time on the remaining computations.
For the mixure target, using the population Monte Carlo version leads to a recovery of the target, whether using a fixed standard deviation as shown on the second row of Figure 1, or , or a sequence of adaptive ’s as in the above algorithm. The graphical difference between these implementations is genuinely difficult to spot; the estimated variance stabilises very quickly in this toy example.
This example considers a simple evolutionary scenario of two populations having diverged from a common ancestral population. Data consists of the genotypes at five microsatellite loci of 50 diploid individuals from each of the populations. Loci are assumed to evolve according to the strict stepwise mutation model: when a mutation occurs, the number of repeats of the mutated gene increases or decreases by one unit with equal probability. Once diverged, populations do not exchange gene and there is no migration. The natural parameters of this model are the three effective population sizes, , and , the time of divergence () and the mutation rate () assumed here to be common to all loci. While the analyses will be performed with these natural parameters, only identifiable combinations of those, such as the three parameters , and , and the parameter , will be considered in the output.
In this experiment, several simulated datasets have been produced using the software developed by cornuet:santos:beaumont:etal:2008, with the following parameter values: , , and , out of which one is presented in this paper, but identical conclusions were drawn from the other datasets.
The tolerance region used in the approximate Bayesian computation schemes is based on twelve summary statistics as in Cornuet et al. (2008), namely mean number of alleles, mean genic diversity, mean size variance and mean Garza-Williamson index for each population sample, and distances between population samples, and mean probability of assignment of each sample to the other population. The distance is then chosen as the Euclidean distance between the observed and the simulated summary statistics, normalised by their standard deviation under the predictive model. The derivation of the distance thus requires a preliminary evaluation of those standard deviations for each summary statistic by simulation.
Each dataset has been submitted to three parallel analyses: a standard approximate Bayesian computation analysis following beaumont:zhang:balding:2002, performed via Cornuet et al.’s (2008) software, a tempered Markov chain Monte Carlo analysis as described in bortot:coles:sisson:2007, and a population Monte Carlo analysis. In each case, the same prior distributions have been used, consisting in a prior on the three effective sizes, a prior on the time of divergence, and a prior on the mutation rate. In addition, since Beaumont et al.’s (2002) algorithm includes a final local regression adjustment, this adjustment has been performed on the output of both alternative algorithms.
In order to operate a fair comparison between the three methods, the computing times are kept identical. Thus, reference posterior distributions have been constructed with Beaumont et al.’s (2002) approach, using simulated datasets and choosing as the quantile of the distances. For Bortot et al.’s (2007) approach, the tuning parameter driving the acceptance threshold is drawn from an exponential prior with parameter equal to the mean distance of the ten closest datasets (among generated in a pilot simulation), followed by regular iterations with a thinning factor of . Each Markov move combined two simultaneous updates : an independent draw of from its exponential prior and a random walk based on a log-normal deviate with standard deviation equal to of one of the other parameters chosen uniformly at random, the latter involving the computation of a Hastings term. Lastly, in the population Monte Carlo version, is based on the preliminary simulation as the quantile and four iterations are performed with , and , and with truncated normals in the random walk. Each iteration is based on samples of size . All three versions then require an average minutes. Figure 2 summarises the output of the comparison between the three methods over five independent runs for the same simulated dataset and it shows that, at least in this example, the posterior evaluations based on population Monte Carlo are more stable than Bortot et al.’s (2007) algorithm, while comparable with Beaumont et al.’s (2002).
While Sisson et al.’s (2007) algorithm induces biased weights, with a visible impact on the quality of the approximation, we have shown that the same Markov transition kernels and thus the same computing power can be used to produce an unbiased scheme.
The population Monte Carlo scheme is based on an importance argument that does not require a backward kernel as in (1). We have thus established that the adaptive scheme of cappe:douc:guillin:marin:robert:2007 is also appropriate in this setting, towards a better fit of the proposal kernel to the target . From a practical point of view, the number of iterations can be controlled via the modifications in the parameters of , a stopping rule being that the iterations should stop when those parameters have settled, while the more fundamental issue of selecting a sequence of ’s towards a proper approximation of the true posterior can rely on the stabilisation of the estimators of some quantities of interest associated with this posterior.
Acknowledgements This research is partly supported by the Agence Nationale de la Recherche through the Misgepop project. Both last authors are affiliated with CREST, Paris. Parts of this paper were written by the last author in the Isaac Newton Institute, Cambridge, whose peaceful and stimulating environment was deeply appreciated. Helpful comments from the editorial board of Biometrika and from O. Cappé are gratefully acknowledged.