Efficient Bayesian Synthetic Likelihood withWhitening Transformations

# Efficient Bayesian Synthetic Likelihood with Whitening Transformations

## Abstract

Likelihood-free methods are an established approach for performing approximate Bayesian inference for models with intractable likelihood functions. However, they can be computationally demanding. Bayesian synthetic likelihood (BSL) is a popular such method that approximates the likelihood function of the summary statistic with a known, tractable distribution – typically Gaussian – and then performs statistical inference using standard likelihood-based techniques. However, as the number of summary statistics grows, the number of model simulations required to accurately estimate the covariance matrix for this likelihood rapidly increases. This poses significant challenge for the application of BSL, especially in cases where model simulation is expensive. In this article we propose whitening BSL (wBSL) – an efficient BSL method that uses approximate whitening transformations to decorrelate the summary statistics at each algorithm iteration. We show empirically that this can reduce the number of model simulations required to implement BSL by more than an order of magnitude, without much loss of accuracy. We explore a range of whitening procedures and demonstrate the performance of wBSL on a range of simulated and real modelling scenarios from ecology and biology.

Keywords: Approximate Bayesian computation; covariance matrix estimation; likelihood-free inference; Markov chain Monte Carlo; shrinkage estimation.

## 1 Introduction

Likelihood-free methods have become a well established tool over the past two decades for performing statistical inference in the presence of computationally intractable likelihood functions. Such intractability can arise through a desire to fit realistically complex models, or through the shear size of a dataset, rendering the straightforward application of standard likelihood-based procedures practically infeasible. One popular and well studied likelihood-free approach is approximate Bayesian computation (ABC) (Sisson et al., 2018a). ABC methods operate by repeated simulation of data under the model of interest, and then comparing observed and simulated data on the basis of summary statistics of these data under some kernel function. ABC methods are known to scale poorly to high-dimensional problems (Prangle, 2018; Nott et al., 2018).

Recently, Bayesian synthetic likelihood (BSL) (Price et al., 2018) has been gaining popularity as an alternative method to ABC for likelihood-free inference. BSL is the Bayesian extension of the synthetic likelihood approach of Wood (2010), which approximates the unknown likelihood function of the summary statistics with a known, tractable distribution, typically Gaussian. Compared to the non-parametric estimate of the likelihood function that is implied by ABC methods (Blum, 2010; Sisson et al., 2018b), by making a parametric assumption, BSL is able to scale better than ABC to high dimensional problems (in both summary statistics and model parameters; Ong et al., 2018a; Nott et al., 2018), and makes the usual ABC trade-off between the dimensionality and informativeness of the summary statistics much easier. Frazier et al. (2019) show that an importance sampling BSL algorithm with the posterior as a proposal distribution is more computationally efficient than the corresponding ABC algorithm.

Despite the relative advantages and efficiencies of BSL, and recent work in this area (e.g. Frazier et al., 2019; An et al., 2019c; Ong et al., 2018b) there remain some key inefficiencies in the method. Most prominently, for a Gaussian synthetic likelihood the unknown mean and covariance matrix must be estimated by simulation for every proposed parameter within any inference algorithm. This is especially problematic when the dimension of the summary statistics is high, as a large number of model simulations are then required to produce an accurate estimate of the covariance matrix, or when simulation from the model itself is expensive.

A number of efficient covariance matrix estimation techniques have been considered to reduce the needed number of model simulations in BSL. An et al. (2019c) use the graphical lasso to provide a sparse estimate of the precision matrix. However, performance is inhibited when there is a low degree of sparsity in the covariance or inverse covariance matrix. Ong et al. (2018a) and Frazier et al. (2019) consider shrinkage estimation to shrink the off-diagonal elements of the correlation matrix by a factor and leave the estimated variances (i.e. the diagonals of the covariance matrix) unadjusted. However, in a number of empirical examples when there is significant correlation between summaries, these estimators result in poor BSL posterior approximations – in particular, recovering the wrong dependence structure between parameters and over- or under-estimates of variances. Frazier et al. (2019) deliberately mis-specify the form of the covariance matrix (as diagonal or taking a factor form) to allow more shrinkage to be applied, and then use asymptotic results to correct the resulting posterior variances post-hoc. Everitt (2017) consider an alternative method to reduce the number of model simulations in a bootstrapped version of synthetic likelihood.

In this article we consider the application of whitening transformations within BSL. Whitening is a linear transformation that maps a set of random variables into a new set of variables with an identity covariance matrix. In the context of BSL, we perform an approximate whitening transformation of the set of simulated summary statistics at each algorithm iteration. The transformation requires a whitening matrix which is based on a point estimate of the parameter that is supplied by the user (following e.g. Luciani et al., 2009). The whitening transformation can be effective in decorrelating the summary statistics across important parts of the parameter space. In addition, because the resulting transformed summary statistics should be significantly less correlated, a greater amount of shrinkage can then be applied to the covariance estimator. Accordingly, the number of required model simulations can be substantially reduced without a detrimental effect on the accuracy of the resulting posterior approximation, relative to standard BSL. We show that when estimating the summary statistic covariance matrix as a diagonal matrix (corresponding to complete shrinkage), the number of model simulations required to control the variance of the log synthetic likelihood scales linearly with the dimension of the summary statistic, as opposed to the standard synthetic likelihood estimator, for which we show that the number of simulations must grow quadratically with the summary statistic dimension. These results provide a strong motivation for our whitening method; we refer to the method of whitening transformation and covariance shrinkage within BSL as wBSL.

Due to the rotational freedom of the whitening transformation, there is an infinite number of whitening transformation matrices available. We consider the five whitening transformations examined by Kessy et al. (2018) and find that the principal component analysis (PCA) based whitening transformation performs best within the BSL framework. We also empirically demonstrate that the whitening BSL posterior approximation is quite insensitive to the point at which the whitening matrix is initially estimated.

This article is structured as follows: Section 2 details BSL, its properties and practical recommendations, as well as background information on shrinkage covariance matrix estimation. Section 3 describes the whitening transformations and introduces the wBSL algorithm. We examine the performance of wBSL under controlled simulations in Section 4, in addition to two real world analyses in ecology and biology. Section 5 explores the choice of whitening transformation in terms of the effectiveness of the transformation over the parameter space, and the sensitivity of the whitening procedure to the initial point estimate. We conclude with a discussion.

## 2 Bayesian Synthetic Likelihood

Suppose we have developed a statistical model and are interested in learning the parameters for a given set of observed data . The model may contain many parameters and hidden states, making it sufficiently complex so that a computationally tractable expression for the likelihood is unavailable. Bayesian synthetic likelihood (BSL) is a likelihood-free inference technique that permits an approximate Bayesian inference in this setting, but without direct evaluation of the intractable likelihood function (Wood, 2010; Price et al., 2018). Like ABC methods, BSL relies on reducing to a lower-dimensional set of informative summary statistics , where is a summary statistic mapping function. BSL aims to target the partial posterior distribution

 p(θ|sy)∝p(θ)p(sy|θ),

where is the prior for . Because will also likely be computationally intractable, BSL then makes the assumption that the summary statistic likelihood follows a convenient specified parametric form. Typically this will be a multivariate normal distribution, so that an auxiliary or synthetic approximation of the summary statistic likelihood is

 p(sy|θ)≈pA(sy|θ)=N(sy|μ(θ),Σ(θ)).

The auxiliary parameters and are generally unknown (as a function of ), but can be straightforwardly estimated by Monte Carlo simulation. Denote by the sequence of summary statistics for i.i.d. simulated data sets such that is the set of summary statistics for , and , where is the number of summary statistics. The parameters of the auxiliary likelihood can then be estimated by the sample statistics

 μn(θ) =1nn∑i=1si, (1) Σn(θ) =1n−1n∑i=1(si−μn(θ))(si−μn(θ))⊤, (2)

and so the estimated auxiliary likelihood, as an explicit function of , becomes

 pA,n(sy|θ)=N(sy|μn(θ),Σn(θ)).

We write to emphasise the dependence on . In practice, the dependence on is weak (Price et al., 2018), and if tends toward infinite with at any rate, then the effect of estimating and is asymptotically negligible (Frazier et al., 2019). As a result, Price et al. (2018) suggest choosing to maximise computational efficiency, with large providing expensive but precise likelihood estimates, compared to low producing fast but variable estimates. Price et al. (2018) recommend choosing a point of high posterior support for and then tuning so that the standard deviation of the synthetic likelihood is roughly between and (see also Doucet et al., 2015).

In practice, (2) is not the most efficient estimator of , and several authors have adopted different strategies, including shrinkage, to improve on this within the BSL context (e.g. An et al., 2019c; Ong et al., 2018a, b; Frazier et al., 2019; Everitt, 2017). With shrinkage, the primary aim is to estimate the covariance matrix with as few model simulations as possible such that the performance of a BSL sampler is efficient. As the number of model simulations approaches the number of summary statistics () from above, (2) becomes increasingly close to singular – with guaranteeing a singular estimate.

One simple approach used by e.g. Ong et al. (2018a, b) makes use of ridge regularisation to avoid such instabilities (Warton, 2008). The standard ridge regulariser for the covariance matrix estimate is , where is the ridge parameter and is the identity matrix. When the variables are measured on different scales (as is usual for the summary statistics in BSL), Warton (2008) derived a ridge estimator of the correlation matrix using maximum penalised Gaussian likelihood estimation, with a penalty. For the estimated correlation matrix

 ^R=Σ−1/2dΣnΣ−1/2d

where is formed using the diagonals of , the ridge estimator is

 ^Rγ=γ^R+(1−γ)Id, (3)

with . The estimator is always a valid correlation matrix with unit diagonals. The estimated covariance matrix is then

 Σn,γ=Σ1/2d^RγΣ1/2d. (4)

The smaller the value of , the closer comes to being a diagonal matrix. In the context of BSL, a smaller reduces the variance of the synthetic likelihood estimator for a given number of simulations, . This implies that less model simulations are required to achieve the same acceptance rate (as a measure of sampler performance) within BSL.

Any shrinkage estimator may be used within wBSL. However here we adopt the Warton estimator (4) since it does not shrink the estimated variances of the transformed summary statistics, and we find that this is crucial for the accuracy of the best performing whitening transformation in wBSL (see Section 5). It is also computationally trivial to calculate. We have also found (results not shown) that using the standard ridge shrinkage estimator with wBSL produces far less accurate posterior approximations. Shrinkage on its own works well in cases where there is a low degree of correlation between summaries (Ong et al., 2018a), but performs poorly for small when there is significant correlation between summaries (e.g. Section 4, Figure 1).

An understanding of the variance of the synthetic log-likelihood estimator provides insight into the computational efficiency of a BSL algorithm. To deduce the behaviour of the variance of the synthetic log-likelihood estimator, we consider the following assumptions.

###### Assumption 1.

For , the simulated summaries are generated i.i.d. and satisfy

 si(θ)∼N(μ(θ),Σ(θ)),

where and is positive-definite for all .

In addition, consider a version of BSL that uses simulated summaries with a diagonal covariance structure, i.e., summaries that are uncorrelated across the dimensions. Denote these simulated summaries as . We maintain the following assumption about these simulated summaries.

###### Assumption 2.

For , the simulated summaries are generated i.i.d. and satisfy

 ζi(θ)∼N(ζ(θ),Ω(θ)), where Ω(θ):=diag(ω11(θ),…,ωdd(θ)),

where

 supθ∈Θ∥ζ(θ)∥<∞, and maxj≤dsupθ∈Θωjj(θ)<∞,minj≤dinfθ∈Θωjj(θ)>0.

Let denote the sample mean of these uncorrelated summary statistics and let denote the sample variance. Throughout the remainder we explicitly consider that is a diagonal matrix.

Denote the BSL likelihood based on the uncorrelated summaries as

 pA,n,w(sy|θ)=N(sy|ζn(θ),Ωn(θ)),

and recall the standard BSL likelihood is given by:

 pA,n(sy|θ)=N(sy|μn(θ),Σn(θ)).

For two real sequences , we say that if the sequence is bounded. Theorem 1 demonstrates how the variance of the synthetic log-likelihood scales under each of the above assumptions.

###### Theorem 1.

For and large, but finite, with , under Assumption 3,

 \em Var[log(pA,n(sy|θ))] =O(d2⋅n2(n−d)3),

however, under Assumption 4,

 \em Var[log(pA,n,w(sy|θ))] =O(d⋅n2(n−d)3).

Therefore, for and large,

 \em Var[log(pA,n,w(sy|θ))]≤\em Var[log(pA,n(sy|θ))].

Proof of Theorem 1 is given in Appendix A. Theorem 1 demonstrates that for a given value of , the variance of is less than or equal to the variance of . Moreover, we can deduce how must scale as increases in order to control the variance of the synthetic log-likelihood. For the synthetic likelihood with uncorrelated summaries, using Theorem 1 and letting , we have that . In this case, must scale linearly with to control the variance of the synthetic log-likelihood. On the other hand, using Theorem 1, and letting , we have that for the standard synthetic likelihood estimator with a full covariance matrix, . Therefore, when using to estimate , must increase quadratically with to control the variance of the synthetic log-likelihood. Thus, if we had access to an uncorrelated set of summary statistics, this would greatly mitigate the curse of dimensionality with respect to the dimension of the summary statistic in BSL. However, in practice, a set of uncorrelated summary statistics that are informative about the model parameters is not usually available. In Section 3 we propose a method that approximately decorrelates a set of correlated summary statistics, allowing us to achieve the computational gains associated with using an uncorrelated set of summary statistics (when in the shrinkage covariance estimator of (3)).

## 3 Whitening Bayesian synthetic likelihood (wBSL)

In order to reduce shrinkage estimation induced error within BSL, and thereby also increase the efficiency of the method, we propose the use of a whitening transformation (e.g. Kessy et al., 2018) to decorrelate the summary statistics at each iteration of the BSL algorithm. Whitening, also known as sphering, is a linear transformation commonly employed in data preprocessing to produce a decorrelated set of data with unit variance (e.g. Bacus, 1976). Specifically, a whitening transformation converts the random vector of arbitrary distribution with covariance matrix , into a new set of variables

 ~s=Ws (5)

for some whitening matrix , such that the covariance is the identity matrix of dimension .

Of course, in the context of BSL, depends on and thus, to achieve exactly at every parameter value, the whitening matrix must also vary as a function of . However, for wBSL, we hold constant so that the transformed summary statistic likelihood

 g(~s|θ)=p(W−1~s|θ)|det(W⊤)|=p(W−1Ws|θ)|det(W⊤)|∝p(s|θ),

which ensures the posterior distribution conditional on the summary statistic remains unchanged with the whitening transformation. Therefore, for parameter values away from where is estimated, the transformation is not exact, so that . By approximately decorrelating summary statistics, the covariance shrinkage estimator may be applied with a greater amount of shrinkage. Heavier shrinkage permits fewer model simulations.

The only requirement for the whitening matrix is that it must satisfy . Hence, as , this is satisfied if

 W⊤W=Σ−1. (6)

Due to the rotational freedom, there are infinitely many whitening matrices that satisfy (6), each resulting in uncorrelated but differing sets of variables .

The most suitable whitening matrix for wBSL is the one that most effectively decorrelates those summary statistics generated under the model, for parameter values that reside in regions with non-negligible posterior density. This would minimise posterior approximation errors caused by shrinkage estimation of the transformed summary statistic covariance matrix , and thereby produce the most accurate inference. Here we consider the five natural whitening procedures outlined by Kessy et al. (2018): zero-phase component analysis whitening (ZCA), ZCA correlation whitening (ZCA-cor), principal component analysis whitening (PCA), PCA correlation whitening (PCA-cor) and Cholesky whitening. Each transform arises naturally by either optimising some criteria with respect to the cross-covariance , the cross-correlation , or by satisfying some symmetry constraint. Each transform is described briefly below (see e.g. Kessy et al., 2018, for further details).

The transformations make use of various matrix decompositions. Specifically, the covariance matrix may be decomposed as , where is the correlation matrix and is the diagonal matrix of variances. The eigendecomposition of the covariance matrix is , where is the matrix of eigenvectors and the diagonal matrix of eigenvalues, and the eigendecomposition of the correlation matrix is , where is the eigenvector matrix and is the diagonal matrix of eigenvalues. Finally, the Cholesky decomposition of the precision matrix is , where is a unique lower triangular matrix.

ZCA or ZCA-Mahalanobis whitening aims to produce a transformed set of data that remains maximally similar to the original data. This is achieved by minimising the squared distance between the original and transformed data

 E[(~s−s)⊤(~s−~s)] =tr(Id)−2E[(~ss)]+tr(Σ) =d−2tr(Φ)+tr(V),

or equivalently maximising the average cross-covariance . The resulting whitening matrix is . ZCA-cor whitening is the scale invariant analogue of ZCA whitening, where the objective is to minimise the distance between the variables on a standardised scale, so that

 E[(~s−V−1/2s)⊤(~s−V−1/2s)]=2d−2tr(Ψ).

The resulting whitening matrix, , minimises the average cross-correlation . PCA whitening and PCA-corr whitening are equivalent to maximising the compression with respect to the cross-covariance

 (ϕ1,...,ϕd)⊤=diag(ΦΦ⊤)

with and cross-correlation

 (ψ1,...,ψd)⊤=diag(ΨΨ⊤)

respectively. PCA whitening results in for the whitening matrix, whereas PCA-cor whitening results in . Finally, Cholesky whitening is directly based on the Cholesky decomposition of the precision matrix and results in . In the examples later we test all the whitening approaches in the context of BSL and find the PCA variants to be most effective.

The original BSL algorithm (Price et al., 2018) was presented in the form of a Metropolis-Hastings Markov chain Monte Carlo (MCMC) sampler, and so we present wBSL similarly. Of course, the (w)BSL procedure is Monte Carlo algorithm agnostic, and so alternative posterior simulation samplers (such as sequential Monte Carlo) are straightforward to construct. The full MCMC-based wBSL procedure is outlined in Algorithm 1.

In wBSL the unknown mean and covariance matrix of the are estimated by simulation. First, the matrix of summary statistics is transformed according to . The whitening matrix is estimated prior to implementing the MCMC sampler using simulations , given some parameter value located in a region of high posterior density. This is not an uncommon procedure within ABC (e.g. Luciani et al., 2009), and any suitable method can be used to find an appropriate , such as prior information, a pilot ABC analysis with a large kernel scale parameter (Fearnhead and Prangle, 2012) or a fast likelihood-free optimisation method (e.g. Gutmann and Corander, 2016).

Given an approximate whitening transformation, we use (4) to estimate the covariance matrix of the transformed summary statistics, . Ideally, this covariance matrix is approximately diagonal, meaning that the off-diagonal elements are close to zero. In this case, the whitened summary statistics in wBSL permit a large amount of shrinkage (a low value of ) to be used, and a correspondingly large reduction in compared to standard BSL. That is, shrinkage covariance estimation can be much more effective when a whitening transformation is applied.

## 4 Examples

We examine the performance of wBSL for three models with simulated data where the covariance of the summary statistics depends explicitly on the parameters. This is realistic in practice. We also consider two real data analyses from ecology and biology. For the first four models we compare wBSL with each of the five whitening transformations and Warton shrinkage by itself, to either standard BSL or to the true posterior (where known).

To find an appropriate combination of and , we estimate the number of model simulations required to maximise the computational efficiency of standard BSL, which we define as wBSL but with no whitening transformation or shrinkage covariance estimation (i.e. ). Following Price et al. (2018), this is the value for such that the estimate of the log synthetic likelihood at has a standard deviation in the range . We then fix to achieve a , and reduction in the number of model simulations at each sampler iteration compared to standard BSL, and tune the value of to similarly constrain the log-likelihood variance. We also consider complete shrinkage () so that the covariance matrix of is forced to be diagonal, and again choose to constrain the variance of the log-likelihood. This latter setting represents the most computationally efficient wBSL algorithm (lowest ), but is potentially the least accurate.

For each method and analysis we use a Gaussian random walk MCMC proposal distribution with covariance set to be roughly equal to the (approximate) posterior covariance. To quantify the accuracy of each method, we use the total variation distance between two probability density functions and given by . The distance is estimated using kernel density estimation from the (approximate) posterior samples and by numerical integration over a grid of carefully chosen parameter values. For models with more than two parameters, we present all pairwise results.

The first two examples are of an AR(1) model and a normal model, and full details of each are provided in Appendix B. The findings from both of these models are similar to the other examples.

An implementation of wBSL is available in the BSL package in R (An et al., 2019b), which is available at https://github.com/ziwenan/BSL.

### 4.1 An MA(2) model

The MA model represents a univariate series of temporally dependent observations as

 xt=wt+θ1wt−1+θ2wt−2wherewi∼N(0,σ2),i=−1,0,1,…,T0,

for , and has parameter constraints , and . Defining , then the likelihood is Gaussian with zero mean vector and covariance matrix constructed from , , and for . We generate 200 observations from the MA process with and fixed , and specify the full observed dataset as summary statistics. Under this setting, the summary statistics are exactly multivariate normal distributed and so standard BSL should perform well in terms of posterior approximation accuracy. We compare the results of wBSL and Warton shrinkage by itself to the output of a standard Metropolis-Hastings sampler using the known likelihood. We find that simulations are efficient for standard BSL, and we use model simulations at to accurately estimate . We use MCMC sampler iterations and a uniform prior over the parameter support.

Contour plots of the estimated joint posterior distribution under each method are shown in Figure 1. It is evident that when the number of model simulations for estimating the synthetic likelihood is less than , using Warton shrinkage alone (leftmost column) fails to recover an accurate posterior approximation. This is likely due to significant dependence between the summary statistics. As the level of shrinkage is increased (i.e.  is reduced), the estimated posterior variances and dependence structure become increasingly poor.

In contrast, all forms of whitening produce accurate dependence structures. PCA and PCA-cor whitening are the only procedures that consistently provide accurate estimates of the variance for varying : ZCA, ZCA-cor and Cholesky whitening all have inflated variances for smaller to roughly the same extent. Note that for model simulations (a 50% computational reduction compared to standard BSL), all whitening methods produce reasonably accurate posterior distributions. However, PCA-based results are reasonably accurate for all levels of shrinkage. This is impressive as for complete shrinkage () the number of model simulations () is reduced by two orders of magnitude for wBSL compared to BSL ( in BSL).

### 4.2 Movement models for Fowler’s toads

Understanding the movement behaviour of native and invasive species is an important topic in ecology (Lindstrom et al., 2013). Marchand et al. (2017) consider three individual-based movement models for a species of Fowler’s toads (Anaxyrus fowleri), motivated by a desire to understand the link between small scale movements and larger phenomena, such as home ranges, dispersal and migrations at seasonal, annual or life-time scales. In particular, the random-return model assumes that toads take refuge during the day and forage throughout the night, generating a net overnight displacement from a Levy alpha-stable distribution, , with stability parameter and scale parameter . Toads are assumed to return only at the end of the nighttime foraging path, with constant probability, . The refuge site is determined random from any of the previous refuge sites, with previously visited sites given a higher weighting.

Previously Marchand et al. (2017) and An et al. (2019a) used ABC and synthetic likelihood, respectively, for inference for this model. Following Marchand et al. (2017) we consider synthetically generated data for toads recorded at least once per night (active foraging) and once per day (resting in refuge) over days, with . We also specify uniform priors , and . The distance moved distribution for each toad at time lags of 1, 2, 4 and 8 days was found, and the log of the differences in the quantiles, the number of absolute displacements less than 10m, and the median of the absolute displacements greater than 10m are used as summary statistics (48 in total). As before, model simulations drawn at are used to estimate , and implement standard BSL and wBSL samplers for MCMC iterations. BSL was found to perform efficiently for this setting with model simulations per iteration.

The resulting estimated bivariate marginal distributions for wBSL with PCA whitening and Warton shrinkage by itself are shown in Figure 2; the results for wBSL with the remaining whitening methods are shown in Appendix C. Here Warton shrinkage by itself performs very poorly compared to standard BSL, producing both biased estimates and significantly underestimating marginal variances, unless large numbers of samples () are used.

In contrast, wBSL performs well compared to standard BSL. There are smaller differences between each of the whitening methods than seen in the previous examples, most likely due to a lack of sensitivity of the covariance matrix of the summary statistics to the model parameters. PCA-based whitening appears to perform the best, followed by ZCA-based whitening and Choleksy whitening the worst performing. However, for all whitening types, the posterior approximations with complete shrinkage () provide reasonable approximations to the standard BSL posterior. In this case, the number of model simulations is reduced by an order of magnitude from standard BSL () to wBSL ().

### 4.3 Collective cell spreading

Central to the understanding of many biological phenomena, such as tissue repair (Shaw and Martin, 2009) and cancer (Friedl and Wolf, 2003), is an understanding of collective cell behaviour. Mathematical models are a flexible tool for gaining insight into the movement, proliferation and interactions between cells on a cell-to-cell level (e.g. Vo et al., 2015 Johnston et al., 2014). An appealing approach is the continuous time, continuous space stochastic individual-based model of Binny et al. (2016). Using ABC methods Browning et al. (2018) calibrate this model to experimental results obtained by a cell proliferation assay experiment.

The model assumes that cells are uniformly sized discs with diameter and location for cells. Two events occur: proliferation and movement, each evolving according to a Poisson process with intrinsic parameters and , respectively. The rates of the cell, and depend on the crowding of neighbouring cells as determined by a Gaussian kernel given separation distance . Browning et al. (2018) assume that that the net proliferation and movement rates reduce to zero under maximum hexagonal cell packing. Upon proliferation events, the location of the daughter cell is simulated from a bivariate normal distribution, . For movement events, the preferred direction of movement is in the direction away from regions of high cell density , according to the crowding surface , with closeness governed by a Gaussian kernel and repulsive strength parameter . The movement distance is , which is equal to the cell diameter. The parameters to be inferred are .

We follow the results from Browning et al. (2018)’s PC-3 prostate cancer cell line in their cell proliferation assay experiment, for which images are taken every 12 hours for a total duration of 36 hours. We generated simulated data under this setting with . In our BSL implementation we use 21 summary statistics. At 12, 24 and 36 hours, we record the number of cells, Ripley’s function evaluated at and and Ripley’s function evaluated at and (see e.g. Baddeley et al., 2007, for a discussion of these). Priors are specified as , and , and it is found that model simulations are required to implement standard BSL efficiently. We use model simulations to estimate the whitening matrix given in a region of high posterior support, and a total of MCMC sampler iterations.

The resulting estimated bivariate posterior approximations are shown in Figure 3, for both PCA whitening wBSL (as the best performing wBSL method) and Warton shrinkage alone. The shrinkage-only posterior approximations are close to the BSL posterior approximations for (bottom row) and (middle row) model simulations (compared to for standard BSL). However, both posterior location and variance are much less accurate for model simulations (). In contrast, PCA wBSL performs very well for all levels of shrinkage. This order of magnitude performance gain is a particularly significant result, as simulating data under this model is very is computationally expensive.

## 5 Whitening method choice and sensitivity

### 5.1 Choice of Whitening Method

The empirical results in Section 4 suggest that PCA-based whitening methods provide the most accurate posterior approximation. Recall that for covariance shrinkage to be effective, the whitening transformation should decorrelate the summary statistics so that their covariance matrix is close to diagonal for parameter values that reside in regions with non-negligible posterior density. We explore how well this has been achieved for the MA(2) and AR(1) models considered earlier.

For each model we compute the whitening matrix using the known analytical covariance at the true parameter value . We then compute the covariances of the transformed summary statistics where is the known analytical covariance matrix for values of drawn from the true posterior. We compute the difference between the upper triangular portion of , both including and excluding the diagonal, from the identity and zero matrix respectively, and then calculate the matrix norm. These matrix norm deviations quantify the location and magnitude of the lack of effectiveness of the whitening transformation, and consequently where this deviation may have a direct effect on the wBSL posterior approximation. By using the known analytical covariances there is no Monte Carlo error in these results.

As might be expected, for both the MA and AR models, as moves further away from , the deviation of from the identity matrix increases for each whitening method (see Appendix C). Here, PCA-based whitening has slightly lower deviations than the other whitening methods. However, the differences between the different whitening methods become much clearer when considering the off-diagonal deviations only (see Figure 4 and Appendix C). Relative to the other whitening transformations, for PCA-based whitening, the covariance deviation (excluding variances) does not increase as rapidly as moves away from . This suggests that PCA-based whitening should be the most effective at decorrelating summary statistics within the BSL algorithm, and that PCA and PCA-cor whitening in wBSL should provide posterior approximations closest to standard BSL. This is aligned with the results in Section 4.

The results also demonstrate why coupling the Warton shrinkage with the whitening (particularly the PCA-based whitening) is so effective; in the Warton shrinkage estimator, the variances are always re-estimated from the model simulations while only the correlations are shrunk. Therefore it is only necessary for the whitening transformation to generate covariance matrices close to diagonal away from the point estimate, rather than being close to the identity, a stronger requirement.

The same conclusions can also be drawn based on the norm of the off diagonal elements of when is taken over the entire parameter space (see Appendix C). It is clear that correlations between the transformed summary statistics are far less sensitive to changing for PCA-based whitening compared to ZCA-based or Cholesky whitening. For PCA-based whitening the deviation surface is almost flat over , in contrast to the clear bowl-shaped surface for the other whitening methods.

### 5.2 Sensitivity to the value of θ0

A necessary step in implementing wBSL is estimation of the whitening matrix before implementing the Monte Carlo sampler. This requires specification of two quantities: a parameter vector believed to lie in region of high posterior probability, under which the summary statistics are generated and is estimated, and the number of these statistics . Because is only estimated once at the start of the wBSL algorithm, it will take up a small fraction of the overall computational budget, and so can be sufficiently large to estimate well. In the analyses in Section 4 we used for the first four examples.

Once has been estimated then is a natural candidate from which to initialise the MCMC sampler (or other Monte Carlo algorithm wBSL variant). In the examples in Section 4 we used the true parameter value from which the observed dataset was generated , however in practice little information about the true posterior may be available. Within the ABC literature a pilot analysis is commonly used to identify the region of high posterior density before performing a full analysis (Fearnhead and Prangle, 2012; Fan et al., 2013) and similar ideas could be adopted here. However there would still be uncertainty regarding the best choice of within this region. Accordingly interest is in understanding the sensitivity of the wBSL posterior approximation to the choice for . Here we re-examine the MA(2) and AR(1) models and focus on PCA whitening as the best performing whitening procedure. The results for the other whitening methods are provided in Appendix C.

For the MA process, the parameters and are subject to the constraints , and . We choose five different initial parameter configurations. The first three are the vertices of these boundaries: , and , which are likely the worst possible choices of the point estimate. The other two values are and such that and . We draw samples from the wBSL posterior approximation using iterations, with a burn-in of iterations, and model simulations.

The resulting posterior approximations are illustrated in Figure 5, which demonstrate that the wBSL posterior has some robustness to the value of . When the point estimate was chosen on the boundary, there are mixed results. For (top row), the posterior variances are inflated, and more so for greater shrinkage (lower ). However, when or (rows 2 and 3), the posterior is recovered with high accuracy. Interestingly, the wBSL posterior initialised at the true posterior quantiles (row 4) performs better than the posterior initialised at the quantiles (row 5), yet both appear less accurate than when initialising at or (rows 2 and 3), which are on the boundary of the parameter support. We would expect point estimates with high posterior support to perform better overall, since the whitening transform would likely be more effective at decorrelating the simulated summary statistics over the region of the parameter space closer to . The empirical results support this to some extent.

Details of the AR example are given in Appendix C. We observe similar results to the MA example. When the point estimate is close to regions of high posterior density the wBSL posterior approximation is accurate for all levels of shrinkage. When the point estimate is very far from the high posterior density region, then the wBSL approximation becomes poorer.

For both the MA and AR analyses, PCA-cor whitening wBSL performed similarly to PCA whitening under the same settings, whereas for ZCA-based whitening and Cholesky whitening there was a greater sensitivity to the initial point estimate (see Appendix C).

## 6 Discussion

In this article we have examined the integration of whitening transformations within the Bayesian synthetic likelihood framework. In combination with shrinkage covariance estimation, the number of model simulations required to estimate the synthetic likelihood function can be drastically reduced compared to standard BSL methods, enabling the efficient implementation of BSL with high dimensional and highly correlated summary statistics. In particular, we obtained orders of magnitude computational gains over standard BSL in all analyses considered, with little detrimental impact on accuracy.

We examined five different whitening transformations: PCA, PCA-cor, ZCA, ZCA-cor and Choleksy whitening, on both simplified and real-world models. In all cases, we empirically demonstrated that PCA-based whitening outperformed the other whitening methods, and produced transformations that were less sensitive to changes in the model parameter in the sense that the covariance of the transformed summary statistics was closer to being diagonal. We found that the PCA-based wBSL posterior approximation was fairly robust to the initial parameter point estimate used to compute the whitening matrix . Although there was some variability in our results for the MA(2) model, we suggest using initial parameter estimates approximately located in regions of high posterior density in order to produce more accurate posterior approximations. Overall, since PCA and PCA-cor whitening produced similar results, we recommend PCA whitening as the standard choice of whitening for wBSL, since it is slightly more computationally efficient to implement.

In practice, we recommend that the user choose an appropriate number of model simulations () given their computational budget and then tune the corresponding shrinkage level (). Following the recommendation of Price et al. (2018), we suggest tuning the shrinkage level so that the estimated log synthetic likelihood has a standard deviation between 1 and 2. This should produce a good trade-off between computational and statistical efficiency. Of course, wBSL can produce more accurate results with more model simulations (less shrinkage).

It would be of future interest to investigate the applicability of whitening transformations in various extensions of BSL. For example, An et al. (2019a) develop a semi-parametric synthetic likelihood, which is more robust to departures from normality. Further, Frazier and Drovandi (2019) develop synthetic likelihood methods that are more robust when there is incompatibility between the model and observed summary statistic. An alternative extension could involve a method for automatically finding a whitening transformation that minimises the loss of accuracy compared to standard BSL.

## Acknowledgements

The authors would like to thank Ziwen An for providing some code for the toad and the collective cell spreading models, and the High Performance Computing group at QUT for their computational resources. JWP was supported by a QUT Master of Philosophy Award. SAS was supported by the Australian Research Council under the Future Fellowship scheme (FT170100079). CD and SAS are also supported by the ARC Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS; CE140100049). CD is grateful to ACEMS for providing funding to visit SAS at UNSW where discussions on this research took place. We thank Ian Turner for some discussions on relevant linear algebra.

## Appendix A Proof of Theorem 1

For completeness, we first restate Assumption 1, Assumption 2 and Theorem 1. Let denote the observed summary statistic. Consider the standard BSL likelihood with and denoting the sample mean and variance of the simulated summary statistics. To deduce the behavior of the variance we consider the following assumption.

###### Assumption 3.

For , the simulated summaries are generated i.i.d. and satisfy

 si(θ)∼N(μ(θ),Σ(θ)),

where and is positive-definite for all .

In addition, consider a version of BSL that uses simulated summaries with a diagonal covariance structure, i.e., summaries that are uncorrelated across the dimensions. Denote these simulated summaries as . We maintain the following assumption about these simulated summaries.

###### Assumption 4.

For , the simulated summaries are generated i.i.d. and satisfy

 ζi(θ)∼N(ζ(θ),Ω(θ)), where Ω(θ):=diag(ω11(θ),…,ωdd(θ)),

where

 supθ∈Θ∥ζ(θ)∥<∞, and maxj≤dsupθ∈Θωjj(θ)<∞,minj≤dinfθ∈Θωjj(θ)>0.

Let denote the sample mean of these uncorrelated summary statistics and let denote the sample variance. Throughout the remainder we explicitly consider that is a diagonal matrix.

Denote the BSL likelihood based on the uncorrelated summaries as

 pA,n,w(sy|θ)=N(sy|ζn(θ),Ωn(θ)),

and recall the standard BSL likelihood is given by:

 pA,n(sy|θ)=N(sy|μn(θ),Σn(θ)).

For two real sequences , we say that if the sequence is bounded.

###### Theorem 2.

For and large, but finite, with , under Assumption 3,

 \em Var[log(pA,n(sy|θ))] =O(d2⋅n2(n−d)3),

however, under Assumption 4,

 \em Var[log(pA,n,w(sy|θ))] =O(d⋅n2(n−d)3).

Therefore, for and large,

 \em Var[log(pA,n,w(sy|θ))]≤\em Var[log(pA,n(sy|θ))].
###### Proof.

We deduce the order for standard BSL, , and the diagonal counterpart, separately, starting with standard BSL.

Standard BSL

Decompose as

 −log(pA,n(sy|θ)) ∝logdet{Σn(θ)}+(sy−μn(θ))⊤Σ−1n(θ)(sy−μn(θ)) :=X1n+X2n,

where the proportionality disregards terms that do not depend on . First, we deduce the variances of and . In what follows, we disregard the functions dependence on to simplify notations.2

(1) term: Consider the variance of . Let denote the Wishart distribution with degrees of freedom and scale matrix . Under Assumption 3, for ,

 logdet(Σn) =logdet{W(n−1)}=−dlog(n−1)+logdet{W}.

From Bishop (2006), for

 Var[logdet{W}] =d∑i=1ψ1(n−i2), (7)

where denotes the trigamma function. Applying equation (7), we have that

 Var[logdet{Σn}]=d∑i=1ψ1(n−i2)=O(1), (8)

where the latter follows from the asymptotic properties of the trigamma function.

(2) term: Recalling that , we bound this term via

 X2n≤2(μn−μ)⊤Σ−1n(μn−μ)+2(sy−μ)⊤Σ−1n(sy−μ).

Apply the Cauchy-Schwarz inequality to obtain

 Var[X2n] ≤4Var[(μn−μ)⊤Σ−1n(μn−μ)]+4Var[(sy−μ)⊤Σ−1n(sy−μ)] +2√Var[(μn−μ)⊤Σ−1n(μn−μ)]×√Var[(sy−μ)⊤Σ−1n(sy−μ)].

Rewrite the first term, , as

 t2:=1n[(μn−μ)⊤[Σn/n]−1(μn−μ)].

Under Assumption 3,

 n⋅t2∼T2d,n−1,

where denotes Hotelling’s T-squared distribution with and degrees of freedom, and where we note that

 T2d,n−1∼d(n−1)(n−d)Fd,n−d,

with denoting the -distribution with numerator and denominator degrees of freedom, respectively. The variance of can now be obtained using the moments of the -distribution. To this end, for ,

 Var[t2] =1n2(d(n−1)n−d)2Var[Fd,n−d] =1n2(d(n−1)n−d)22(n−d)2(n−2)d(n−d−2)2(n−d−4) =1n22d(n−1)2(n−2)(n−d−2)2(n−d−4) =O(d⋅n(n−d)3). (9)

To deduce the variance of the second term, , we can use the following result for the inverse-Wishart distribution (see Fujikoshi et al., 2011, Ch. 2): if , it follows that, for a constant vector , with , and for ,

 Var[z⊤Wz] =d∑i=1d∑j=1zizj(n−d+1)[Σ−1]2ij+(n−d+1)[Σ−1]ii[Σ−1]jj(n−d)(n−d−1)2(n−d−3)=O(d2(n−d)3). (10)

Now, rewrite the term as

 (n−1)(sy−μ)⊤[(n−1)Σn]−1(sy−μ).

Use the fact that and equation (10) to deduce that, for ,

 Var[(n−1)(sy−μ)⊤[(n−1)Σn]−1(sy−μ)] =(n−1)2Var[z⊤Wz] =O(d2n2(n−d)3). (11)

: To obtain the order of , again apply the Cauchy-Schwarz inequality to obtain:

 Var[log(pA,n(sy|θ))] ≤Var[X1n]+Var[X2n]+√Var[X1n]×√Var[X2n].

The dominant order is determined by the variance terms, and we can then apply the results in (8), (9), (11) to obtain