] ] ]
Learning Summary Statistic for Approximate Bayesian
Computation via Deep Neural Network
Bai Jiang, TungYu Wu, Charles Zheng and Wing H. Wong
Stanford University
Abstract: Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal modelspecific tuning, our method constructs summary statistics for the Ising model and the movingaverage model, which match or exceed theoreticallymotivated summary statistics in terms of the accuracies of the resulting posteriors.
Key words and phrases: Approximate Bayesian Computation, Summary Statistic, Deep Learning
1. Introduction
1.1. Approximate Bayesian Computation
Bayesian inference is traditionally centered around the ability to compute or sample from the posterior distribution of the parameters, having conditioned on the observed data. Suppose data is generated from a model with parameter , the prior of which is denoted by . If the closed form of the likelihood function is available, the posterior distribution of given observed data can be computed via Bayes’ rule
Alternatively, if the likelihood function can only be computed conditionally or up to a normalizing constant, one can still draw samples from the posterior by using stochastic simulation techniques such as Markov Chain Monte Carlo (MCMC) and rejection sampling (Asmussen and Glynn (2007)).
1]]1
In many applications, the likelihood function cannot be explicitly obtained, or is intractable to compute; this precludes the possibility of direct computation or MCMC sampling. In these cases, approximate inference can still be performed as long as 1) it is possible to draw from the prior , and 2) it is possible to simulate from the model given , using the methods of Approximate Bayesian Computation (ABC) (See e.g. Beaumont, Zhang, and Balding (2002); Toni, Welch, Strelkowa, Ipsen, and Stumpf (2009); Lopes and Beaumont (2010); Beaumont (2010); Csilléry, Blum, Gaggiotti and François (2010); Marin, Pudlo, Robert, and Ryder (2012); Sunnåker, Busetto, Numminen, Corander, Foll, and Dessimoz (2013)).
While many variations of the core approach exist, the fundamental idea underlying ABC is quite simple: that one can use rejection sampling to obtain draws from the posterior distribution without computing any likelihoods. We draw parameterdata pairs from the prior and the model , and accept only the such that , which occurs with conditional probability for any . Algorithm 1 describes the ABC method for discrete data (Tavaré, Balding, Griffiths, and Donnelly (1997)), which yields an i.i.d. sample of the exact posterior distribution .
The success of Algorithm 1 depends on acceptance rate of proposed parameter . For continuous and , the event happens with probability , and hence Algorithm 1 is unable to produce any draws. As a remedy, one can relax the acceptance criterion to be , where is a norm and is the tolerance threshold. The choice of is crucial for balancing efficiency and approximation error, since with smaller the approximation error decreases while the acceptance probability also decreases.
1.2. Summary Statistic
When data vectors are highdimensional, the inefficiency of rejection sampling in high dimensions results in either extreme inaccuracy, or accuracy at the expense of an extremely timeconsuming procedure. To circumvent the problem, one can introduce lowdimensional summary statistic and further relax the acceptance criterion to be . The use of summary statistics results in Algorithm 2, which was first proposed as the extension of Algorithm 1 in population genetics application (Fu and Li (1997); Weiss, and von Haeseler (1998); Pritchard, Seielstad, PerezLezaun, and Feldman (1999)).
Instead of the exact posterior distribution, the resulting sample obtained by Algorithm 2 follows an approximate posterior distribution
(1.1)  
(1.2) 
The choice of the summary statistic is crucial for the approximation quality of ABC posterior distribution. An effective summary statistic should offer a good tradeoff between two approximation errors (Blum, Nunes, Prangle, and Sisson (2013)). The approximation error (1.1) is introduced when one replaces “equal” with “similar” in the first relaxation of the acceptance criterion. Under appropriate regularity conditions, it vanishes as . The approximation error (1.2) is introduced when one compares summary statistics and rather than the original data and . In essence, this is just the information loss of mapping highdimensional to lowdimensional . A summary statistic of higher dimension is in general more informative, hence reduces the approximation error (1.2). At the same time, increasing the dimension of the summary statistic slows down the rate that the approximation error (1.1) vanishes in the limit of . Ideally, we seek a statistic which is simultaneously lowdimensional and informative.
A sufficient statistic is an attractive option, since sufficiency, by definition, implies that the approximation error (1.2) is zero (Kolmogorov (1942); Lehmann and Casella (1998)). However, the sufficient statistic has generally the same dimensionality as the sample size, except in special cases such as exponential families. And even when a lowdimensional sufficient statistic exists, it may be intractable to compute.
The main task of this article is to construct lowdimensional and informative summary statistics for ABC methods. Since our goal is to compare methods of constructing summary statistics (rather than present a complete methodology for ABC), the relatively simple Algorithm 2 suffices. In future work, we plan to use our approach for constructing summary statistics alongside more sophisticated variants of ABC methods, such as those which combine ABC with Markov chain Monte Carlo or sequential techniques (Marjoram, Molitor, Plagnol, and Tavaré (2003); Sisson, Fan, and Tanaka (2007)). Hereafter all ABC procedures mentioned use Algorithm 2.
1.3. Related Work and Our DNN Approach
Existing methods for constructing summary statistics can be roughly classified into two classes, both of which require a set of candidate summary statistics as input. The first class consists of approaches for best subset selection. Subsets of are evaluated according to various informationbased criteria, e.g. measure of sufficiency (Joyce and Marjoram (2008)), entropy (Nunes and Balding (2010)), Akaike and Bayesian information criteria (Blum, Nunes, Prangle, and Sisson (2013)), and the “best” subset is chosen to be the summary statistic. The second class is linear regression approach, which constructs summary statistics by linear regression of response on candidate summary statistics (Wegmann, Leuenberger, and Excoffier (2009); Fearnhead and Prangle (2012)). Regularization techniques have also been considered to reduce overfitting in the regression models (Blum, Nunes, Prangle, and Sisson (2013)). Many of these methods rely on expert knowledge to provide candidate summary statistics.
In this paper, we propose to automatically learn summary statistics for highdimensional by using deep neural networks (DNN). Here DNN is expected to effectively learn a good approximation to the posterior mean when constructing a minimum squared error estimator on a large data set . The minimization problem is given by
where denotes a DNN with parameter . The resulting estimator approximates and further serves as the summary statistic for ABC.
Our motivation for using (an approximation to) as a summary statistic for ABC is inspired by the semiautomatic method in (Fearnhead and Prangle (2012)). Their idea is that as summary statistic leads to an ABC posterior, which has the same mean as the exact posterior in the limit of . Therefore they proposed to linearly regress on candidate summary statistics
and to use the resulting minimum squared error estimator as the summary statistic for ABC. In their semiautomatic method, could be expertdesigned statistics or polynomial bases (e.g. power terms of each component ).
Our DNN approach aims to achieve a more accurate approximation and a higher degree of automation in constructing summary statistics than the semiautomatic method. First, DNN with multiple hidden layers offers stronger representational power, compared to the semiautomatic method using linear regression. A DNN is expected to better approximate if the posterior mean is a highly nonlinear function of . Second, DNNs simply use the original data vector as the input, and automatically learn the appropriate nonlinear transformations as summaries from the raw data, in contrast to the semiautomatic method and many other existing methods requiring a set of expertdesigned candidate summary statistics or a basis expansion. Therefore our approach achieves a higher degree of automation in constructing summary statistics.
Blum and François (2010) have considered fitting a feedforward neural network (FFNN) with single hidden layer by regressing on . Their method significantly differs from ours, as theirs was originally motivated by reducing the error between the ABC posterior and the true posterior, rather than constructing summary statistics. Specifically, their method assumes that the appropriate summary statistic has already been given, and adjusts each draw from the ABC procedure using summary statistic in the way
Both and are nonlinear functions represented by FFNNs. Another key difference is the network size: the FFNNs in Blum and François (2010) contained four hidden neurons in order to reduce dimensionality of summary statistics, while our DNN approach contains hundreds of hidden neurons in order to gain representational power.
1.4. Organization
The rest of the article is organized as follows. In Section 2, we show how to approximate the posterior mean by training DNNs. In Sections 3 and 4, we report simulation studies on the Ising model and the moving average model of order 2, respectively. We describe in the supplementary materials the implementation details of DNNs and how consistency can be obtained by using the posterior mean of a basis of functions of the parameters.
2. Methods
Throughout the paper, we denote by the data, and by the parameter. We assume it is possible to obtain a large number of independent draws from the model given despite the unavailability of . Denote by the observed data, the prior of , the summary statistic, the norm to measure , and the tolerance threshold. Let denote the approximate posterior distribution obtained by Algorithm 2.
The main task is to construct a lowdimensional and informative summary statistic for highdimensional , which will enable accurate approximation of . We are interested mainly in the regime where ABC is most effective: settings in which the dimension of is moderately high (e.g. ) and the dimension of is low (e.g. ). Given a prior for , our approach is as follows.

Generate a data set by repeatedly drawing from and drawing from with .

Train a DNN with as input and as target.

Run ABC Algorithm 2 with prior and the DNN estimator as summary statistic.
Our motivation for training such a DNN is that the resulting statistic (estimator) should approximate the posterior mean .
2.1 Posterior Mean as Summary Statistic
The main advantage of using the posterior mean as a summary statistic is that the ABC posterior will then have the same mean as the exact posterior in the limit of . That is to say, does not lose any firstorder information when summarizing .
This theoretical result has been discussed in Theorem 3 in Fearnhead and Prangle (2012), but their proof is not rigorous. We provide in Theorem 1 a more rigorous and general proof.
Theorem 1.
If , then is well defined. The ABC procedure with observed data , summary statistics , norm , and tolerance threshold produces a posterior distribution
with
Proof.
First, we show is a version of conditional expectation of given . Denote by the algebras of and , respectively. is clearly measurable with respect to , thus . Then
[ is known in ]  
[Definition of ]  
[Tower property, ] 
As , we have by the definition of conditional expectation
It follows that
implying by Jensen’s inequality that
Letting yields . ∎
ABC procedures often give the sample mean of the ABC posterior as the point estimate for . Theorem 1 shows ABC procedure using as the summary statistic maximizes the pointestimation accuracy in the sense that the exact mean of ABC posterior is an approximation to the Bayes estimator under squared error loss.
Users of Bayesian inference generally desire more than just point estimates: ideally, one approximates the posterior globally. We observe that such a global approximation result is possible when extending Theorem 1: if one considers a basis of functions on the parameters, , and uses the dimensional statistic as the summary statistic(s), the ABC posterior weakly converges to the exact posterior as and at the appropriate rate. We state this result in the supplementary material.
There is a nice connection between the posterior mean and the sufficient statistics, especially minimal sufficient statistics in the exponential family. If there exists a sufficient statistic for , then from the concept of the sufficiency in the Bayesian context (Kolmogorov (1942)) it follows that for almost every , , and further is a function of . In the special case of an exponential family with minimal sufficient statistic and parameter , the posterior mean is a onetoone function of , and thus is a minimal sufficient statistic.
2.2. Structure of Deep Neural Network
At a high level, a deep neural network merely represents a nonlinear function for transforming input vector into output . The structure of a neural network can be described as a series of nonlinear transformations applied to . Each of these transformations is described as a layer: where the original input is , the output of the first transformation is the 1st layer, the output of the second transformation is the 2nd layer, and so on, with the output as the th layer. The layers 1 to are called hidden layers because they represent intermediate computations, and we let denote the th hidden layer. Then the explicit form of the network is
where is the input, is the output, and are the parameters controlling how the inputs of layer are transformed into the outputs of layer . Let denote the size of the th layer: then is an matrix, called the weight matrix, and is an dimensional vector, called the bias vector. The components of each layer are also described evocatively as “neurons” or “hidden units”. Figure 1 illustrates an example of 3layer DNN with input and 5/5/3 neurons in the 1st/2nd/3rd hidden layer, respective.
The role of layer is to apply a nonlinear transformation to the outputs of layer , , and then output the transformed outputs as . First, a linear transformation is applied to the previous layer , yielding . The nonlinearity (in this case ) is applied to each element of to yield the output of the current layer, . The nonlinearity is traditionally called the “activation” function, drawing an analogy to the properties of biological neurons. We choose the function as an activation function due to smoothness and computational convenience. Other popular choices for activation function are and . To better explain the activity of each individual neuron, we illustrate how neuron in the hidden layer works in Figure 2.
The output layer takes the top hidden layer as input and predicts . In many existing applications of deep learning (e.g. computer vision and natural language processing), the goal is to predict a categorical target. In those cases, it is common to use a softmax transformation in the output layer. However, since our goal is prediction rather than classification, it suffices to use a linear transformation.
2.3. Approximating Posterior Mean by DNN
We use the DNN to construct a summary statistic: a function which maps to an approximation of . First, we generate a training set by drawing samples from the joint distribution . Next, we train the DNN to minimize the squared error loss between training target and estimation . Thus we minimize (2.1) with respect to the DNN parameters ,
(2.1) 
We compute the derivatives using backpropagation (LeCun, Bottou, Bengio, and Haffner (1998)) and optimize the objective function by stochastic gradient descent method. See the supplementary material for details.
Our approach is based on the fact that any function which minimizes the squared error risk for predicting from may be viewed as an approximation of the posterior mean . Hence, any supervised learning approach could be used to construct a prediction rule for predicting from , and thereby provide an approximation of . Since in many applications of ABC, we can expect to be a highly nonlinear and smooth function, it is important to choose a supervised learning approach which has the power to approximate such nonlinear smooth functions.
DNNs appear to be a good choice given their rich representational power for approximating nonlinear functions. More and more practical and theoretical results of deep learning in several areas of machine learning, especially computer vision and natural language processing (Hinton and Salakhutdinov (2006); Hinton, Osindero, and Teh (2006); Bengio, Courville, and Vincent (2013); Schmidhuber (2015)), show that deep architectures composed of simple learning modules in multiple layers can model highlevel abstraction in highdimensional data. It is speculated that by increasing the depth and width of the network, the DNN gains the power to approximate any continuous function; however, rigorous proof of the approximation properties of DNNs remains an important open problem (Faragó and Lugosi (1993); Sutskever and Hinton (2008); Le Roux and Bengio (2010)). Nonetheless we expect that DNNs can effectively learn a good approximation to the posterior mean given a sufficiently large training set.
2.4. Avoiding Overfitting
DNN consists of simple learning modules in multiple layers and thus has very rich representational power to learn very complicated relationships between the input and the output . However, DNN is prone to overfitting given limited training data. In order to avoid overfitting, we consider three methods: generating a large training set, early stopping, and regularization on parameter.
Sufficiently Large Training Data. This is the fundamental way to avoid overfitting and improve the generalization, that, however, is impossible in many applications of machine learning. Fortunately, in applications of Approximate Bayesian Computation, an arbitrarily large training set can be generated by repeatedly sampling from the prior and the model , and dataset sampling can be parallelized. In our experiments, DNNs contains hidden layers, each of which has 100 neurons, and has around parameters, while the training set contains data samples.
Early Stopping (Caruana, Lawrence, and Giles (2001)). This divides the available data into three subsets: the training set, the validation set and the testing set. The training set is used to compute the gradient and update the parameter. At the same time, we monitor both the training error and the validation error. The validation error usually decreases as does the training error in the early phase of the training process. However, when the network begins to overfit, the validation error begins to increase and we stop the training process. The testing error is reported only for evaluation.
Regularization. This adds an extra term to the loss function that will penalize complexity in neural networks (Nowlan, and Hinton (1992)). Here we consider regularization (Ng (2004)) and minimize the objective function
(2.2) 
where is the Frobenius norm of , the square root of the sum of the absolute squares of its elements.
More sophisticated methods like dropout (Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014)) and tuning network size can probably better combat overfitting and learn better summary statistic. We only use the simple methods and do minimal modelspecific tuning in the simulation studies. Our goal is to show a relatively simple DNN can learn a good summary statistic for ABC.
3. Example: Ising Model
3.1 ABC and Summary Statistics
The Ising model consists of discrete variables ( or ) arranged in a lattice (Figure 3). Each binary variable, called a spin, is allowed to interact with its neighbors.
The inversetemperature parameter characterizes the extent of interaction. Given , the probability mass function of the Ising model on lattice is
where , means and are neighbors, and the normalizing constant is
Since the normalizing constant requires an exponentialtime computation, the probability mass function is intractable except in small cases.
Despite the unavailability of probability mass function, data can be still simulated given using Monte Carlo methods such as Metropolis algorithm (Asmussen and Glynn (2007)). It allows use of ABC for parameter inference. The sufficient statistic is the ideal summary statistic, because is univariate, speeds up the convergence of approximation error (1.1) in the limit of , and losses no information in the approximation (1.2).
Since results in the ABC posterior with the highest quality, we take it as the gold standard and compare the DNNbased summary statistic to it. The DNNbased summary statistic, if approximating well, should be an approximately increasing function of . As is an increasing function of , it is a sufficient statistic as well. To see this, view the posterior as an exponential family with as “parameter” and as “sufficient statistic”,
and then use the mean reparametrization result of exponential family. As and are highly nonlinear functions in the highdimensional space , they are challenging to approximate.
3.2 Experimental Design
Figure 4 outlines the whole experimental scheme. We generate a training set by the Metropolis algorithm, train a DNN to learn a summary statistic , and then compare to the “gold standard” .
Metropolis algorithm generated training, validation, testing sets of size , , , respectively, from the Ising model on the lattice with a prior . The value is the phase transition point of Ising model on infinite lattice: when , the spins tend to be disordered; when is large enough, the spins tend to have the same sign due to the strong neighbortoneighbor interactions (Onsager (1944)). The Ising model on a finite lattice undergoes a smooth phase transition around as increases, which is slightly different than the sharp phase transition on infinite lattice (Landau (1976)).
A 3layer DNN with neurons on each hidden layer was trained to predict from . For the purpose of comparison, the semiautomatic method with components of raw vector as candidate summary statistics was used. We also tested an FFNN with a single hidden layer of 100 neurons and considered the regularization technique (2.2) with . The FFNN used is totally different from that used by Blum and François (2010). See details in Section 1.3.
Summary statistics learned by different methods led to different ABC posteriors. They were compared to those ABC posteriors resulting from the ideal summary statistic .
3.3 Results
As shown in Table 1, DNN learns a better prediction rule than the semiautomatic method and FFNN, although it takes more training time. The regularization technique does not improve the performance, probably because overfitting is not a significant issue given that the training data () outnumbers the parameters.
Method  Training RMSE  Testing RMSE  Time (s) 

Semiautomatic  0.4401  0.4406  4.36 
FFNN,  0.2541  0.2541  480.08 
DNN,  0.2319  0.2318  1348.17 
FFNN,  0.2583  0.2584  447.07 
DNN,  0.2514  0.2512  1378.33 
Figure 3.4(a) displays a scatterplot which compares the DNNbased summary statistic and the sufficient statistic . Points in the scatterplot represent to for an instance in the testing set. A large number of the instances are concentrated at , which appear as points in the topright corner of the scatterplot. These instances are relatively uninteresting, so we display a heatmap of excluding them in Figure 3.4(b). It shows that the DNNbased summary statistic approximates an increasing function of .
The semiautomatic method constructs a summary statistic that fails to approximate (an increasing function of ) but centers around the prior mean (Figure 6). This is not surprising since the semiautomatic construction, a linear combination of , is unable to capture the nonlinearity of .
ABC posterior distributions were obtained with the sufficient statistic and the summary statistics constructed by DNN and the semiautomatic method. For the sufficient statistic , we set the tolerance level so that the ABC posterior sample follows the exact posterior . For each summary statistic , we set the tolerance threshold small enough so that of proposed s were accepted. We repeated the comparison for four different observed data , generated from , respectively; in each case, we compared the posterior obtained from with the posteriors obtained from , in Figure 7.
We highlights the case with true (lowerright subplot in Figure 7). Since with high probability the spins have the same sign when is large, it becomes difficult to distinguish different values of above the critical point based on the data . Hence we should expect the posterior to be small below and have a similar shape to the prior distribution above . All three ABC posteriors demonstrate this property.
4. Example: Moving Average of Order 2
4.1 ABC and Summary Statistics
The movingaverage model is widely used in time series analysis. With the observations, the movingaverage model of order , denoted by , is given by
where are unobserved white noise error terms. We took in order to enable exact calculation of the posterior distribution , and then evaluation of the ABC posterior distribution. If the ’s are nonGaussian, the exact posterior is computationally intractable, but ABC is still applicable.
Approximate Bayesian Computation has been applied to study the posterior distribution of the model using the autocovariance as the summary statistic (Marin, Pudlo, Robert, and Ryder (2012)). The autocovariance is a natural choice for the summary statistic in the MA(2) model because it converges to a onetoone function of underlying parameter in probability as by the Weak Law of Large Numbers,
4.2 Experimental Design
The model is identifiable over the triangular region
so we took a uniform prior over this region, and generated the training, validation, testing sets of size , , , respectively. Each instance was a time series of length .
A 3layer DNN with neurons on each hidden layer was trained to predict from . For purposes of comparison, we constructed the semiautomatic summary statistic by fitting linear regression of on candidate summary statistics  polynomial bases . We also test an FFNN with a single hidden layer of 100 neurons and considered the regularization technique (2.2) with . The FFNN used here is different than that used by Blum and François (2010). See details in Section 1.3.
Next, we generated some true parameters from the prior, drew the observed data , and numerically computed the exact posterior . Then we computed ABC posteriors using the autocovariance statistic , the DNNbased summary statistics , and the semiautomatic summary statistic. The resulting ABC posteriors are compared to the exact posterior and evaluated in terms of the accuracies of the posterior mean of , the posterior marginal variances of , and the posterior correlation between .
4.3 Results
Training RMSE  Testing RMSE  Time (s)  

Method  
Semiautomatic  0.8150  0.3867  0.8174  0.3857  45.63 
FFNN,  0.1857  0.2091  0.1884  0.2115  543.42 
DNN,  0.1272  0.1355  0.1293  0.1378  1402.02 
FFNN,  0.2642  0.2522  0.2679  0.2546  432.27 
DNN,  0.1958  0.1939  0.1980  0.1956  1282.66 
Again DNN learns a better prediction rule than the semiautomatic method and FFNN, but takes more training time (Table 2, Figures 8 and 9). The regularization technique does not improve the performance.
We ran ABC procedures for an observed datum generated by true parameter , with three different choices of summary statistic: the DNNbased summary statistic, the autocovariance, and also the semiautomatic summary statistic. The tolerance threshold was set to accept of proposed in ABC procedures. Figure 10 compares the ABC posterior draws to the exact posterior which is numerically computed.
The DNNbased summary statistic gives a more accurate ABC posterior than either the ABC posterior obtained by the autocovariance statistic or the semiautomatic construction. One of the important features of the DNNbased summary statistic is that its ABC posterior correctly captures the correlation between and , while the autocovariance statistic and the semiautomatic statistic appear to be insensitive to this information (Table 3).
Posterior  mean()  mean()  std()  std()  cor() 

Exact  0.6418  0.2399  0.1046  0.1100  0.6995 
ABC (DNN)  0.6230  0.2300  0.1210  0.1410  0.4776 
ABC (autocov)  0.7033  0.1402  0.1218  0.2111  0.2606 
ABC (semiauto)  0.0442  0.1159  0.5160  0.4616  0.0645 
We repeated the comparison for 100 different . As Table 4 shows, the ABC procedure with the DNNbased statistic better approximates the posterior moments than those using the autocovariance statistic and the semiautomatic construction.
Posterior  MSE for  

mean()  mean()  std()  std()  cor()  
ABC (DNN)  0.0096  0.0089  0.0025  0.0026  0.0517 
ABC (autocov)  0.0111  0.0184  0.0041  0.0065  0.1886 
ABC (semiauto)  0.5405  0.1440  0.4794  0.0891  0.3116 
5. Discussion
We address how to automatically construct lowdimensional and informative summary statistics for ABC methods, with minimal need of expert knowledge. We base our approach on the desirable properties of the posterior mean as a summary statistic for ABC, though it is generally intractable. We take advantage of the representational power of DNNs to construct an approximation of the posterior mean as a summary statistic.
We only heuristically justify our choice of DNNs to construct the approximation but obtain promising empirical results. The Ising model has a univariate sufficient statistic that is the ideal summary statistic and results in the best achievable ABC posterior. It is a challenging task to construct a summary statistic akin to it due to its high nonlinearity and highdimensionality, but we see in our experiments that the DNNbased summary statistic approximates an increasing function of the sufficient statistic. In the movingaverage model of order 2, the DNNbased summary statistic outperforms the semiautomatic construction. The DNNbased summary statistic, which is automatically constructed, outperforms the autocovariances; the autocovariances in the MA(2) model can be transformed to yield a consistent estimate of the parameters, and have been widely used in the literature.
A DNN is prone to overfitting given limited training data, but this is not an issue when constructing summary statistics for ABC. In the setting of Approximate Bayesian Computation, arbitrarily many training samples can be generated by repeatedly sampling from the prior and the model . In our experiments, the size of the training data () is much larger than the number of parameters in the neural networks (), and there is little discrepancy between the prediction error losses on the training data and the testing data. The regularization technique does not significantly improve the performance.
We compared the DNN with three hidden layers with the FFNN with a single hidden layer. Our experimental comparison indicates that FFNNs are less effective than DNNs for the task of summary statistics construction.
Supplementary Materials
The supplementary materials contain an extension of Theorem 1 and show the convergence of the posterior expectation of under the posterior obtained by ABC using as the summary statistic. This extension establishes a global approximation to the posterior distribution.
Implementation details of backpropagation and stochastic gradient descent algorithms when training deep neural network are provided. The derivatives of squared error loss function with respect to network parameters are computed. They are used by stochastic gradient descent algorithms to train deep neural networks.
Acknowledgements
The authors gratefully acknowledge the National Science Foundation grants DMS1407557 and DMS1330132.
References
 Asmussen and Glynn (2007) Asmussen, S., and Glynn, P. W. (2007). Stochastic simulation: Algorithms and analysis (Vol. 57). Springer Science & Business Media.
 Beaumont, Zhang, and Balding (2002) Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate Bayesian computation in population genetics. Genetics, 162(4), 20252035.
 Toni, Welch, Strelkowa, Ipsen, and Stumpf (2009) Toni, T., Welch, D., Strelkowa, N., Ipsen, A. and Stumpf, M. P. (2009). Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface, 6(31), 187202.
 Lopes and Beaumont (2010) Lopes, J. S. and Beaumont, M. A. (2010). ABC: a useful Bayesian tool for the analysis of population data. Infection, Genetics and Evolution, 10(6), 825832.
 Beaumont (2010) Beaumont, M. A. (2010). Approximate Bayesian computation in evolution and ecology. Annual review of ecology, evolution, and systematics, 41, 379406.
 Csilléry, Blum, Gaggiotti and François (2010) Csilléry, K., Blum, M. G., Gaggiotti, O. E. and François, O. (2010). Approximate Bayesian computation (ABC) in practice. Trends in ecology & evolution, 25(7), 410418.
 Marin, Pudlo, Robert, and Ryder (2012) Marin, J. M., Pudlo, P., Robert, C. P. and Ryder, R. J. (2012). Approximate Bayesian computational methods. Statistics and Computing, 22(6), 11671180.
 Sunnåker, Busetto, Numminen, Corander, Foll, and Dessimoz (2013) Sunnåker, M., Busetto, A. G., Numminen, E., Corander, J., Foll, M. and Dessimoz, C. (2013). Approximate Bayesian computation. PLoS Comput. Biol., 9(1), e1002803.
 Tavaré, Balding, Griffiths, and Donnelly (1997) Tavaré, S., Balding, D. J., Griffiths, R. C. and Donnelly, P. (1997). Inferring coalescence times from DNA sequence data. Genetics, 145(2), 505518.
 Fu and Li (1997) Fu, Y. X. and Li, W. H. (1997). Estimating the age of the common ancestor of a sample of DNA sequences. Molecular biology and evolution, 14(2), 195199.
 Weiss, and von Haeseler (1998) Weiss, G. and von Haeseler, A. (1998). Inference of population history using a likelihood approach. Genetics, 149(3), 15391546.
 Pritchard, Seielstad, PerezLezaun, and Feldman (1999) Pritchard, J. K., Seielstad, M. T., PerezLezaun, A., and Feldman, M. W. (1999). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular biology and evolution, 16(12), 17911798.
 Blum, Nunes, Prangle, and Sisson (2013) Blum, M. G., Nunes, M. A., Prangle, D., Sisson, S. A. (2013). A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2), 189208.
 Kolmogorov (1942) Kolmogorov, A. N. (1942) Definition of center of dispersion and measure of accuracy from a finite number of observations. Izv. Akad. Nauk S.S.S.R. Ser. Mat., 6, 332 (in Russian).
 Lehmann and Casella (1998) Lehmann, E. L. and Casella, G. (1998). Theory of point estimation (Vol. 31). Springer Science & Business Media.
 Marjoram, Molitor, Plagnol, and Tavaré (2003) Marjoram, P., Molitor, J., Plagnol, V., and Tavaré, S. (2003). Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26), 1532415328.
 Sisson, Fan, and Tanaka (2007) Sisson, S. A., Fan, Y. and Tanaka, M. M. (2007). Sequential monte carlo without likelihoods. Proceedings of the National Academy of Sciences, 104(6), 17601765.
 Joyce and Marjoram (2008) Joyce, P. and Marjoram, P. (2008). Approximately sufficient statistics and Bayesian computation. Statistical applications in genetics and molecular biology, 7(1).
 Nunes and Balding (2010) Nunes, M. A. and Balding, D. J. (2010). On optimal selection of summary statistics for approximate Bayesian computation. Statistical applications in genetics and molecular biology, 9(1).
 Wegmann, Leuenberger, and Excoffier (2009) Wegmann, D., Leuenberger, C. and Excoffier, L. (2009). Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4), 12071218.
 Fearnhead and Prangle (2012) Fearnhead, P. and Prangle, D. (2012). Constructing summary statistics for approximate Bayesian computation: semiautomatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3), 419474.
 Blum and François (2010) Blum, M. G. and François, O. (2010). Nonlinear regression models for Approximate Bayesian Computation. Statistics and Computing, 20(1), 6373.
 LeCun, Bottou, Bengio, and Haffner (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 22782324.
 Hinton and Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504507.
 Hinton, Osindero, and Teh (2006) Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 15271554.
 Bengio, Courville, and Vincent (2013) Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8), 17981828.
 Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85117.
 Faragó and Lugosi (1993) Faragó, A., and Lugosi, G. (1993). Strong universal consistency of neural network classifiers. Information Theory, IEEE Transactions on, 39(4), 11461151.
 Sutskever and Hinton (2008) Sutskever, I., and Hinton, G. E. (2008). Deep, narrow sigmoid belief networks are universal approximators. Neural Computation, 20(11), 26292636.
 Le Roux and Bengio (2010) Le Roux, N., and Bengio, Y. (2010). Deep belief networks are compact universal approximators. Neural computation, 22(8), 21922207.
 Caruana, Lawrence, and Giles (2001) Caruana, R., Lawrence, S. and Giles, C.L. (2000). Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, 13, 402408.
 Nowlan, and Hinton (1992) Nowlan, S. J., and Hinton, G. E. (1992). Simplifying neural networks by soft weightsharing. Neural computation, 4(4), 473493.
 Ng (2004) Ng, A. Y. (2004, July). Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the twentyfirst international conference on Machine learning (p. 78). ACM.
 Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014) Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 19291958.
 Onsager (1944) Onsager, L. (1944). Crystal statistics. I. A twodimensional model with an orderdisorder transition. Physical Review, 65(34), 117.
 Landau (1976) Landau, D. P. (1976). Finitesize behavior of the Ising square lattice. Physical Review B, 13(7), 2997.