# Likelihood-free inference with emulator networks

###### Abstract

Approximate Bayesian Computation (ABC) provides methods for Bayesian inference in simulation-based models which do not permit tractable likelihoods. We present a new ABC method which uses probabilistic neural emulator networks to learn synthetic likelihoods on simulated data – both ‘local’ emulators which approximate the likelihood for specific observed data, as well as ‘global’ ones which are applicable to a range of data. Simulations are chosen adaptively using an acquisition function which takes into account uncertainty about either the posterior distribution of interest, or the parameters of the emulator. Our approach does not rely on user-defined rejection thresholds or distance functions. We illustrate inference with emulator networks on synthetic examples and on a biophysical neuron model, and show that emulators allow accurate and efficient inference even on problems which are challenging for conventional ABC approaches.

00C4Ä \DeclareUnicodeCharacter00E4ä \DeclareUnicodeCharacter00A0 \DeclareUnicodeCharacter2010- \DeclareUnicodeCharacter2013– \DeclareUnicodeCharacter2014— \DeclareUnicodeCharacter2009 \DeclareUnicodeCharacter2026… \DeclareUnicodeCharacter0391A \DeclareUnicodeCharacter0392B \DeclareUnicodeCharacter0393Γ \DeclareUnicodeCharacter0394Δ \DeclareUnicodeCharacter0395E \DeclareUnicodeCharacter0396Z \DeclareUnicodeCharacter0397H \DeclareUnicodeCharacter0398Θ \DeclareUnicodeCharacter0399I \DeclareUnicodeCharacter039AK \DeclareUnicodeCharacter039BΛ \DeclareUnicodeCharacter039CM \DeclareUnicodeCharacter039DN \DeclareUnicodeCharacter039EΞ \DeclareUnicodeCharacter039FO \DeclareUnicodeCharacter03A0Π \DeclareUnicodeCharacter03A1P \DeclareUnicodeCharacter03A3Σ \DeclareUnicodeCharacter03A4T \DeclareUnicodeCharacter03A5Y \DeclareUnicodeCharacter03A6Φ \DeclareUnicodeCharacter03A7X \DeclareUnicodeCharacter03A8Ψ \DeclareUnicodeCharacter03A9Ω \DeclareUnicodeCharacter03B1α \DeclareUnicodeCharacter03B2β \DeclareUnicodeCharacter03B3γ \DeclareUnicodeCharacter03B4δ \DeclareUnicodeCharacter03B5ϵ \DeclareUnicodeCharacter03B6ζ \DeclareUnicodeCharacter03B7η \DeclareUnicodeCharacter03B8θ \DeclareUnicodeCharacter03B9ι \DeclareUnicodeCharacter03BAκ \DeclareUnicodeCharacter03BBλ \DeclareUnicodeCharacter03BCμ \DeclareUnicodeCharacter03BDν \DeclareUnicodeCharacter03BEξ \DeclareUnicodeCharacter03BF\omicron \DeclareUnicodeCharacter03C0π \DeclareUnicodeCharacter03C1ρ \DeclareUnicodeCharacter03C3σ \DeclareUnicodeCharacter03C4τ \DeclareUnicodeCharacter03C5υ \DeclareUnicodeCharacter03C6ϕ \DeclareUnicodeCharacter03C7χ \DeclareUnicodeCharacter03C8ψ \DeclareUnicodeCharacter03C9ω \DeclareUnicodeCharacter0229ȩ

Giacomo Bassetto giacomo.bassetto@caesar.de

Theofanis Karaletsos theofanis@uber.com

Jakob H. Macke macke@tum.de

^{1}

^{1}footnotetext: Computational Neuroengineering, Department of Electrical and Computer Engineering, Technical University of Munich, Germany

^{2}

^{2}footnotetext: Neural Systems Analysis, Research Center caesar, an associate of the Max Planck Society, Bonn, Germany

^{3}

^{3}footnotetext: Uber AI Labs, Uber Technologies, Inc., San Francisco, CA

^{4}

^{4}footnotetext: Part of this work was done while J.H.M was at the Centre for Cognitive Science, Technische Universität Darmstadt, Germany

## 1 Introduction

Many areas of science and engineering make extensive use of complex, stochastic, numerical simulations to describe the structure and dynamics of the processes being investigated Karabatsos_2017. A key challenge in simulation-based science is linking simulation models to empirical data: Bayesian inference provides a general and powerful framework for identifying the set of parameters which are consistent both with empirical data and prior knowledge. One of the key quantities required for statistical inference, the likelihood of observed data given parameters, , is typically intractable for simulation-based models, rendering conventional statistical approaches inapplicable.

Approximate Bayesian Computation (ABC) aims to close this gap BeaumontZhang2002, but classical algorithms Pritchard1999; Marjoram2003 scale poorly to high-dimensional non-Gaussian data, and require ad-hoc choices (i.e., rejection thresholds, distance functions and summary statistics) which can significantly affect both computational efficiency and accuracy. In synthetic likelihood approaches to ABC Wood2010; OngNott2016; Price2018, one instead uses density estimation to approximate the likelihood on summary statistics of simulated data. A recent proposal by Jarvenpaa2017, Gutmann2018 uses a Gaussian process () to approximate the distribution of the discrepancy as a function of , and Bayesian Optimization to propose new parameters. While this approach can be very effective even with a small number of simulations, it still requires summary statistics, choice of a distance function , and relies on assuming a homoscedastic .

The goal of this paper is to scale synthetic-likelihood methods to multivariate and (potentially) non-Gaussian, heteroscedastic data. We use neural-network based conditional density estimators (which we call ‘emulator networks’, inspired by classical work on emulation methods; Kennedy_OHagan_2001), to develop likelihood-free inference algorithms which are efficient, flexible, and scale to high-dimensional observations. Our approach does not require the user to specify rejection thresholds or distance functions, or to restrict oneself to a small number of summary statistics.

## 2 Likelihood-free inference with emulator networks

Our goal is to obtain an approximation to the true posterior of a black-box simulator model, i.e. models from which we can generate samples , but for which we cannot evaluate likelihoods . To solve this task, we learn a synthetic likelihood function by training a conditional density estimator on simulated data. We actively propose parameters for simulations, since simulations are often the dominant cost in ABC: Therefore, we want to keep the number of calls to the simulator as low as possible (Fig. 1).

Core to our approach is an emulator , a conditional density estimator with parameters that approximates the simulator . Having collected an initial simulated dataset , e.g. by repeatedly drawing from the prior and simulating data, the emulator is trained. We actively select new locations for which to simulate new data points to keep the number of calls to the (potentially computationally expensive) simulator low. is appended to the dataset, the emulator is updated, and the active learning loop repeats. The emulator defines a synthetic likelihood function that we use to find an approximate posterior, which is proportional to . This approach is summarized in Appendix A in form of an algorithm.

Thus, our approach requires (1) an emulator, i.e., a flexible conditional density estimator, (2) an approach for learning the emulator on simulated data and expressing our uncertainty about its parameters, (3) an acquisition rule for proposing new sampling locations, and (4) an inference procedure for obtaining the posterior distribution from the synthetic likelihood and the prior. We will describe these steps in the following.

### 2.1 Choice of emulator

We use neural network based emulators : parameters are given as inputs to the network, and the network is trained to approximate . In contrast to traditional synthetic likelihood approaches Wood2010, we are not restricted to using a (multivariate) normal distribution to approximate the conditional density . The output form of the emulator is chosen according to our knowledge regarding the conditional density of the simulator. In our second example application, we e.g. model as a binomial distribution over 8-bit integer pixel values, and in the third example we model a categorical distribution. If the noise model of the simulation process is unknown, flexible conditional density estimators such as conditional autoregressive models oord2016pixel; papamakarios_masked_2017 can be readily used in our approach.

### 2.2 Inference on the parameters of the emulator

We use probabilistic neural networks, i.e. we represent uncertainty about the parameters of the emulator . We then use these uncertainties to guide the acquisition of training data for the emulator using active learning (as discussed in the next section).

In the Bayesian framework, uncertainty is represented through the posterior distribution. Multiple approaches for estimating the posterior distributions over neural network parameters have been proposed, including MCMC methods to draw samples from the full posterior Welling_2011; Chen_2014 and variational methods, e.g. using factorising posteriors Blundell_2015 or normalizing flows louizos2017multiplicative. Finally, deep ensemble approaches Lakshminarayanan16 represent predictive distributions through ensembles of networks. They have the advantage of not requiring the choice of a functional form of the approximation, and are simple to set up.

Our approach can be applied with any method that represents uncertainty over network parameters. In our experiments, we use deep ensembles to represent uncertainty about , as we found them to combine simplicity with good empirical performance. Instead of training a single emulator network and inferring its posterior distribution, we train an ensemble of networks with parameters . From here on, we treat as if they were samples from , the posterior over network parameters given data. (In practice, these samples will describe local maxima of the posterior.) The posterior-predictive distribution is approximated by .

Networks are trained supervised with data . During training, the parameters of the networks are optimized subject to the loss w.r.t. (a proper scoring rule as discussed in Lakshminarayanan16). Networks in the ensemble are initialized differently, and data points are randomly shuffled during training.

### 2.3 Acquisition rules

We use active learning to selectively acquire new samples. We distinguish between two scenarios: In the first, we have particular observed data available, and train a local emulator which approximates the likelihood near . This approach requires learning a new emulator for each new observed data .

We also consider a second scenario, in which we learn a global emulator – which approximates globally. Learning a global emulator is more challenging and may potentially require more flexible density estimators. However, once the emulator is learned, we can readily approximate the likelihood for any , therefore amortizing the cost of learning the emulator.

The two scenarios call for different acquisition functions for proposing new samples, which we will discuss next.

#### 2.3.1 Acquisitions for local emulator learning

With given , we want to learn a local emulator that allows us to derive a good approximation to the (unnormalized) posterior .

As we are interested in increasing our certainty about the posterior, we target its variance, , where denotes that we take the variance with respect to the posterior over network weights given data . Thus, we use an acquisition rule which targets the region of maximum variance in the predicted (unnormalized) posterior,

(1) |

We approximate with the sample variance across drawn from the posterior over networks. We refer to this rule as the MaxVar rule Jarvenpaa2017. We optimize this acquisition rule by using gradient descent, making use of automatic differentiation to take gradients with respect to through the synthetic likelihood specified by the emulator.

#### 2.3.2 Acquisitions for global emulator learning

A global emulator may be used to do inference once becomes available. Here, the goal for active learning is to bring the emulator close to the simulator for all s using as few runs of the simulator as possible. We use a rule based on information theory from the active learning literature Houlsby11; Gal17; Depeweg17. We refer to the rule

(2) |

as the maximum mutual information rule (MaxMI). See Appendix B for details.

### 2.4 Deriving the posterior distribution from the emulator

Once we have learned the emulator, we use Hamiltonian Monte Carlo (HMC, Neal_2010) to draw samples from our approximate posterior, using the emulator-based synthetic likelihood. We generate samples of drawn from the distribution . In practice, we sample from each ensemble member individually and use the union of all samples as a draw from the approximate posterior. We could also obtain the posterior through variational inference, but here prefer to retain full flexibility in the shape of the inferred posterior.

## 3 Results

We demonstrate likelihood-free inference with emulator networks on three examples: i) we show that emulators are competitive with state-of-the-art on an example with Gaussian observations; ii) we demonstrate the ability of emulators to work with high-dimensional observations while learning to amortize the simulator; iii) we show an application from neuroscience, and infer the posterior over parameters of a biophysical neuron model.

### i) Low-dimensional example: Simulator with Gaussian observations

We first demonstrate emulator networks on a non-linear model between parameters and data, corrupted by additive Gaussian observation noise: data is generated according to , , where is cubic in , is fixed, and is distributed uniformly (see Appendix D for complete specification). The goal is to approximate the posterior from a small number of draws from the generative model (Fig. 2a). We parameterize using a Gaussian distribution whose mean and precision are the output of a neural network with one hidden layer consisting of 10 units.

We will compare our method to BOLFI (Bayesian Optimization for Likelihood-free Inference, Gutmann2018), an ABC method which – given a user-specified discrepancy measure – learns a that models the distribution of discrepancies between summary statistics of and . Jarvenpaa2017 proposed multiple acquisition rules for BOLFI. The most principled (but also most costly) rule minimizes the expected integrated variance (ExpIntVar) of the approximate posterior after acquiring new data. BOLFI is a state-of-the-art method for simulation-efficient likelihood-free inference, and substantially more efficient than classical rejection-based methods such as rejection-ABC Pritchard1999, MCMC-ABC Marjoram2003, SMC-ABC Sisson2007.

We use the total variation (TV) between true and approximate posterior (evaluated using numerical integration) to quantify performance as a function of the number of acquisitions. The emulator is trained on an initial dataset and updated after each new acquisition. We find that emulators with MaxVar sampling work better than uniform sampling (Fig. 2b). Both BOLFI rules (ExpIntVar and MaxVar) exhibit very similar performance, but require higher number of simulations than emulators to reach low TV values. On a 2-dimensional version of the problem, the qualitative ordering is the same, but the differences between methods are greater (Fig. 2c). We did additional runs of BOLFI MaxVar to confirm that it eventually converges towards the correct posterior. However, convergence is slow and the quality of the inferred posterior depends strongly on the choice of the threshold parameter used in BOLFI (see Appendix G).

### ii) High-dimensional observations: Inferring the location and contrast of a blob

We show that our method can be applied to estimation problems with high-dimensional observations without having to resort to using summary statistics. We model the rendering of a blob on a 2D image, and learn a global emulator for the forward model.

The forward model takes as inputs three parameters (, and ) – which encode horizontal and vertical displacement, and contrast of the blob – and returns per-pixels activation probabilities . The value of each pixel is then generated according to a binomial distribution with total count (8-bits gray-scale image) and probability , resulting in a pixel image (Fig. 3a). In this application, we use a multi-layer neural network whose output is, for each pixel, the mean parameter of the binomial distribution (see Appendix E for further details).

Using the MaxMI rule to acquire new test points in parameters space results in faster learning of the emulator, compared to uniform random acquisitions. Eventually, both rules converge towards the log-likelihood of the held-out test set, indicating successful global emulation of the forward model (Fig. 3b). We show posteriors distributions and samples in Appendix H. Since alternative approaches for likelihood-free inference (e.g. BOLFI) do not allow one to globally approximate a simulator, no performance benchmark against these methods was performed.

### iii) Scientific application: Hodgkin-Huxley model

As an example of a scientific application, we use the Hodgkin-Huxley model HodgkinHuxley1952 which describes the evolution of membrane potential in neurons (Fig. 4a). Fitting single- and multi-compartment Hodgkin-Huxley models to neurophysiological data is a central problem in neuroscience, and typically addressed using non-Bayesian approaches based on evolutionary optimization Druckmann2007; VanGeit2016. In contrast to the previous examples, we do not model the raw data , but summary features derived from them. While this is often done out of necessity, calculating the posterior relative to summary statistics can be of scientific interest Cornebise2012. This is indeed the case when fitting biophysical models in neuroscience, which is typically performed with carefully chosen summary statistics representing properties of interest.

Here, we chose to model the number of action potentials (or spikes) in response to a step-current input, and we are interested in the set of parameters that are consistent with the observed number of action potentials. The conditional density of the emulator networks becomes a categorical distribution with 6 classes, modelling the probabilities of exactly 0, 1, …4 spikes, and 5 or more spikes (which never occurred under the parameter ranges we explored). Model parameters are the ion-channel conductances and , controlling the shape and frequency of the spikes (further details in Appendix F).

We trained emulator networks using MaxMI to infer the posterior probabilities over generating a given number of observed spikes – the acquisition surface is shown in Appendix I. Resulting posterior distributions are shown in Fig. 4b, along with a posterior predictive check showing that the mapping between parameters and summary features was learned correctly.

## 4 Discussion

We presented an approach for performing statistical inference on simulation-based models which do not permit tractable likelihood. We learn an ‘emulator network’, i.e. a probabilistic model that is consistent with the simulation, and for which likelihoods are tractable. The likelihoods of the emulator can then be plugged into any Bayesian inference approach (as in synthetic likelihood approaches Wood2010; OngNott2016; OngNott2017) to calculate the posterior. Active learning can be used to adaptively suggest new samples to reduce the number of calls to the simulator. We discussed two acquisition functions for learning ‘local’ and ‘global’ emulators, we showed that our approach scales to high-dimensional observation spaces, does not require user-defined distance functions or acceptance thresholds, and is not limited to Gaussian observations – all of which are challenging for conventional ABC approaches.

Our approach uses density estimation to approximate the likelihood. A complementary use of density-estimation in ABC is to directly target the posterior distribution PapamakariosMurray17; Lueckmann2017; LeBaydinZinkovWood2017; Izbicki2018. This approach can be very useful – however, one advantage of likelihood-based approaches is that they allow one to apply the same synthetic likelihood to multiple priors (without having to retrain), or to pool information from multiple observations (by multiplying the corresponding synthetic likelihoods). More technically, posterior density estimation gives less flexibility in proposing samples – in order to yield the correct posterior, samples have to be drawn from the prior, or approaches such as importance-weighting Lueckmann2017 or other post-hoc corrections PapamakariosMurray17 have to be applied. We discuss additional related work published concurrently with this manuscript in Appendix C.

There are multiple ways in which our approach can be improved further: First, one could use alternative, and more expressive neural-network based density estimators, e.g. ones based on normalizing flows papamakarios_masked_2017. Second, one could use Bayesian posterior estimation (rather than ensembles) to capture parameter uncertainty, and/or use variational inference (rather than HMC) to derive an estimate of the posterior from the synthetic likelihood provided by the emulator. Third, we presented two acquisition functions (one for local and one for global estimation) – it is likely that the approach can be made more simulation-efficient by using different, and more sophisticated acquisition functions. In particular, our MaxVar rule targets the parameters with maximal uncertainty, but does not try to predict whether that uncertainty will be effectively reduced. However, evaluating acquisition functions like ExpIntVar can be computationally expensive – it will be useful to develop approaches which are sensitive to the relative cost of simulations and proposals, and adaptively adjust the acquisition function used.

Numerical simulations make it possible to model complex phenomena from first principles, and are indispensable tools in many fields in engineering and science. The advent of powerful approaches for statistical inference in simulation-based models BrehmerCranmer2018 is opening up exciting opportunities for closing the gap between mechanistic, simulation-based and statistical approaches to modelling complex systems. Our Bayesian methodology based on emulators provides a fast, effective surrogate model for the intractable likelihood implied by the simulator, and the active-learning based rules lead to bounded-rational decisions about which simulations to run. In combination, they form a rigorous and resource-efficient basis for data analysis with simulators in the loop.

## Acknowledgements

We thank Marcel Nonnenmacher and Pedro J. Gonçalves for discussions, and help with simulation of the Hodgkin-Huxley model. We thank David Greenberg and all members of the Neural Systems Analysis group for comments on the manuscript.

This work was supported by BMBF (FKZ 01IS18052 A-D, Project ADIMEM) and DFG (SFB 1089, SPP 2041, and SFB 1233, Project ’Robust Vision’, 276693517) grants and by the caesar foundation.

manuscript.bbl

## A

## B Acquisition rule for global emulator learning

For global emulator learning, we use a rule based on information theory from the active learning literature that maximizes information gain Houlsby11, Gal17, Depeweg17. We refer to the rule

(3) | ||||

as the maximum mutual information rule (MaxMI).

The first term is the entropy of the data under the posterior-predictive distribution implied by the emulator:

(4) |

where is obtained by marginalizing out the emulator’s parameters w.r.t. :

(5) |

The expected conditional entropy, , is the average entropy of the output for a particular choice of inputs and emulator parameters , under the posterior distribution of emulator parameters . Again, we treat ensemble members as if they were draws from . Houlsby11 refer to this rule as Bayesian Active Learning by Disagreement (BALD): we query parameters where the posterior predictive is very uncertain about the output (entropy is high), but the emulator, conditioned on the value of its parameters , is on average quite certain about the model output (conditional entropy low on average).

For many distributions closed-form expressions of are available, but this is in general not true for the entropy of the marginal predictive distribution . To overcome this problem, we derived an upper-bound approximation to the entropy term based on the law of total variance: if we characterize the marginal distribution only in terms of its (co)variance , then . Using the law of total (co)variance, we get

(6) |

where all expectations can be approximated by samples drawn from .

Note that the density of the forward model, , does not appear in this rule. By using the upper-bound, we can use gradient-based optimization to find . Alternatively, entropies could be approximated using sample, which, however, would be slower.

appendix_acq_global.bbl

## C Additional related work

papamakarios_sequential_2018, concurrently and independently to our approach [lueckmann_2018, an earlier preprint version of this work], proposed learning synthetic likelihoods using neural density estimators for likelihood-free inference: They use Masked Autoregressive Flows as synthetic likelihoods and report state-of-the-art performance compared to methods that directly target the posterior. Like our approach, the density estimator is trained on sequentially chosen simulations. Rather than using acquisition functions that take into account uncertainty to guide sampling, they draw samples from the current estimate of the posterior. Their approach corresponds to an alternative way of learning a local emulator.

The recent workshop paper of durkan_2018 compares papamakarios_sequential_2018 and our approach on three toy problems learning local emulators. On these toy-problems, both methods are similarly efficient (and more efficient than methods directly targeting the posterior), however, the wallclock time of our method is substantially higher, because of the additional cost of evaluating the acquisition function. Whether this additional cost is warranted on a given problem will depend both on any additional gain brought about by the active selection of samples, as well as the cost of the simulator. For expensive simulation costs, additional computational budget should be spent to carefully decide for which parameters to simulate.

appendix_related_work.bbl

## D Gaussian simulator example

### d.1 Model

Data is generated independently according to , , where , , , for , , and is distributed uniformly in where is the dimensionality of the problem.

This problem is inspired by the Gaussian example studied in Jarvenpaa2017, where was chosen as . We introduce a nonlinearity in , since our method with uniform acquisitions would otherwise trivially generalize across the space – we observed that a neural network with the right amount of ReLu units can learn the linear mapping perfectly, independently of where the training samples are acquired.

### d.2 Evaluation

We evaluate our method and BOLFI Jarvenpaa2017 on this problem in D and D. In D, algorithms start with initial samples, in D with , and make 100 acquisitions after each of which we evaluate how well the ground truth posterior is recovered.

As performance metric, we calculate total variation (TV) between and , defined as

### d.3 Network architecture and training

Emulator networks model a normal distribution as output, so that the outputs of the network parametrise mean and covariance (Cholesky factor of the covariance matrix). Neural networks have one hidden layer consisting of 10 units. We train an ensemble of networks using Adam kingma_2014 with default parameters () for SGD, and a learning rate of .

### d.4 Bolfi

BOLFI requires choice of a distance function: We use the the Mahalanobis distance

in line with the distance function used for the Gaussian example studied in Jarvenpaa2017. We use the implementation provided by the authors elfi.

appendix_gauss.bbl

## E Image example

### e.1 Model

Images are generated according to:

where and are coordinates in the image, and is the binomial distribution.

Model parameters are and , which respectively determine the horizontal and the vertical offset of the blob, , defining its contrast, and , determining the width.

For our experiments, we use images of size pixels. We choose uniform priors in the range for and , and a uniform prior in the range for . We fix to 2.

### e.2 Evaluation

We evaluate different acquisition methods by keeping track of the log-likelihood of a test set consisting of 5000 parameters-image pairs over the course of acquisitions (starting from an initial sample of size ).

### e.3 Network architecture and training

Emulator networks model a binomial distribution as output. Neural networks have two hidden layers (200 units each) with ReLu activation functions. We train an ensemble of networks using Adam kingma_2014 with default parameters () for SGD with a learning rate of .

appendix_blob.bbl

## F Hodgkin-Huxley example

### f.1 Model

The dynamic equations describing the evolution of the membrane potential and of the gating variables of the neuron are taken from Pospischil_2008:

where is membrane capacitance, the membrane potential, are ionic currents () and is an externally applied current which we can imagine as the sum of a static bias and a time-varying zero-mean noise signal . and shape the up- and down-stroke phases of the action potential (spike), is responsible for spike-frequency adaptation, and is a leak current describing the passive properties of the cell membrane. Each current is in turn expressed as the product of a maximum conductance () and the voltage difference between the membrane potential and the reversal potential for that current(), possibly modulated by zero or more ‘gating’ variables (, , , ).

Each evolves according to first order kinetics in the form:

We provide a step current as input.

In our example application, free model parameters are and . We model uniform priors over these parameters: is between and and is between and .

### f.2 Evaluation

We evaluate the posterior obtained through the emulator after acquisitions, starting from an initial sample size . As posterior predictive check, we span a grid over the parameter space and compare simulator outputs to the posterior.

### f.3 Network architecture and training

Emulator networks model a categorical distribution with classes as output. Neural networks have two hidden layer (200 units each) with a ReLu activation functions. We train an ensemble of networks using Adam kingma_2014 with default parameters () for SGD with a learning rate of .

appendix_hh.bbl