Flowbased generative models for Markov chain Monte Carlo in lattice field theory
Abstract
A Markov chain update scheme using a machinelearned flowbased generative model is proposed for Monte Carlo sampling in lattice field theories. The generative model may be optimized (trained) to produce samples from a distribution approximating the desired Boltzmann distribution determined by the lattice action of the theory being studied. Training the model systematically improves autocorrelation times in the Markov chain, even in regions of parameter space where standard Markov chain Monte Carlo algorithms exhibit critical slowing down in producing decorrelated updates. Moreover, the model may be trained without existing samples from the desired distribution. The algorithm is compared with HMC and local Metropolis sampling for theory in two dimensions.
I Introduction
A key problem in lattice field theory and statistical mechanics is the evaluation of integrals over field configurations, referred to as path integrals. Typically, such integrals are evaluated via a Markov chain Monte Carlo (MCMC) approach: field configurations are sampled from the desired probability distribution, dictated by the action of the theory, using a Markov chain. A significant practical concern is the existence of correlations between configurations in the chain. Critical slowing down Wolff (1990) refers to the divergence of the associated autocorrelation time as a critical point in parameter space is approached. This behavior drastically increases the computational cost of simulations in these parameter regions Del Debbio et al. (2004); Meyer et al. (2007). For some models, algorithms have been found which significantly reduce or eliminate this slowing down Kandel et al. (1988); Bonati and D’Elia (2018); Hasenbusch and Schaefer (2018); Swendsen and Wang (1987); Wolff (1989); Prokof’ev and Svistunov (2001); Kawashima and Harada (2004); Bietenholz et al. (1995), enabling efficient simulation. For field theories, a number of methods have been proposed to circumvent critical slowing down by variations of Hybrid Monte Carlo (HMC) techniques Ramos (2012); Gambhir and Orginos (2015); Cossu et al. (2018); Jin and Osborn (2019), multiscale updating procedures Endres et al. (2015); Detmold and Endres (2016, 2018), open boundary conditions or nonorientable manifolds Luscher and Schaefer (2013); Mages et al. (2017); Burnier et al. (2018), metadynamics Laio et al. (2016), and machine learning tools Tanaka and Tomiya (2017); Shanahan et al. (2018). In important classes of theories, however, critical slowing down remains limiting; for example, in lattice formulations of Quantum Chromodynamics (QCD, the piece of the Standard Model describing the strong nuclear force) it is a major barrier to simulations at the fine lattice spacings required for precise control of the continuum limit.
Here, a new flowbased MCMC approach is proposed and is applied to lattice field generation. The resulting Markov chain has autocorrelation properties that are systematically improvable by an optimization (training) step before sampling. The approach defines an improved Markov chain step using a flowbased generative model from which sampling is efficient and in which the probability density associated with each sample is easy to compute. In this method, a set of tunable model parameters defines a changeofvariables—a “flow” —that maps input samples from a simple prior distribution to output samples from a new distribution . By training the model parameters, the output distribution can be made to approximate a desired probability distribution . A flowbased Markov chain is then defined to guarantee asymptotic exactness of sampling by using a MetropolisHastings step with taken as a proposal distribution. Since proposed samples are independent of the previous samples in the chain, the autocorrelation time and acceptance rate are coupled; the autocorrelation time drops to zero as the acceptance rate approaches . This is true even in regions of parameter space where standard algorithms exhibit critical slowing down. Under mild conditions (detailed in Section II), this approach is guaranteed to generate samples from the desired probability distribution in the limit of a large number of updates.
This method has several features that make it attractive for the evaluation of path integrals in lattice field theories:

The autocorrelation time of the Markov chain can be systematically decreased by training the model;

Each step of the Markov chain requires only the model evaluation and an action computation;

Each update proposal is independent of the previous sample, thus proposals can be generated in parallel and efficiently composed into a Markov chain;

The model is trained using samples produced by the model itself, without the need for existing samples from the desired probability distribution.
The proposed flowbased MCMC algorithm is detailed in Section II. A numerical study of its effectiveness in the context of twodimensional theory is presented in Section III. Finally, Section IV outlines the further development and scaling of the approach that will be required for applications to theories defined in a larger number of spacetime dimensions and to more complicated field theories such as QCD.
Ii A flowbased Markov chain Monte Carlo algorithm
In lattice field theory, a Markov chain Monte Carlo (MCMC) process is an efficient way to generate field configurations distributed according to a target probability distribution
(1) 
where indexes the components of , is the action that defines the theory and is the partition function. Here, is defined to be a vector of real components representing the combined internal and spacetime degrees of freedom of the field evaluated on a finite, discrete spacetime lattice (generalizations to gauge fields are discussed in Section IV). A MCMC process generates a chain by steps through configuration space starting with an arbitrary configuration . The steps are stochastic and are determined by the probabilities associated with each possible transition . These probabilities must be nonnegative and normalized:
(2) 
They must also satisfy the conditions of ergodicity and balance to ensure that samples in the chain are drawn from a distribution that converges to after thermalization. For the chain to be ergodic, it must be possible to transition from a starting configuration to any other configuration in a finite number of steps, i.e.,
(3) 
and the chain must not have a period, for which it is sufficient that a single state has nonzero selftransition probability, i.e.,
(4) 
Balance is the condition that is a stationary distribution of the transition:
(5) 
Any procedure which satisfies these conditions will, in the limit of a sufficiently long Markov chain, produce field configurations distributed according to .
ii.1 MetropolisHastings with generative models
Given a model that allows sampling from a known probability distribution , a Markov chain for a desired probability distribution can be constructed via the independence Metropolis sampler, a specialization of the MetropolisHastings method Tierney (1994). For each step of the chain, an update proposal is generated by sampling from , independent of the previous configuration. This proposal is accepted with probability
(6) 
If the proposal is accepted, , otherwise . This procedure defines the transition probabilities of the Markov chain.
The general MetropolisHastings algorithm has been proven to satisfy balance Hastings (1970) for any proposal scheme. For the independence Metropolis sampler, under the further condition that every state has nonzero proposal density and nonzero desired density,
(7) 
the Markov chain is also ergodic and thus guaranteed to converge to the desired distribution Tierney (1994).
This Markov chain can be intuitively considered a method to correct an approximate distribution to the desired distribution . The accept/reject statistics of the MetropolisHastings algorithm serve as a diagnostic for closeness of the approximate and desired distributions; if the distributions are equal, proposals are accepted with probability 1 and the Markov chain process is equivalent to a direct sampling of the desired distribution. This is made precise in Section II.3.
ii.2 Sampling using normalizing flows
Here, a normalizing flow model is used to define a proposal distribution for a generative MetropolisHastings algorithm. Normalizing flows Rezende and Mohamed (2015) are a machine learning approach to the task of sampling from complicated, intractable distributions. They do so by learning a map from an input distribution that is easy to sample to an output distribution that approximates the desired distribution. Normalizing flow models produce both samples and their associated probability densities, allowing the acceptance probability in Eq. (6) to be calculated.
A normalizing flow enacts the transformation between distributions by a changeofvariables^{1}^{1}1The convention of using for the changeofvariables stems from typical applications of normalizing flows.: a smooth, bijective function, maps samples from a prior distribution to . This mapping implicitly defines an output distribution , parameterized by the invertible function . Typically, the prior distribution is a simple and analyticallyunderstood distribution (e.g., a normal distribution). While the desired distribution is often complicated and difficult to sample from directly, optimizing the function allows one to generate samples from . The function is chosen to have a tractable Jacobian such that the probability density associated with any sample can be computed exactly:
(8) 
To encode a map from a simple distribution to a complicated distribution , the map must be highly expressive while also being invertible and having a computable Jacobian. Here, the real nonvolumepreserving (real NVP) flow Dinh et al. (2016) machine learning approach is used: is constructed by the composition of affine coupling layers that scale and offset half of the components of the input at a time; the choice of which components of the data are transformed is part of the layer definition. Splitting the dimensional vector into ()dimensional pieces and according to this choice, a single coupling layer transforms to via
(9) 
where and are neural networks mapping from to and denotes elementwise multiplication. Importantly, each affine layer is invertible, because the half of the data that is used as input to the neural networks remains unchanged by the layer:
(10) 
The Jacobian matrix is lowertriangular and its determinant can be easily computed. For coupling layer :
(11) 
where indexes the components of the output of . Stacking many coupling layers which alternate which half of the data is transformed, the function is defined as
(12) 
Using the chain rule, the determinant of the Jacobian of is a product of the contributions from each . By increasing the number of coupling layers and the complexity of the networks and , can systematically be made more expressive and general. Figure 1 depicts how composing many coupling layers incrementally modifies a prior distribution which is easy to sample into a more complex output distribution that approximates a distribution of interest.
For a fixed initial distribution , the neural networks within each affine coupling layer of can be trained to bring close to the desired distribution . This training is undertaken by minimizing a loss function. Here, the loss function used is a shifted KullbackLeibler (KL) divergence^{2}^{2}2This training paradigm is a specific instance of Probability Density Distillation Oord et al. (2017); Li and Wang (2018). between the target distribution of the form and the proposal distribution :
(13)  
This loss function has been successfully applied in related generative approaches to statistical lattice models Zhang et al. (2018); Li and Wang (2018). The formal shift by in Eq. (13) eliminates the need to compute the true partition function, and does not affect the gradients or location of the minima. By nonnegativity of the KL divergence, the lower bound on the loss is , and this minimum is achieved exactly when . In practice, the loss is stochastically estimated by drawing batches of samples from the model and computing the sample mean:
(14) 
The loss minimization can then be undertaken using stochastic optimization techniques such as stochastic gradient descent or momentumbased methods including Adam and Nesterov Kingma and Ba (2015); Nesterov (1983).
By construction, the flow model allows sampling from efficiently. The training process can thus be performed by drawing samples from the model itself, rather than using existing samples from the desired distribution as training data. This selftraining is a key feature of the proposed approach to Monte Carlo sampling for field theories, where samples from the desired distribution are often computationally expensive to obtain. If samples do exist, they can be used to ‘pretrain’ the network, although in the numerical studies undertaken here this was not found to be markedly more efficient in network optimization than using only selftraining.
Given a trained model with distribution , samples from can be used as proposals to advance a Markov chain using the generative MetropolisHastings algorithm described above. This forms the basis for the flowbased MCMC algorithm proposed here:

A flowbased generative model (here, a real NVP model) is trained using the shifted KL loss given in Eq. (13) to have output distribution ;

proposals are produced by sampling from the flowbased model (this can be done in parallel) and the associated action is computed for each proposal;

Starting from an arbitrary initial configuration, each proposed sample is successively accepted or rejected using the MetropolisHastings algorithm given in Eq. (6) to build a Markov chain of length .
When the prior distribution is strictly positive, the invertibility and continuity of guarantees that the generated distribution is also strictly positive. For all models with finite action, and thus , the resulting Markov chain is then ergodic by the arguments detailed in Section II.1.
ii.3 Autocorrelation time for generative MetropolisHastings
For any Markov chain constructed via a generative MetropolisHastings algorithm (with independent update proposals), an observableindependent estimator for autocorrelation time can be defined from the accept/reject statistics of the chain. This serves both as a measure of the similarity between the proposal and desired distributions and enables proper error estimation for lattice observables Wolff (2004).
Precisely, the autocorrelation at Markov chain separation , for all observables, is given by the probability of rejections in a row,
(15) 
where is an indicator variable taking value when the proposed step from to was rejected in the MetropolisHastings algorithm, and otherwise. In practice, for a nearequilibrium, finite Markov chain with length , a finitesample estimator provides a good approximation to :
(16) 
This measure of autocorrelation is consistent with the usual definition; it is shown in Appendix A that the standard estimator for the autocorrelation of any given observable ,
(17) 
also converges to in the limit .
The qualitative relation between acceptance rate and autocorrelations gives a convenient measure of the autocorrelation characteristics of a Markov chain. Precisely, the autocorrelation at distance can be bounded in terms of the average acceptance rate :
(18) 
Increasing the acceptance rate of an independence Metropolis sampler is thus a necessary condition to reduce autocorrelations. Additionally, exactly when the proposal and desired distributions are equal. In this case, for each , and there are no autocorrelations. While bringing close to does not provide an upper bound on autocorrelation, stochastically improving a loss function that measures distance between distributions is expected to reduce autocorrelations on average. In practice, autocorrelations should be evaluated as a test metric alongside the training loss to confirm improvement over the course of training the model. The correspondence between loss minimization, acceptance rate, and autocorrelations is studied in the context of theory in Section III, where a clear correlation between and is observed.
ii.4 Critical slowing down
When a distribution is sampled using a Markov chain with large autocorrelation time, many updates are required to produce decorrelated samples. Critical slowing down (CSD) is defined as the divergence of the autocorrelation time of Markov chain sampling as a critical point in parameter space is approached Wolff (1990). A numericallystable definition of the characteristic autocorrelation time of a Markov chain is the integrated autocorrelation time:
(19) 
As a critical point is approached, analysis of standard localupdate algorithms for lattice models suggests is typically welldescribed by a power law in the lattice spacing, or for fixed physical volume, a power law in the lattice sites per dimension, . A dynamical critical exponent is thus defined by a fit to along a line of constant physics. An update algorithm for which the critical exponent is zero is unaffected by CSD.
In any generative MetropolisHastings simulation, the autocorrelation time is completely fixed by the expected accept/reject statistics, which in turn result from the structure of the proposal and desired distributions. For models trained with a target value of the integrated autocorrelation time used as a stopping criterion, CSD associated with the Markov chain sampling is thus trivially removed at the expense of upfront training costs. The difficulty of CSD is in essence shifted to the training of the model, i.e., to the optimization of the proposal distribution.
In this study, the viability of using machinelearned models to produce is demonstrated for a simple system. It remains to be shown for theories of more physical interest that models can be trained to produce approximate distributions from which decorrelated samples can be efficiently generated. There are, however, reasons for optimism. Generative models, and in particular flowbased models, are rapidly evolving towards more efficient representation capacity. Complex coupling layers have been implemented Dinh et al. (2014, 2016), as have generalized convolutions Kingma and Dhariwal (2018); Hoogeboom et al. (2019) and transformations with continuous dynamics that are not dependent on restricted coupling layers Grathwohl et al. (2019). Moreover, the algorithm proposed here can be trained with constant memory cost Li and Grathwohl (2018), alleviating the storage limitations that can arise in gradientbased optimization. Largescale parallelization of generative models can also be achieved with simultaneous inference and sampling Reed et al. (2017). Finally, models trained with respect to an action at a given set of parameter values can either be used to initialize training or as a prior distribution for models targeting that action at nearby parameter values, reducing the cost associated with parameter tuning.
ii.5 Related machine learning approaches to MCMC
A number of other machine learning approaches to accelerate MCMC have been explored, primarily in the context of quantum manybody systems.
For example, selflearning Monte Carlo (SLMC) methods construct, by a variety of techniques, an effective Hamiltonian for a theory that can be more easily sampled than the original Hamiltonian Huang and Wang (2017); Liu et al. (2017, 2016); Nagai et al. (2017); Shen et al. (2018). The effective Hamiltonian is learned using supervised learning techniques based on training data drawn from a combination of existing MCMC simulations, randomlymutated samples, and the accelerated Markov chain itself (hence the term “selflearning”). The flowbased MCMC algorithm proposed here similarly learns an approximate distribution by selflearning. In contrast to SLMC methods, the model used here allows direct sampling, eliminating the need for intermediate MCMC steps to produce training data and enabling final sampling using an independence Metropolis sampler.
Normalizing flows have also been used to sample from Boltzmann distributions in condensed matter systems. In Ref. Li and Wang (2018), a renormalization group architecture was employed to specify an expressive changeofvariables that exploited translational invariance and a hierarchy of scales. The normalizing flow model was trained using the shifted KL divergence also used here. In that work, accelerated MCMC sampling was achieved using HMC in the space of variables drawn from the prior distribution, in contrast to the independence Metropolis sampler proposed here. The renormalization group architecture for normalizing flows is an intriguing possibility for future applications of flowbased MCMC.
Finally, several machine learning generalizations of HMC have been developed. In Ref. Song et al. (2017), volumepreserving flows were learned in an augmented space that introduced auxiliary variables analogous to HMC momenta. Likelihoodfree adversarial training Goodfellow et al. (2014) was employed to optimize these flows. Acquiring approximately independent training samples, however, requires running the Markov chain itself for many steps. Moreover, the volumepreserving constraint on the normalizing flow results in less expressive power in comparison to generic normalizing flows Dinh et al. (2014). By careful construction, nonvolumepreserving flows have also been applied to construct Markov chains in spaces augmented with auxiliary variables Levy et al. (2017).
Iii Application of flowbased MCMC to theory
The theory with a massive scalar field and a quartic selfinteraction is one of the simplest interacting field theories that can be constructed. It is thus a convenient testing ground for new algorithms for lattice field theory, such as the flowbased MCMC approach proposed here.
In a dimensional Euclidean spacetime, a discretized formulation of theory can be defined on a lattice with sites , where denotes the lattice spacing, labels spacetime dimension, and . Here, a finite lattice volume is considered, with periodic boundary conditions in all dimensions. The lattice action (in units where ) can be expressed as
(20) 
where the parameters and are the bare mass squared and bare coupling, respectively, and the lattice d’Alembert operator is defined by
(21) 
By taking expectation values over the distribution , observables in the theory can be estimated.
E1  E2  E3  E4  E5  

The observables studied here are the connected twopoint Green’s function
(22) 
and its momentumspace representation
(23) 
where , as well as the corresponding pole mass
(24) 
and the twopoint susceptibility
(25) 
In the limit , with fixed, scalar theory reduces to an Ising model. Another observable of interest is therefore the average Ising energy density Vierhaus (2010), defined by
(26) 
where the sum runs over singlesite displacements in all dimensions.
The action of theory is invariant under the discrete symmetry . Depending on the value of the parameters and , this symmetry can be spontaneously broken. The theory thus has two phases: a symmetric phase and a brokensymmetry phase.
ML 50%  
Local  
HMC 
iii.1 Model definition and training
For this proofofprinciple study, the flowbased MCMC algorithm detailed in Section II was applied to theory in two dimensions with lattice sites in each dimension. The parameters and were chosen to fix for each lattice size; their numerical values are given in Table 1. For simplicity in this initial work, all parameters were chosen to lie in the symmetric phase. In principle, the flowbased MCMC algorithm can be applied with identical methods to the brokensymmetry phase of the theory, but it remains to be shown that models can be trained for such choices of parameters.
For each set of parameters, real NVP models were defined using 8–12 affine coupling layers. The coupling layers were defined to update half of the lattice sites in a checkerboard pattern; successive layers alternately updated the odd and even sites. The neural networks and used in coupling layer (see Eq. (9)) were constructed from two to six fullyconnected layers of 100–1024 hidden units with leaky rectified linear units Nair and Hinton (2010) as nonlinear activation functions between each stage. The prior distribution was chosen to be an uncorrelated Gaussian distribution in which the value at each lattice site was independently sampled from a Gaussian with mean 0 and standard deviation 1. The models were trained to minimize the shifted KL loss between the output distribution and the desired distribution using gradientbased updates with the Adam optimizer Kingma and Ba (2015). A mean absolute error loss, defined in Appendix B, was optimized before training in the case of the model where it was found to accelerate convergence to the KL loss minimum.
An exhaustive study of the optimal choice of prior distribution , model depth, architecture and initialization of the neural networks, and of the mode of coupling of the affine layers, is beyond the scope of this proofofprinciple study. The parameters used here, however, proved to define sufficiently expressive models such that the MetropolisHastings algorithm applied to output from the trained models easily achieved acceptance rates of well over 50%. With further investment in hyperparameter optimization, higher rates of acceptance could be achieved. In any Markov chain using the MetropolisHastings algorithm, there is a tradeoff between computational cost and correlations resulting from low acceptance rates. The optimal acceptance rate minimizes the cost per decorrelated sample from the chain. Here, the cost of training, and not just model evaluation, must be considered, and the optimal level of training in future applications will depend on many factors, such as the desired ensemble size.
For each set of parameters studied, instances of the model were trained to reach both 50% and 70% average Metropolis acceptance. Figure 2 shows histograms of the number of updates between accepted configurations for models at both levels of training. Models trained to reach the higher acceptance rate are seen to have shorter runs of consecutive rejections. Because autocorrelation is related to rejections by for independence Metropolis sampling, a reduced frequency of rejection runs with length longer than directly implies a reduction in . Implications for critical slowing down of the generation of decorrelated configurations are discussed in Section III.3.
For comparison, ensembles of 100,000 lattice configurations were generated using the machinelearned models in flowbased MCMC as well as standard local Metropolis Wolff (1996) and Hybrid Monte Carlo (HMC) Duane et al. (1987) algorithms at matched parameters. The local Metropolis algorithm employed a fixed order of sequential updates to each site, with proposed updates to sampled uniformly from the interval followed by a MetropolisHastings accept/reject step; for all parameters considered, the width was tuned to achieve a 70% acceptance rate. The HMC method was implemented using a leapfrog integrator with a fixed division of trajectory length into steps; the trajectory length was also tuned to achieve a 70% acceptance rate. Figure 3 shows pictorial representations of configurations randomly chosen from the ensembles produced using each method. The flowgenerated configurations cannot be distinguished by eye from those generated using local Metropolis and HMC; the generated samples are varied and display correlations on similar length scales as those produced by the standard MCMC techniques. A quantitative comparison of the ensembles generated by the different algorithms is made in the following section.
iii.2 Tests: physical observables and error scaling
Since the flowbased MCMC algorithm satisfies ergodicity and balance, it is guaranteed to produce samples from the desired probability distribution in the limit of an infinite chain. To test the performance of the algorithm for a finite number of samples, each of the physical observables defined above was computed on ensembles of configurations at the parameters of Table 1, generated both using standard HMC and local Metropolis methods, as well as with the trained flowbased MCMC algorithm. Figures 5–7 compare the observables computed on ensembles generated using all three methods.
To estimate the pole mass , an effective mass is defined based on the zeromomentum Green’s functions at various time separations:
(27) 
For all observables, the values computed using the flowbased MCMC ensembles are consistent within statistical uncertainties with those computed using the standard methods. Moreover, Figure 7 shows that the statistical uncertainties of the observables scale as with the number of samples , as expected for decorrelated samples.
iii.3 Critical slowing down
For theory, a number of algorithms have been developed that mitigate CSD to various extents, such as worm algorithms Vierhaus (2010), multigrid methods Goodman and Sokal (1989), Fourieraccelerated Langevin updates Batrouni and Svetitsky (1987) and cluster updates via embedded Ising dynamics Brower and Tamayo (1989). The path towards generalizing those algorithms to more complicated theories such as QCD, however, is not clear. Algorithms such as HMC and local Metropolis, which are also used for studies of QCD and pure gauge theory, exhibit CSD for (as well as more complicated theories) as the continuum limit is approached.
The parameter sets chosen for the study of theory in this work (Table 1) correspond to a critical line with constant as . For the flowbased MCMC approach proposed here, as well as for ensembles generated using the HMC and local Metropolis algorithms, the autocorrelation times of the set of physical observables discussed previously were fit to leadingorder power laws in to determine the dynamical critical exponents for that observable. Figure 8 shows the autocorrelation times for each observable for each approach to ensemble generation. The absolute values of are not directly comparable between methods because the cost per update differs. The scaling with lattice size, on the other hand, indicates the sensitivity of each method to critical slowing down. For both HMC and local Metropolis, the critical behavior and consequently the performance of the algorithm was found to depend on the observable. In each case, the critical exponent was . In comparison, for the flowbased MCMC ensembles at a fixed acceptance, the critical exponent was found to be consistent with zero, with the autocorrelation time observableindependent and in agreement with the acceptancebased estimator defined in Section II.3.
Since the the mean acceptance rate was used as the stopping criterion for training these models, it was not guaranteed a priori that the measured integrated autocorrelation time would be constant across the different models used. The results in Figure 8, however, suggest that beyond the simple lower bound from Eq. (18) there is a strong correlation between the mean acceptance rate and integrated autocorrelation time for models trained using a shifted KL loss. This is further confirmed by the similarity of the rejection run histograms across lattice sizes for flowbased MCMC, as shown in Figure 2.
As discussed in Section II.4, while CSD in the sampling step for the flowbased MCMC is eliminated, the cost is potentially transferred to the training of the flowbased generative model. For the models produced here, achieving the target acceptance rate on larger lattice volumes required more training epochs and more care in hyperparameter choices than for smaller volumes, although standard training techniques were sufficient in all cases.
Iv Summary
This work defines a flowbased MCMC algorithm to sample lattice field configurations from a desired probability distribution:

A real NVP flow model is trained to produce approximately the desired distribution;

Samples are proposed from the trained model;

Starting from an arbitrary configuration, each proposal is accepted or rejected to advance a Markov chain using the MetropolisHastings algorithm.
The approach is shown to define an ergodic and balanced Markov chain, thus guaranteeing convergence to the desired probability distribution in the limit of a long Markov chain. In essence, the flowbased MCMC algorithm combines the expressiveness of normalizing flows based on neural networks with the theoretical guarantees of Markov chains to create a trainable and asymptoticallycorrect sampler. Since these flows are applicable for arbitrary configurations with continuous, realvalued degrees of freedom, one can generically apply this method to any of a broad class of lattice theories. Here, the algorithm is implemented in practice for theory, and is demonstrated to produce ensembles of configurations that are indistinguishable from those generated using standard local Metropolis and HMC algorithms, based on studies of a number of physical observables.
A key feature of the approach is that models trained to a fixed acceptance rate do not experience critical slowing down in the sampling stage. In particular, the autocorrelation time for all observables is dictated entirely by the accuracy with which the flow model has been trained; perfect training corresponds to decorrelated samples and 100% acceptance in the MetropolisHastings step of the MCMC process. Nevertheless, the efficiency with which the training step of this approach can be scaled to larger model sizes, and to more complicated theories such as QCD, remains to be studied. Recent advances in the training and scaling of flow models, and in particular demonstrations of constant memory cost training Li and Grathwohl (2018), provide reasons for optimism on this front. Further, incorporating symmetries generally improves data efficiency of training, and implementing spacetime and gauge symmetries Cohen et al. (2019) may be a natural next step to practically train these flow models for lattice gauge theories like QCD.
In moving towards lattice gauge theories such as QCD, several theoretical developments are also required. The real NVP model chosen to parameterize the normalizing flows here is described in terms of vectors of variables . Gauge configurations, however, live in a compact manifold arising from the Lie group structure. Extending this method will require a normalizing flow model that can act on this manifold while remaining sufficiently expressive. The choice of prior likewise will need to be extended to a distribution over the manifold of lattice gauge configurations which can be easily sampled. A uniform distribution, for example, may be a candidate for a prior, but this choice must be tested in the context of a specific flow model.
If the flowbased MCMC algorithm proposed here can be implemented for a complex theory such as QCD, the advantages would be significant; arbitrarily large ensembles of field configurations could be generated at minimal cost. The independence of the proposal step from any previous configuration allows parallel generation of proposals, and the continuallyimproving support in hardware and software for neural network execution suggests future practical gains for this style of ensemblegeneration. Given efficient sample generation from a trained model, ensembles would not need to be stored longterm. Moreover, a model trained for one action could either be retrained or used as a prior for another flow model targeting an action with nearby parameter values. This would allow efficient tuning of parameters and generation of additional ensembles interpolating between and extrapolating from existing models.
Acknowledgements
We thank J.W. Chen, K. Cranmer, W. Detmold, R. Melko, D. Murphy, A. Pochinsky, and B. Trippe for helpful discussions. This work is supported in part by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under grant Contract Number DESC0011090. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0206CH11357, under the ALCF Aurora Early Science Program. Some work was undertaken at the Kavli Institute for Theoretical Physics, supported by the National Science Foundation under Grant No. NSF PHY1748958. PES is supported by the National Science Foundation under CAREER Award 1841699, GK is supported by the U.S. Department of Energy under the SciDAC4 award DESC0018121, and PES and MSA are supported in part by NSERC and the Perimeter Institute for Theoretical Physics. Research at the Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Research and Innovation.
Appendix A Acceptance rate estimator for autocorrelation
Here it is shown that the standard estimator for the autocorrelation of an observable converges in the limit of infinite path length to , as claimed in Section II.3. The standard estimator is defined by:
(28)  
(29)  
(30) 
where the final equality is true assuming the Markov chain is initialized with a sample from the stationary distribution (i.e., it is assumed that enough prior iterations were discarded that the chain is thermalized). The expectation value can then be split by cases and, conditioning on the fixed accept/reject pattern, the expectation values can be computed by identifying the distributions of observables and :
(31)  
(32)  
(33)  
(34)  
(35) 
The limit is used to drop biases arising from conditioning on behavior within the region .
Appendix B Mean absolute error loss
The mean absolute error (MAE) loss optimized before training some models is defined by:
(36) 
It is bounded below by the KL divergence and has global minima exactly where the KL loss does, when .
In practice, the loss is stochastically estimated by drawing batches of samples from the model and computing the sample mean:
(37)  
To employ this loss, the partition function must either be estimated ahead of time, or initialized as a trainable parameter. In this study, a multistage method Valleau and Card (1972) was used to estimate and fix the partition function value used while optimizing .
This loss is appealing due to the pointbypoint potential driving the distribution towards the correct one. Any errors in computing , however, result in a minimum at which the model distribution does not necessarily agree with the desired distribution . This loss was therefore only used prior to training with the shifted KL loss.
References
 Wolff (1990) U. Wolff, Nucl. Phys. Proc. Suppl. 17, 93 (1990).
 Del Debbio et al. (2004) L. Del Debbio, G. M. Manca, and E. Vicari, Phys. Lett. B594, 315 (2004), eprint heplat/0403001.
 Meyer et al. (2007) H. B. Meyer, H. Simma, R. Sommer, M. Della Morte, O. Witzel, and U. Wolff, Comput. Phys. Commun. 176, 91 (2007), eprint heplat/0606004.
 Kandel et al. (1988) D. Kandel, E. Domany, D. Ron, A. Brandt, and E. Loh, Phys. Rev. Lett. 60, 1591 (1988).
 Bonati and D’Elia (2018) C. Bonati and M. D’Elia, Phys. Rev. E98, 013308 (2018), eprint 1709.10034.
 Hasenbusch and Schaefer (2018) M. Hasenbusch and S. Schaefer, Phys. Rev. D 98, 054502 (2018), URL https://link.aps.org/doi/10.1103/PhysRevD.98.054502.
 Swendsen and Wang (1987) R. H. Swendsen and J.S. Wang, Phys. Rev. Lett. 58, 86 (1987), URL https://link.aps.org/doi/10.1103/PhysRevLett.58.86.
 Wolff (1989) U. Wolff, Phys. Rev. Lett. 62, 361 (1989), URL https://link.aps.org/doi/10.1103/PhysRevLett.62.361.
 Prokof’ev and Svistunov (2001) N. Prokof’ev and B. Svistunov, Phys. Rev. Lett. 87, 160601 (2001), URL https://link.aps.org/doi/10.1103/PhysRevLett.87.160601.
 Kawashima and Harada (2004) N. Kawashima and K. Harada, Journal of the Physical Society of Japan 73, 1379 (2004), eprint condmat/0312675.
 Bietenholz et al. (1995) W. Bietenholz, A. Pochinsky, and U. J. Wiese, Phys. Rev. Lett. 75, 4524 (1995), URL https://link.aps.org/doi/10.1103/PhysRevLett.75.4524.
 Ramos (2012) A. Ramos, PoS LATTICE2012, 193 (2012), eprint 1212.3800.
 Gambhir and Orginos (2015) A. S. Gambhir and K. Orginos, PoS LATTICE2014, 043 (2015), eprint 1506.06118.
 Cossu et al. (2018) G. Cossu, P. Boyle, N. Christ, C. Jung, A. JÃ¼ttner, and F. Sanfilippo, EPJ Web Conf. 175, 02008 (2018), eprint 1710.07036.
 Jin and Osborn (2019) X.Y. Jin and J. C. Osborn, in Proceedings of Science (2019), eprint 1904.10039.
 Endres et al. (2015) M. G. Endres, R. C. Brower, W. Detmold, K. Orginos, and A. V. Pochinsky, Phys. Rev. D92, 114516 (2015), eprint 1510.04675.
 Detmold and Endres (2016) W. Detmold and M. G. Endres, Phys. Rev. D94, 114502 (2016), eprint 1605.09650.
 Detmold and Endres (2018) W. Detmold and M. G. Endres, Phys. Rev. D97, 074507 (2018), eprint 1801.06132.
 Luscher and Schaefer (2013) M. Luscher and S. Schaefer, Comput. Phys. Commun. 184, 519 (2013), eprint 1206.2809.
 Mages et al. (2017) S. Mages, B. C. Toth, S. Borsanyi, Z. Fodor, S. D. Katz, and K. K. Szabo, Phys. Rev. D95, 094512 (2017), eprint 1512.06804.
 Burnier et al. (2018) Y. Burnier, A. Florio, O. Kaczmarek, and L. Mazur, EPJ Web Conf. 175, 07004 (2018), eprint 1710.06472.
 Laio et al. (2016) A. Laio, G. Martinelli, and F. Sanfilippo, JHEP 07, 089 (2016), eprint 1508.07270.
 Tanaka and Tomiya (2017) A. Tanaka and A. Tomiya (2017), eprint 1712.03893.
 Shanahan et al. (2018) P. E. Shanahan, D. Trewartha, and W. Detmold, Phys. Rev. D97, 094506 (2018), eprint 1801.05784.
 Tierney (1994) L. Tierney, Ann. Statist. 22, 1701 (1994), URL https://doi.org/10.1214/aos/1176325750.
 Hastings (1970) W. K. Hastings, Biometrika 57, 97 (1970), ISSN 00063444, URL http://www.jstor.org/stable/2334940.
 Rezende and Mohamed (2015) D. J. Rezende and S. Mohamed, arXiv preprint arXiv:1505.05770 (2015).
 Dinh et al. (2016) L. Dinh, J. SohlDickstein, and S. Bengio (2016), URL http://arxiv.org/abs/1605.08803.
 Oord et al. (2017) A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. (2017), URL http://arxiv.org/abs/1711.10433.
 Li and Wang (2018) S. H. Li and L. Wang, Physical Review Letters 121, 1 (2018), ISSN 10797114.
 Zhang et al. (2018) L. Zhang, W. E, and L. Wang, CoRR abs/1809.10188 (2018), eprint 1809.10188, URL http://arxiv.org/abs/1809.10188.
 Kingma and Ba (2015) D. P. Kingma and J. L. Ba, in International Conference on Learning Representations (ICLR) (2015).
 Nesterov (1983) Y. E. Nesterov, Dokl. Akad. Nauk SSSR 269, 543 (1983), URL https://ci.nii.ac.jp/naid/10029946121/en/.
 Wolff (2004) U. Wolff (ALPHA), Comput. Phys. Commun. 156, 143 (2004), [Erratum: Comput. Phys. Commun.176,383(2007)], eprint heplat/0306017.
 Dinh et al. (2014) L. Dinh, D. Krueger, and Y. Bengio, 1, 1 (2014), ISSN 1410.8516, URL http://arxiv.org/abs/1410.8516.
 Kingma and Dhariwal (2018) D. P. Kingma and P. Dhariwal, in Neural Information Processing Systems (2018), pp. 1–15, ISBN 0769501850, ISSN 10059040, URL http://arxiv.org/abs/1807.03039.
 Hoogeboom et al. (2019) E. Hoogeboom, R. V. D. Berg, and M. Welling (2019), URL http://arxiv.org/abs/1901.11137.
 Grathwohl et al. (2019) W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud, in ICLR (2019), pp. 1–13, ISBN 9783901882760, ISSN 16879139, URL http://arxiv.org/abs/1810.01367.
 Li and Grathwohl (2018) X. Li and W. Grathwohl, pp. 1–7 (2018), URL http://bayesiandeeplearning.org/2018/papers/37.pdf.
 Reed et al. (2017) S. Reed, A. v. d. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, D. Belov, and N. de Freitas, in ICML (2017), URL http://arxiv.org/abs/1703.03664.
 Huang and Wang (2017) L. Huang and L. Wang, Physical Review B 95, 1 (2017), ISSN 24699969.
 Liu et al. (2017) J. Liu, Y. Qi, Z. Y. Meng, and L. Fu, Physical Review B 95 (2017), ISSN 24699969.
 Liu et al. (2016) J. Liu, H. Shen, Y. Qi, Z. Y. Meng, and L. Fu, arXiv eprints arXiv:1611.09364 (2016), eprint 1611.09364.
 Nagai et al. (2017) Y. Nagai, H. Shen, Y. Qi, J. Liu, and L. Fu, Physical Review B 96, 1 (2017), ISSN 24699969.
 Shen et al. (2018) H. Shen, J. Liu, and L. Fu, Physical Review B 97 (2018), ISSN 24699969.
 Song et al. (2017) J. Song, S. Zhao, and S. Ermon, in Proceedings in Neural Information Processing Systems 2017 (2017), NIPS, ISBN 00085472 (Print)\r00085472 (Linking), ISSN 13000527, URL http://arxiv.org/abs/1706.07561.
 Goodfellow et al. (2014) I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, in Neural Information Processing Systems (2014), ISBN 1406.2661, ISSN 10495258, URL http://arxiv.org/abs/1406.2661.
 Levy et al. (2017) D. Levy, M. D. Hoffman, and J. SohlDickstein, arXiv eprints arXiv:1711.09268 (2017), eprint 1711.09268.
 Vierhaus (2010) I. Vierhaus, Ph.D. thesis, Humboldt University of Berlin (2010).
 Nair and Hinton (2010) V. Nair and G. E. Hinton, in ICML (2010).
 Wolff (1996) U. Wolff, Numerical Simulation in Quantum Field Theory (Springer Berlin Heidelberg, Berlin, Heidelberg, 1996), pp. 245–257, ISBN 9783642852381, URL https://doi.org/10.1007/9783642852381_14.
 Duane et al. (1987) S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth, Physics Letters B 195, 216 (1987), ISSN 03702693, URL http://www.sciencedirect.com/science/article/pii/037026938791197X.
 Goodman and Sokal (1989) J. Goodman and A. D. Sokal, Phys. Rev. D 40, 2035 (1989), URL https://link.aps.org/doi/10.1103/PhysRevD.40.2035.
 Batrouni and Svetitsky (1987) G. G. Batrouni and B. Svetitsky, Phys. Rev. B 36, 5647 (1987), URL https://link.aps.org/doi/10.1103/PhysRevB.36.5647.
 Brower and Tamayo (1989) R. C. Brower and P. Tamayo, Phys. Rev. Lett. 62, 1087 (1989), URL https://link.aps.org/doi/10.1103/PhysRevLett.62.1087.
 Cohen et al. (2019) T. S. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling, arXiv eprints arXiv:1902.04615 (2019), eprint 1902.04615.
 Valleau and Card (1972) J. P. Valleau and D. N. Card, The Journal of Chemical Physics 57, 5457 (1972), eprint https://doi.org/10.1063/1.1678245, URL https://doi.org/10.1063/1.1678245.