Evolutionary reinforcement learning of dynamical large deviations
We show how to calculate dynamical large deviations using evolutionary reinforcement learning. An agent, a stochastic model, propagates a continuous-time Monte Carlo trajectory and receives a reward conditioned upon the values of certain path-extensive quantities. Evolution produces progressively fitter agents, eventually allowing the calculation of a piece of a large-deviation rate function for a particular model and path-extensive quantity. For models with small state spaces the evolutionary process acts directly on rates, and for models with large state spaces the process acts on the weights of a neural network that parameterizes the model’s rates. This approach shows how path-extensive physics problems can be considered within a framework widely used in machine learning.
Introduction— Machine learning provides the physics community with methods that complement the traditional ones of physical insight and manipulation of equations. Many-parameter ansätze, sometimes encoded in the form of neural networks, can learn connections between physical properties (such as the positions of atoms and a system’s internal energy) and detect phase transitions, without drawing upon an underlying physical model Behler and Parrinello (2007); Mills et al. (2017); Ferguson and Hachmann (2018); Artrith et al. (2018); Singraber et al. (2018); Desgranges and Delhommelle (2018); Thurston and Ferguson (2018); Singraber et al. (2019); Han et al. (2016); Schütt et al. (2017); Yao et al. (2018); Schütt et al. (2018); Carrasquilla and Melko (2017); Portman and Tamblyn (2017); Rem et al. (2019). Reinforcement learning is a branch of machine learning concerned with performing actions so as to maximize a numerical reward Sutton and Barto (2018). It has a close connection to ideas of stochastic control enacted by variational or adaptive algorithms Ahamed et al. (2006); Basu et al. (2008); Borkar (2010); Borkar et al. (2003); Chetrite and Touchette (2015); Kappen and Ruiz (2016); Nemoto et al. (2017); Ferré and Touchette (2018); Bojesen (2018). A recent success of reinforcement learning is the playing of computer games Watkins and Dayan (1992); Mnih et al. (2013, 2015); Bellemare et al. (2013); Mnih et al. (2016); Tassa et al. (2018); Todorov et al. (2012); Puterman (2014); Asperti et al. (2018); Riedmiller (2005); Riedmiller et al. (2009); Schulman et al. (2017); Such et al. (2017); Brockman et al. (2016); Kempka et al. (2016); Wydmuch et al. (2018); Silver et al. (2016, 2017). Here we show that reinforcement learning can also be used to propagate trajectories of a stochastic dynamics conditioned upon potentially rare values of a path-extensive observable. Doing so allows the calculation of dynamical large deviations, which are of fundamental importance, being to dynamical quantities what free energies are to static ones Touchette (2009); Garrahan et al. (2009); Den Hollander (2008); Ellis (2007).
Calculating large deviations is a challenging problem for which specialized numerical methods are required Giardina et al. (2006); Ahamed et al. (2006); Basu et al. (2008); Borkar (2010); Borkar et al. (2003); Touchette (2009); Garrahan et al. (2009); Chetrite and Touchette (2015); Ray et al. (2018); Nemoto et al. (2017); Ferré and Touchette (2018); Bañuls and Garrahan (2019). The reinforcement learning procedure we describe is conceptually and technically simple, and does not require insight into the models under study or access to the formal results of large-deviation theory. It therefore offers an alternative to existing approaches, and provides an example of one of the potentially large number of applications of reinforcement learning in physics. Our method focuses upon the likelihood ratio of a trajectory generated by two stochastic models, the second model being an ansatz for the behavior of the first model conditioned upon a particular value of a time-extensive observable. In previous work we showed that simple, physically motivated choices for the second model can be used to approximate and (in certain cases) calculate the likelihood of rare events in the first Jacobson and Whitelam (2019); here we show that multi-parameter ansätze, in some cases encoded by a neural network, are in this respect more accurate than their few-parameter counterparts. Furthermore, using only a few tens of parameters we can closely approximate the likelihood of rare events whose quantification requires state-of-the-art methods, showing the present approach to have the potential to address cutting-edge problems.
Large deviations by change of dynamics— To set the large-deviations problem in a form amenable to reinforcement learning, consider a continuous-time Monte Carlo dynamics on a set of discrete states, with the rate for passing between states and , and the escape rate from Gillespie (1977). This dynamics generates a trajectory consisting of jumps and associated jump times . In the language of reinforcement learning, is a policy (often denoted ) that stochastically selects a new state and a jump time given a current state.
Stochastic trajectories can be characterized by path-extensive observables , with
Here is the change of the observable upon moving between and . This type of observable describes many physically important quantities, including work, entropy production, and non-decreasing counting observables Seifert (2005); Speck et al. (2012); Lecomte et al. (2010); Garrahan et al. (2009); Fodor et al. (2015). Let the typical value of be , the limiting value of (1) for a long trajectory of the model . Finite-time fluctuations occur with a probability controlled by the distribution , taken over all trajectories of length . For large this distribution often adopts the large-deviation form Den Hollander (2008); Touchette (2009)
is the large-deviation rate function, which quantifies the likelihood of observing atypical values of Touchette (2009); Den Hollander (2008). Calculation of far from using only the original model is not feasible, because such values of occur only rarely. Instead, we can consider a new stochastic model, which we call the reference model, whose purpose is to allow the calculation of potentially far from Bucklew (2013); Chetrite and Touchette (2015).
To do so we use a form of importance sampling Ahamed et al. (2006); Basu et al. (2008); Borkar (2010); Borkar et al. (2003); Den Hollander (2008); Glynn and Iglehart (1989); Sadowsky and Bucklew (1990); Bucklew et al. (1990); Bucklew (1990); Asmussen and Glynn (2007); Juneja and Shahabuddin (2006); Bucklew (2013); Touchette (2009); Jacobson and Whitelam (2019). Let the rates of the reference model be and , and let the limiting value of (1) for a long reference-model trajectory be . Then an upper bound on at is given by the value of
for a long reference-model trajectory, where
Here is the jump time of the reference model, and is a random number uniformly distributed on . Eq. (3) follows from straightforward algebra (see Section S1). It can be motivated by noting that the probability of a jump in time occurs in the reference model with probability , and in the original model with probability ; Eq. (3) is the sum over a trajectory of the log-ratio of such terms.
Our aim is to use evolutionary reinforcement learning to find a reference model (a new policy) that produces a particular typical value of (1), say , and which minimizes (3). Given a value of , the model associated with the smallest possible value of (3) is called the driven or effective model, and its typical behavior is equivalent to the conditioned rare behavior of the original model Chetrite and Touchette (2015). The typical behavior of the driven model yields, from (3), the piece of the rate function of the original model at the point . In previous work Jacobson and Whitelam (2019) we showed that the reference model need only be close to the driven model (in a sense made precise in that paper) in order to calculate ; from the typical behavior of such a reference model we get a bound , and by sampling the atypical behavior of the reference model we can compute a correction such that . There we showed that simple, physically-motivated choices of reference model lead to relatively tight bounds and small corrections; in this paper we show how to improve upon such models using multiparameter ansätze determined by evolutionary learning. Formulated in this way this is an extreme example of reinforcement learning in which there is no instantaneous reward, only an overall reward (or return) associated with the entire trajectory Sutton and Barto (2018). Given that we possess a constraint on and work in continuous time, this problem also falls outside the (standard) Markov decision process framework Li and Cao (2013); Singh et al. (2007).
Large deviations via evolutionary reinforcement learning— As proof of principle we consider the example of entropy production in the 4-state model
We can reconstruct this function using evolutionary reinforcement learning by mutating the rates of a set of reference models until desired values of (1) and (3) are achieved (see Section S2). Some of the models produced in this way are shown in Fig. 1(a), and the associated rate-function values are shown in panel (b). All points derived from the typical behavior of the reference models lie on the exact rate function of the original model, indicating that each reference model’s typical dynamics is equivalent to the conditioned rare dynamics of the original model. Some of the reference models so obtained are shown in panel (a) and in the inset of panel (b). The blue- and gray dashed lines are respectively the Conway-Maxwell-Poisson bound Garrahan (2017) (see Section S3) and the universal current bound Pietzonka et al. (2016); Gingrich et al. (2016).
A neural-network ansatz for models with large state spaces— Thus evolutionary reinforcement learning using 12 trainable parameters (the 12 rates of the reference model) permits accurate computation of probabilities exponentially small in the trajectory length . However, direct application of rate-based evolution is impractical for models with a large number of rates. To overcome this problem we can encode the rates of the reference model as a neural network, and we illustrate this procedure using the one-dimensional Fredrickson-Andersen (FA) model Fredrickson and Andersen (1984). This is a lattice model with dynamical rules that give rise to slow relaxation and complex space-time behavior Garrahan and Chandler (2002). On each site of a lattice of length lives a spin , which can be up or down . Up spins (resp. down spins) flip down (resp. up) with rate (resp. ) if at least one of their neighboring spins is up; if not, then they cannot flip. We take the dynamical observable to be the number of configuration changes per unit time, , often called activity Garrahan et al. (2007, 2009). To determine the large-deviation rate function for activity we chose a reference-model parameterization
Here is a parameter that effectively speeds up or slows down the clock Jacobson and Whitelam (2019)
where is the configuration of the lattice in state , and returns the number of active hidden nodes in the filter [see Fig. 2(a)]. The reference model contains trainable parameters: , (only one type of 1-spin filter is necessary), and ; note that when all filter types are used.
The form of (6) is similar to the multi-parameter auxiliary potential of Ref. Nemoto et al. (2017), used to improve the convergence of the cloning method Giardina et al. (2006) in order to calculate the large-deviation function of the FA model. The present approach is different, however, in that the calculation is done using direct simulation of a reference model whose parameters are determined by an evolutionary process (rather than using rare-event algorithms such as cloning or transition-path sampling Bolhuis et al. (2002)), and results in the calculation of directly (rather than its Legendre transform, which in general contains less information Touchette (2009)).
To test the method we considered the FA model with periodic boundary conditions and the parameter choices and , the latter value being small enough that the exact can be determined by diagonalization of the model’s rate matrix; that function is shown as a black dashed line in Fig. 3(a). We next introduce the reference model (5), and do evolutionary reinforcement learning on the weights of the network (see Section S4). In Fig. 3(a) we show results of these calculations using spin filters up to order . Increasing from 0 improves the quality of the bound until, for , the bound becomes numerically close to the exact answer; see Fig. S3. That figure demonstrates that the quality of the bound exceeds that of the few-parameter, physically-motivated ansatz used in Ref. Jacobson and Whitelam (2019). The neural network contains many fewer parameters than the model has rates (unlike in many deep-learning studies), and so we do not necessarily expect the bound to be exact. If the bound is good, the exact answer can be calculated by computing a correction term Jacobson and Whitelam (2019). Here, though, the correction term is small, indicating that the typical dynamics of this set of reference models is similar to the conditioned rare behavior of the original model. Comparison of these results with the exact result, and with the -bound from Ref. Jacobson and Whitelam (2019) (Fig. S3), indicates that rare trajectories of the FA model with parameter resemble the typical trajectories of versions of the FA model with different values of the parameter , but with slightly different tendencies to display spin domains of different lengths. These tendencies are quantified by the weights of the neural network, some of which are shown in Fig. 3(b). In panel (c) we show space-time plots of the trajectories of 5 reference models.
With proof of principle demonstrated using models whose state space is small enough to solve by matrix diagonalization, we show in Fig. 4 and Section S4 that evolutionary learning can closely approximate rate functions whose calculation requires state-of-the-art numerical methods Bañuls and Garrahan (2019). This demonstration indicates the potential of the present method to address cutting-edge problems, an open question being which neural-network designs are required for particular types of model.
Conclusions— We have shown that the calculation of dynamical large-deviation rate functions can be done using direct simulation guided by evolutionary reinforcement learning. In previous work we showed how to calculate such rate functions using a variational ansatz for rare dynamics (VARD) that contains a few parameters motivated by physical insight Jacobson and Whitelam (2019); here we show that evolutionary reinforcement learning using a multi-parameter ansatz or “VARDnet” removes the need for such insight. Our approach does not rely on the formal results of large-deviation theory, making it complementary to the growing body of methods based on such results Giardina et al. (2006); Touchette (2009); Garrahan et al. (2009); Chetrite and Touchette (2015); Jack and Sollich (2015); Ray et al. (2018); Nemoto et al. (2017); Ferré and Touchette (2018). More generally, many physical problems involve time- or path-extensive quantities – for instance, the yield of molecular self-assembly depends on the history of the interactions of molecules – and we have shown that it is possible to generate dynamical trajectories conditioned upon particular values of path-extensive quantities without physical insight or access to formal results.
Acknowledgments— We thank Hugo Touchette for comments. This work was performed as part of a user project at the Molecular Foundry, Lawrence Berkeley National Laboratory, supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02–05CH11231. D.J. acknowledges support from the Department of Energy Computational Science Graduate Fellowship. I.T. performed work at the National Research Council of Canada under the auspices of the AI4D Program.
Appendix S1 Large deviations by change of model
For completeness we present the derivation of Eq. (3) of the main text, which follows straightforwardly from the definition of the probability distribution. The derivation follows Ref. Jacobson and Whitelam (2019) with minor notational changes. For more on the ideas of dynamic importance sampling see e.g. Refs. Glynn and Iglehart (1989); Sadowsky and Bucklew (1990); Bucklew et al. (1990); Bucklew (1990); Asmussen and Glynn (2007); Juneja and Shahabuddin (2006); Bucklew (2013); Touchette (2009) and Ref. Chetrite and Touchette (2015) (esp. Section 5).
Consider a continuous-time dynamics on a set of discrete states, defined by the master equation Binder (1986)
Here is the probability that the system is in (micro)state at time , is the rate for passing from state to state , and is the escape rate from . A standard way of simulating (S1) is as follows Gillespie (1977): from state , choose a new state with probability
and a time increment from the distribution
of a time-extensive dynamical observable
In these expressions specifies a constraint on the trajectory, is the change of upon moving from to , and is the sum of these quantities over a single trajectory . We define as the time-intensive version of . is the elapsed time of trajectory , and is the probability of a trajectory , proportional to a product of factors (S2) and (S3) for all jumps of the trajectory.
Direct evaluation of (S4) using the dynamics (S2) and (S3) leads to good sampling of near the typical value , where , and poor sampling elsewhere. To overcome this problem we can introduce a reference dynamics
in which is a modified version of the rate of the original model, and . Let be the trajectory weight of the reference dynamics, proportional to a product of factors (S7) and (S8) for all jumps of the trajectory. We write
for the ensemble averages over trajectories (having length and observable ) of the original and reference models, respectively. We can then write (S4) as
Here is the jump time of the reference model ( is a random number uniformly distributed on ). In (S12) the quantity .
Taking logarithms of (S12) and the large- limit gives us
In these expressions is the typical value of for the reference model. The term (S16) is by Jensen’s inequality an upper bound on the piece of the rate function at the point , i.e
The bound can be determined by computing the values of (S5) and (S13) for a suitably long reference-model trajectory. If the reference model’s typical dynamics is similar to the conditioned rare dynamics of the original model (something we generally do not know in advance), then the bound will be tight, and if it is tight enough the exact value of can be calculated by sampling the (slightly) atypical behavior of the reference model Jacobson and Whitelam (2019). In the main text we show that evolutionary reinforcement learning can generate reference models for which the correction term is very small. The optimal reference model, called the driven or auxiliary process Chetrite and Touchette (2015); Jack and Sollich (2015); Bucklew (2013); Touchette (2009), is one for which the bound is exact, meaning that its typical behavior is equivalent to the conditioned rare behavior of the original model.
Appendix S2 4-state model evolutionary procedure
The evolutionary procedure used to make Fig. 1 of the main text is as follows. We start by running a trajectory of the reference model (of events) and recording the typical value of the observable and bound, the long-time limits of (1) and (3), respectively. Initially the reference model is the original model, , and so and .
To perform an evolutionary step we create a mutant model whose rates are
Here is an evolutionary rate and is a uniformly distributed random number on . The parameter is a learning rate and its effect is similar to other types of learning rate in machine learning, or basic step size in Monte Carlo simulation: if it is too small then we do not explore configuration space rapidly enough; if it is too large then the acceptance rate is too low; and somewhere in between these extremes its precise numerical value does not matter. This optimal region must be determined empirically, and we found values of of order 0.1 to be acceptable.
With this new set of rates we run a new trajectory and compute the new values of and , called and , respectively. If our selection criteria are fulfilled (see below) then we accept the mutation, and set , , and (the mutant model becomes the new reference model); if not, we retain the current reference model.
We imposed two types of selection criteria. For the first, called -evolution, we accepted the mutation if is closer than to a specified target value , i.e. if
For the second, called -evolution, we accept the mutation if is smaller than and if lies within a tolerance of a specified pinning value , i.e. if
The process of -evolution leads to reference models able to generate values of far from , while -evolution leads to reference models that generate values of in a manner as close as possible to the original model. The role of the parameter in (S21) is to constrain the reference model to a particular window of , and its value can be chosen for convenience (e.g. to ensure that we plot particular points along the rate function).
To produce Fig. 1 we alternated steps of -evolution, using an evolutionary rate of , with steps of -evolution, using an evolutionary rate of . During -evolution we chose the pinning value to be the last value of produced by the preceding phase of -evolution. Upon reaching a specified value we carried out an additional steps of evolution (with , again using . We took large enough that the bound had stopped evolving under -evolution. For the 4-state model the chosen value is much larger than necessary, because the bound stopped evolving after a few hundred trajectories. We carried out 100 independent simulations, each with a different target value .
Appendix S3 The CMP universal activity bound
The Conway-Maxwell-Poisson (CMP) formula
gives a bound on the large-deviation rate function for any non-decreasing counting observable ; here is the typical value of the observable, and is the typical dynamical activity (the total number of configuration changes per unit time). Eq. (S22) was derived in Ref. Garrahan (2017) from Level 2.5 of large deviations Maes and Netočnỳ (2008); Bertini et al. (2015), and we have used this form in Fig. 3, Fig. S3, and Fig. 4 (for the case ).
We note here that the CMP formula can be straightforwardly derived from the generic bound (3), without using the result known as Level 2.5 of large deviations. Let and be the typical activities produced by an original model . Then a reference model , whose rates are uniformly rescaled versions of those of the original model, will produce typical activities and (a uniform rescaling of rates does not affect the choice of new state, i.e. , and so the reference model will visit the same set of states as the original model, just faster or slower). The reference-model escape rate is then . In (3) we assume the long-time, steady-state limit, and so replace the fluctuating jump time with its mean , giving
where and are respectively the mean value of and for the reference model ( and are the analogous quantities for the original model). Eq. (S23) is the bound associated with the single reference model whose rates are . By choosing different values of we create a family of reference models, each with a distinct typical behavior, and so can replace in Eq. (S23) with the general ; doing so, we recover Eq. (S22), the CMP bound.
The derivation leading to (S23) specifies only that be derived from a time-extensive quantity , so that rescaling all rates by a factor changes the typical value of the observable, , to . Thus the CMP bound applies to any time-extensive quantity, including currents, not just non-decreasing counting variables. This fact justifies its inclusion in Fig. 1, where we consider entropy production (a current). However, the CMP bound does not address the sector, which cannot be accessed if the typical value of the observable of the original model is (because any reference model obtained under a rescaling of rates has ).
A simple way to produce a bound pertaining to the sector is to use the -rescaling on the time-reversed version of the original model. To see this, we proceed as follows. Consider the reference model obtained by rescaling the rates of the original model, , by the exponential of (minus) the entropy production:
Here we are using standard notation for Markov chains: is the probability of moving to (micro)state , given that we are in state ; is the escape rate from ; and is the invariant measure, which satisfies
Then upon dividing (S25) by we have
and so this reference model generates the time-reversed Markov chain Hastings (1970). If the observable is a current, odd under time reversal, then the typical value of the observable in the reference model is .
Thus choosing the time-reversed model to be the reference model gives as a bound a single point on the rate function of any current ; here and are the typical values of the current and the entropy production rate in the original model.
If we now apply a -rescaling to the time-reversed model we create a family of reference models with rates
Hence one bound on any current is provided by the combination of (S23) (with ) for and (S23) for . We show this bound (for the choice for the 4-state model) as a blue dotted line in Fig. S2. The double-well form results from the fact that the associated family of reference models is a glued-together combination of forward and time-reversed ‘original’ models with uniformly rescaled rates. Comparison with the universal current bound Pietzonka et al. (2016); Gingrich et al. (2016) (gray dotted line) shows the latter to derive from a different family of models (see Figs. 2 and 3 of Ref. Jacobson and Whitelam (2019) for a comparison between the universal current bound and the bounds produced by other families of reference models).
Appendix S4 FA model evolutionary procedure
s4.1 Small FA model
All neural-network weights of the reference model (5) were initially zero. Each proposed evolutionary move consisted of a shift of each weight by independent Gaussian-distributed random numbers of zero mean and variance :
The parameter is a learning rate and its effect is similar to other types of learning rate in machine learning, or basic step size in Monte Carlo simulation. We found values of of order 0.01 to be acceptable. We ran trajectories for events, and recorded the values of (1) and (3) after each proposed trajectory. We did -evolution on the parameters and until a specified value was reached. This procedure was as described for the 4-state model, with the additional restriction that the new bound must be not more than a value larger than the current bound. That is, the proposed set of weights was accepted if
We introduced the parameter in order to test the effect of replacing the alternating - and -evolution of Section S2 with a “regularized” form of -evolution (one that does not allow the bound to grow beyond a particular size in any one step). If was chosen very small (e.g. ) then -evolution could not get going at all, because the bound must be allowed to increase in size at some point in the calculation. If was set very large (e.g. of order 10, so that the second requirement in (S34) was effectively not present), then we observed the effect seen with the gray line in Fig. 4, whereby the bound obtained after -evolution and prior to -evolution was much larger than the exact value of . For intermediate values of , such as the value 0.2 chosen here, the bound obtained prior to -evolution was in general close to the exact answer (see the gray and orange lines in Fig. 3). However, the total CPU time required in the cases of moderate- and large were similar.
We then did -evolution using a tolerance of [see Eq. (S21)], for proposed trajectories, with higher-order spin filters applied. We ran 50 simulations, each with a different target value of .
In Fig. S3 we reproduce some of the results shown in Fig. 3(a) of the main text, together with results obtained for different values of . In physical terms the value of determines the lengthscale over which the dynamical rules of the reference model act. The original model possesses nearest-neighbor dynamics; its dynamics conditioned upon certain values of involves potentially long-range correlations. Fig. S3 shows how closely (in terms of probabilities) the typical dynamics of reference models whose dynamical rules possess -spin correlations approximate this conditioned dynamics: is sufficient to closely approximate the rate function.
s4.2 Large FA model
We also tested the method on the 100-site FA model of Ref. Bañuls and Garrahan (2019) (for ), whose state space is large enough that state-of-the-art methods are needed to compute its large-deviation rate function. Initial tests done using evolutionary learning on the network shown in Fig. 2 (with ) produced a bound that was inexact, but close enough to the exact answer to suggest that modifications of that scheme might more closely approximate the exact answer. We therefore replaced the network shown in Fig. 2 of the main text with the one shown in Fig. 4(a) of the main text. Each hidden node in this network couples to input nodes (lattice sites), and takes one of values. This value, called in microstate , is determined by the state of the spins to which is it connected, via
The output of the network in microstate is then
where the weights are, along with , the trainable parameters of the model. This network is capable of learning which features (spin patterns) are most significant; by contrast, the network shown in Fig. 2 looks only for homogenous blocks of spins. The reference-model ansatz is again (5), but now with (S36) replacing (6).
We ran 40 evolutionary simulations with target values of between 0.1 and 30 (the typical value of the original model is approximately 3.5). We turned on the neural network from the start, and used trajectories of events. We used -evolution (using Eq. (S20)) to generate the desired values of , and then did -evolution for proposed trajectories. Results are shown in Fig. 4(b): for , the bound produced is inexact, but numerically close to the exact answer. We show space-time plots of reference-model dynamics in panel (c).
These calculations illustrate that relatively simple neural networks are capable of parameterizing (to a good approximation) the rates corresponding to the conditioned dynamics of lattice models that display complex behavior. An important open question is the nature of the neural network best suited for particular models.
We also note that we used slightly different variants of the - and -evolution protocols for each of the 4-state model and the small- and large FA models (each protocol is detailed above), in order to explore the effect of changing the protocol. We did not find one protocol to be obviously better than another, suggesting that a number of different evolutionary strategies can be used to tackle these problems.
- J. Behler and M. Parrinello, Phys. Rev. Lett. 98, 146401 (2007).
- K. Mills, M. Spanner, and I. Tamblyn, Physical Review A 96, 042113 (2017).
- A. L. Ferguson and J. Hachmann, Molecular Systems Design & Engineering (2018).
- N. Artrith, A. Urban, and G. Ceder, The Journal of Chemical Physics 148, 241711 (2018).
- A. Singraber, T. Morawietz, J. Behler, and C. Dellago, Journal of Physics: Condensed Matter 30, 254005 (2018).
- C. Desgranges and J. Delhommelle, The Journal of Chemical Physics 149, 044118 (2018).
- B. Thurston and A. Ferguson, Molecular Simulation , 1 (2018).
- A. Singraber, J. Behler, and C. Dellago, Journal of chemical theory and computation 15, 1827 (2019).
- J. Han et al., arXiv preprint arXiv:1611.07422 (2016).
- K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, Nature Communications 8, 13890 (2017).
- K. Yao, J. E. Herr, D. Toth, R. Mckintyre, and J. Parkhill, Chem. Sci. 9, 2261 (2018).
- K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. MÃ¼ller, The Journal of Chemical Physics 148, 241722 (2018).
- J. Carrasquilla and R. G. Melko, Nature Physics 13, 431 (2017), 1605.01735 .
- N. Portman and I. Tamblyn, Journal of Computational Physics 350, 871 (2017).
- B. S. Rem, N. Käming, M. Tarnowski, L. Asteria, N. Fläschner, C. Becker, K. Sengstock, and C. Weitenberg, Nature Physics (2019), 10.1038/s41567-019-0554-0.
- R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2018).
- T. I. Ahamed, V. S. Borkar, and S. Juneja, Operations Research 54, 489 (2006).
- A. Basu, T. Bhattacharyya, and V. S. Borkar, Mathematics of operations research 33, 880 (2008).
- V. S. Borkar, in Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems–MTNS, Vol. 5 (2010).
- V. Borkar, S. Juneja, A. Kherani, et al., Communications in Information & Systems 3, 259 (2003).
- R. Chetrite and H. Touchette, Journal of Statistical Mechanics: Theory and Experiment 2015, P12001 (2015).
- H. J. Kappen and H. C. Ruiz, Journal of Statistical Physics 162, 1244 (2016).
- T. Nemoto, R. L. Jack, and V. Lecomte, Physical Review Letters 118, 115702 (2017).
- G. Ferré and H. Touchette, arXiv preprint arXiv:1803.11117 (2018).
- T. A. Bojesen, Phys. Rev. E 98, 063303 (2018).
- C. J. Watkins and P. Dayan, Machine learning 8, 279 (1992).
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, arXiv preprint arXiv:1312.5602 (2013).
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Nature 518, 529 (2015).
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, Journal of Artificial Intelligence Research 47, 253 (2013).
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, in International conference on machine learning (2016) pp. 1928–1937.
- Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al., arXiv preprint arXiv:1801.00690 (2018).
- E. Todorov, T. Erez, and Y. Tassa, in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on (IEEE, 2012) pp. 5026–5033.
- M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming (John Wiley & Sons, 2014).
- A. Asperti, D. Cortesi, and F. Sovrano, arXiv preprint arXiv:1804.08685 (2018).
- M. Riedmiller, in European Conference on Machine Learning (Springer, 2005) pp. 317–328.
- M. Riedmiller, T. Gabel, R. Hafner, and S. Lange, Autonomous Robots 27, 55 (2009).
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, arXiv preprint arXiv:1707.06347 (2017).
- F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune, arXiv preprint arXiv:1712.06567 (2017).
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, arXiv preprint arXiv:1606.01540 (2016).
- M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski, in Computational Intelligence and Games (CIG), 2016 IEEE Conference on (IEEE, 2016) pp. 1–8.
- M. Wydmuch, M. Kempka, and W. Jaśkowski, arXiv preprint arXiv:1809.03470 (2018).
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., nature 529, 484 (2016).
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Nature 550, 354 (2017).
- H. Touchette, Physics Reports 478, 1 (2009).
- J. P. Garrahan, R. L. Jack, V. Lecomte, E. Pitard, K. van Duijvendijk, and F. van Wijland, Journal of Physics A: Mathematical and Theoretical 42, 075007 (2009).
- F. Den Hollander, Large Deviations, Vol. 14 (American Mathematical Soc., 2008).
- R. S. Ellis, Entropy, large deviations, and statistical mechanics (Springer, 2007).
- C. Giardina, J. Kurchan, and L. Peliti, Physical Review Letters 96, 120603 (2006).
- U. Ray, G. K.-L. Chan, and D. T. Limmer, Physical Review Letters 120, 210602 (2018).
- M. C. Bañuls and J. P. Garrahan, arXiv preprint arXiv:1903.01570 (2019).
- D. Jacobson and S. Whitelam, Phys. Rev. E 100, 052139 (2019).
- D. T. Gillespie, The Journal of Physical Chemistry 81, 2340 (1977).
- T. R. Gingrich, J. M. Horowitz, N. Perunov, and J. L. England, Physical Review Letters 116, 120601 (2016).
- J. P. Garrahan, Physical Review E 95, 032134 (2017).
- P. Pietzonka, A. C. Barato, and U. Seifert, Physical Review E 93, 052145 (2016).
- U. Seifert, Physical Review Letters 95, 040602 (2005).
- T. Speck, A. Engel, and U. Seifert, Journal of Statistical Mechanics: Theory and Experiment 2012, P12001 (2012).
- V. Lecomte, A. Imparato, and F. v. Wijland, Progress of Theoretical Physics Supplement 184, 276 (2010).
- É. Fodor, M. Guo, N. Gov, P. Visco, D. Weitz, and F. van Wijland, EPL (EuroPhysics Letters) 110, 48005 (2015).
- J. Bucklew, Introduction to rare event simulation (Springer Science & Business Media, 2013).
- P. W. Glynn and D. L. Iglehart, Management Science 35, 1367 (1989).
- J. S. Sadowsky and J. A. Bucklew, IEEE transactions on Information Theory 36, 579 (1990).
- J. A. Bucklew, P. Ney, and J. S. Sadowsky, Journal of Applied Probability 27, 44 (1990).
- J. A. Bucklew, Large deviation techniques in decision, simulation, and estimation (Wiley New York, 1990).
- S. Asmussen and P. W. Glynn, Stochastic simulation: algorithms and analysis, Vol. 57 (Springer Science & Business Media, 2007).
- S. Juneja and P. Shahabuddin, Handbooks in operations research and management science 13, 291 (2006).
- Y. Li and F. Cao, European Journal of Operational Research 224, 333 (2013).
- S. S. Singh, V. B. Tadić, and A. Doucet, European Journal of Operational Research 178, 808 (2007).
- The model’s rates are , , , , , , , , , , , and .
- G. Fredrickson and H. C. Andersen, Physical Review Letters 53, 1244 (1984).
- J. P. Garrahan and D. Chandler, Physical Review Letters 89, 035704 (2002).
- J. P. Garrahan, R. L. Jack, V. Lecomte, E. Pitard, K. van Duijvendijk, and F. van Wijland, Physical Review Letters 98, 195702 (2007).
- In order to calculate the correction to the bound (3), the parameterization using the variable called in Ref. Jacobson and Whitelam (2019) is more efficient; to calculate the bound itself the choice ( or ) makes little difference.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Advances in neural information processing systems (2012) pp. 1097–1105.
- Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Proceedings of the IEEE 86, 2278 (1998).
- P. G. Bolhuis, D. Chandler, C. Dellago, and P. L. Geissler, Annual Review of Physical Chemistry 53, 291 (2002).
- R. L. Jack and P. Sollich, The European Physical Journal Special Topics 224, 2351 (2015).
- K. Binder, in Monte Carlo Methods in Statistical Physics (Springer, 1986) pp. 1–45.
- C. Maes and K. Netočnỳ, EPL (EuroPhysics Letters) 82, 30003 (2008).
- L. Bertini, A. Faggionato, D. Gabrielli, et al., in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Vol. 51 (Institut Henri Poincaré, 2015) pp. 867–900.
- W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications (Oxford University Press, 1970).