Hamiltonian Graph Networks with ODE Integrators
Abstract
We introduce an approach for imposing physically informed inductive biases in learned simulation models. We combine graph networks with a differentiable ordinary differential equation integrator as a mechanism for predicting future states, and a Hamiltonian as an internal representation. We find that our approach outperforms baselines without these biases in terms of predictive accuracy, energy accuracy, and zeroshot generalization to timestep sizes and integrator orders not experienced during training. This advances the stateoftheart of learned simulation, and in principle is applicable beyond physical domains.
1 Introduction
Learning to simulate complex physical processes has been shown to benefit from rich inductive biases about the structure of the system’s state, such as modeling particles and their interactions via nodes and edges in a graph Battaglia et al. (2016); Chang et al. (2017); SanchezGonzalez et al. (2018); Mrowca et al. (2018); Li et al. (2018), and optimizing internal representations to predict known physical quantities Seo and Liu (2019). Here we explore another class of physically informed inductive biases: that a system’s dynamics can be modeled as an ordinary differential equation (ODE) and formulated as Hamiltonian mechanics. These have also been individually studied recently by Chen et al. (2018); Rubanova et al. (2019) (ODE bias), and Greydanus et al. (2019) (Hamiltonian bias), but here we combine them and incorporate both into graphnetworkbased neural architectures Battaglia et al. (2018, 2016); SanchezGonzalez et al. (2018) for learning simulation.
Our experiments show that our trained models have greater predictive accuracy than baselines, as well as better energy conservation (via the Hamiltonian inductive bias) and stronger generalization to novel timesteps and integrator orders (via the ODE integrator inductive bias). These inductive biases also carry tradeoffs: with lower order integrators and coarse timesteps, both our learned Hamiltonian model as well as a model which uses the true Hamiltonian derived from physics struggle, because the Hamiltonian puts hard constraints on how the higher order structure of the dynamics must be modeled, whereas the less constrained models can learn ways of approximating such structure.
2 Background
Graph networks
We represent a particle system as a graph whose nodes correspond to particles, and with edges connecting all nodes to each other. All of our models use a graph network (GN) Battaglia et al. (2018), which operates on graphs with global features, , and variable numbers of nodes, , and edges, . Here we focus on GNs which can compute pernode outputs, , and global outputs . Preliminary experiments show our GN model outperformed an MLPbased approach by several orders of magnitude, consistently with previous work Battaglia et al. (2016); SanchezGonzalez et al. (2018).
Numerical integrators for solving ODEs
Given a firstorder ODE and initial conditions, numerical integration can be used to approximate solutions to the initial value problem,
(1) 
The family of RungeKutta (RK) integrators are fully differentiable and can generate trajectories via iterations of the form, , where . The lowest order RK integrator (Euler integrator, RK1) iterates as, . Higher order RK integrators produce more accurate trajectories by composing multiple queries to the function . Here we explore first through fourthorder RK integrators: RK1, RK2, RK3, and RK4 (See Supp. Sec. D.1 for results with symplectic integrators).
Hamiltonian mechanics
In Hamiltonian mechanics, the Hamiltonian, , is a function of the canonical position, , and momentum, , coordinates, and usually corresponds to the energy of the system ^{1}^{1}1Our methods are also compatible with timedependent Hamiltonians, . However, since our dataset does not require time dependency, for simplicity we will omit everywhere from here on.. The dynamics of the system follow Hamilton’s equations, two firstorder ODEs (analogous to Eq. 1 with ),
(2) 
3 Models
Delta graph network (DeltaGN)
Our main baseline, a “delta graph network” (DeltaGN), replicates previous approaches for learning to simulate Battaglia et al. (2016); SanchezGonzalez et al. (2018), and directly predicts changes to and ,
(3) 
The are static parameters (masses, spring constants) of the system, and are the neural network parameters. The ’s signature matches the integrator’s, and so is analogous to learning an integrator.
ODE graph network (OGN)
Our “ODE graph network” (OGN) imposes an ODE integrator as an inductive bias in the GN, by assuming that the dynamics of follow a firstorder ODE (Eq. 1). We train a neural network that learns the ODE, that is, learns to produce the time derivatives (which are independent from ). We again use the node output of a GN to model the perparticle time derivatives, and provide the GN, together with the initial conditions and , to an RK integrator,
(4)  
(5) 
The is a function that the integrator can use to operate on any and query more than once. For fixed and when using the RK1/Euler integrator (see Sec. 2) this is equivalent (up to a scale factor) to the DeltaGN.
Hamiltonian ODE graph network (HOGN)
Our “Hamiltonian ODE graph network” (HOGN) imposes further constraints by using a to compute a single scalar for the system—the Hamiltonian—via the global output, analytically differentiating it with respect to its inputs, and , and, in accordance with Hamilton’s equations (Eq. 2), treating these gradients as and , respectively,
(6)  
(7) 
The resulting ODE defined by can be integrated similarly to the OGN by passing those functions to the integrator (Eq. 4). Crucially, the is not supervised directly, but instead learned endtoend through the integrator.
4 Results
We train and test our approach on datasets consisting of particle systems where particle exerts a spring force on particle , as defined by Hooke’s law, , where is the spring constant. All systems contained between and particles, and simulations were computed with timesteps of . To form the data from the raw simulations, we subsampled the trajectories at a fixed timestep of , except in the timestep generalization conditions described below. We trained all models to make nextstep predictions of all particles’ positions and momenta, with a loss function which penalized mean squared error between the true and predicted values.
Fig. 2a shows that for models trained and tested on RK4, the OGN and HOGN have lower rollout error than the DeltaGN, and even lower than integrating the true Hamiltonian at the same time step. Fig. 2b shows that the average energy of the system also stays closer to the initial energy for the OGN and HOGN. The OGN energy seems to preserve energy better than the HOGN, however this is not surprising: as shown below, the HOGN matches the true Hamiltonian well, and both have imperfect energy conservation when integrated with RK4, which is not symplectic^{2}^{2}2Symplectic integrators are designed to have more accurate energy conservation.. Similar experiments with a thirdorder symplectic integratorCandy and Rozmus (1991) yielded similar rollout errors for the HOGN, with 3 times better energy conservation than the OGN (See Supp. Sec. D.1). Figs. 2cg (and video link) illustrate the rollout accuracy of the different models for a long 500step trajectory.
Generalizing to untrained timesteps
Both the OGN and HOGN exhibit better zeroshot generalization than the DeltaGN on timesteps which were never experienced during training. Fig. 3a shows each model’s performance when trained on a timestep of and tested on timesteps between and . We also trained models on multiple timesteps, from to , to study whether that could improve generalization. We varied the timesteps of the ODE models via the integrator, while for the DeltaGN we provided the timestep as input to the model. Fig. 3b shows that the DeltaGN is on par with the other two models within the training range, but becomes worse outside this range. Note, predictions for long timesteps () become very inaccurate for all learned models, as well as the true Hamiltonian, which suggests that the models are determining the dynamics on the basis of features only appropriate for shorter timesteps. Results for other integrators in Supp. Sec. D.2.
Generalization across integrators
We varied the integrators used by the model to examine how the quality of the integrator interacts with our different models. Figs. 3cd show that for lower order integrators (RK13), the HOGN and true Hamiltonian have higher rollout and energy errors than the nonHamiltonian models (DeltaGN and OGN). We speculate the nonHamiltonian models are more accurate because they are not obligated to respect the constraint the Hamiltonian imposes between the ’s and ’s dynamics, and can instead, for instance, learn approximations to the time derivatives which are not consistent with the and gradient vector fields of any Hamiltonian.
However, Fig. 3e shows that when the OGN is tested on integrators different from those on which it was trained, its performance is always significantly worse. This indicates that the OGN is not learning accurate , but rather some other function, which interfaces with the trainingtime integrator to produce accurate trajectories, but is inappropriate for other integrators. By contrast, Fig. 3f shows that the HOGN model usually generalizes to greater accuracy when used with higher order integrators at test time, and specifically when trained with RK4 (dark purple line in Fig. 3f), perfectly matches the behavior of the true Hamiltonian (grey line in Fig. 3c).
Greydanus et al. (2019), in their Hamiltonian Neural Networks, use finite differences to approximate the derivatives of and when not available analytically, use those approximated values during training to supervise the Hamiltonian gradients, and then generate test trajectories with an RK4 integrator. In our experiments this is analogous to training the HOGN using RK1 and testing on RK4, and indeed, our findings are generally consistent with theirs, but also indicate that directly learning through higher order integrators is generally better.
5 Discussion and future work
We incorporated two physically informed inductive biases—ODE integrators and Hamiltonian mechanics—into graph networks for learning simulation, and found they could improve performance, energy, and zeroshot timestep generalization. We also analyzed the impact of varying the ODE integrator independently between training and test, and found that the Hamiltonian approaches benefit most from training with an RK4 integrator, while nonHamiltonian ones can learn more accurate predictions when trained on lower order integrators. However the HOGN generalizes better across integrators, and approximates the true Hamiltonian best in terms of performance and energy accuracy.
Similar to the graph network’s inductive bias for representing complex systems as nodes interacting via edges, the ODE integrator and Hamiltonian inductive biases are informed by physics, but are not specific to physics. Our approach may be applicable to many complex multientity systems which can be modeled with ODEs or have some notion of canonical position and momentum coordinates. Moreover, there are other physically informed inductive biases to explore in future work, such as imposing separable potential and kinetic energy terms, Lagrangian mechanics, timereversibility, or entropic constraints.
Acknowledgments
We acknowledge Irina Higgins, Danilo Rezende, Razvan Pascanu, Shirley Ho and Miles Cranmer for helpful discussions and comments on the manuscript.
References
 Battaglia et al. (2016) Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Proceedings of the 30th Conference on Neural Information Processing Systems (NeurIPS 2016), 2016.
 Chang et al. (2017) Michael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A compositional objectbased approach to learning physical dynamics. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
 SanchezGonzalez et al. (2018) Alvaro SanchezGonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
 Mrowca et al. (2018) Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li FeiFei, Joshua B. Tenenbaum, and Daniel L. K. Yamins. Flexible neural representation for physics prediction. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
 Li et al. (2018) Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B Tenenbaum, and Antonio Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566, 2018.
 Seo and Liu (2019) Sungyong Seo and Yan Liu. Differentiable physicsinformed graph networks. arXiv preprint arXiv:1902.02950, 2019.
 Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018.
 Rubanova et al. (2019) Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent odes for irregularlysampled time series. CoRR, abs/1907.03907, 2019. URL http://arxiv.org/abs/1907.03907.
 Greydanus et al. (2019) Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. CoRR, abs/1906.01563, 2019. URL http://arxiv.org/abs/1906.01563.
 Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, pages 1–38, 2018.
 Candy and Rozmus (1991) J Candy and W Rozmus. A symplectic integration algorithm for separable hamiltonian functions. Journal of Computational Physics, 92(1):230–256, 1991.
 sym (2015) Test of 3rdorder vs 4thorder symplectic integrator with strange result. Computational Science Stack Exchange, 2015. URL https://scicomp.stackexchange.com/q/20535.
Appendix A Data generation
a.1 Initial conditions
Each particle had initial conditions (all in figures in international system of units) drawn from independent uniform distributions:

Mass

Spring constant

Initial position

Initial velocity ,
a.2 Physics simulator
Our dataset consists of simulations for particlespring systems where particle exerts a force on particle (Hooke’s force) where is the spring constant, calculated as .
Trajectories were generated using RK4 with a generation time step of 0.005s, with trajectories of up to 4s in length.
a.3 Dataset
The datasets were obtained by subsampling the trajectories generated with the physics simulator at different time intervals.
We generated data for systems with between 2 and 15 particles, although we only used data with 4, 5, 6, 8 and 9 particles for training.
The training dataset consisted of 10000 onestep pairs of states (for each number of particles) separated by either a fixed time step of 0.1 s, or by a random time step drawn uniformly from the s interval and rounded to the nearest exact multiple of the generation timestep. In the latter case, to avoid having discrete values of the time step, instead of using a fixed generation timestep of 0.005 s, we used random values form the interval 0.005 s 10%. Each pair was sampled from a different randomly generated trajectory. Additionally we generated 1000 validation and 1000 test pairs for each number of particles.
We also stored validation and test trajectories of length 20 sampled at 0.005, 0.01, 0.03, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5 seconds (1000 trajectories for each number of particles and time step). All results presented in the paper correspond to the full test dataset of 20step test trajectories for systems with 4, 5, 6, 8 and 9 particles.
Appendix B Graph Network
Each of the MLPs used on the edge model, node model and global model of the graph network, have output sizes of [64, 64], and softplus activations after each layer including the last one. The graph network always uses sum as the aggregation function for edges and nodes.
The output of the network is obtained by applying an additional nonactivated linear layer with the desired output size to the output globals (HOGN, output size of 1 to represent a single scalar value per graph) or to the output nodes (OGN and DeltaGN, output size of 4 to represent values associated to 4 canonical coordinates for each node/particle).
We chose softplus because relu activation did not work well with the Hamiltonian model. We hypothesize this is a consequence of Eq. 7: a function approximator with relu activation is a piecewise linear function (linear almost everywhere), so after taking the derivatives of the output of the network with respect to its inputs, it becomes a function that is constant almost everywhere, and the gradientbased optimization process struggles to optimize it. This is consistent with Greydanus et al. [2019] where they find better results for tanh activation than for relu activation. In our case softplus seemed to produce better results than tanh.
We always remove the mean position of the objects from the inputs to the GraphNetwork. This does not prevent the models from predicting correct absolute positions since all dynamics are translation invariant and models (Including RK integrators) only predict relative differences with respect to the previous state. This allows the model to produce correct predictions even when particles have travelled far away from their initial positions.
Also, because the ground truth Hamiltonian in spring systems is timeindependent (it is a conservative system), we omit feeding absolute time to the GraphNetworks to avoid unnecessary dependencies.
Appendix C Training details
We used TensorFlow and the DeepMind Graph Nets library to build our neural networks, calculate the analytic gradients of the Hamiltonian, and train our models. We trained with a batch size of 100, for a million training steps. We used AdamOptimizer to minimize the loss.
Due to the different nature of the models (which may require different learning rates) we ran each of the models with 13 initial learning rates uniformly spanning an interval between and in log scale. Learning rate was decayed exponentially at a rate of 0.1 every , with a lower limit of .
For each model, we report median values and minmax range for the 4 learning rates with the smallest rollout error. In practice the variance across seeds given the same learning rate was negligible (except on generalization results).
Rollout error is defined as the RMS position error averaged across all examples, dimensions, particles, and sequence axis. Energy error is calculated as the RMS of the deviation between the mean energy of a trajectory and the initial energy (normalized by the initial energy, to yield relative errors), averaged across all examples.
Appendix D Additional results
d.1 Symplectic integrators
We attempted using symplectic integrators of first order (Symplectic Euler, S1), second order (Verlet Integrator, S2) and third order (S3) with coefficients as described in Candy and Rozmus [1991]^{3}^{3}3We also tried a fourth order symplectic integrator, as described in Candy and Rozmus [1991], however it did not improve results, possibly due to the linear nature of the Hooke’s forcesym [2015].. It is worth noting that these integrators are only symplectic for separable Hamiltonians , or equivalently, for systems for which and . However, we do not impose this constraint, neither to the HOGN (by adding the scalar output of two independent networks, with and inputs respectively) nor the OGN (by only feeding to the network that outputs and vice versa), so there are not any guarantees that the learned models will correctly preserve energy for any Hamiltonian.
Fig. D.1a shows the predictive accuracy including results for models trained with symplectic integrators. Contrary to the RK integrators, in this case the HOGN seems to be able to produce lower errors than the true Hamiltonian even for low order integrators (S1 and S2), in a similar way to how the OGN was able to produce very low errors for low order RK integrators. While this seems like an advantage of using symplectic integrators it also indicates that the HOGN model is not really learning something close to the true Hamiltonian in these cases, and, as we will see in the next section, it will cause worse generalization to unseen test timesteps. Similarly, Fig. D.2d shows that the models trained with the S1, S2 integrators do not generalize at all to higher order symplectic integrators, and only the model trained with S3 shows good generalization across all integrators that is comparable to the generalization across integrators of the model trained with RK4 (Fig. D.2b) and the behavior of the true Hamiltonian (Fig. D.1a).
We speculate these differences in accuracy between training the HOGN with the RK integrators and the symplectic integrators are related to fundamental differences between the algorithms behind the two types of integrators. Firstly while the output of RK integrators is usually a weighted average of the results obtained across several internal iterations for intermediate timesteps, the symplectic integrators feed each internal iteration with just the output of the previous internal iteration, and output just the result of the very last internal iteration. This means that the trained models can learn to encode additional information in the outputs of the intermediate internal iterations, which may allow the neural networks to distinguish whether the gradients are being evaluated for the first, the second, etc. or the final internal iteration. Secondly, in RK integrators, the derivatives of the position and momentum are always evaluated at the same point in each internal iteration, but in the symplectic integrators the derivatives of the position and the momentum are evaluated at different points, in alternating fashion. This has two consequences, the first consequence is that an th order symplectic integrator performs network evaluations, while a th order RK integrator performs only network evaluations, as it shares the evaluations of the position and momentum gradients at the same point. The additional network evaluations may explain why apparently lower order symplectic integrators, when used with learned models, can produce lower errors (e.g. S2 vs RK2). The second consequence is that, if as speculated before, the model can identify which calls corresponds to which internal iterations, it may also identify when it is being evaluated to produce gradients for the momentum or for the position. This would allow the model to get around the Hamiltonian constraint (that used to force it to produce a position and momentum vector field of gradients consistent with a common Hamiltonian), and allow it to start behaving more similarly to the OGN, where the derivatives of the position and momentum can be fully independent.
With respect to energy conservation, Fig. D.1b, shows that models trained with higher order symplectic integrators (S1 and S2) preserve energy much better than the RK integrators, and, as mentioned in the main text, allow the HOGN to preserve energy much better than the OGN. It is also worth noting, that the HOGN trained with RK2, RK3 and RK4 integrators also improves in energy conservation when integrated with symplectic integrators (Fig. D.3b), confirming that the models trained with high order RK integrators learn something very similar to to the ground truth Hamiltonian that even generalizes to a different family of integrators.
d.2 Time generalization for other integrators
Figs. D.4, D.5, and D.6 show that for RK integrators of lower order (RK1RK3) the OGN overfits heavily to the timestep used during training, however, the HOGN presents consistent generalization to timesteps not seen during training, yielding approximately similar errors. This again is evidence that the HOGN learns something more consistent with a Hamiltonian, and that in fact follows more closely the true Hamiltonian than the OGN.
In the case of symplectic integrators, we observe that even the HOGN overfits to the timestep used during training when using first order (S1) and second order (S2) integrators (Figs. D.7 and D.8), and starts generalizing well only with the third order (S3) integrator (Fig. D.9), although even for S3 the generalization to longer (> 0.2) timesteps is comparatively worse than with RK4 (Fig. 3ab). This is consistent with the previous discussion (Section D.1) in that the HOGN is not really learning an accurate Hamiltonian when trained with a symplectic integrator.