A Domain Specific Language for Performance Portable Molecular Dynamics Algorithms

# A Domain Specific Language for Performance Portable Molecular Dynamics Algorithms

William Robert Saunders Department of Mathematical Sciences James Grant Department of Chemistry Eike Hermann Müller
###### Abstract

Developers of Molecular Dynamics (MD) codes face significant challenges when adapting existing simulation packages to new hardware. In a continuously diversifying hardware landscape it becomes increasingly difficult for scientists to be experts both in their own domain (physics/chemistry/biology) and specialists in the low level parallelisation and optimisation of their codes. To address this challenge, we describe a “Separation of Concerns” approach for the development of parallel and optimised MD codes: the science specialist writes code at a high abstraction level in a domain specific language (DSL), which is then translated into efficient computer code by a scientific programmer. In a related context, an abstraction for the solution of partial differential equations with grid based methods has recently been implemented in the (Py)OP2 library. Inspired by this approach, we develop a Python code generation system for molecular dynamics simulations on different parallel architectures, including massively parallel distributed memory systems and GPUs. We demonstrate the efficiency of the auto-generated code by studying its performance and scalability on different hardware and compare it to other state-of-the-art simulation packages. With growing data volumes the extraction of physically meaningful information from the simulation becomes increasingly challenging and requires equally efficient implementations. A particular advantage of our approach is the easy expression of such analysis algorithms. We consider two popular methods for deducing the crystalline structure of a material from the local environment of each atom, show how they can be expressed in our abstraction and implement them in the code generation framework.

keywords: Molecular Dynamics, Domain Specific Language, Performance Portability, Parallel Computing, GPU

## 1 Introduction

Molecular Dynamics (MD) codes such as NAMD [1, 2], LAMMPS [3], GROMACS [4, 5] and DL-POLY [6, 7] are important computational tools for understanding the fundamental properties of physical, chemical and biological systems. They can be used to verify phenomenological theories about atomistic interactions, understand complex biomolecules [8] and self assembly processes [9], replace costly laboratory experiments and allow access to areas of parameter space which are very difficult to reproduce experimentally. For example, simulations can be run at high pressures and temperatures found in stellar atmospheres [10], or for dangerous substances, such as radioactive materials (see e.g. [11]). Classical MD codes simulate a material by following the time evolution of a large number of particles which obey the laws of classical physics (in particular Newton’s laws [12]) and interact via phenomenological potentials. To extract meaningful information, the state of the system (i.e. the distribution of particle positions and velocities) has to be analysed, for example by calculating pairwise distribution functions. Information on the crystalline structure of a material can be derived by inspecting the local environment of each particle [13, 14, 15].

In order to study systems at physically relevant length- and timescales and to produce statistically converged results, modern codes typically run in parallel on state-of-the art supercomputers [2]. With the recent rise of novel manycore chips, such as GPU and Xeon Phi processors, several popular MD simulation packages have been successfully adapted to those new architectures, see e.g. [16, 17, 18, 19, 20, 21]. However, developers of MD codes face significant challenges: adapting and optimising existing codes requires not only a deep understanding of the physics and chemistry of the simulated system, but also detailed knowledge of the rapidly evolving hardware. To name just a few complications, GPUs have a complex memory hierarchy (host/device memory, shared memory and local registers) and any data access has to be coalesced to avoid unnecessary data movement. Write conflicts have to be avoided in threaded implementations on manycore chips and recent CPUs, such as the Intel Haswell and Broadwell chip, only run at peak performance if the code can be vectorised. Since in practice it is rare for a chemist/physicist to possess the skills for optimising code on this level, it can be very challenging to port MD software to a new architecture and maintain its performance in a rapidly evolving hardware landscape. To address this fundamental issue, we describe an approach based on the idea of a “Separation of Concerns” between the domain specialist and scientific programmer. By using a suitable abstraction, both the scientific capabilities and computational performance can be improved independently.

##### DSLs for grid-based PDE solvers

Very similar issues have been faced by developers of grid-based solvers for partial differential equations (PDEs). The key observation there was that the fundamental and computationally most expensive operations can be expressed in terms of a suitable abstraction: the algorithms (e.g. explicit time stepping methods or iterative solvers for elliptic PDEs) can be formulated as the repeated iterations over a set of grid entities (cells, vertices, faces, edges), each of which can hold information, such as a local field value. This expression of the algorithm in a Domain Specific Language (DSL) simplifies the implementation significantly: once the domain-specialist has expressed the code in terms of those basic operations at the correct abstraction level and encapsulated any data in the corresponding fundamental data structures, a computational scientist can implement and optimise the code on a particular architecture.

By introducing the correct abstraction, only a small set of typical loops, which can be parametrised over the set of input and output data, has to be considered. This concept has been applied very successfully in the development of the performance-portable OP2 library [22, 23], which allows the execution of finite element and finite volume codes on a range of architectures. As demonstrated in [24, 22, 23, 25], the code achieves excellent performance on CPUs, GPUs and Xeon Phi processors. Similar techniques for structured grids have been used to develop the C++ based STELLA grid library for the COSMO numerical weather forecast model [26]. DSLs for highly efficient stencil computations on GPUs have also been described in [27, 28].

Recently OP2 was re-implemented in Python as the PyOP2 [29] framework. In PyOP2 the science user specifies the computationally most expensive operations as a set of small kernels written in C. Using code generation techniques, those kernels are then compiled and executed on a particular architecture. By employing just-in-time compilation, the kernels are launched from a high-level Python code which implements the overall solver algorithm. The performance of the resulting code is on a par with that of monolithic Fortran- or C- implementations.

##### A new DSL for MD simulations

In this paper we describe a similar DSL approach for molecular dynamics simulations. The fundamental operation we consider is a two-particle kernel: the user implements a short C-code which is executed for each combination of particle pairs in the simulation. This kernel can modify any properties stored on those particles. A classic example is the force calculation: for each pair of particles, the force (output) is calculated as a function of the two particle positions (input). This local operation can be expressed in a few lines of C-code. The code is then executed over all particle pairs, using the optimal algorithm for a particular hardware and the nature and size of the problem. For example, on a CPU architecture, cell-list or neighbour-list methods can be used, whereas on GPU a neighbour-matrix approach as in [30] might be more suitable. Those details of the kernel execution, however, are of no interest for the science developer who can focus on (i) the implementation of the local kernel and (ii) the overall algorithm which orchestrates the kernel calls in an outer timestepping loop.

To achieve this we developed a Python-based code generation system which creates and compiles fast, architecture dependent wrapper code to execute the C-kernel over all particle pairs. Our approach is shown schematically in Fig. 1.

By using Python as a high-level language, looping algorithms such as the Velocity Verlet method [31] (see also e.g. [32, 33]) for timestepping or advanced thermostats [34, 35] can be implemented very easily, while still generating fast code for the computationally expensive particle loops.

In the following we describe a proof-of-concept implementation of the DSL and concentrate on short-range two-particle kernels, i.e. kernels which are only executed for particles which are separated by no more than a specified cutoff distance. We demonstrate that for a Lennard-Jones benchmark we achieve performance similar to state-of-the-art simulation tools such as DL-POLY and LAMMPS.

While many atomistic models require the calculation of long range forces and intra-molecular interactions, systems containing only short range interactions remain actively studied, particularly in problems in soft matter and nucleation see e.g. [36, 37]. In a separate paper [38] we report on the implementation of a particle-Ewald method [39] for electrostatic forces in our framework. As discussed in Section 6, more advanced long range algorithms and further generalisations of the framework to support multiple species and bonded interactions for molecules will be implemented in the future.

We stress, however, that our approach is not limited to force calculations. To extract meaningful information from a simulation, the results have to be analysed. With growing problem sizes and data volumes, this step becomes computationally expensive and requires efficient and parallel implementations. Below we consider two methods for analysing local environments which can be used to classify the crystalline phase of a material: the bond order analysis in [13] and common neighbour analysis in [14] (see also [15] for an overview of other analysis methods). In the traditional approach, the user would run the simulation with an existing MD package and then write post-processing code to extract physically meaningful information from the output. However, in contrast to the MD code itself, parallelising this analysis code or porting it to a different architecture is often too time consuming to be feasible. As we will demonstrate below, the fundamental kernels for various common analysis methods can be expressed in our framework. This implies that optimised and parallel code is automatically generated for this important stage of the simulation workflow.

A high-level approach for introducing new algorithms to existing MD packages has been realised in the PLUMED [40] and MIST [41] libraries. They are written as plug-ins to well-established codes and introduce free energy methods and alternative integrators respectively. However, this approach still requires the underlying MD code to be implemented efficiently in the first instance. Similar high-level Python interfaces are provided by OpenMM [42] and HOOMD-blue [17]; in those two cases the underlying code is part of the package itself. Using these interfaces both OpenMM and HOOMD-blue allow the user to control a simulation and access available particle data through calls to the underlying library. A Python based DSL for MD simulations is described in [43]: the Molecular Dynamics Language (MDL) provides data structures for particle vectors and allows the easy construction of new integrators via Python classes. It also provides an interface to existing algorithms from the ProtoMol packages and support for reading MD configuration file formats. The main purpose of MDL is to provide a scripting environment for rapid prototyping of new timestepping algorithms. Although there is support for MPI parallelism, the main focus is not on performance or portability. While using optimised C++ implementations from ProtoMol, in contrast to our approach there is no code generation.

Many MD libraries support the implementation of custom interactions by either providing a mechanism that interpolates tabulated values to produce a potential, or a plugin system that allows users to write and compile extensions that implement the desired interaction. The OpenMM Python interface allows a custom potential to be described in symbolic form. Based in this, OpenMM will automatically generate GPU code by using symbolic differentiation and code generation. The resulting code is compiled at runtime through the OpenCL compiler.

However in all cases (with the exception of kernel code generation in OpenMM) the primary aim of the provided Python interface is to simplify access to functionality in an underlying C++ or Fortran code, i.e. Python acts as a “glue” for combining existing functionality. If a desired simulation or technique cannot be described within the Python interface for the library, the user needs to program extensions for the specific MD package. In contrast, our approach is more invasive and allows the expression of both the high-level algorithm and low level kernel in one code. We support general kernels, which are not restricted to force calculations that can be expressed in mathematical form.

##### Structure

This paper is organised as follows: in Section 2 we introduce the fundamental abstractions and data structures used in our approach. The implementation of the abstractions in a Python library and code generation techniques for different architectures are discussed in Section 3. In Section 4 we show how fairly complex structure analysis techniques based on bond order- and common neighbour- analysis can be expressed in our abstraction and explain how they can be added to the simulation. To demonstrate the performance of the generated code, we compare runtime and scalability to other popular MD packages both on MPI-parallel clusters and for GPUs in Section 5. Here we also show output of the structure analysis algorithms described in Section 4. We conclude and outline ideas for further developments in Section 6.

## 2 Abstraction

We begin by formulating the key operations which are required to develop a generic MD code. If the domain specialist (computational physicist or chemist) can express their algorithms in terms of those operations, then the code can be implemented in a performance portable way in the high-level Python framework described in Section 3.

Throughout this paper we assume that we want to simulate and analyse a collection of particles. Let each particle with global index have a set of properties such that is the value of the -th property on particle . Each particle has exactly properties, i.e. . Properties can, for example, be the particle’s position and momentum vector, its charge or the particle index. In addition there can be global properties with . Typical global properties might be the total kinetic energy or the radial distribution function (represented as a vector with entries which count the average number of particles in each distance interval ).

Operations which involve one or more particles are described in the following three definitions:

###### Definition 1.

A Particle Loop is an operation which for each particle reads properties with and writes properties with . The operation can also read global properties with and write with such that the final value of these global properties is independent of the order in which it loops over the particles.

###### Example 1.

Kinetic energy calculation. To calculate the total kinetic energy, we loop over all particles and add to the global variable . The particle properties considered in this example are the mass and the three components , of the particle’s velocity vector .

###### Definition 2.

A Particle Pair Loop is an operation which for all particle pairs reads properties and with and modifies properties with such that the result is independent of the order of execution. The kernel can also read global properties with and write with such that the result does not depend on the order in which the loop is executed over all particle pairs.

###### Example 2.

Force Calculation. The most obvious example of a Particle Pair Loop is the force calculation. Here each particle has six relevant properties, namely the three entries of its position vector and the three entries of the force exerted on the particle by all other particles. For each particle pair the total force on the first particle is incremented by the interaction force which depends on the relative position of the particles, i.e. the three position properties for are read and the three force properties are incremented as .

###### Definition 3.

A Local Particle Pair Loop is a Particle Pair Loop which is only executed for particles which are separated by no more than a specified cutoff distance .

###### Example 3.

Local environment. Suppose that each atom can be in one of two possible states. For every atom we want to count the number of other atoms in the same state which are up to a distance away. In this case each particle would have five properties, namely the three entries of the position vector, the state of the atom and the number of atoms in the same state in the local environment. For each pair of atoms the Particle Pair Kernel would first check whether they are less than apart by calculating the distance between the particle positions. If this is the case, and both particles are in the same state, the counter for the number of same-state atoms is increased.

Further examples will be given in Section 4 where we show how the bond order analysis in [13] and a common neighbour analysis [14] can be expressed as Particle- and Particle Pair- Loops. The Particle Pair Loop can be easily generalised to a loop involving particles for multiparticle forces.

Note that the computational complexity of a Local Particle Pair loop is where is the average number of local neighbours. Since, for constant density , the number is constant and relatively small, the computational complexity is and therefore significantly smaller than the complexity of a Particle Loop.

##### Comment on Newton’s third law

For most physically relevant interactions the force on the first particle of the pair is equal and opposite to the force acting on the second particle. Hence, instead of looping over all unordered pairs , one could also only loop over the ordered pairs with , calculate the force once and update it on both particles. Naively this should lead to a speedup of a factor of two. However, it introduces write conflicts in a (shared memory) parallel implementation. While those can be avoided by adding suitable atomic statements or using a colouring approach, the more serious issue is that it prevents automatic vectorisation. When writing back to memory, the compiler has to assume that there could be aliasing between particle data (from the compiler’s point of view two of the neighbours of each particle could be identical), and will not generate vectorised code. This can be overcome by suitable clustering of the neighbour lists [44] or blocking of the pair lists [45] and explicit vector load/store operations. Note, however, that the authors of [44] use architecture dependent vector instructions in their kernels, which we want to avoid to achieve portability.

Here we do not use any of those approaches and rely on automatic vectorisation, which works well if we only write to the first particle in each pair. In summary we observe that the factor of two which could be gained by using Newton’s third law is more than offset by the advantages of vectorisation and we find that the code is faster overall if we loop over all ordered pairs and only write to the first particle. As will be demonstrated in Section 5.1, for short range forces we achieve equal or better performance than other common MD packages. If necessary, it would of course be possible to implement a version of the pair looping mechanism which exploits Newton’s second law in our code generation framework and improvements such as those described in [44, 45] could be considered in future extensions.

## 3 Implementation

The operations identified in the previous section are the computationally most expensive components of an MD simulation. We now describe their efficient parallel implementation in a code generation framework. From the discussion above it should be clear that our framework will have to provide (1) data structures to represent particle properties as well as global properties and (2) mechanisms for executing Particle- and Particle Pair-Loops. The following choices are inspired by the PyOP2 [29] data structures and execution model. An implementation of the framework described in this section can be found at:

https://bitbucket.org/wrs20/ppmd

All results in this paper were obtained with the release available as [46].

### 3.1 Data structures

Particle properties are represented as instances of a ParticleDat class. This class is a wrapper around a two-dimensional numpy array, where the first index labels the particle and the second corresponds to the property index . Similarly we provide storage for global data shared by all particles in a ScalarArray class.

For convenience and to support different data types, we do not collect all properties into a single ParticleDat (or ScalarArray), but rather allow several ParticleDats and ScalarArrays instances which can be named by the user. For example, consider a simulation with particles which have three dimensional position and momentum vectors and a species index . We also store the total kinetic- and potential energies . This set of local and global properties would be implemented as shown in Listing 1.

The underlying numpy array can be accessed as the ParticleDat.data property; however the “getitem” and “setitem” methods have been overloaded to automatically mark the ParticleDat as “dirty” if the internal data has been modified directly by the user. This is important in parallel implementations based on a domain decomposition approach, where data owned by neighbouring processors is duplicated in a “halo” region. If “dirty” data is used subsequently in a loop, a exchange of halo data will be triggered automatically and ensures that data is consistent between processors. The interface to the stored data is identical for both CPU- and GPU- ParticleDat data structures. When accessing data stored in a ParticleDat stored on the GPU in device memory, “getitem” and “setitem” calls will automatically trigger data copies between host- and device-memory. The correct architecture is chosen at the beginning of the Python script by setting aliases for the appropriate objects as shown in Listing 2.

### 3.2 Particle Pair Loops

In addition to data structures, an execution model is required to launch the computational kernel over all particle pairs. For this, the user writes a brief C-kernel which describes how the properties of the two particles involved in the interaction are modified. In addition, the ParticleDats which are operated on have to be passed explicitly to the pair looping mechanism. For each ParticleDat an access descriptor describes whether the property is read from or written to. The allowed access descriptors are READ (property is only read), WRITE (property is only written to), RW (property is read and written), INC (property is incremented) and INC_ZERO (identical to INC except the values are set to zero before the kernel is launched); see also Tab. 3 for a summary. Since the code generation system does not inspect the C-kernel provided by the user, this information allows the looping system to handle read- and write- access to particle properties in a parallel setting. For example, in a distributed memory implementation, before the execution of the loop halo regions have to be updated for all variables which have a READ access descriptor. Similarly, if a particle has WRITE or INC access, in a threaded implementation write conflicts have to be avoided by generating atomic write statements or employing suitable colouring (see for example the layer algorithm described in [47]). In addition to ParticleDats, global variables (represented as ScalarArrays) can be passed to the kernel with the same access descriptors. To treat numerical constants which do not change during the kernel execution, each kernel can also be be passed a list of Constant objects. Any instances of Constant variables in a kernel are replaced by their numerical values at compile time; this allows the compiler to make additional optimisations, for example by exploiting static loop bounds.

As an (fictitious) example, imagine that on each particle we store the properties (which has components) and (which has one component). For all particles we carry out the operation which calculates

 b(i)=∑all pairs (i,j)d−1∑r=0(a(i)r−a(j)r)2 (1)

and updates the global sum

 Sg=∑all pairs (i,j)d−1∑r=0(a(i)r−a(j)r)4. (2)

A Particle Pair loop which performs this operation can be implemented as shown in Listing 3. The execution over all particle pairs is illustrated schematically in Fig. 2.

Inside the Particle Pair Loop the two involved particles are accessed as the .i and .j component of a structure, and the names of the ParticleDats are given in the dictionary which is passed as the second argument to the PairLoop constructor. For example, the -th component of the first particle is accessed as a.i[r]. This C-variable automatically points to the correct position in the numpy array which holds the ParticleDat values. Particle Loops are conceptually very similar and can be implemented in the same way. While the simple example above aims to illustrate the key concepts of our approach, we also describe the implementation of a complete Lennard-Jones benchmark with Velocity-Verlet integrator in Section 5. The C- and Python-code for executing the force calculation in this case is given in Listings 9 and 10 in A.1.

We note that the code in Listing 3 resembles what would be written in PyOP2 to implement a loop over a set of mesh entities. In PyOP2 the fundamental data types are called Dat and GlobalDat. A Dat object represents data which is associated with topological entities of the mesh, for example the average value of a field in each grid cell. A GlobalDat variable contains globally available data. The main difference is that PyOP2 loops over a particular static set of topological entities and can access data on other related entities which are specified via indirection maps. Those indirection maps are provided as additional arguments to the Dat dictionary of the looping class. An important difference is that the indirection maps in PyOP2 have a fixed “arity”, i.e. each unknown depends on a fixed number of other unknowns. In contrast, in an MD code, the number and identity of nearest neighbours of each particle varies throughout the simulation. In a parallel MD code the distribution of particles over processors also changes over time, and this requires additional parallel communication.

### 3.3 Domain Specific Language

The key Python classes for representing MD specific data objects in our embedded DSL are summarised in Tab. 1. The looping classes which are used to modify those fundamental objects according to the Particle Loop and Particle Pair Loop operations defined mathematically in Section 2 are given in Tab. 2. Valid access descriptors are listed in Tab. 3. For clarity instances of fundamental Python types are coloured in blue, the DSL specific classes are shown in orange and instances of those classes in red. The semantics of the language have been explained in the preceeding sections. The code strings used in the Kernel objects have to be legal C-code, and the particle properties can be accessed as described in Section 3.2.

While the spectrum between pure DSLs (such as the Unified Form Language [48]) and APIs (such as, for example, the BLAS/LAPACK libraries [49, 50]) is somewhat fluid, we argue that our approach does represent an (embedded) DSL since:

1. It allows the expression of domain-specific mathematical operations (Particle- and Particle Pair loops) for the fundamental data objects (= particle properties).

2. It is relatively complete in the sense that it allows the expression of key operations in MD codes; it is not restricted to the composition of high-level operations such as calls to pre-defined force terms.

3. The user has full low-level control in the sense that they can directly manipulate the fundamental data objects in the C-kernel; this allows the implementation of complex force calculations or analysis algorithms.

In this sense it differs from other, more scripting-like approaches such as the PLUMED [40] or MIST [41] libraries which mainly provide high-level APIs to existing MD packages.

### 3.4 Code generation for performance-portability

To execute a pairloop we use a code generation approach. Given the kernel and information on how data is accessed, appropriate wrapper C-code for launching the kernel over all particle pairs is generated for a particular hardware backend. This means that to target different architectures, the user has to write the kernel code only once: it is up to the code generation system (developed by a computational scientist) to execute this on a specific architecture. The implementation generates C code by first inserting the user written kernel into a pre-made template for the specified looping type, then for each passed ParticleDat or ScalarArray C code is added that matches the specified access descriptor. The result of the code generation stage is a C function which is subsequently compiled into a shared library using the C compiler defined by the user. The shared library is then loaded by the framework using the ctypes Python module such that it may be called directly from the Python code.

Note that the user never has to explicitly add calls to MPI routines or guarantee the correctness of the results on a threaded architecture by protecting write statements with “atomic” or “critical” keywords.

On a particular architecture different pair looping mechanisms (described below) lead to the same scientific result but can have different computational performance. Our method allows the straightforward comparison between different looping mechanism without the user intervention to modify code, a feature that could potentially be exploited to optimise performance on a problem-by-problem basis. Since the system is aware of data dependencies between different kernel, loop fusion to reduce the amount of data movement could be implemented to further improve performance.

On a sequential machine, the simplest possible wrapper code is shown in Listing 4.

The computational complexity of this nested loop is and for short range kernels this method would be extremely inefficient. In the following we describe more advanced looping mechanisms for executing Local Particle Pair Loops on parallel architectures.

### 3.5 Cell based methods for Local Particle Pair Loops

If we only consider Local Particle Pair kernels with a fixed cutoff , the computational complexity is reduced to and it is possible to use cell based looping methods (see [51] for an introduction). In this approach the physical domain of size is divided into small cells of size such that ; to simplify the presentation, we assume in the following. At a given point in time every particle can be uniquely associated with one of those small cells. The local Particle Pair loop with cutoff can then be executed by visiting all cells in an outer loop and then iterating over all 26 neighbouring cells . Since it is then sufficient to consider pairs of particles such that and .

For each particle this algorithm considers potential interactions with other particles in a volume . However, most of these pairs will be separated by a distance . To avoid unnecessary execution of the kernel for non-interacting particles, it is possible to add another preprocessing step which loops through all potential pairs and only stores those which are a distance of up to away. Interactions can then be calculated by looping through this neighbour list. In three dimensions this reduces the cost of the force calculation by up to a factor . The computational overhead for building the neighbour list is usually amortised by the gain in the force calculation.

For both pair looping mechanisms described above, different ParticleDats can no longer be considered independently, but rather have to be seen as members of a State object which also stores the shared cell- and neighbour lists. One particular ParticleDat in this State object stores the particle position, and this information is required when building the cell- or neighbour-lists. To distinguish it from other properties such as velocity and acceleration, a special derived class PositionDat is used. As shown in Listing 5, all ParticleDats in a simulation have to be associated with a State object by setting (user-defined) properties of the state as

A.PROPERTY = ParticleDat(...).

Each state also contains a domain object, which stores information about the physical domain size and boundary conditions.

During the simulation, particles will move between cells and hence if the cell- and neighbour lists need to be rebuilt at every iteration, which can be very expensive. This can be avoided by increasing the cell size and choosing an extended cutoff : if the relevant interaction range is and is the maximal particle velocity, a rebuild of the cell- and neighbour lists is only necessary every time steps if

 ¯¯¯rc=rc+2n⋅δt⋅vmax=rc+δ (3)

where is the time step size. For time integration loops we further provide the IntegratorRange class, which allows timestepping methods to be implemented in a way that retains the simplicity and flexibility of a standard Python range based loop without explicit cell- and neighbour list rebuilds by the user. An example is shown in Listing 6. In addition to the number of integration steps, IntegratorRange is passed the following information:

• the timestep size ,

• a ParticleDat containing particle velocities,

• a maximum reuse count and

• the thickness of the additional shell.

#### 3.5.1 Parallelisation

To simulate the interactions of a very large number of particles in a reasonable time, MD codes have to be parallelised. Modern HPC installations expose parallelism on different levels and the implication of this complex hierarchy on MD implementations will be discussed in the following.

##### Distributed memory

The cell-based methods described above can be parallelised with a standard domain-decomposition approach. For this the global domain is split up into smaller subdomains stored on each processor. To correctly include interactions with particles stored on neighbouring subdomains, a layer of halo cells is added. Those cells hold copies of particles which are owned by other processors. Data in halo cells needs to be updated whenever this data changes. Note, however, that this is only necessary if the values of a particular ParticleDat are actually read in the loop, and this information is made explicit via the access descriptors passed to the pairloop. Our code generation system will therefore only launch the corresponding parallel communication calls if necessary. This guarantees the parallel correctness of the code while avoiding superfluous and expensive parallel communications. Since particles can move to different processors and hence the data layout is not fixed, there are actually two parallel communication types which need to be carried out:

1. Data on particles in the halo region has to be updated if it has changed and is used in a pair loop.

2. Particles which leave the local domain need to be moved to a different processor.

The first operation typically needs to be performed whenever dirty data is read. For example halo exchanges on particle positions are required before every force update. The second communication type only needs to be performed every steps, since the increased cut off in Eq. (3) ensures the accuracy of the calculation even if a particle leaves the cell during those steps. In addition to rebuilding the cell list, when a particle has left the subdomain owned by a processor, the State object will automatically move all data owned by the particle to the receiving processor.

Virtually all modern supercomputers now consist of a large collection of relatively complex compute units (CPUs, GPUs or Xeon Phis) organised into nodes. While parallelisation between nodes is achieved with the distributed memory approach described above, each node consists of a large number of compute cores which have access to the same memory. Parallelisation across those cores on a node requires a different approach which will be described in the following section. To make use of the full machine, a hybrid approach which combines both parallelisation strategies is typically used.

##### Threading and GPU parallelisation

To reduce memory requirements, in a sequential implementation (or if the code is parallelised purely with a distributed memory approach), the cell-list is stored as a linked list and the neighbour list is realised by storing all neighbours in a long array. This prevents any further shared memory parallelisation based on threading since neither the cell-list nor the neighbour-list can be built in parallel. To avoid this problem on GPUs we use the approach in [30] and replace the cell list by a cell-occupancy matrix . For this each particle is associated with a cell and the particles in a cell are arranged into “layers”, such that all particles in a cell have a different layer-index. If the layer index of particle is , then , and can be built in parallel. Based on this, a neighbour matrix can be built such that is the index of the -th neighbour of particle . An alternative approach which is described in [47] and avoids building , would be to loop over all pairs of layers and use the matrix to identify interacting particles.

We recently also extended our framework by an OpenMP backend which is described in [38].

##### Vectorisation

Modern HPC CPUs contain floating point units (FPU) which are capable of executing Single Instruction Multiple Data (SIMD) instructions. By using SIMD instructions the FPU can apply the same operation to multiple data points simultaneously. For example a 256bit wide vector FPU may simultaneously apply the same operation to four 64bit doubles or eight 32bit floats. However, producing machine code that contains these SIMD instructions is a non-trivial task. One approach to produce SIMD instructions involves explicitly implementing the desired mathematical operations using “intrinsic” functions for a target architecture (see e.g. [44]). This ensures that SIMD instructions are generated by the compiler but requires careful implementation to be technically correct and produce efficient code. Since the intrinsics are hardware specific, this approach is not portable. In our code we currently simply avoid code patterns which inhibit auto-vectorisation by the compiler. The ability to replace loop bounds by their numerical values via Constant objects also helps with vectorisation. As noted in Section 2, we do not currently exploit symmetry in Newton’s third law when computing forces between particles (although the framework would in principle support this). We find that the Intel C/C++ compiler will successfully auto-vectorise kernels without explicitly implementing gather or scatter operations provided the kernel itself does not contain a code pattern that inhibits vectorisation. The strong- and weak- scaling results reported in Section 5.1 were obtained with vectorised code. We have also tried to vectorise the code by blocking pair-loops as described in [45], but find that for the simple examples we considered this did not give any improvement due to additional explicit memory movement. In the future we will also explore further optimisations which are necessary for more complex kernels and consider for example a portable implementation of the vectorisation approaches in [44].

## 4 Structure analysis algorithms

To demonstrate that the abstraction and implementation described in the previous sections can be used to implement more complex kernels and is not restricted to force calculations, we now discuss two popular algorithms for classifying the local environment of a particle. We show how these algorithms can be expressed in terms of particle- and local particle-pair loops. Both algorithms can be used to identify the crystalline structure of the material; an overview of other common methods can be found in [15].

### 4.1 Bond order analysis

The bond order analysis (BOA) in [13] introduces a set of order parameters which are defined for each particle as

 Q(i)ℓ= ⎷4π2ℓ+1+ℓ∑m=−ℓ|q(i)ℓm|2 (4)

with . The sum

 q(i)ℓm=1|N(i)|∑j∈N(i)Ymℓ(^r(i,j)) (5)

is computed by evaluating the spherical harmonics in the directions

 ^r(i,j)=r(i)−r(j)|r(i)−r(j)|

pointing from the atom to each of its neighbours . Atoms are considered to be neighbours if their distance is smaller than a predefined cutoff range . The moments describe the angular dependence of the charge density of the atom’s neighbours in spectral space. It can then be shown that the integral of the squared averaged charge density can be written as

 ∫Ω|ρ(i)(r)|2dΩ=∞∑ℓ=0(Q(i)ℓ)2.

Perfect crystal lattices have well defined values for . In particular the order parameters with are often used to estimate the degree and nature of crystalinity. Specific values for fcc, hcp and bcc lattices are given in Tab. 4 ([15, 52]).

In a simulation the local structure of the material can therefore be estimated by calculating and comparing to the reference values in Tab. 4. If they agree within some tolerance, the system is classified to be in the corresponding state.

The order parameters can be calculated with the two loops shown in Algorithms 1 and 2. The first particle pair loop (Algorithm 1) calculates the number of neighbours and the moments

 ~q(i)ℓm=∑j∈N(i)Ymℓ(^r(i,j))(=ν(i)nbq(i)ℓm)

for for each atom ; those quantities are stored in two ParticleDats. The particle loop in Algorithm 2 uses and to calculate the according to Eq. (4); the result is stored in a third ParticleDat. The corresponding source code can be found in the examples/structure/boa/ subdirectory of the accompanying code release [46].

### 4.2 Common neighbour analysis

Common neighbour analysis (CNA) [14] is a purely topological method for classifying the local environment of each particle. All atoms within a certain cutoff distance are considered to be “bonded”. For any bonded pair the set of all other atoms which are bonded to both and are referred to as common neighbours. The bonds between those common neighbours define a graph . For each pair this graph is now classified by three numbers [15]: (1) the number of common neighbours , i.e. the number of vertices in , (2) the number of bonds , i.e. the number of edges in , and (3) , the number of bonds in the largest cluster (connected subgraph) . For each pair of bonded atoms this defines a triplet (see Fig. 3). To classify the local environment of an atom, the triplets are computed for all its neighbours and compared to reference signatures for periodic crystal structures. For example, in an hcp lattice, each atom has 12 bonds, six of which are classified as and the other six are ; see Tab. 1 in [15].

There is some ambiguity in the cutoff distance . To overcome this limitation, the author of [15] suggests an adaptive extension of the method. While this improved algorithm can also be implemented in our framework, for the sake of brevity we do not discuss this extension here and focus on the original method.

To implement the CNA algorithm in our framework we proceed in two steps: For each atom we first calculate all directly and indirectly bonded atoms. The set describes the direct bonds; the indirect bonds in the local environment are collected in (see Fig. 4):

 E(i)d ={(i,v):v∈N,|r(i)−r(v)|

Since some of the indirect bonds are counted twice in , the set is an ordered representation of the same bonds:

 (7)

As before, is global index set and the set of all neighbours of particle , i.e. all other particles which are no more than a distance away. In a second step we loop over all pairs of atoms and calculate the sets

 C =N(i)∩N(j) (8) E ={(v,w):v,w∈C,v

is the set of common neighbours and is the set of common neighbour bonds. Note that, to avoid double counting, here we consider ordered bounds such that . Together the two sets and define the graph introduced above. The first two entries of the triplet can be calculated directly as and . To calculate the size of all subgraphs , a random node is chosen. The size of the subgraph such that is obtained with a breadth-first traversal of the connected component containing , removing all visited nodes from in the process. This is repeated until all nodes have been removed, thus calculating the size of all subgraphs . The computation of the maximal cluster size with this method is shown explicitly in Algorithm 7 in D.

We now show how the CNA algorithm can be implemented as a set of Local Particle Pair- loops. For this, define the following ParticleDats:

• (ncomp=3): Particle coordinates, stores the position of particle

• (ncomp=1): Global id, stores the unique global index of particle .

• (ncomp=1): Number of neighbours, i.e. ; this is the number of red particles in the inner circle in Fig. 5.

• (ncomp=1): Number of bonds in the local environment. counts the directly bonded neighbours of a particle plus the number of indirect bonds defined in Eq. (6).

• (ncomp=): Array representation of the set defined in Eq. (6). Two consecutive entries represent a bonded pair in the local environment of particle , i.e. one of the links shown in Fig. 5. The entries of are arranged as follows:

• with for

• with , for

In other words, the first tuples represent the bonds in and are shown as red (solid) lines in Fig. 5. The remaining tuples describe the set and correspond to the blue (dashed) lines. The static size of the list has to be chosen sufficiently large, i.e. .

• (ncomp=) stores the triplets such that is the triplet for the -th bonded neighbour of particle . The number of components has to be chosen such that .

• (ncomp=1) stores the number of classified bonds of particle .

Using those ParticleDats, for each particle the list representation of the set can now be calculated with two Local Particle Pair Loops: the first loop, shown in Algorithm 3, calculates the first entries of by inspecting the direct neighbours of each particle. Based on this, the second loop in algorithm 4 adds the remaining entries, i.e. the blue (dashed) lines in Fig. 5. The final Particle Pair Loop in algorithm 5 then uses the information stored in and to extract the tuple .

The C-code for Algorithms 3 and 4 is shown in A.2. All source code (include the one for the slightly longer Algorithm 5) can be found in the subdirectory examples/structure/cna of the accompanying code release [46]. Results obtained with our implementation of both a bond order- and common-neighbour-analysis algorithm are shown below in Section 5.2.

## 5 Results

To demonstrate the performance, portability and scalability of our code generation framework on two different chip architectures, we implemented the Velocity Verlet integrator [31] (see also e.g. [32, 33]) shown in Algorithm 6. Access descriptors for all loops are given in Tab. 5. The main time stepping loop is realised with an IntegratorRange iterator (see Section 3.5), which takes care of cell-list and neighbour-list updates. C-kernels for the particle-loops that update velocity and position in lines 6 and 8 are shown in Listings 7 and 8.

We simulated a Lennard-Jones liquid system of non-bonded particles interacting via the potential

 V(r)=4ϵ((σr)12−(σr)6+14) (9)

with a specified cutoff . The C-kernel for the calculation of the resulting short-range force in line 7 is given in A.1. The full source code can be found in the code/examples/lennard-jones subdirectory of [46]. It should be stressed that exactly the same code can be used to run the simulation both on a CPU and a GPU if the appropriate definitions shown in listing 2 are added at the beginning of the Python code.

### 5.1 Comparison to other codes

To verify that the code generation approach does not introduce any sizable computational overheads, we compare the performance of our code to monolithic C/Fortran implementations in well established and optimised MD libraries. For this we performed the same strong scaling experiment with DL-POLY (version 4.08), LAMMPS (release dated 1 March 2016) and our code generation framework (subdirectory release of [46]). Raw results can be found in the accompanying data repository [53].

All codes were built with the Intel 2016 compiler suite and OpenMPI 1.8.4 (with the exception of DL-POLY, which used OpenMPI 2.0.0). The NVIDIA CUDA toolkit version 7.5.18 was used for the GPU compilation and the framework was run with Python 2.7.8. The numerical experiments were carried out on the University of Bath HPC facility “Balena”. All nodes of the cluster consist of two Intel Xeon E5-2650v2 (2.6GHz) processors with eight cores each; in addition some nodes are equipped with Nvidia Tesla K20X GPU accelerator cards. As the GPU port of LAMMPS offloads the force calculation, we allowed LAMMPS to use all 16 cores of the host CPU along with the GPU. In contrast, in our framework the entire simulation is run on the GPU and it is sufficient to use a single MPI rank which acts as the host controller.

We use the parameters in Tab. 6, adapted from a LAMMPS benchmark [54]. All three codes implement the neighbour list method for force calculations. For LAMMPS and our framework the extended cutoff in Eq. (3) was chosen such that be with a neighbour list update every 20 iterations. In contrast, DL-POLY automatically updates the neighbour-list when necessary.

The total integration time on up to 1024 cores (64 nodes) and up to 8 GPUs is tabulated in Table 7. Parallel speed-up and parallel efficiency are plotted in Figure 6; grey regions indicate core counts contained within a single CPU node. On the largest core count (1024 cores) the average local problem size is reduced to 1,000 particles per processor. To provide a fair comparison, one K20X GPU is compared to a full 16-core CPU node since in this case the power consumption is comparable ( for the K20X GPU [55] vs. for the Intel Xeon E5-2650v2 CPU [56]). We write for the measured wallclock time required to integrate a system with particles on CPU nodes or GPUs. The corresponding speed-up and parallel efficiency (relative to one CPU node or one GPU) are defined as

 Speed-up =t(1,N)t(p,N) (10) Strong parallel efficiency =t(1,N)p×t(p,N)

and shown in Fig. 6. The absolute times demonstrate that the framework provides comparable performance and scalability to DL-POLY and LAMMPS. In fact we find that for this particular setup both LAMMPS and our code are significantly faster than DL-POLY and scale better. It should be kept in mind, however, that currently both LAMMPS and DL-POLY have a much wider range of applications and provide functionality which is not yet implemented in our framework. A socket-to-socket comparison demonstrates that one full GPU can only deliver a slightly higher performance than a full CPU node. Again, the same is observed for LAMMPS. The framework can make effective use of multi-GPU systems to accelerate computation.

To test performance for very large problem sizes we also carried out a weak scaling experiment. In this setup the average work per unit computational resource is fixed and the total problem size grows proportional to the number of nodes. A system with particles per CPU core ( particles per node) was integrated over timesteps. For the largest computational configuration (1024 cores) the total problem size is about half a billion () particles. All other system parameters are unchanged from Tab. 6. The total time for increasing problem sizes is shown in Fig. 7 (left). The weak parallel efficiency is defined as

 Weak parallel efficiency=t(1,N)t(p,N⋅p) (11)

and plotted in Fig. 7 (right). We observe that (relative to one node) the parallel efficiency never drops below and conclude that the framework will effectively scale to systems containing very large numbers of particles on a significant core count.

The number of particles on a single CPU node in the previous weak scaling run is too large to fit into GPU memory. To also compare the weak scalability of the generated CPU and GPU code we therefore repeat the same experiment with a reduced number of 512,000 particles per node. The resulting time and parallel efficiency are shown in Fig. 8. While the parallel efficiency is worse for the GPU, it never drops below 60%. On one node the GPU code is about twice as fast as the CPU code and on 16 nodes this speedup factor drops to around . This can be explained by the fact that on one node the CPU implementation is slower and therefore communication overheads will have a relatively larger impact on the GPU code. To improve scalability further, we will investigate overlapping communication and communication in the future. This, however, is usually more challenging on GPUs due to the reduced work in halo regions.

#### 5.1.1 Absolute performance

To quantify the absolute performance on both CPU and GPU we use data collected in the second weak scaling experiment (see Fig. 8). The computationally most expensive operation in the simulation is the force update step performed with a particle pair loop. This accounts for of the total runtime on the CPU and on the GPU. As in this simulation the potential energy was updated every 10 iterations, we also report performance metrics for the combined force- and potential-energy (PE) update.

With the vector instruction set each core of an E5-2650v2 (2.6 GHz) Intel CPU can perform 4 double precision additions and 4 double precision multiplications per clock cycle, resulting in a total performance of 332.8 GFLOPs per node. The peak double precision floating point performance of the nVidia Tesla K20x GPU is quoted as 1.31 TFLOPs [57].

Absolute performance numbers for a single-node run are reported in Tab. 8. The measured times only include the time spent in the auto-generated C-code, but we found that the launch of a shared library function from Python has a negligible overhead ( 10–20s). Since the system is spatially homogeneous and there is little load imbalance, we report measurements collected by a single core on the fully populated node. The results demonstrate that the computationally most relevant kernels use a significant fraction of the peak floating point performance. As confirmed by the report generated by the compiler, the kernel for the Lennard-Jones force calculation in Listing 9 is automatically vectorised.

### 5.2 Structure analysis algorithms

We finally demonstrate how the structure analysis algorithms described in Section 4 can be implemented with our framework. For this we first add an on-the-fly implementation of the BOA analysis method. This is achieved by extending the main timestepping loop in Algorithm 6 by calls to the PairLoop and ParticleLoop which evaluate according to Algorithms 1 and 2. The source code is available in the examples/on-the-fly-analysis subdirectory of [46].

To initialise the simulation, 125000 identical particles are arranged in a periodic cubic lattice and their velocities are sampled from a normal distribution. After allowing the system to equilibrate for 50,000 steps in an microcanonical ensemble we coupled the system to an Andersen thermostat with a target temperature near zero for 500,000 iterations. The final configuration consists of two distinct regions. The first is void of particles while the second contains a crystal structure. Fig. 9 shows the change of , and throughout the simulation.

A distribution of the and values at the final timestep is shown in Figs. 11 and 12 in C. This distribution describes the proportion of FCC and HCP in the final configuration as classified by the BOA method. In this work we purely focus on the implementation of the method and do not attempt a physical interpretation of the results.

To demonstrate that the resulting code still scales well in parallel, we carry out a weak scaling experiment with the parameters in Tab. 9. The results are shown in Fig. 10 and confirm that adding the on-the-fly analysis and thermostat have no negative impact on scalability.

Finally the common neighbour analysis was implemented as a parallel post-processing step. C-Kernels for Algorithms 3, 4, 5 and 7 can be found in the subdirectory examples/structure/cna of [46]. We validated our implementations by verifying that perfect crystals are correctly classified in each of the FCC, BCC and HPC configurations. We then applied the method to the test case with 125000 particles mentioned above. For the final configuration the algorithm classified 19360 (15.5%) particles as FCC and 13052 (10.4%) particles as HCP while 92588 (74.1%) particles were left unclassified. Again a physical interpretation of this result would be beyond the scope of this article.

## 6 Conclusions

The key computational components of a Molecular Dynamics simulation can be expressed as loops over all particles or all particle pairs. Based on this observation, we described an abstraction for implementing those loops and introduced the necessary data structures and execution model. Our approach is inspired by the OP2 and PyOP2 frameworks for the solution of PDEs with grid based methods. We implemented a Python-based code generation system which allows the developer to write performance portable molecular dynamics algorithms based on a separation of concerns philosophy. By considering two popular analysis methods for the classification of crystalline structures, we showed that it is easy to apply our approach to write performant and scalable analysis code. In principle the framework also allows for biasing dynamics within a simulation dependent on the local environment of each particle.

The performance and scalability of our code generation framework compares favourably to two existing and well established Molecular Dynamics codes (LAMMPS and DL-POLY) both on CPUs and GPUs. This demonstrates that for the model system considered here the code generation approach does not introduce any computational overheads; the autogenerated code runs at similar speed as monolithic codes in C++ (LAMMPS) or Fortran (DL-POLY). We stress, however, that our main aim is not to out-perform existing codes but rather explore new ways of implementing both timestepping methods and analysis algorithms with minimal programmer effort.

There are many ways in which our framework has to be extended to provide similar functionality to existing MD packages. As reported in [38], long range force calculations with the Ewald summation method [39] are supported in a more recent version of the code; this method can be implemented directly with the data structures and looping algorithms described here. However, the computational cost of this algorithm grows with and it can therefore only be used for moderate size systems. To overcome this limitation we are currently also implementing a Fast Multipole algorithm [58] which has optimal complexity. This approach will require new data structures such as a hierarchical mesh which stores multipole- and local- expansions in each grid cell. Since the functional form of the electrostatic interaction is fixed, long range interactions could also be simulated by linking to a standalone C-code or an existing library such as the SPME method in DL-POLY [59]. Another important extension is support for multiple species. While currently different species can be simulated by adding a species label as a ParticleDat and adding corresponding if-branches to the computational kernels, this is clearly not efficient and should be replaced by native support in the fundamental data structures. Adding constraints to incorporate bonded interactions will require further work. We note, however, that excluded particles can already be treated in our framework. For this, a ParticleDat stores a list with global ids of all excluded particle for each atom. In the PairLoop kernel this exclusion list can be inspected to calculate only the relevant forces.

The performance of the GPU implementation of an algorithm is sensitive to the memory access pattern. At the beginning of a simulation particles are arranged in an ordered fashion in memory that corresponds to the physical location of the particle. As the simulation evolves the movement of particles within the simulation domain introduces an essentially random ordering of particles in memory. The results we present exhibit a slow down effect as the simulation evolves due to this sub-optimal memory ordering effect. Future versions of the framework will periodically reorder the particle data to mitigate this effect. More generally, an in-depth performance study from the perspective of memory utilisation both for the CPU and the GPU backend is important since many MD codes are memory bandwidth limited.

Finally, automatic generation of kernels from the analytical form of the potential as implemented in the OpenMM library [42] could be added. We stress, however, that it is important to still allow the user to also implement arbitrary kernels by hand to cover more general applications.

## 7 Acknowledgements

We would like to thank Alexander Stukowski (Darmstadt) for useful correspondence to clarify the exact definitions of the quantities in the common neighbour analysis algorithm. The PhD project of William Saunders is funded by an EPSRC studentship. This research made use of the Balena High Performance Computing (HPC) service at the University of Bath.

## Appendix A Kernels

This appendix lists some C-kernels which are used for the Lennard-Jones force calculation in Section 5 and in the common neighbour analysis discussed in Section 4.2. The full source code can be found in [46].

### a.1 Force calculation

The Lennard-Jones potential in Eqn. (9) gives rise to the force

 F(r) =−∇V(r)=−rr∂V∂r (12) =48ϵσ2r((σr)14−12(σr)8).

The corresponding kernel for the force- and potential calculation is shown in Listing 9 and the Python code for creating the corresponding data objects and executing the PairLoop is given in Listing 10. The particle position is passed in as the ParticleDat r and the resulting force and potential energy are returned in the ParticleDat F and ScalarArray u. The squared cutoff distance and the numerical constants , and are passed to the pairloop as Constant objects. Since we use a hard cutoff, the force and potential are nonzero only if and only need to be calculated in this case. However, to ensure that the code can be vectorised, the force and potential is calculated for all relative distances and written to the variable F with a ternary operator.