# Exploiting multi-scale parallelism for large scale numerical modelling of laser wakefield accelerators

###### Abstract

A new generation of laser wakefield accelerators, supported by the extreme accelerating fields generated in the interaction of PW-Class lasers and underdense targets, promises the production of high quality electron beams in short distances for multiple applications. Achieving this goal will rely heavily on numerical modeling for further understanding of the underlying physics and identification of optimal regimes, but large scale modeling of these scenarios is computationally heavy and requires efficient use of state-of-the-art Petascale supercomputing systems. We discuss the main difficulties involved in running these simulations and the new developments implemented in the OSIRIS framework to address these issues, ranging from multi-dimensional dynamic load balancing and hybrid distributed / shared memory parallelism to the vectorization of the PIC algorithm. We present the results of the OASCR Joule Metric program on the issue of large scale modeling of LWFA, demonstrating speedups of over 1 order of magnitude on the same hardware. Finally, scalability to over cores, and sustained performance over PFlops is demonstrated, opening the way for large scale modeling of laser wakefield accelerator scenarios.

###### pacs:

52.65, 52.38^{†}

^{†}: Plasma Phys. Control. Fusion

## 1 Introduction

Extreme laser intensities, in excess of , and laser pulse durations shorter than 1 picosecond, where the electron quiver motion in the laser field becomes relativistic [1], are an extraordinary tool for new physics and new applications. Their interaction with plasmas can generate tremendous accelerating fields, which can be used to accelerate particles to high energies in very short distances in what is called laser wakefield acceleration (LWFA) [2]. In LWFA an intense laser driver propagating through a plasma drives a strong, non-linear plasma wave, by pushing the plasma electrons aways through radiation pressure. The excitation of the wake is non-linear in most scenarios, and the fields within the wake structure are electromagnetic in nature requiring full electromagnetic treatment. Also, the trajectories of individual electrons can cross each other and their oscillation energy in the wakefield will reach highly relativistic values. This makes purely theoretical descriptions difficult, and requires the use of full kinetic models that track the motion of individual particles and that solve for the relativistically correct equations of motion

Numerical modeling has played a key role in the development of this field, for example predicting the generation of low energy spread electron bunches [3, 4] that were later demonstrated experimentally [5, 6, 7] and that can now reach energies of GeV [8]. Modeling these scenarios requires us to accurately model the propagation and evolution of a short ( fs) intense laser driver over long distances ( cm), how the wake is excited and how it evolves and how the properties of the injected beams evolve as they are accelerated. However, full 3D scale of modeling of these scenarios is computationally heavy, and require the use of high end high performance computing (HPC) systems. These systems rely on a hierarchy of parallelism that allows many systems worldwide to operate at maximum performances well in excess of 1 Petaflop [9], following a trend that can be traced back 60 years, and that relied on substantial changes in computer architecture. While these systems provide the raw computing power required, numerical codes must evolve and adapt to match these architectures and efficiently use these resources for modeling relevant problems.

In this paper we discuss the issues involved in exploiting this multi-scale parallelism for running efficient large scale LWFA simulation on high end HPC systems. We begin by giving some details on the numerical algorithm used, and estimate the computational cost of running these large scale numerical experiments of LWFA. We follow by describing the multi-scale parallel hierarchy that represents the most common paradigm of modern HPC systems. We then discuss the techniques employed for scaling the code to cores systems, and for exploiting existing hardware acceleration features. We continue by discussing the results of the OASCR Joule Software effectiveness metric test for modeling LWFA on a full system, and show results for the code performance on Tier-0 systems. We conclude with an overview.

The developments presented in this paper were done in the OSIRIS framework [10], which is a massively parallel, fully relativistic, electromagnetic particle-in-cell code that was originally developed for the study of plasma based accelerators which is, to the best of our knowledge, the only code currently supporting all of these features. There are several other codes that have been applied to 3D modeling of LWFA such as VLPL [11], CALDER [12], Vorpal [13], and others [14, 15, 16], and some implement some of the features described here, such as vectorization of the PIC algorithm [16]. There has also been significant developments in implementing these algorithms in novel architectures, most notably using GPUs [17].

## 2 Numerical modeling of LWFA

Due to the highly non-linear kinetic processes occurring in high intensity laser plasma interactions, and in particular in laser wakefield acceleration scenarios, kinetic modeling is required to capture the full detail of the acceleration and injection mechanisms in LWFA. The most complete method that can presently be employed for modeling these scenarios is the fully explicit, full Maxwell solver, fully relativistic, particle-in-cell (PIC) method [18]. In this method, plasma particles are represented by a set of macro-particles that interact through forces deposited on a grid, implementing a particle-mesh algorithm. These particles are initialized with positions within a grid and momenta. At each timestep, the particle motion generates an electrical current that is deposited onto the appropriate points on the grid. MaxwellÕs equations are then used to advance the electric and magnetic fields, generally through the finite difference time domain (FDTD) method. The updated fields can then be interpolated at the particle positions to calculate the force acting on the particle and then advance the fully relativistic equations of motion for the particle, thus closing the simulation loop. The grid cell size is chosen to resolve the smallest scale length of physical relevance, typically the laser wavelength in LWFA scenarios, and the time step is then chosen according to the Courant condition.

If we apply this algorithm to a 10 GeV LWFA stage, theoretical scalings [19] indicate that this will require an acceleration on the order of m, using a plasma density of . The simulation needs to resolve the laser wavelength longitudinally along the laser propagation axis with points per , and transversely the plasma skin depth with points per . The grid cells will be much smaller along the longitudinal direction, which will improve the numerical dispersion of the FDTD method along the laser propagation axis. The algorithm uses a moving window that follows the laser as it propagates in the plasma, that should be large enough to model skin depths longitudinally (or 2 accelerating buckets), and laser spot sizes, , transversely. For a laser focused on a spot the simulation grid will therefore require cells to model the required interaction volume in 3D. The simulation time step will be on the order of the longitudinal resolution and the total number of iterations can be calculated by dividing the total propagation distance by this value, yielding a total of time steps. Using 8 particles per cell, the total number of particles that need to be modeled will be on the order of . A 3D EM-PIC algorithm, using linear interpolation, requires operations per particle and operations per grid cell, which leads to a total number of required calculations on the order of . Realistic, high-fidelity simulations therefore require the efficient use of Petascale computing systems, and adapting our algorithms and codes to the specific requirements involved.

The high performance attained by present computing systems is reached by exploiting several levels of parallelism, ranging from inside a single compute core to a full system consisting of cores. The details of each of these systems will vary, but the currently prevailing paradigm generally follows a three level hierarchy of parallelism: i) At the highest level we have a computer network of independent computing nodes, each one having private memory spaces and processing units, ii) each of these nodes is generally comprised by a set of CPUs/cores that share the memory and other resources inside these nodes, and iii) at the lowest level the computing cores have vector units capable of single instruction multiple data (SIMD) calculation i.e. being able to execute the same instruction on a set of data. Efficient use of these systems requires taking advantage of all these levels of parallelism, only then tapping the real potential of current day Petascale computing systems. Over the past decade some systems have also include some additional accelerator hardware at the node level, such as a general purpose graphics processing unit (GPGPU), or Intel Xeon Phi (MIC) cards, such as the ones used in the current number 1 machine in the world. These can also be viewed as a computing node with memory shared by a (large) number of computing elements and some of the techniques described below can be applied to these systems, but they will not be covered in detail here.

## 3 Scaling to cores

Given the computational requirements for realistic simulations PIC codes must be designed from scratch to be used in massively parallel computer systems. As described above, at the highest level these systems can be viewed as a network of independent processing elements (PEs): each PE has its own private memory space, and information residing on other nodes can only be accessed by exchanging messages over the interconnect. PIC codes are good candidates for parallelization in these systems because the particle calculations are inherently local, requiring only information close to the particle position on the grid. The field solver chosen should also be a local field solver, requiring only the information regarding neighboring cells, which requires less parallel communication than a global field solver such as a spectral solver. With this in mind, the code can simply split the problem across multiple processing elements using a spatial domain decomposition. In this parallel partition each node will handle a fraction of the global simulation grid and the particles that occupy that region of simulation space. The simulation grids use guard cells that hold values belonging to neighboring nodes, that are updated only once per iteration. These guard cells must be enough to accommodate both the needs of the field solver and the particle interpolation scheme. After advancing particles, these are checked to see if they have crossed the node boundary, and if so, they are sent to the appropriate neighboring node. This parallelization strategy is well established [20] and has proven to be very effective for scenarios with small density variations, as shown by the results in section 7 below.

However, for high resolution 3D LWFA simulations this strategy leads to poor performance at high () core counts due to load imbalance. The plasma dynamics during LWFA, where the wave structure and the self-injection mechanism places many particles in a localized region of real space, leads to a severe accumulation of particles in a very narrow region at the back if each wavelength. If the region where the particles accumulate resides on a very small number of the total PEs then a severe load imbalance will arise. Figure 1 shows results from a 3D LWFA simulation that is described in detail in section 6 below that illustrates this problem. Compared to a warm plasma benchmark run on the same partition, the code ran approximately times slower as a result of the severe load imbalance occurring in this simulation. Figure 1(a) plots the number of particles per core in a 2D slice of the 3D parallel partition closest to the laser propagation axis at iteration number 50000, showing a large accumulation of particles behind the laser pulse. The average load imbalance for this simulation (ratio between the maximum number of particles per node and average number of particles per node) was 8.68 which is very close to the observed slowdown. This situation persists throughout most the simulation, as shown in figure 1(b). As soon as the plasma enters the simulation box it starts to accumulate behind the laser and a significant imbalance is maintained over the whole simulation length. The dynamic nature of this imbalance should also be noted, since the position of the accumulated particles in the simulation window will vary over time and slowly move between PEs.

### 3.1 Shared memory parallelism

One way to address this issue is to explore the fact many computer cores share memory access inside a node. The parallelization strategy described previously is designed for a distributed memory system, where each core handles a separate region of simulation space. This has proven extremely efficient because at high core counts the simulation space handled by each core is quite small, and the probability that significant load imbalance will occur is quite high. However, since many cores will share memory, we can therefore have them share a given simulation region, whose volume will be much larger, and distribute the particles inside this larger volume evenly across these cores. The workload for these cores will always be perfectly balanced; globally some imbalance may still occur, but since the volume handled by each group of cores is much larger, the probability for significant load imbalance will be lower.

This can be achieved by parallelizing the particle pusher using a shared memory approach: each process will be responsible for a larger region of simulation space to be shared by a given number of cores. This process will spawn threads (ideally matching the number of cores available) to process the particles, dividing particles evenly across threads. This may result in memory conflicts, as particles being processed by different threads may try to deposit current to the same memory location. To overcome this, we also use copies of the electric current grid, and each thread will only deposit to 1 of these copies. After this has finished, the algorithm will accumulate all current grids in a single current grid, also using dividing the work over threads. This algorithm has the overhead of having to zero multiple current grid copies and also performing the reduction operation of all them, and therefore will not scale for an arbitrary number of cores, and a balance needs to be found. However, this overhead is quite small, especially because these operations can also benefit from shared memory parallelism, and it is generally possible to scale it to all cores in a CPU with good efficiency. Since a larger simulation volume is now being shared by multiple cores, it is also necessary to use shared memory parallelism on other sections of the code that would now represent a more significant part of the simulation loop, such as the field solver. However these are trivial to parallelize in shared memory nodes using textbook strategies with OpenMP constructs. This technique has the added benefit of reducing the communication volume (the total number of cells that need to be communicated between nodes), slightly offsetting the overhead of the shared memory algorithm.

We measured the performance of the shared memory algorithm using a standard warm plasma benchmark on 2 different systems, a Cray XT5 (Jaguar/ORNL), using 2, 6-core, AMD 6276 cpus per node, and an IBM BlueGene/P (Jugene/JFZ), using 1, 4-core, PowerPC 450 cpu per node. For the baseline we used the standard distributed memory algorithm (MPI only). On the IBM system the performance degradation was negligible up to 4 threads per MPI process (the maximum) value. On the Cray system, the performance degradation was below 10% up to 6 threads per MPI process (or 1 MPI process per cpu). When using 12 threads per MPI process the performance drops 27% below the baseline because the additional calculations become relevant and also because of memory affinity issues.

## 4 Multi-dimensional Dynamic Load Balance

Another approach to the imbalance problem is to dynamically adjust the positions of node boundaries to adjust the load in each computing node, attempting to maintain an even load across processors [21, 22]. The algorithm has 2 steps: i) determining the best possible partition from current computational load and ii) redistribute the domain boundaries. Both steps are non-trivial, since finding an optimal partition can be quite difficult, and redistributing the domains can represent a large overhead, with significant communication between nodes.

The only scenario for which a single solution exists for the best parallel partition is for a one-dimensional parallel partition. At high core counts, however, one must resort to multi-dimensional partitions, and attempting to improve the load imbalance in one direction may result in worse results in the other directions. One possible solution is to assume that the computational load is a separable function (i.e. load is a product of ). In this case dynamic load balance in each direction becomes independent of each other and a solution can be found as shown in figure 2. This assumption is not true for all scenarios (e.g. spherical plasma) but will generally result in a better parallel partition. It should also be noted however that at very high core counts, there isnÕt significant room to move your boundaries, given that we will be close to the minimum limit of cells per node, and therefore our chances for optimization are quite low.

Figure 3 shows this algorithm being used in a LWFA scenario. In this example the algorithm was applied simultaneously in the and (not shown here) directions, while the partition along was kept constant, since there was no significant room for the boundaries to move. The algorithm concentrated more partitions near the center of the laser propagation axis where particles accumulate, and has wider partitions close to the edge of the simulation box (similar partitions were used along ). For this particular example however, although the algorithm improved the average imbalance over the simulation by 30%, the code showed no performance improvement. This was due to the fact that the imbalance improvement was not sufficient to offset the overhead of reshaping the simulation partition.

## 5 Using vector (SIMD) units

Further improving simulation performance requires exploiting the vector capabilities available on most current CPUs. As mentioned earlier, these CPUs include a single instruction multiple data (SIMD). These vector units allow for the same instruction to be applied on a wide vector of distinct values at once, enabling much greater floating-point performance. The particular width of the vector will depend on the system; for example the Power A2 QPX unit (BlueGene/Q) allows for 4-wide double precision calculations and the x86 AVX unit allows for 8-wide single precision calculations. Modern optimizing compilers will automatically generate code for these units by attempting to identify vectorization opportunities in serial code, but optimal performance can only be achieved by rethinking the algorithm and implementing the vector code explicitly. Also, this is not required for all parts of the PIC loop, and optimizations should be focused on the most time consuming steps of the algorithm, in particular the particle push (field interpolation and equations of motion) and the electric current deposition that take up of execution time.

### 5.1 Data structures for vector code

Efficient use of the SIMD hardware requires first and foremost the ability to efficiently load the data into the vector registers for calculations. With current CPUs global memory access is generally much more expensive than computations, since up to 8 operations can be performed in a single cycle, efficiently loading the data into the vector registers is critical. Ideally, data structures should be organized so that data could be loaded into vector registers in a single instruction e.g. by grouping each position component in a different buffer sequentially in memory. However, to benefit from the existing code base that handles a large set of functionalities, ranging from initialization, to diagnostics and parallel communications, we need to maintain compatibility with the existing serial data structures, where particle data is is stored with different , and position components for the same particle sequentially in memory.

An efficient way of translating between the two data structures is to use data shuffling instructions that are also available in SIMD units. These instructions allow us to efficiently exchange vector elements inside a single vector or pairs of vectors without accessing memory, and with very little overhead. For example, when reading 3D positions for processing in a 4-wide vector unit we i) read 3 vectors sequentially from the particle buffer, that will correspond to 12 position components and 4 particles, and ii) shuffle the vector data to get one vector of 4 positions, one vector of 4 positions and one vector of 4 positions. This operation is known as a transpose, and it can be done very efficiently using only the vector registers, giving a minimal overhead of about operations per particle (compared to the total floating point operations required for a linear push). For storing the particles back to memory, the opposite operation is performed, again allowing us to maintain the global data structures with negligible performance impact. This method can easily be extended to work with other vector widths. It should also be noted that for benefiting from the highest performance of most SIMD units available the simulation should be done in single precision, as this allows for more values to be processed in a vector of the same bit width. To ensure accuracy and stability of the simulation, this requires storing particle positions relative to the local mesh vertex, normalized to cell size, together with the particle mesh vertex index. This brings the added benefit of making the noise properties of the algorithm uniform across the entire grid, and allows large scale LWFA simulations to be performed correctly in single precision.

Regarding grid quantities, the field interpolation calculations present a different challenge, requiring what is known as a gather operation. We need the appropriate field values for the 4 particles being processed to be loaded into a single vector: since particles are free to move about the grid, these field values will belong to random positions in the grids. Furthermore, when using a Yee grid [23] for the fields, different field components will be defined at different positions inside the cell, and a particle may need to interpolate data from different cells for different components. While the current generation of CPUs do not support vector gathering natively, it can be implemented through a set of serial reads and shuffle operations, with negligible overhead when compared to a serial read, again maintaining compatibility with the global memory structures. For the electric current deposition a different approach was chosen: we add a dummy component to the electric current so that there are 4 values per cell. This allows us to very efficiently accumulate the current in a given cell reading these 4 values into a vector at once, vectorially adding the new current and then storing the 4 values also in a single instruction. Besides adding a small memory overhead, this has no impact in the global code structure.

### 5.2 Vector version of the PIC algorithm

PIC codes, and the particle pusher/current deposition in particular, are good candidates for vectorization since most operations acting on each particle are exactly the same and independent from each other. The exception is the electric current accumulation on the global grid, where more than 1 particle in the same vector may be accumulating current in the same cell, which requires special treatment to avoid memory collisions. The vectorization strategy for an -wide vector unit is therefore straightforward: i) we load particle data (position and momenta) into the vector unit, using the procedure described in the previous section, and calculate the interpolation weights, ii) we interpolate the EM fields for these particles, loading the appropriate field values and using the previously calculated weights, iii) we advance the particles equations of motion using the particle data and the interpolated fields, iv) we create up to virtual particles trajectories for current deposition and v) we store the particles back to main memory. With the exception of the trajectory splitting, which is mostly a serial operation, all calculations are done vectorially.

We then turn our attention to depositing the current using a charge conserving method [24, 25] of the virtual particles trajectories created and i) load virtual trajectories into the vector unit, ii) calculate the current contribution for the virtual particles and iii) accumulate this current in the global electric current grid. This last step needs to be handled differently as more than one of the virtual particles may be in the same simulation cell, and there would be a memory conflict when accumulating the results on the global grid. This can just be done serially by looping through each of the virtual trajectories and depositing the previously calculated current for each one. To improve performance and benefit from SIMD calculations in this step, we accumulate all current components in a single vector step. We convert the 3 vectors holding the , and electric current components of all virtual particles, into vectors holding all current components (and a dummy value) for each individual virtual particle. We then loop over these vectors and accumulate them in global memory using a vector operation, provided the electric current grid is organized as described in the previous section.

## 6 OASCR Joule Metric

Run | Grid | Particles | Performance [G part/s] | |||
---|---|---|---|---|---|---|

Baseline | Improvements | |||||

Warm Plasma | 76.2 | 179.9 | ||||

LWFA 1 | 4.22 | 29.6 | ||||

LWFA 2 | 3.72 | 27.43 | ||||

LWFA 3 | 8.85 | 61.2 |

The performance of the new features implemented for real laser wakefield problems were benchmarked within the framework of the 2011 OASCR Joule Software Effectiveness Metric project, that aims to analyze/improve applications requiring high capability/capacity HPC systems. These tests focused on the Jaguar supercomputer at ORNL, a Cray XT5-HE system with a measured performance of 1.76 Pflop/s, that ranked number 3 in the Top 500 list [9] at the time of the tests. Jaguar had 18680 computing nodes with 2 AMD 6276 cpus with 6 cores each for a total of 224160 computing cores. Each core had an SSE SIMD unit capable of performing 4-wide vector operations in single precision. The project started with a set of initial tests that used of the full systems. These tests aimed to establish a baseline performance for the code for normal simulation parameters relevant for current research, and allowed us to find the relevant bottlenecks and performance hotspots. The developments described above were then implemented in the code, and the same tests were repeated to measure the performance improvements. At the end of the project we also analyzed the strong scaling of the test problems using the full system.

The simulation parameters for the tests performed can be found in table 1. To test the baseline performance of the code on modern hardware a uniform warm plasma test, using a temperature distribution with a generalized thermal velocity of . The laser wakefield scenarios modeled focused on propagating a 200 TW (6 Joule) laser in a plasma with and without moving ions (LWFA-1 and LWFA-3 respectively), and a 1 PW (30 Joule) laser in a (LWFA-2). All simulations were done with quadratic (2nd order) interpolation on a parallel partition of thousand cores using the standard fortran version of the code. The measured baseline performance of the code can be found in table 1. The code had a peak performance of 76.2 G particles/s for the warm plasma. However, due to the load imbalance problems mentioned above the performance of the LWFA runs was 9 - 20 times lower.

The same tests were then repeated using a SIMD (SSE) accelerated version of the code on the same parallel partition. The warm plasma test used the pure MPI distributed memory parallel algorithm, and the LWFA runs were done using the hybrid MPI/OpenMP shared memory algorithm, using 6 threads per MPI process (1 MPI process per CPU). The results are summarized in table 1. With the new features the peak performance of the code improved to 179.9 G particles/s, and all LWFA simulations witnessed speedups of , without any change in the hardware. The gap between peak performance and LWFA performance went down by a factor of . This speedup comes from a combination a factors. For the LWFA-1 test the breakdown of this speedup was as follows: i) the change in parallel partition, using more nodes longitudinally lowered the imbalance by a factor of 1.91, ii) the shared memory algorithm further lowered the imbalance by a similar factor, improving performance by 1.82 and iii) the SSE code and minor memory optimizations provided an additional 2.02 fold speedup. This behavior was similar for the other LWFA simulations.

Run | PIC | Speedup | FP |

[ G part / s ] | [T Flops] | ||

Frozen (s1) | 1463.5 | - | 516.9 |

Frozen (s2) | 784.0 | - | 736.1 |

Warm plasma (strong) | 719.8 | 679.7 | |

Warm plasma (weak) | 741.2 | 700.1 | |

LWFA 1 (strong) | 70.9 | 76.55 | |

HPL | - | - | 1759 |

The final tests aimed to analyze code performance in the full system, and to establish our capability in scaling the test problems in this hardware partition. A series of tests were chosen to establish theoretical peak code performance, strong (fixed problem size) and weak (increasing problem size with number of cores) scaling of well balanced problems, and strong scaling of one of the laser wakefield problems. The hardware partition used was exactly 4 times the size of the partition used previously, and corresponds to 98.7% of the full system. Table 2 summarizes the performance results for these runs. The frozen plasma tests measure the asymptotic limit for code performance by modeling a zero temperature plasma with particles, using linear (s1) and quadratic (s2) interpolation. In this scenario there is no particle motion, which leads to optimal performance in current deposition and minimizes the communications for particle boundaries. For linear interpolation the code performance was particles / s and for quadratic interpolation particles / s. The warm plasma tests focus on a more realistic scenario that uses a warm plasma distribution that more accurately mimics typical simulation parameters, while maintaining a balanced load. Two scaling tests were made, for strong and weak scaling of this simulation. The total number of particles in these simulations were and particles respectively. The code performance was very similar in both cases, on the order of particles / s, slightly below the frozen plasma limit. In both scenarios the code was found to hyper-scale from the previous tests, with speedups of 9.48 to 16.8, with only 4 times the number of cores. This is as result of both the shared memory algorithm, that improved load balance issues, and the vector optimizations, that improved floating point efficiency. Table 2 also shows the floating point (FP) performance (floating point operations per second) of the code for the large scale tests. For comparison the High Performance Linpack (HPL) benchmark, which is the standard HPC system performance benchmark [9], was included for this system. The code reaches very high floating point / parallel efficiency for the full system, reaching 0.74 PFlops peak performance or 42% of the HPL benchmark. Quadratic Interpolation was found to improve operation count / memory ratio which offsets the overhead of memory access, achieving higher floating point efficiency. As a result, although the number of operations involved in the quadratic interpolation algorithm is larger than for linear interpolation, the code is only slower.

Finally, figure 4 shows a visualization of the results from the strong scaling of the LWFA run 1. The global performance of this simulation was particles / s, or core push time, which corresponds to a speedup of 16.8 from the previous tests, and a 2.4 speedup from the improved results at 55 thousand cores. The load imbalance was 4.66, larger than for the 55 thousand core partition tests, which is expected since the node volume decreased by a factor of 4. The floating point performance of the code was 77 TFlops (4.4% of the HPL benchmark). When compared with the warm plasma tests we see that the code ran 9.8 times slower. This is mainly due to the imbalance shown above and the fact that we have a very small final problem size, on average 15 thousand particles per node. In this situation the overhead of the remaining code structure and the parallel communication overhead become important, and lead to a slowdown of about a factor of two, even for a perfectly load balanced situation. These numbers indicate that the LWFA run 1 simulation described above, that would take 2 days and 18 hours using the performance established by the initial tests, can now be performed in under 4 hours using the full Jaguar system.

## 7 Performance on Tier-0 Systems

To test the baseline performance of the code on modern hardware we chose the same uniform warm plasma test described in the previous sections. We chose a warm plasma instead of a frozen plasma for these tests to benchmark a more realistic scenario where particles move between nodes and stress memory access issues. Table 3 shows the performance measured on a single workstation with a single 8 core Intel E5-2680 cpu running at 2.7 GHz for a grid with 8 particles per cell (16 million particles total), using all cores. The test used the AVX vector unit of the cpus operating in single precision. Results are shown as a function of different interpolation level orders, from linear (1st order) to quartic (4th order), per core and per cpu. For linear interpolation the code achieved a performance of 134.4 million particle pushes per second per cpu, completing one iteration in the test node in ms. The floating point efficiency was about of peak.

Interpolation | Push time | Performance | |||||
---|---|---|---|---|---|---|---|

Level | [ns/part] | [M part/s] | |||||

core | cpu | core | cpu | ||||

1 | 59.5 | 7.4 | 16.79 | 134.34 | |||

2 | 108.7 | 13.6 | 9.20 | 73.62 | |||

3 | 198.3 | 24.8 | 5.04 | 40.34 | |||

4 | 492.2 | 61.5 | 14.4 | 16.25 |

We also performed parallel scalability tests at the Sequoia system (IBM BlueGene/Q) at the Lawrence Livermore National Laboratory in the US, which is currently the largest CPU based system in the world, with a total of 1572864 of computing cores. In these tests we analyzed the code scalability starting from a 4096 core partition all the way to the full system partition. We used the same warm plasma test described above with quadratic (2nd order) particle interpolation, and tested both weak parallel scaling and strong parallel scaling. For weak scaling, we used a grid size of and 8 particles per cell, with a final problem size of Gcells. For strong scaling we chose a problem size of grid cells with 16 particles per cell. The results are summarized in figure 5. The weak scaling results show near perfect scaling up to the full system partition, with scaling efficiencies over 96% for all partitions tested. The choice of a finite difference field solver for the algorithm means that all calculations are local, and each partition elements needs only to communicate with its nearest neighbors, allowing for very high weak scalability, that should also hold for larger systems. The strong scaling tests show a final scaling efficiency of 75%. This small drop from optimal efficiency comes mainly from 2 factors, namely i) the problem size could not be divided evenly across all cores at the largest partition sizes, because of the high number of cores involved, which meant that some cores had to deal with a larger computational domain resulting in load imbalance and ii) the final problem size per core was only cells, meaning that the ratio between computation and communication is quite small. The final simulation ran in 44.2 s, compared to the initial hours at 4096 cores.

We also had access to the BlueWaters system at NCSA Illinois with the goal of demonstrating sustained Petascale performance in PIC plasma simulations. The BlueWaters system is a hybrid cpu / gpu system, with an aggregate peak (theoretical) performance of 11.61 PFlops. The tests were run on the cpu partition only that uses a total of 772 480 cores of AMD 6276 cpus running at 2.3 GHz, again using the AVX optimized version of the code in single precision. We used the same warm plasma test, with a problem size of () grid cells, with 400 particles/cell, or particles total, which were, to the best of our knowledge, the largest PIC simulations ever performed. The measured average floating point performance of the code was 2.2 PFlop/s, corresponding to 31% of the peak theoretical performance of the cpu partitions, clearly demonstrating sustained petascale performance.

## 8 Overview

The present problems in laser wakefield acceleration present a formidable challenge to computational physicists. Present day state of the art machines already provide the necessary computing power for detailed large scale simulations on this field, but efficient use of these systems is required. The developments presented in this paper exploit the multi-scale parallelism available in current HPC systems, allowing us to perform large scale numerical modeling of laser wakefield accelerators.

The simulation tools presented here can push particles in under 8 ns / particle / cpu using modern hardware. This performance was reached in single precision that allows for twice the number of operations to be performed simultaneously in vector operations. Provided the PIC algorithm is carefully implemented, in particular defining particle positions relative to local mesh vertices, this poses no limitation for LWFA modeling, allowing for stable and accurate large scale simulations. The simulation code can scale efficiently to over cores, and demonstrates an aggregated performance of over particle pushes per second using quadratic interpolation, modeling over cells and particles, operating at a sustained floating point performance in excess of 2 PFlop/s. The excellent parallel scalability attained is a result of the algorithm and communication pattern choices: all calculations are local, and communication only occurs with nearest neighbors. The communication overhead will be independent of the global core count and, provided the problem size per core is above a certain threshold, the ratio between computation and communication will render this overhead negligible, allowing for near perfect weak scaling of well balanced problems on this number of cores.

For LWFA problems however, load imbalance is a serious issue at high core counts. Dynamically load balancing the simulation by adjusting the position of node boundaries, even considering multi-dimensional scenarios, was not found to improve simulation performance. This was mainly due to the very small simulation volume per core that limits the range of motion for the boundaries, and therefore the possibility for improvement, and the overhead involved in redistributing the computational load. Alternatively, exploiting the shared memory parallelism available in current HPC systems, allows for perfect load balance inside each shared memory node and provide a significant speedup, effectively mitigating this issue and allowing for efficient modeling of these scenarios. The tools presented here open new avenues of research between theoretical/ massive computational studies and laboratory experiments in plasma based particle acceleration, with full scale models, accurate physics, and quantitative high fidelity simulations.

## References

## References

- [1] Mourou G, Tajima T and Bulanov S V 2006 Reviews of Modern Physics 78 309–371
- [2] Tajima T and Dawson J M 1979 Physical Review Letters 43 267–270
- [3] Pukhov A and Meyer-ter Vehn J 2002 Applied Physics B: Lasers and Optics 74 355–361
- [4] Tsung F S, Narang R, Mori W B, Joshi C, Fonseca R A and Silva L O 2004 Physical Review Letters 93 185002
- [5] Faure J, Glinec Y, Pukhov A, Kiselev S, Gordienko S, Lefebvre E, Rousseau J P, Burgy F and Malka V 2004 Nature 431 541
- [6] Geddes C G R, Tóth C, Van Tilborg J, Esarey E, Schroeder C B, Bruhwiler D L, Nieter C, Cary J R and Leemans W P 2004 Nature 431 538
- [7] Mangles S P D, Murphy C D, Najmudin Z, Thomas A G R, Collier J L, Dangor A E, Divall E J, Foster P, Gallacher J G, Hooker C J, Jaroszynski D A, Langley A J, Mori W B, Norreys P A, Tsung F S, Viskup R, Walton B R and Krushelnick K 2004 Nature 431 535
- [8] Wang X, Zgadzaj R, Fazel N, Li Z, Yi S A, Zhang X, Henderson W, Chang Y Y, Korzekwa R, Tsai H E, Pai C H, Quevedo H, Dyer G, Gaul E, Martinez M, Bernstein A C, Borger T, Spinks M, Donovan M, Khudik V, Shvets G, Ditmire T and Downer M C 2013 Nature Communications 4
- [9] Top 500 supercomputer sites http://www.top500.org/lists/2013/06/
- [10] Fonseca R A, Silva L O, Tsung F S, Decyk V K, Lu W, Ren C, Mori W B, Deng S, Lee S, Katsouleas T and Adam J C 2002 (Lecture Notes in Computer Science vol 2331) (Berlin, Heidelberg: Springer Berlin Heidelberg)
- [11] Pukhov A 1999 Journal of Plasma Physics 61 425–433
- [12] Lefebvre E, Cochet N, Fritzler S, Malka V, Al onard M M, Chemin J F, Darbon S, Disdier L, Faure J, Fedotoff A, Landoas O, Malka G, M ot V, Morel P, Gloahec M R L, Rouyer A, Rubbelynck C, Tikhonchuk V, Wrobel R, Audebert P and Rousseaux C 2003 Nuclear Fusion 43 629–633
- [13] Nieter C and Cary J R 2004 Journal of Computational Physics 196 448–473
- [14] Ruhl H 2006 Introduction to Computational Methods in Many Body Physics ed Bonitz M and Semkat D (Rinton Press, Paramus, New Jersey) p 416
- [15] Geissler M, Schreiber J and Meyer-Ter-Vehn J 2006 New Journal of Physics 8 186
- [16] Bowers K J, Albright B J, Yin L, Bergen B and Kwan T J T 2008 Physics Of Plasmas 15 055703
- [17] Burau H, Widera R, Honig W, Juckeland G, Debus A, Kluge T, Schramm U, Cowan T E, Sauerbrey R and Bussmann M 2010 Plasma Science, IEEE Transactions on 38 2831–2839
- [18] Dawson J M 1983 Reviews of Modern Physics 55 403–447
- [19] Lu W, Tzoufras M, Joshi C, Tsung F S, Mori W B, Vieira J, Fonseca R A and Silva L O 2007 Physical Review Special Topics-Accelerators and Beams 10 061301
- [20] Wang J, LIewer P and Decyk V K 1995 Computer Physics Communications 87 35–53
- [21] Ferraro R, LIewer P and Decyk V K 1993 Journal of Computational Physics 109 329–340
- [22] Fonseca R A, Martins S F, Silva L O, Tonge J W, Tsung F S and Mori W B 2008 Plasma Phys Contr Fusion 124034
- [23] Yee K 1966 IEEE Transactions on Antenna Propagation AP14 302–307
- [24] Villasenor J and Buneman O 1992 Computer Physics Communications 69 306–316
- [25] Esirkepov T Z 2001 Computer Physics Communications 135 144–153