Strong scaling for numerical weather prediction at petascale with the atmospheric model NUMA
Numerical weather prediction (NWP) has proven to be computationally challenging due to its inherent multiscale nature. Currently, the highest resolution NWP models use a horizontal resolution of about 10km. At this resolution many important processes in the atmosphere are not resolved. Needless to say this introduces errors. In order to increase the resolution of NWP models highly scalable atmospheric models are needed.
The Non-hydrostatic Unified Model of the Atmosphere (NUMA), developed by the authors at the Naval Postgraduate School, was designed to achieve this purpose. NUMA is used by the Naval Research Laboratory, Monterey as the engine inside its next generation weather prediction system NEPTUNE. NUMA solves the fully compressible Navier-Stokes equations by means of high-order Galerkin methods (both spectral element as well as discontinuous Galerkin methods can be used). Mesh generation is done using the p4est library. NUMA is capable of running middle and upper atmosphere simulations since it does not make use of the shallow-atmosphere approximation.
This paper presents the performance analysis and optimization of the spectral element version of NUMA. The performance at different optimization stages is analyzed using a theoretical performance model as well as measurements via hardware counters. Machine independent optimization is compared to machine specific optimization using BG/Q vector intrinsics. By using vector intrinsics the main computations reach 1.2 PFlops on the entire machine Mira (12% of the theoretical peak performance). The paper also presents scalability studies for two idealized test cases that are relevant for NWP applications. The atmospheric model NUMA delivers an excellent strong scaling efficiency of 99% on the entire supercomputer Mira using a mesh with 1.8 billion grid points. This allows to run a global forecast of a baroclinic wave test case at 3km uniform horizontal resolution and double precision within the time frame required for operational weather prediction.
Andreas Müller, work was done at the Naval Postgraduate School, Department of Applied Mathematics, Monterey, CA 93943, USA. Current address: European Center for Medium-Range Weather Forecasts, Shinfield Park, Reading RG2 9AX, UK.
Numerical weather prediction (NWP) has always been considered one of the important computationally intensive uses of supercomputers. Nevertheless there is a big gap between the size of the available supercomputers and the amount of computing power that is used for operational weather prediction. State of the art operational deterministic weather forecasts typically use about 1000 processors (Bauer et al., 2015) with a global resolution of about 10km, whereas the biggest available supercomputers offer more than one million processors allowing more than floating point operations in one second (petascale). One of the reasons for this discrepancy is that many weather models do not scale to this large number of processors and therefore are not able to make good use of these big machines. The National Oceanic and Atmospheric Administration (NOAA) has initiated the High-Impact Weather Prediction Project (HIWPP) with the goal to reach numerical weather prediction at 3km resolution by the year 2020 (Schneider, 2014). Being able to improve the resolution by almost one order of magnitude will allow resolving some of the atmospheric processes explicitly that are currently only described by heuristic approximations (parameterizations). For this reason, it is expected that such a significant improvement in the resolution of weather prediction models will reduce the error and improve the accuracy of weather forecasts significantly.
In this paper, we show that the Non-hydrostatic Unified Model of the Atmosphere, NUMA (Giraldo and Restelli, 2008; Kelly and Giraldo, 2012; Giraldo et al., 2013), is capable of simulating a global baroclinic wave test case within the timeframe required for operational weather prediction at 3km resolution using a uniform global mesh with 31 layers in the vertical direction. We achieve this performance with double precision and without making use of the commonly used shallow atmosphere approximation; in fact, NUMA (within the NEPTUNE modeling system) was the only model studied by NOAA in the HIWPP study that did not use the shallow atmosphere approximation (Whitaker, 2015). Using the deep atmosphere equations instead allows our simulations to include middle and upper atmospheric processes which are important for long-term (seasonal) weather and climate predictions. Furthermore, our code does not assume any special alignment of its mesh with the horizontal and vertical direction which allows the simulation of arbitrary steep terrain. It was possible to reach the desired resolution thanks to a careful optimization of the code and an excellent strong scaling efficiency of 99% on the entire 3.14 million threads of the supercomputer Mira using a mesh with 1.8 billion grid points. To our knowledge, this paper not only presents the first atmospheric model that is capable of reaching the envisioned resolution within operational requirements, but also presents the first published strong scalability study up to petascale of fully compressible 3D global simulations.
Johnsen et al. (2013) present strong scaling efficiency of about 65% at almost 300 TFlops sustained performance on the Cray machine Blue Waters for a Hurricane simulation using 4 billion grid points. Wyszogrodzki et al. (2012) present strong scaling up to cores on the Hopper II system including full parameterizations for moisture using up to 84 million grid points. Strong scaling for the atmospheric model CAM-SE using a spectral element method similar to the one utilized in NUMA is presented by Dennis et al. (2012). CAM-SE is targeted at climate prediction. Dennis et al. report strong scaling up to 172,800 cores on the Cray system JaguarPF using 81 million grid points. Other publications either do not solve the fully compressible Navier-Stokes equations (Tufo and Fischer, 1999; Xue et al., 2014) or show strong scaling only at much smaller scale (Nair et al., 2009). None of these publications is targeted at enabling numerical weather prediction at petascale.
Our paper is organized as follows: the numerical methods are introduced in Section 2. Section 3 presents the two test cases considered for the studies of this paper, Section 4 describes the mesh generation with the p4est library and Section 5 gives some important technical details about the supercomputer Mira that we use for this work. A theoretical performance model is presented in Section 6. The code optimizations are presented in Section 7 and scalability results are shown in Section 8.
2 Numerical Methods
NUMA solves the compressible Navier-Stokes equations which can be written as (see e.g. Müller et al., 2013)
where is the time, is a time dependent vector field containing the so called prognostic variables (air density , 3D wind speed, , and potential temperature, ) and are the coordinates in the three space dimensions. The nonlinear operator F denotes the flux tensor and is a source function.
In the following subsections we illustrate the main steps of the numerical solution of these equations using a spectral element method. In the last subsection of this section we describe two different numerical possibilities to organize the data of our simulation. The two methods are identified as CG storage and DG storage (Abdi and Giraldo, 2016).
2.1 Spatial Discretization
In order to discretize Eq. (1) we introduce a mesh of elements. An example for a 2D cross section of our mesh is illustrated in Fig. 1a. Inside each element we approximate the solution in each dimension by polynomials of order . We indicate with the approximation of . To define these polynomials we introduce a mesh of grid points inside each element and in each direction. To simplify numerical integration we use Lobatto points. We denote the coordinates of these grid points by . We restrict the rest of this section to the special case of because this case is most efficient for vectorization on the supercomputer Mira (see Section 7). We define our polynomials in a reference element over the interval in each direction, with coordinates and grid points . We denote the Jacobian determinant of the coordinate transformation between reference element and physical element at grid point with . The numerical solution inside element is given by
with and the 1D Lagrange basis polynomials are given by
where is the domain of element .
The goal is now to insert Eq. (2) into Eq. (1) and solve it for the values of at the grid points . In this paper we use a spectral element method. From now on, the spectral element method will be often referred to with the acronym CG, from Continuous Galerkin. We multiply Eq. (1) by (including ) and integrate over the entire domain . By using Gauss-Lobatto quadrature with quadrature weights we obtain the following equation:
where , , and are the entries of the diagonal mass matrix.
The spatial derivatives in the divergence of the flux tensor are given with Eq. (2) by:
The products of the values of at the grid points and the derivatives of our basis functions are essentially 44 matrix-matrix-multiplications. All of the derivatives in Eq. (2.1) are computed once at the beginning of the simulation.
The basis functions vanish outside of element . For this reason the sum over all elements in Eq. (4) reduces to a single element for interior grid points. For the grid points along the edges we need to sum over all neighboring elements weighted by the volume of the elements (this summation is called DSS which stands for Direct Stiffness Summation).
2.2 Time Discretization
In order to keep communication between different processors simple we use explicit time integration in the horizontal direction. If the vertical resolution is of the same order of magnitude of the horizontal resolution we use a fully explicit Runge-Kutta scheme with five stages and third order. In each of those five stages we need to evaluate the right hand side of Eq. (4) and communicate the values of the grid points along the process boundaries (blue lines in Fig. 1a).
If the vertical resolution is much finer than the horizontal resolution we organize our mesh in such a way that all the vertical columns of our elements are always computed in the same MPI process. This allows us to make implicit corrections along the vertical columns after an explicit step of a leap-frog scheme. We call this approach 1D-IMEX (IMplicit-EXplicit) (Giraldo et al., 2013).
Spectral element methods require stabilization (Marras et al., 2015a). NUMA allows for the use of different stabilization schemes that range from subgrid-scale models (Marras et al., 2015b) to low-pass filters (Boyd-Vandeven). In this paper we use a Boyd-Vandeven filter. The main idea of this filter is to perform a spectral transformation of the nodal values and to dampen the highest order modes. From a computational point of view this results in multiplying all the values of the element with a filter matrix. Each time the filter is applied the new filtered values need to be communicated between neighboring MPI processes. In the future we will move to Laplacian based stabilization methods which do not require an additional communication step (see Giraldo et al., 2013).
2.4 CG and DG storage
Each MPI process needs to own a copy of values at the grid points along the process boundaries. This is illustrated in Fig. 1b by drawing a gray gap between the different processes. There is only one copy in memory for interior grid points even if they are located on a boundary between different elements (green lines in Fig. 1b). We call this approach CG storage because it requires the solution to be continuous and works only for Continous Galerkin methods.
Another possibility to organize the data is to always store the values along element boundaries for each neighboring element separately (Fig. 1c). We call this approach DG storage because it allows the use of Discontinous Galerkin methods. We compare the performance of these two approaches in a simple performance model in Section 6.
3 Test Cases
Two test cases are considered in this paper. One test case is the baroclinic wave instability problem by Jablonowski and Williamson (2006). This problem is classically used to test the dynamical core of global circulation models (also in HIWPP, cf. Schneider, 2014). It is initialized by a zonal band of high wind speed in the mid-latitudes (jet stream). A Gaussian perturbation of the zonal wind is added. This perturbation leads to wave like meridional perturbations of the jet stream (Fig. 2). After some time the flow pattern looks similar to the polar front jet stream of the real atmosphere.
The other test case is a 3D rising thermal bubble in a box of 1000m in each direction. This test case is initialized with a temperature perturbation in a neutrally stratified atmosphere. The precise definition and analysis of the full simulation is reported in Kelly and Giraldo (2012).
Both test cases are important for NWP applications. Operational weather prediction needs to cover the global circulation on the entire Earth like in the baroclinic wave test case. In order to use a higher resolution, for specific localized features of the atmosphere like hurricanes, one needs to run the simulation in limited area mode like in the 3D rising thermal bubble test case.
4 Mesh generation and load balancing
The data structures and algorithms for parallel mesh generation, partitioning, and load balancing used in our simulations were provided by the p4est library. The p4est library has been used for efficient and scalable parallel adaptive mesh refinement for 2D advection on the sphere (Burstedde et al., 2014), in other applications such as mantle convection and seismic wave propagation (Burstedde et al., 2010), and as a backend for the deal.II finite element library (Bangerth et al., 2015). Our present paper is the first time that p4est is used for full 3D atmospheric simulations.
The p4est library represents two- and three-dimensional domains via a two-level structure, with a macro mesh and a micro mesh. The macro mesh is a conformal quadrilateral or hexahedral mesh, which is encoded as an unstructured mesh that is reproduced on each MPI process. Each element in the macro mesh is then treated as the root of a partitioned quadtree or octree, which recursively refines the macro element isotropically to create a micro mesh. The tree structure is represented in memory as a list of the leaves of the tree, ordered by the Morton curve (also known as the z-curve). This ordering induces a space filling curve that visits the centers of the leaves: while this curve is not a continuous space filling curve, it has many of their nice properties. One important property is that partitioning a domain by dividing the Morton curve into continuous segments creates subdomains that are fairly compact, with low surface-to-volume ratios (Hungershöfer and Wierum, 2002). This means that partitioning by this method keeps the intra-process communication during simulations low. A full description of p4est’s forest-of-quadtree and forest-of-octree data structures and algorithms can be found in Burstedde et al. (2011).
When used in its raw form, the neighborhood information of an element in the micro mesh (i.e., which elements are adjacent) takes time to calculate, where is the number of micro mesh elements in the th partition. To avoid incurring this cost during each time step, the adjacency information for all elements in the th partition can be converted into a lookup table, much like an unstructured mesh. An efficient approach to creating this information, which can also be used to enumerate the nodes for high-order CG finite elements, is described in Isaac et al. (2015a).
The numerical methods in our simulations involve three-dimensional computations, but the radial direction is treated differently from the other two: its grid resolution requirements are different, and achieving efficiency in the IMEX time evolution scheme (and other calculations that are performed only in the radial direction) requires that radial columns of elements and degrees of freedom be contiguous in memory. A forest-of-octrees approach would be ill-suited for these constraints. First, octree refinement is isotropic: the aspect ratio of a macro element is inherited by all of the micro elements created by refinement. This means that the relationship between horizontal and radial resolution would have to be respected at the macro mesh level, increasing the macro mesh’s complexity. Second, the three-dimensional Morton curve does not respect the need to keep radial columns contiguous: elements in a column would be separated in memory, and without care would even be placed in separate partitions.
For these reasons, we want to use a forest-of-quadtrees approach to generate and partition radial columns, but to handle the elements within each column using a different approach. An extension to the p4est library, which was first used in the context of ice sheet modeling (Isaac et al., 2015b), provides the necessary data structures and interface. This extension is a set of “p6est” data types and functions (so named because it uses aspects of the two-dimensional “p4est” interface and the three-dimensional “p8est” interface). Essentially, it treats each radial column as a list, from the bottom to top, of the elements created by recursive bisection of the full column, and uses the existing two-dimensional p4est routines to manage the partitioning of columns and intercolumn interactions. A more complete description of this approach can be found in (Isaac, 2015, Chapter 2). The p6est mesh format is illustrated in Figure 3.
As the elements within a column are defined by recursive bisection, p6est was designed for meshes with elements per column for some . Because NUMA must work with meshes which do not have this property, the p6est format was extended for this work to support an arbitrary number of elements per column. The runtime of the mesh generation is shown in Fig. 4. Even for 43 billion grid points (blue results) it takes always less than 20 seconds runtime to generate the mesh.
It should be noted that, although mesh adaptivity is not used in this work, the p6est data structures support bimodal local mesh adaptivity: elements may be independently refined in the radial direction, and each column can be independently refined into four smaller columns.
5 Blue Gene Q Mira
The simulations presented in this paper were performed on the supercomputer Mira of the Argonne National Laboratory. Mira is an IBM Blue Gene/Q system offering computational nodes. Each of these nodes has 16 cores resulting in a total number of cores. Each core has a quad floating point unit. This permits running up to four MPI processes or up to four OpenMP threads on each core. The maximum total number of hardware threads is therefore .
Important for the performance of our code is the memory architecture of Mira. Each computational node has 16GB of random access memory (RAM). The processor receives its data from RAM through two levels of cache. Each core has its own level 1 (L1) cache of 16KB while the L2 cache of Mira is shared among all 16 processors of the computational node and has a size of 32MB.
In addition to these two levels of cache, each core has a L1 cache prefetch (L1P) of 4KB (Chung et al., 2012). Whenever the data needed for the simulation cannot be found in the L1 cache (L1 cache miss) the computer checks if this data is available in L1P. This is always done for an entire cache line of 128 bytes (16 double precision floating point numbers). If the requested cache line is not in L1P it goes on and looks in L2 cache and eventually in the RAM for this data. The prefetcher of the Blue Gene/Q keeps track of previous cache misses. The stream prefetcher of the BG/Q establishes a stream if consecutive cache lines are requested, i.e. L1P loads the next cache lines even when no cache miss occurs. If this consecutive data is actually needed by the code this can save a lot of runtime. However, if the data is not used consecutively by the code the prefetcher will read a lot of data from L2 cache without ever using it. This can produce a huge number of unnecessary cache misses.
Important for our optimizations is also the vector unit. The registers of Mira have a length of 256 bits offering space for four double precision floating point numbers. This allows performing up to 8 double precision fused multiply-add floating point operations per core within the same clock cycle. Each core can perform one floating point instruction and one integer instruction per clock cycle. Load and store instructions take each one cycle of the integer unit.
6 Performance Model
Making a theoretical performance model allows us to estimate the expected performance and to compare different numerical methods without fully optimizing them in our code. We created the performance model by counting all floating point operations as well as memory read and write accesses throughout our entire code. The results of this theoretical model are presented in Tables 1 to 3. We do this for three possible implementations: our CG version avoids having multiple copies of data in memory wherever possible (as in Fig.1b). The only variables that still require multiple values at the same physical locations are the metric terms in Eq. (2.1). The CG/DG-version uses the same approach like the CG version for dynamically changing variables but allows additional copies of the data at element boundaries (Fig. 1c) for data related to the reference atmosphere that does not change throughout the simulation. The DG-version allows multiple copies of the data along element interfaces (Fig. 1c) for all variables.
Table 1 shows the expected number of floating point operations and total memory read and write traffic throughout the entire simulation for the rising thermal bubble test case. The additional copies of data along the element interfaces in DG storage lead to a significant increase in the number of floating point operations as well as memory traffic and to a significant reduction of arithmetic intensity. Our results for arithmetic intensity show that all of our cases are memory bound (see roofline plot in Fig. 6 and 7). The optimal runtime in Table 1 is the runtime that we expect if we manage to reach 100% of the theoretical peak memory bandwidth as given by the STREAM benchmark in Morozov et al. (2013). The percentage of the theoretical peak performance of the processor is also given in Table 1 for this optimal runtime. Our CG-only version gives us the best performance in all of these results.
|GFlops per node||3007.00||3007.00||4023.19|
|read traffic in GB||2129.42||2537.28||3489.66|
|write traffic in GB||661.83||688.34||1168.69|
|arithmetic intensity in Flops/Bytes||1.08||0.93||0.86|
|optimal runtime in seconds||97.94||113.18||163.45|
|% of theoretical peak of processor||14.99||12.97||12.02|
CG-storage has a big disadvantage that we have not included in Table 1: we have to perform the computations of Eq. (2.1) on a per element basis while data stored in CG-storage is not arranged on a per element basis. This leads to non-contiguous memory access that appears to be random. Having random memory access will not affect the results in Table 1 if most of the data is already in L2 cache. Therefore the results of Table 1 will still be correct if the number of elements per compute core is small. Our goal for the rising thermal bubble problem is to use a very high grid resolution which produces much more data than we can fit into L2 cache. In this case we expect to get a much better estimate for the performance of our code by assuming that none of the data is already in L2 cache and by counting the full cache lines in those cases where CG storage requires us to access only a small portion of that cache line. The results including this estimate for the effect of random memory access are shown in Table 2. Including the effect of random memory access does not change the number of floating point operations but it leads to a significant increase in memory traffic. This makes DG storage now the version with the highest arithmetic intensity and therefore the best percentage of the theoretical peak performance of the processes. The overall runtime of the entire simulation is still slower for DG storage due to the increased number of floating point operations. The winner in terms of overall runtime is our CG/DG-version.
|GFlops per node||3007.00||3007.00||4023.19|
|read traffic in GB||3483.05||3138.46||3682.77|
|write traffic in GB||853.44||879.95||1360.30|
|arithmetic intensity in Flops/Bytes||0.69||0.75||0.80|
|optimal runtime in seconds||152.16||141.00||176.95|
|% of theoretical peak of processor||9.65||10.41||11.10|
All of the results presented so far have been obtained for . Our performance model allows us to compare the optimal performance of different polynomial orders for the three different versions from Table 2. The results are shown in Fig. 5. The runtime per timestep increases with decreasing order due to the additional copies of data for DG storage (Fig. 5b). This effect occurs also in our CG-version because even in this version we still need to store metric terms in DG-storage.
The results look very different when the actual time to solution for the entire simulation is considered (Fig. 5a). The numerical methods considered in this paper require to reduce the time step with increasing order to keep the explicitly resolved part of the simulation stable. This makes low polynomial order much faster than high order. As shown in Fig. 5a we get the most efficient time to solution for convergence order 3 which corresponds to polynomial order . We still choose for the remainder of this paper because we expect to be easier to vectorize on the Blue Gene Q architecture.
Table 3 shows the results of our theoretical performance model for the baroclinic instability test case. As described in the introduction our final goal is to run this test case at 3km resolution on the entire supercomputer Mira. This setup gives us only 20 elements per hardware thread which makes it possible to fit most of the data in each timestep into L2 cache. For this reason we do not estimate the effect of random memory access in this case. Our CG and CG/DG versions are again significantly faster than our DG version. We will use the CG/DG version for our optimizations in the next section because this version gives us a significant advantage for the rising thermal bubble test case in Table 2.
|GFlops per node||61.96||61.96||83.00|
|read traffic in GB||58.55||62.18||94.22|
|write traffic in GB||22.48||22.48||36.69|
|arithmetic intensity in Flops/Bytes||0.76||0.73||0.63|
|optimal runtime in seconds||2.84||2.97||4.59|
|% of theoretical peak of processor||10.64||10.18||8.82|
7 Code Optimizations
The goal of this section is to optimize our CG storage version of NUMA for . To take advantage of the reduced amount of data compared to DG storage we aim at computing as much work as possible on a per grid point basis and try to avoid making computations on a per element basis. The main structure of our code is illustrated in code example 1. Computations that need to be computed element-wise are highlighted in blue. Communication is highlighted in red.
We tried to optimize all parts of our code. We found create_rhs (the computation of the right hand side in Eq.(4), see also code example 1) to be the only function that contains enough floating point operations to allow significant optimizations.
Tables 5 to 8 show performance measurements for different versions of create_rhs. The different versions are explained in Table 4. For the rest of this section we simply refer to the different versions in these tables. Version T gives theoretical expectations from our performance model as presented in the previous section. As in Table 2 we include here our estimate for random memory access for the rising thermal bubble test case. Version O is another theoretical result that is obtained by making some further optimizations. These optimizations consist of avoiding some unnecessary memory access and minimizing memory access by computing metric terms in Eq. (2.1) in each stage of our time integration method. This leads to a significant increase in the number of floating point operations but more importantly it allows a significant reduction in memory traffic. We will try these optimizations in our future work.
All of the measurements in Tables 5 to 8 are taken over the entire simulation. This includes that (version A) requires a larger number of timesteps than (version B). The column “% peak” gives the percentage of the theoretical peak performance of the processor. The column “arith. int.” shows the arithmetic intensity in Flops/Bytes. The column “% max.” shows how close this part of the code is to the maximum performance according to the roofline model for the given arithmetic intensity. The three columns “flops”, “read” and “write” show the total number of flops, read and write traffic summed over the entire simulation on one BG/Q node. The column “vect.” shows how many percent of all floating point operations are vectorized. The column “fma” gives the percentage of fused multiply-add operations among all floating point operations. We have not estimated the optimal number of fused multiply-add operations in our performance model. For this reason we leave the column “fma” empty for the versions T and O. The column “mix” shows the percentage of floating point instructions among all instructions. The column “issue” shows how close this part of the code is to the maximum issue rate of one integer/load/store instruction per cycle per core. The three columns “L1P”, “L2” and “DDR” show how many of the loads hit the L1P buffer, the L2 cache and the DDR memory respectively.
To confirm our theoretical results from Fig. 5 we first tried a fairly high polynomial order of (version A). In agreement with our theoretical results we found that order gives us significantly better time to solution (version B). We have not seen a significant impact on the accuracy of our test cases. We use for the rest of this paper because this order is very well suited for vectorization on Mira (four double precision floating point numbers fit into one register). Another significant speedup was obtained by running four MPI processes per core (version C).
We computed the derivatives of Eq. (2.1) in versions A, B and C for each of the five variables of separately. By merging all these derivatives into one matrix for each direction and element we achieved another significant speedup (version D). We also used the BLAS function dgemm (version E) but without any improvement on the performance.
The rest of our optimizations can be categorized into three main topics which we discuss in the following subsections: compiler optimizations, BG/Q vector intrinsics and OpenMP. At the end of this section we give a short description of possible next steps for further optimization.
7.1 Compiler Optimization
To improve the performance while retaining portability we worked first on enabling better optimization through the compiler. We spent some time on finding the best level of compiler optimization for each function of our code. We found a few functions for which level 3 optimization gave us wrong results. This is not surprising because level 3 compiler optimization is not IEEE compliant.
Many of our operations in create_rhs looked initially like code example 2. The operations were computed for each grid point of the element separately which makes it impossible for the compiler to vectorize the code. This explains the very low fraction of vectorized operations in versions A, B, C and D (column “vect.” in Table 5 to 8).
|A||, 1 MPI process per core|
|B||, 1 MPI process per core|
|C||like B, 4 MPI processes per core|
|D||like C, all derivatives of in one matrix|
|E||like D, use BLAS function dgemm|
|F||like C, optimized for compiler vectorization|
|G||like D, some vector intrinsics|
|H||like C, rewritten using vector intrinsics|
|I||like H, 4 OpenMP threads per core|
|T||theoretical expectation for version I|
|O||like T, further optimized (see text for details)|
|arith.||traffic in GB||flops||instructions||loads that hit|
|runtime||% peak||int.||% max.||GFlops||read||write||vect.||fma||mix||issue||L1P||L2||DDR|
|arith.||traffic in GB||flops||instructions||loads that hit|
|runtime||% peak||int.||% max.||GFlops||read||write||vect.||fma||mix||issue||L1P||L2||DDR|
|arith.||traffic in GB||flops||instructions||loads that hit|
|runtime||% peak||int.||% max.||GFlops||read||write||vect.||fma||mix||issue||L1P||L2||DDR|
|arith.||traffic in GB||flops||instructions||loads that hit|
|runtime||% peak||int.||% max.||GFlops||read||write||vect.||fma||mix||issue||L1P||L2||DDR|
To improve vectorization we changed our code in such a way that the operations are performed for the entire element at once (code example 3). Our measurements for version F show that this simple modification leads to a significant improvement of the vectorization.
7.2 BG/Q Vector Intrinsics
To make even better use of the vector unit we rewrote our function create_rhs by using BG/Q vector intrinsics (code example 4). We first kept the computation of the derivatives unchanged (version G). This gave us another significant speedup. Using vector intrinsics for the entire function create_rhs gave us another minor speedup (version H). This brings create_rhs to an excellent level of about 80% of the maximum attainable performance according to the roofline model (column “% max.” in Table 5 and 7 and Fig. 6 and 7).
The fairly large number of loads that hit L1P buffer and L2 cache seems to be due to the random memory access that CG storage produces. Optimal would be if the prefetcher could bring all data into L1 cache before it is needed. We tried different prefetching strategies and handwritten prefetching but could not improve the performance compared to the default strategy. According to our performance model we still expect CG storage to be significantly faster compared to DG storage even though CG storage makes prefetching very difficult.
OpenMP allows a reduction in the number of MPI processes. This leads for CG storage to a reduced amount of work for some parts of the code (namely the black text in code example 1). However, we need to be very careful to avoid race conditions. In create_rhs, race conditions can occur in the summation over all the elements in Eq. (4). Using OpenMP atomic statements made our code too slow. The best solution that we could find was to reorder the elements inside each MPI process in such a way that different OpenMP threads can never compute neighboring elements at the same time. To ensure this we need to synchronize all threads by using an OpenMP barrier after each element computation. These barriers slow down create_rhs by less than 10% (version I). Nevertheless we obtain in the case of the baroclinic instability a noticeable improvement on the runtime of the entire simulation due to the reduced amount of work for the IMEX corrections in the vertical direction. We obtained the best performance by using 4 OpenMP threads per MPI process (2, 8, 16 and 64 OpenMP threads per MPI process were slower).
Comparing our final version I with our theoretical results (version T) shows reasonably good agreement in the total amount of read and write traffic for the entire timeloop (Table 6 and 8). The measured read and write traffic for create_rhs does not agree very well with our theoretical results. The explanation for this is probably that the L2 cache miss counters that are used to measure the traffic are shared among all hardware threads of the node. Since create_rhs is only a small part of the entire timeloop it can easily happen that our measurements include memory traffic from threads that have not reached create_rhs. This effect would result in too large measurements for the memory traffic. Prefetching on the other hand might lead to data being loaded before the code reaches create_rhs which would result in too low measurements. These effects are not present when measuring the entire timeloop which explains the much better agreement between theoretical results and measurements for the entire timeloop. The number of floating point operations in our theoretical performance model has been tuned to agree with our overall measurements of version I in order to take compiler optimizations into account.
7.4 Next Steps
The roofline plots in Fig. 6 and 7 show that our optimizations have given us a massive improvement in the number of GFlops per second per node which has brought us very close to the maximum attainable performance at the given arithmetic intensity. Our measurements show also that our final version I achieves an excellent level of vectorization (98.6% of all floating point operations are vectorized). In order to improve performance further we need to increase arithmetic intensity by reducing the overall amount of memory traffic. We have found some unnecessary memory access which we will remove in our future work and we have found that we could reduce memory traffic by recomputing the metric terms in Eq. (2.1) in each stage of our time integration method. These optimizations have been included in our performance model and are shown in version O in Table 5 to 8. According to our performance model we should be able to achieve another factor 2 of speedup for create_rhs. This should bring us to about 35% of the theoretical peak performance of the processor. Recomputing the metric terms in each time integration stage should also improve the very low percentage of floating point instructions among all instructions (column “mix” in Table 5 to 8). We will try these optimizations in our future work.
8 Strong Scaling Results
We present in this section strong scaling results up to the entire machine Mira for the baroclinic wave test case (Fig. 8, 9 and 11) and the rising thermal bubble test case (Fig. 10 and 12). All these results use version I from Section 7.
The runtime of the entire simulation for the baroclinic wave test case is shown in Fig. 8. The dynamics of a one day forecast needs to be finished within less than about 4.5 minutes runtime (more than 320 model days per wallclock day). We reach this goal on the entire machine Mira for our 3.0km uniform horizontal resolution simulation of the baroclinic wave test case which takes 4.15 minutes runtime per one day forecast (346.6 model days per wallclock day).
The strong scaling efficiency of the simulations in Fig. 8 is shown in Fig. 9 for the different parts of the code. The entire code reaches a strong scaling efficiency of 99.1% on the entire machine Mira. The parts create_rhs and filter show a scaling efficiency of more than 100%. This is not surprising because the problem fits better into L2 cache with increasing number of threads and at the same time the time spent in our OpenMP barriers is decreasing. The IMEX part gives us the lowest scaling efficiency. We still need to understand the reason for this behavior.
The lowest scaling efficiency for the entire simulation is obtained for 2.1 threads. This is due to non-optimal load balancing. The number of elements per thread is always perfectly balanced for all results shown in this paper. However, the arrangement of these elements can vary which leads for CG storage to variations in the number of grid points. So far we use the very simple mesh partitioning built into the p4est library. We expect to be able to improve this result by using more advanced mesh partitioning algorithms.
The strong scaling efficiency of the rising thermal bubble test case is shown in Fig. 10. We achieve 99.7% strong scaling efficiency on the entire machine for this case. We use a much larger total number of grid points of about 43 billion grid points for this case because we plan to use our code for hurricane and cloud simulations at this kind of problem size. We have not optimized the memory usage of our code. The smallest number of threads that can handle this problem is currently 7.7. We expect to be able to reduce the memory usage of our code significantly. Also we need to understand the reason why the simulation using 1.6 shows a reduced performance.
The percentage of the theoretical peak performance in terms of floating point operations is shown in Fig. 11 and 12. Not surprisingly we obtain the best performance for our optimized part create_rhs. For the baroclinic wave test create_rhs reaches 1.21 PFlops (12.1% of peak) and for the rising thermal bubble test it reaches 1.28 PFlops (12.8% of peak) on the entire machine. The sustained performance of the entire simulation is at 0.55 PFlops for the baroclinic wave test and at 0.70 PFlops for the rising thermal bubble test on the entire machine Mira.
In this paper, we present the optimization and performance analysis of the atmospheric model NUMA. Our optimizations have improved the performance of our code by a factor of about 10 and have brought us very close to the maximum attainable performance due to the peak memory bandwidth (Fig. 6 and 7). These optimizations allow us to perform most of the computations at 1.2 PFlops on the entire supercomputer Mira by using BG/Q vector intrinsics. The sustained performance of the entire simulation is at 0.70 PFlops for a rising thermal bubble test case using explicit time integration. We have not optimized all parts of the code yet. For the baroclinic wave test the non-optimized computations for the implicit part of the time integration lead to a slightly lower sustained performance of 0.55 PFlops. We expect to improve our performance significantly by optimizing the remaining non-optimized parts of our code.
We have shown that NUMA achieves a near perfect strong scaling efficiency of 99.7% for the rising thermal bubble test case using 43 billion grid points on the entire 3.14 million threads of Mira. For the baroclinic wave test case on the sphere we obtain a strong scaling efficiency of 99.1% using a mesh with 1.8 billion grid points. This allows us to compute a one day forecast at 3.0km resolution within 4.15 minutes runtime and fulfills the requirements for operational weather prediction (less than 4.5 minutes runtime for the dynamics of a one day forecast).
As explained in the introduction, we expect this massive increase in resolution to be a major step towards more accurate weather forecasts. Nevertheless, the demand to increase the resolution of NWP models does not end at 3km resolution (Bauer et al., 2015). The demand for better performance is even more severe when high resolution climate prediction is considered. Climate prediction requires forecast periods of more than one hundred years. To simulate such a long period of time at a resolution of 3km would still require about one year of runtime on the entire machine Mira when tracers and physics parameterizations are taken into account. For this reason we need to continue to work on improving the performance of our code and to optimize it for next generation supercomputers.
Our analysis in this paper shows that we need to reduce the amount of memory traffic to further improve our performance. It should be possible to achieve this by recomputing metric terms in each stage of our time integration method. For this reason, we expect to improve our sustained performance of the entire simulation beyond 1 PFlops. This should allow us to reach a uniform horizontal resolution close to 2km within operational requirements. The next goal will be the optimization of our code for the upcoming next generation supercomputer Aurora at the Argonne National Laboratory. Aurora is expected to achieve a peak performance of 180 PFlops (18 times more than Mira). We hope to be able to reach 1km resolution for global numerical weather prediction once Aurora is available and once our code is fully optimized for that machine.
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. We would like to thank Vitali Morozov at the Argonne National Laboratory for his support in analyzing the performance of our code with the Hardware Performance Monitor Toolkit. AM, MK, and SM are grateful to the National Research Council of the National Academies.
The Authors declare that there is no conflict of interest.
This work was supported by the Office of Naval Research [PE-0602435N]; the Air Force Office of Scientific Research (Computational Mathematics program); and the National Science Foundation (Division of Mathematical Sciences) .
- Even though only create_rhs is changed between the different simulations in this table the runtime saved for the entire timeloop is larger than the time saved in create_rhs. The reason for this behavior is that the reduced runtime of create_rhs leads to an improved synchronization between different MPI processes which reduces the time spent in MPI communication.
- Abdi DS and Giraldo FX (2016) Efficient construction of unified continuous and discontinuous Galerkin formulations for the 3D Euler equations. J. Comput. Phys. 320: 46–68.
- Bangerth W, Heister T, Heltai L, Kanschat G, Kronbichler M, Maier M, Turcksin B and Young T (2015) The deal.II library, version 8.2. Archive of Numerical Software 3(100).
- Bauer P, Thorpe A and Brunet G (2015) The quiet revolution of numerical weather prediction. Nature 525(7567): 47–55.
- Burstedde C, Calhoun D, Mandli K and Terrel AR (2014) ForestClaw: Hybrid forest-of-octrees AMR for hyperbolic conservation laws. In: Bader M, Bode A and Bungartz HJ (eds.) Parallel Computing: Accelerating Computational Science and Engineering (CSE). Amsterdam: IOS Press BV, pp. 253–262.
- Burstedde C, Ghattas O, Gurnis M, Isaac T, Stadler G, Warburton T and Wilcox L (2010) Extreme-scale AMR. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp. 1–12.
- Burstedde C, Wilcox LC and Ghattas O (2011) p4est: Scalable algorithms for parallel adaptive mesh refinement on forests of octrees. SIAM J. Sci. Comput. 33(3): 1103–1133.
- Chung IH, Kim C, Wen HF and Cong G (2012) Application data prefetching on the IBM Blue Gene/Q supercomputer. In: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, pp. 1–8.
- Dennis JM, Edwards J, Evans KJ, Guba O, Lauritzen PH, Mirin AA, St-Cyr A, Taylor MA and Worley PH (2012) CAM-SE: A scalable spectral element dynamical core for the Community Atmosphere Model. International Journal of High Performance Computing Applications 26(1): 74–89.
- Giraldo FX, Kelly JF and Constantinescu EM (2013) Implicit-explicit formulations of a three-dimensional nonhydrostatic unified model of the atmosphere (NUMA). SIAM J. Sci. Comput. 35(5): B1162–B1194.
- Giraldo FX and Restelli M (2008) A study of spectral element and discontinuous Galerkin methods for the Navier-Stokes equations in nonhydrostatic mesoscale atmospheric modeling: Equation sets and test cases. J. Comput. Phys. 227(8): 3849–3877.
- Hungershöfer J and Wierum JMM (2002) On the quality of partitions based on space-filling curves. In: Computational Science — ICCS 2002. Springer, pp. 36–45.
- Isaac T, Burstedde C, Wilcox LC and Ghattas O (2015a) Recursive algorithms for distributed forests of octrees. SIAM J. Sci. Comput. 37(5): C497–C531.
- Isaac T, Stadler G and Ghattas O (2015b) Solution of nonlinear Stokes equations discretized by high-order finite elements on nonconforming and anisotropic meshes, with application to ice sheet dynamics. SIAM J. Sci. Comput. 37(6): B804–B833.
- Isaac TG (2015) Scalable, adaptive methods for forward and inverse problems in continental-scale ice sheet modeling. PhD Thesis.
- Jablonowski C and Williamson DL (2006) A baroclinic instability test case for atmospheric model dynamical cores. Q. J. R. Meteor. Soc. 132(621C): 2943–2975.
- Johnsen P, Straka M, Shapiro M, Norton A and Galarneau T (2013) Petascale WRF simulation of hurricane sandy: Deployment of NCSA’s cray XE6 blue waters. In: 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp. 1–7.
- Kelly JF and Giraldo FX (2012) Continuous and discontinuous Galerkin methods for a scalable three-dimensional nonhydrostatic atmospheric model: limited-area mode. J. Comput. Phys. 231(24): 7988 – 8008.
- Marras S, Kelly JF, Moragues M, Müller A, Kopera MA, Vázquez M, Giraldo FX, Houzeaux G and Jorba O (2015a) A Review of Element-Based Galerkin Methods for Numerical Weather Prediction: Finite Elements, Spectral Elements, and Discontinuous Galerkin. Archives of Computational Methods in Engineering : 1–50.
- Marras S, Nazarov M and Giraldo FX (2015b) Stabilized high-order Galerkin methods based on a parameter-free dynamic SGS model for LES. J. Comput. Phys. 301: 77–101.
- Morozov V, Kumaran K, Vishwanath V, Meng J and Papka ME (2013) Early Experience on the Blue Gene/Q Supercomputing System. In: IEEE Xplore. Argonne Nat. Lab., Argonne, IL, USA, IEEE, pp. 1229–1240.
- Müller A, Behrens J, Giraldo FX and Wirth V (2013) Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments. J. Comput. Phys. 235(0): 371 – 393.
- Nair RD, Choi HW and Tufo HM (2009) Computational aspects of a scalable high-order discontinuous Galerkin atmospheric dynamical core. Comput. Fluids 38(2): 309 – 319.
- Schneider T (2014) High-Impact Weather Prediction Project (HIWPP). Technical report, NOAA. URL http://web.archive.org/web/20150221115209/http://hiwpp.noaa.gov/docs/HIWPP_ProjectPlan_Public.pdf.
- Tufo HM and Fischer PF (1999) Terascale spectral element algorithms and implementations. In: Proceedings of the 1999 ACM/IEEE conference on Supercomputing. ACM, p. 68.
- Whitaker J (2015) HIWPP non-hydrostatic dynamical core tests: Results from idealized test cases. Technical report, NOAA. URL http://web.archive.org/web/20151003234909/http://www.nws.noaa.gov/ost/nggps/DycoreTestingFiles/HIWPP_idealized_tests-v8%20revised%2005212015.pdf.
- Wyszogrodzki AA, Piotrowski ZP and Grabowski WW (2012) Parallel implementation and scalability of cloud resolving EULAG model. In: Parallel Processing and Applied Mathematics. Springer, pp. 252–261.
- Xue W, Yang C, Fu H, Wang X, Xu Y, Gan L, Lu Y and Zhu X (2014) Enabling and scaling a global shallow-water atmospheric model on tianhe-2. In: Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, pp. 745–754.