Porting Large HPC Applications to
GPU Clusters: The Codes GENE and VERTEX
We have developed GPU versions for two major high-performance-computing (HPC) applications originating from two different scientific domains. GENE [1, 2] is a plasma microturbulence code which is employed for simulations of nuclear fusion plasmas. VERTEX [3, 4, 5] is a neutrino-radiation hydrodynamics code for ”first principles”-simulations of core-collapse supernova explosions [6, 7, 8]. The codes are considered state of the art in their respective scientific domains, both concerning their scientific scope and functionality as well as the achievable compute performance, in particular parallel scalability on all relevant HPC platforms. GENE and VERTEX were ported by us to HPC cluster architectures with two NVidia Kepler GPUs mounted in each node in addition to two Intel Xeon CPUs of the Sandy Bridge family. On such platforms we achieve up to twofold gains in the overall application performance in the sense of a reduction of the time to solution for a given setup with respect to a pure CPU cluster. The paper describes our basic porting strategies and benchmarking methodology, and details the main algorithmic and technical challenges we faced on the new, heterogeneous architecture.
A]\fnmsTilman \snmDannert, A]\fnmsAndreas \snmMarek and A]\fnmsMarkus \snmRampp††thanks: Corresponding Author: Markus Rampp; E-mail: firstname.lastname@example.org.
PU, HPC application, GENE, VERTEX
With GPU hardware and the corresponding software environments becoming mature, compute clusters with GPU-accelerated nodes establish as a new, powerful platform for high-performance computing (HPC). Mainly motivated by the expected boost for application performance (i.e. reducing ”time to solution”) and also by energy-efficiency considerations (i.e. reducing ”energy to solution”), major research organizations and providers of HPC resources have already deployed an appreciable amount of GPU-accelerated resources worldwide . Moreover, GPU-like architectures are expected to play a major role in the upcoming exascale era .
It is well known, however, in the community, that the new hardware architecture together with the apparently disruptive programming models pose substantial challenges to scientific application developers (e.g. ). While selected algorithms and applications have in fact been demonstrated to keep up with the shiny performance promises of GPUs, in some cases even at the very large scale (e.g. ), it remains to be seen whether a broader class of scientific applications can take advantage of GPU-accelerated systems with reasonable programming effort and in a sustainable way. Often inappropriately termed ”legacy applications” in this context, leading scientific HPC codes are typically being actively developed, comprise many tens or hundreds of thousands of lines of code achieved by a team effort of many dozens of person years, they provide state-of-the-art functionality, as well as high optimization, parallel scalability and portability. The codes GENE111GENE is developed by the group of F. Jenko (Max-Planck-Institute for Plasma Physics) and VERTEX222VERTEX is developed by the group of H.-Th. Janka (Max-Planck-Institute for Astrophysics), which have been developed in the Max-Planck-Society with continuous support from its high-performance Computing Centre (RZG) may serve as prototypical examples in this respect. But as a matter of fact, such highly tuned codes are often reaching the limits of (strong) scalability. For example, due to increasing inter-node communication times, or, as in the case of VERTEX, due to the lack of conventionally exploitable parallelism in the code structure, the time to solution for a given setup can no more be efficiently reduced by utilizing more CPU resources. Thus, a significantly increased node performance due to accelerators appears as a promising route towards further boosting application performances at scale.
Although both, GENE and VERTEX are written in FORTRAN (with MPI and hybrid MPI/OpenMP parallelization, respectively) we decided to adopt the C-based CUDA programming model because it is the performance reference for NVidia GPUs. While we found the commercial CUDA-FORTRAN language to deliver competitive performance on the GPU, the employed PGI compiler falls behind the Intel compiler (which is our reference for the CPU) on the remaining CPU parts which marginalizes the overall application speedups of the heterogeneous code. For the same reason we did not yet make productive use of the OpenACC programming model .
The performance baseline for all comparisons is defined by a highly optimized, parallel CPU implementation of the respective algorithms. Rather than quoting single-core speedups (which, in our opinion is hardly meaningful in most cases) our comparisons are always based on the same number of GPU cards and multicore CPUs (”sockets”). Specifically, we compare the run time obtained on a certain number of nodes, each equipped with two Intel Xeon E5-2670 8-core CPUs and two NVidia K20X GPUs, with the run time measured with the original, parallel CPU code on the same number of nodes (without GPUs).
2 The GENE code
GENE [1, 2] is a massively parallel code for the simulation of plasma turbulence in fusion devices. The code solves the time-dependent, five-dimensional Vlasov-Maxwell system of equations on a fixed phase-space grid. Depending on the physical problem, typical GENE simulations take between a few days and many weeks, using thousands of cores on x86_64-based HPC systems. GENE is open-source  and has a world-wide user base.
2.1 Code structure
The GENE algorithm employs coordinates aligned to the magnetic field lines in a fusion device like a tokamak. In this paper we use the so called x-global version, where all physical quantities are handled in a spectral representation with respect to the coordinate, which is the second of the three space dimensions (radial), (binormal) and (along the field line). The remaining phase-space coordinates are (in this order) the velocity along the field line and the magnetic moment (see  for details). Although GENE is able to handle any number of ion species and the electrons in the framework of gyrokinetics, we use for this paper only a single ion species, neutralized by electrons. For all performance comparisons a problem setup with a number of grid points is used.
The starting point of this work was a profiling of the GENE code (SVN revision 3440), with the times given in Table 2.1.
|region 1 SNB 2 SNB time loop 13.5s 6.9s field solver * 2.0s * 1.1s rhs computation * 8.9s * 4.4s nonlinearity ** 5.4s ** 2.7s||1 SNB 2 SNB CPU only 4.4s 2.1s Fermi M2090 7.4s 3.5s Kepler K20X 4.1s 2.3s|
|Table 1: Profiling of GENE rev. 3440 on Sandy Bridge (SNB) sockets. The symbol * indicates the nesting level of the performance region.||Table 2: Performance of the nonlinearity with GPU acceleration.|
The times of Table 2.1 show that the computation of all terms of the right-hand side (rhs) of the Vlasov equation consumes more than 60% of the computing time in GENE, with the computation of the (quadratic) nonlinearity dominating the other terms.
The computation of the nonlinearity follows the usual approach  to avoid the computationally expensive convolutions in spectral space by multiplication of the two fields corresponding to the quadratic term in real space after Fourier transforming them. Specifically, the fields are preprocessed (transposition, extension in direction for dealiasing according to the -rule) in a first step, and a fast Fourier transform to real space is applied. After multiplication the result is transformed back to the spectral representation, the additional modes are deleted, the array is transposed and multiplied with a prefactor, and is finally added to the right-hand side vector.
A high level of data parallelism () can be realized, as the algorithm for the nonlinearity depends on the remaining phase-space coordinates (, and and the number of ion species) in a parametric way. This thread concurrency, however, competes with regular MPI parallelism on the CPU: A larger number of MPI ranks leaves less thread parallelism for the GPU and vice versa.
2.2 Algorithmic details and GPU implementation
The nonlinear term needs as input four arrays, the and component of the -velocity and the derivatives of the distribution function with respect to and . Additionally it needs the already computed right-hand side, to which the result is added after multiplication with a prefactor.
To minimize the transfer costs, we split each input array into (usually four) contiguous chunks of -planes, which are transferred and computed in two asynchronous CUDA streams. Additional asynchronicity comes from the fact that all MPI tasks assigned to one CPU socket share one GPU. This also accounts for an overlap of different kernels and transfer and helps to fully utilize the GPU card.
|transpose||transposes in - dimension|
|extend_for_dealiasing||add 50% of zero modes which are filled by the multiplication|
|shrink_for_dealiasing||after multiplication, remove the added modes to get the dealiased solution|
|compute_nonlinearity||multiplication in real space of the two arrays according to the structure of the nonlinearity|
|multiply_with_prefactor||multiply the result with a prefactor and add to the original right-hand-side|
|CUFFT||use this library for the Fourier transformations|
From profiling with the NVidia visual profiler, we find that all of our kernels have a good coalescing memory access pattern and show a high utilization of the GPU (occupancy 65%). We also find that more than half of the GPU run time is used by the Fourier transforms from the CUFFT library (53%), followed by transpose (17.9%), extend_for_dealiasing (11.7%), and compute_nonlinearity (8.8%).
2.3 Performance results
The performance results for a Sandy Bridge CPU (8 cores) with Fermi or Kepler GPU is shown in Table 2.1. So far, we achieve only a very moderate acceleration of the code by using a GPU. To get a deeper insight, we use the roofline performance model . We define as performance metric for the nonlinearity of GENE the number of -planes computed per second (unit: ) and as metric for the bandwidth, we use the number of -planes transferred from host to device via the PCI express bus (unit: ).
Having defined the target metrics, it is necessary for the model to assess a peak value (or ceiling) for these two metrices. The ceiling for the bandwidth can be computed from the measured values of the bandwidth of the PCIe bus, which is (for PCIe 2.0). Therefore we get in our units (using , the size of a double-complex representation)
The peak value for the computing performance is determined by measuring the number of computed -pages, assuming the data is already present on the GPU. Hence, after transferring all data to the device, synchronizing the GPU with all MPI ranks, we time only the computation of the -planes on the GPU. This number depends on the quality of the kernels and not on the transfer and hence can define a ceiling for the computing performance.
In a next step we switch on the asynchronous CUDA streams to hide the transfer and measure the achieved total performance. The transfer time decreases significantly, hence doubling the transfer bandwidth which gives a much higher ceiling in the roofline graph (the solid, inclined bandwidth line). The computational intensity of the algorithm is (indicated by the dotted blue vertical line in the figure), as we need 6 transferred planes (5 input, 1 output) to compute one result -plane .
For the K20X, the algorithm is still in the bandwidth-limited region, hence we cannot exploit the full computational power of the GPU, whereas for the older model (M2090) the limitation comes from the computational performance of the GPU. In the latter case, faster kernels would help to get a better performance, but for obvious reasons, we do not put effort in optimizing for obsolete hardware. This roofline model shows clearly that when using the PCIexpress bus, gen. 2.0, the performance of the nonlinearity on a Kepler K20X card is bound by the data transfer between host and device. Hence, a further optimization of the kernels will not help to improve the overall performance. This analysis also shows a way for further improvements. One can try to overcome the bandwidth limit by moving more computation to the GPU, while keeping the amount of transfer constant, or one uses the next generation 3 of the PCI bus, which nearly doubles the bandwidth. In both cases, the algorithm will become compute-bound and reaches the compute ceiling. Note that contrary to the original roofline model , this ceiling is not defined “top-down” from the nominal peak performance of the hardware, but is given by the actual kernel performance. Using such an application-specific metric facilitates the determination of the computational intensity in our case. It is justified, provided that the analysis is confined to the bandwidth-limited region of the roofline diagram.
3 The VERTEX code
VERTEX [3, 4] is a massively parallel, multi-dimensional neutrino-radiation hydrodynamics code for simulating core-collapse supernova explosions [6, 7, 8]. Typical model runs require between a few months on 256 cores (for two-dimensional, axisymmetric simulations) and up to 64 000 cores (for the latest generation of three-dimensional models) on HPC systems based on x86_64 processors.
3.1 Code structure
VERTEX employs a spherical grid and a hybrid MPIOpenMP parallelization, based on a standard domain decomposition. Each MPI domain is further divided into angular ’rays’ (see Fig. 2a) of computational work by virtue of a coarse-grained OpenMP parallelization.
Figure 2a schematically shows the division of the computational work into MPI tasks and OpenMP threads (’rays’), together with the mapping on the hardware. The figure also sketches the execution flow of the application, with different threads working independent of each other during a time-step.
Table 5a shows that the run time of VERTEX on the CPU is dominated by solving the radiative transfer equations (item ’transport’), and in particular for computing neutrino absorption and emission rates (item ’rates’). Fig. 2a identifies the positions of the individual routines in the execution flow. About 50 percent of the run time is spent in the computation of one particular interaction rate (named ’rate kernel’, ’C2’). The different interaction rates are often termed ”local physics”, which expresses the fact that the computations are to a high degree independent of each other and provide a data parallelism on the grid level. Different interaction processes (’rates’) can be computed independently of each other, which implies additional, coarse-grained parallelism on the function level (see blowup in Fig. 2a).
3.2 Algorithmic details and GPU implementation
In the following the algorithm for offloading the ’rate kernel’ to the GPU is outlined. Due to its dominance in the code, high data parallelism and arithmetic intensity the suitability for the GPU shall become immediately apparent.
As input for the computations a few one-dimensional arrays are needed, which represent the local thermodynamic conditions for which the interaction kernel is evaluated. All operations are performed on a five-dimensional grid representing discretized phase space. The size of this grid varies with the resolution, in a typical setup the total number of grid points is about . For the major part of the kernel, computations on each grid zone can be done independently of the others, which leads to a high degree of data parallelism (up to threads). Only after all grid zones are processed, a reduction (corresponding to a phase space integral) to a three dimensional grid is performed. This can be still done in parallel, but with much less parallelism ( threads). All computations are done twice for subsets of different input data, accounting for two possible reaction channels.
The actual implementation of the part ’C2’ is straightforward: the data is copied asynchronously to the GPU and the five-fold nested loops of the CPU version are separated in kernel calls with about 100 000 threads. The kernels are scheduled in streams, in order to allow the CUDA run time to overlap kernel executions corresponding to the twofold computation of the processes. The problem is compute bound, as data transfer is negligible (0.9 ms) compared to GPU computations (40 ms) and at least 140 double-precision floating-point operations are executed per transferred byte.
For good performance results it turned out to be crucial to use shared memory for the input data and to use as much registers as possible on the device. After tuning our CUDA code with the help of the NVidia profiler, we achieve an occupancy of 93% of the theoretical upper limit for the most important kernels. However, we still encounter about 10% of branch-divergence overhead and 25% of global memory replay overhead. Work is still ongoing to improve on the latter performance metrics.
As mentioned above, the different sub-steps ’C1’ to ’Cn’ (see Fig. 2b) are independent of each other and can be computed in any order within one OpenMP thread, or ’ray’. In the original code, however, the order across different OpenMP threads is always the same, e.g. when a thread computes sub-step ’C1’, also the other threads work on the same sub-step. An overlap of computations on the cores and the GPU was thus achieved by: a) individually shuffling the computations of the sub-steps ’C1’ to ’Cn’ on each ’ray’, and b) ensuring that the sub-step ’C2’ from each ’ray’ to the GPU is offloaded in a queue (see Fig. 2b). In an ideal situation where all steps ’C1’ to ’Cn’ take the same amount of execution time, work on the CPUs and the GPU would be perfectly overlapped. In reality, a balancing of about 80% could be reached.
3.3 Performance results
The rate kernel ’C2’ requires 2.16 s on one CPU thread (cf. Tab 5) and scales almost perfectly with OpenMP. The same kernel can be computed on the GPU in 0.04 s. Thus, with one GPU, speedups of 7 or 54 are achieved when comparing with one CPU socket or a single core, respectively. This demonstrates that a significant speedup was achieved with respect to a Sandy Bridge CPU. As the coarse grained OpenMP parallelization of VERTEX (which is crucial for achieving its excellent weak scalability) does not allow to use the threaded rate kernel on the CPU, the acceleration factor of 54 applies for production applications which effectively eliminates the rate kernel from the computing time budget and in practice accounts for a twofold acceleration (corresponding to the original 50% share of the rate kernel, cf. Tab 5) of the entire application.
4 Summary and Conclusions
With the specific cases of GENE and VERTEX we have shown that complex HPC applications can successfully be ported to heterogeneous CPU-GPU clusters. Besides writing fast GPU code, exploiting and balancing both the GPU and the CPU resources of the heterogeneous compute nodes turned out to be an essential prerequisite for achieving good overall ”speedups”, which we define as the ratio of the run time obtained on a number of GPU-accelerated sockets and the run time measured with parallel code on the same number of CPU sockets.
In the case of VERTEX we have demonstrated twofold speedups which hold for production applications on GPU-clusters with many hundreds of nodes. In particular, the excellent weak scalability of VERTEX  is not affected by the additional acceleration due to GPUs. Threefold speedups appear in reach but would require at least additional porting of a linear solver for a block-tridiagonal system. Limitations in the software environment (lack of device-callable LAPACK functionality) have so far impeded a successful port of this part of the algorithm. Importantly, due to the specific code structure of VERTEX, such speedups would not have been possible with comparable programming effort by simply using more CPU cores.
The performance of GENE is currently limited by the data transfer between the host CPU and the GPU as we have shown by an elaborate performance-modeling analysis. After this bottleneck will have relaxed by upcoming hardware improvements (PCIe 3) further optimization efforts on the GPU code will increase the overall speedups on a heterogeneous cluster.
The question whether the effort of several person-months, which we have invested for each code, and which we consider typical for such projects, is well justified cannot be answered straightforwardly. For complex, and ”living” scientific HPC codes, for which GENE and VERTEX can serve as prototypical examples, achieving up to threefold speedups in overall application performance appears very competitive . Also from the point of view of hardware investment (buying GPUs instead of CPUs) and operational costs (”energy to solution”) the migration of applications from pure CPU machines to GPU-accelerated clusters can be considered cost-effective if speedups of at least about two are achieved. On the other hand, while very valuable for increasing simulation throughput, twofold or threefold application speedups usually do not enable qualitatively new science objectives. For this reason we sometimes observe reluctance in the scientific community to invest significant human resources for achieving GPU-performance improvements in this range. This is further exacerbated by legitimate concerns about sustainability, maintainability and portability of GPU-kernel code. These are no serious issues for GENE and VERTEX, where the parts we have ported to the GPU are not under heavy algorithmic development and were also carefully encapsulated by us. In general, however, the need for kernel programming, which is considered as a pain by many, currently appears as the largest hurdle for a broader adoption of GPU programming in the scientific HPC community. Moreover, it may turn out necessary to port significant parts of the application code to the GPU, e.g. in cases like GENE where the data transfers become a limiting factor, or even to completely reimplement the application.
These concerns could be mitigated by the establishment of a high-level, directive based programming model, based e.g. on the OpenACC standard  or a future revision of OpenMP , together with appropriate compiler support. Also Intel’s Xeon Phi many-core coprocessor with its less disruptive programming model appears very prospective in this respect. Despite serious efforts, however, we were not yet successful with GENE or VERTEX to achieve performances on this platform which are competitive with the GPU. We attribute this mostly to a comparably lower maturity of the Xeon Phi software stack and we expect improvements with upcoming versions of the compiler and the OpenMP run time.
Most importantly, today’s GPUs (and many-core coprocessors) might provide a first glimpse on the architecture and the related programming challenges of future HPC architectures of the exascale era . Applications need to be prepared in time for the massive SIMT and SIMD parallelism which is expected to become prevalent in such systems. Even on contemporary multicore CPUs with comparably moderate thread-counts and SIMD width, the experience we have gained with porting GENE and VERTEX has already led to appreciable performance improvements of the CPU codes.
We thank F. Jenko and H.-Th. Janka for encouraging the development of GPU versions for GENE and VERTEX, respectively. NVidia Corp. and Intel Corp. are acknowledged for providing hardware samples and technical consulting.
-  Jenko, F., Dorland, W. Kotschenreuther, W., Rogers, B. N., Electron temperature gradient driven turbulence. Phys. Plasmas 7, 1904 (2000).
-  Görler, T., Lapillonne, X. et al., The global version of the gyrokinetic turbulence code GENE. Journal of Computational Physics 230, 7053 (2011).
-  Rampp, M., Janka, H.-Th. Radiation hydrodynamics with neutrinos: Variable Eddington factor method for core-collapse supernova simulations. Astronomy & Astrophysics, 396, 361 (2002).
-  Buras, R., Janka, H.-Th., Rampp, M., Kifonidis, K. Two-dimensional hydrodynamic core-collapse supernova simulations with spectral neutrino transport. I. Numerical method and results for a 15 M star. Astronomy & Astrophysics 447, 1049 (2006).
-  Hanke, F., Müller, B., Wongwathanarat, A., Marek, A., Janka, H.-Th. SASI Activity in Three-Dimensional Neutrino-Hydrodynamics Simulations of Supernova Cores. Astrophysical Journal 770, 66 (2013).
-  Janka, H.-Th. Explosion Mechanisms of Core-Collapse Supernovae. Annual Review of Nuclear and Particle Science. 62, 407 (2012).
-  Burrows, A. Colloquium: Perspectives on Core-Collapse Supernova Theory. Reviews of Modern Physics 85, 245 (2013).
-  Cardall, C., Endeve, E., Budiardja, R. D., Marronetti, P., Mezzacappa, A. Towards the Core-Collapse Supernova Explosion Mechanism. Advances in computational astrophysics: methods, tools, and outcome. ASP Conference Proceedings, 453, 81 (2012).
-  The Top500 List of supercomputers, http://www.top500.org.
-  Dongarra, J., Beckman, P. et al. The International Exascale Software Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011, ISSN 1094-3420.
-  Shimokawabe, T., Aoki, T. et al. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, ACM New York, NY, US (2011) 3:1–3:11
-  The OpenACC standard. http://www.openacc-standard.org/
-  The GENE code. http://gene.rzg.mpg.de/.
-  Canuto, C., Hussaini, M. Y., Quarteroni, A., Zang, T. A., Spectral Methods. Vol. 1, Fundamentals in Single Domains. Springer, Berlin (2006).
-  Williams, S., Waterman, A. and Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM 52, 65 (2009).
-  Janka, H.-Th. Explosion Models of Core-Collapse Supernovae. Hirschegg, (2012), http://theorie.ikp.physik.tu-darmstadt.de/nhc/pages/events/hirschegg/2013/talks/Wed/Janka.pdf
-  Marek, A., Rampp, M., Hanke, F., Janka, H.-Th. Towards Petaflops Capability of the VERTEX Supernova Code. This conference volume (2013).
-  Accelerating Computational Science Symposium 2012. Washington, US, (2012). https://www.olcf.ornl.gov/wp-content/uploads/2012/07/ACSS_Final_v3.pdf
-  Stotzer, E. et al. Technical Report on Directives for Attached Accelerators (2012), http://www.openmp.org/mp-documents/TR1_167.pdf