Porting Large HPC Applications to
GPU Clusters: The Codes GENE and VERTEX
Abstract
We have developed GPU versions for two major highperformancecomputing (HPC) applications originating from two different scientific domains. GENE [1, 2] is a plasma microturbulence code which is employed for simulations of nuclear fusion plasmas. VERTEX [3, 4, 5] is a neutrinoradiation hydrodynamics code for ”first principles”simulations of corecollapse supernova explosions [6, 7, 8]. The codes are considered state of the art in their respective scientific domains, both concerning their scientific scope and functionality as well as the achievable compute performance, in particular parallel scalability on all relevant HPC platforms. GENE and VERTEX were ported by us to HPC cluster architectures with two NVidia Kepler GPUs mounted in each node in addition to two Intel Xeon CPUs of the Sandy Bridge family. On such platforms we achieve up to twofold gains in the overall application performance in the sense of a reduction of the time to solution for a given setup with respect to a pure CPU cluster. The paper describes our basic porting strategies and benchmarking methodology, and details the main algorithmic and technical challenges we faced on the new, heterogeneous architecture.
A]\fnmsTilman \snmDannert, A]\fnmsAndreas \snmMarek and A]\fnmsMarkus \snmRampp^{†}^{†}thanks: Corresponding Author: Markus Rampp; Email: markus.rampp@rzg.mpg.de.
PU, HPC application, GENE, VERTEX
Introduction
With GPU hardware and the corresponding software environments becoming mature, compute clusters with GPUaccelerated nodes establish as a new, powerful platform for highperformance computing (HPC). Mainly motivated by the expected boost for application performance (i.e. reducing ”time to solution”) and also by energyefficiency considerations (i.e. reducing ”energy to solution”), major research organizations and providers of HPC resources have already deployed an appreciable amount of GPUaccelerated resources worldwide [9]. Moreover, GPUlike architectures are expected to play a major role in the upcoming exascale era [10].
It is well known, however, in the community, that the new hardware architecture together with the apparently disruptive programming models pose substantial challenges to scientific application developers (e.g. [10]). While selected algorithms and applications have in fact been demonstrated to keep up with the shiny performance promises of GPUs, in some cases even at the very large scale (e.g. [11]), it remains to be seen whether a broader class of scientific applications can take advantage of GPUaccelerated systems with reasonable programming effort and in a sustainable way. Often inappropriately termed ”legacy applications” in this context, leading scientific HPC codes are typically being actively developed, comprise many tens or hundreds of thousands of lines of code achieved by a team effort of many dozens of person years, they provide stateoftheart functionality, as well as high optimization, parallel scalability and portability. The codes GENE^{1}^{1}1GENE is developed by the group of F. Jenko (MaxPlanckInstitute for Plasma Physics) and VERTEX^{2}^{2}2VERTEX is developed by the group of H.Th. Janka (MaxPlanckInstitute for Astrophysics), which have been developed in the MaxPlanckSociety with continuous support from its highperformance Computing Centre (RZG) may serve as prototypical examples in this respect. But as a matter of fact, such highly tuned codes are often reaching the limits of (strong) scalability. For example, due to increasing internode communication times, or, as in the case of VERTEX, due to the lack of conventionally exploitable parallelism in the code structure, the time to solution for a given setup can no more be efficiently reduced by utilizing more CPU resources. Thus, a significantly increased node performance due to accelerators appears as a promising route towards further boosting application performances at scale.
1 Methods
Although both, GENE and VERTEX are written in FORTRAN (with MPI and hybrid MPI/OpenMP parallelization, respectively) we decided to adopt the Cbased CUDA programming model because it is the performance reference for NVidia GPUs. While we found the commercial CUDAFORTRAN language to deliver competitive performance on the GPU, the employed PGI compiler falls behind the Intel compiler (which is our reference for the CPU) on the remaining CPU parts which marginalizes the overall application speedups of the heterogeneous code. For the same reason we did not yet make productive use of the OpenACC programming model [12].
The performance baseline for all comparisons is defined by a highly optimized, parallel CPU implementation of the respective algorithms. Rather than quoting singlecore speedups (which, in our opinion is hardly meaningful in most cases) our comparisons are always based on the same number of GPU cards and multicore CPUs (”sockets”). Specifically, we compare the run time obtained on a certain number of nodes, each equipped with two Intel Xeon E52670 8core CPUs and two NVidia K20X GPUs, with the run time measured with the original, parallel CPU code on the same number of nodes (without GPUs).
2 The GENE code
GENE [1, 2] is a massively parallel code for the simulation of plasma turbulence in fusion devices. The code solves the timedependent, fivedimensional VlasovMaxwell system of equations on a fixed phasespace grid. Depending on the physical problem, typical GENE simulations take between a few days and many weeks, using thousands of cores on x86_64based HPC systems. GENE is opensource [13] and has a worldwide user base.
2.1 Code structure
The GENE algorithm employs coordinates aligned to the magnetic field lines in a fusion device like a tokamak. In this paper we use the so called xglobal version, where all physical quantities are handled in a spectral representation with respect to the coordinate, which is the second of the three space dimensions (radial), (binormal) and (along the field line). The remaining phasespace coordinates are (in this order) the velocity along the field line and the magnetic moment (see [2] for details). Although GENE is able to handle any number of ion species and the electrons in the framework of gyrokinetics, we use for this paper only a single ion species, neutralized by electrons. For all performance comparisons a problem setup with a number of grid points is used.
The starting point of this work was a profiling of the GENE code (SVN revision 3440), with the times given in Table 2.1.
region 1 SNB 2 SNB time loop 13.5s 6.9s field solver * 2.0s * 1.1s rhs computation * 8.9s * 4.4s nonlinearity ** 5.4s ** 2.7s  1 SNB 2 SNB CPU only 4.4s 2.1s Fermi M2090 7.4s 3.5s Kepler K20X 4.1s 2.3s 
Table 1: Profiling of GENE rev. 3440 on Sandy Bridge (SNB) sockets. The symbol * indicates the nesting level of the performance region.  Table 2: Performance of the nonlinearity with GPU acceleration. 
The times of Table 2.1 show that the computation of all terms of the righthand side (rhs) of the Vlasov equation consumes more than 60% of the computing time in GENE, with the computation of the (quadratic) nonlinearity dominating the other terms.
The computation of the nonlinearity follows the usual approach [14] to avoid the computationally expensive convolutions in spectral space by multiplication of the two fields corresponding to the quadratic term in real space after Fourier transforming them. Specifically, the fields are preprocessed (transposition, extension in direction for dealiasing according to the rule) in a first step, and a fast Fourier transform to real space is applied. After multiplication the result is transformed back to the spectral representation, the additional modes are deleted, the array is transposed and multiplied with a prefactor, and is finally added to the righthand side vector.
A high level of data parallelism () can be realized, as the algorithm for the nonlinearity depends on the remaining phasespace coordinates (, and and the number of ion species) in a parametric way. This thread concurrency, however, competes with regular MPI parallelism on the CPU: A larger number of MPI ranks leaves less thread parallelism for the GPU and vice versa.
2.2 Algorithmic details and GPU implementation
The nonlinear term needs as input four arrays, the and component of the velocity and the derivatives of the distribution function with respect to and . Additionally it needs the already computed righthand side, to which the result is added after multiplication with a prefactor.
To minimize the transfer costs, we split each input array into (usually four) contiguous chunks of planes, which are transferred and computed in two asynchronous CUDA streams. Additional asynchronicity comes from the fact that all MPI tasks assigned to one CPU socket share one GPU. This also accounts for an overlap of different kernels and transfer and helps to fully utilize the GPU card.
transpose  transposes in  dimension 

extend_for_dealiasing  add 50% of zero modes which are filled by the multiplication 
shrink_for_dealiasing  after multiplication, remove the added modes to get the dealiased solution 
compute_nonlinearity  multiplication in real space of the two arrays according to the structure of the nonlinearity 
multiply_with_prefactor  multiply the result with a prefactor and add to the original righthandside 
CUFFT  use this library for the Fourier transformations 
From profiling with the NVidia visual profiler, we find that all of our kernels have a good coalescing memory access pattern and show a high utilization of the GPU (occupancy 65%). We also find that more than half of the GPU run time is used by the Fourier transforms from the CUFFT library (53%), followed by transpose (17.9%), extend_for_dealiasing (11.7%), and compute_nonlinearity (8.8%).
2.3 Performance results
The performance results for a Sandy Bridge CPU (8 cores) with Fermi or Kepler GPU is shown in Table 2.1. So far, we achieve only a very moderate acceleration of the code by using a GPU. To get a deeper insight, we use the roofline performance model [15]. We define as performance metric for the nonlinearity of GENE the number of planes computed per second (unit: ) and as metric for the bandwidth, we use the number of planes transferred from host to device via the PCI express bus (unit: ).
Having defined the target metrics, it is necessary for the model to assess a peak value (or ceiling) for these two metrices. The ceiling for the bandwidth can be computed from the measured values of the bandwidth of the PCIe bus, which is (for PCIe 2.0). Therefore we get in our units (using , the size of a doublecomplex representation)
The peak value for the computing performance is determined by measuring the number of computed pages, assuming the data is already present on the GPU. Hence, after transferring all data to the device, synchronizing the GPU with all MPI ranks, we time only the computation of the planes on the GPU. This number depends on the quality of the kernels and not on the transfer and hence can define a ceiling for the computing performance.
Using these numbers (cf. Tab. 2.3) in our modified roofline model, we obtain the solid horizontal line and the dashed line labelled “no overlapping” in Fig. 1.
In a next step we switch on the asynchronous CUDA streams to hide the transfer and measure the achieved total performance. The transfer time decreases significantly, hence doubling the transfer bandwidth which gives a much higher ceiling in the roofline graph (the solid, inclined bandwidth line). The computational intensity of the algorithm is (indicated by the dotted blue vertical line in the figure), as we need 6 transferred planes (5 input, 1 output) to compute one result plane .
For the K20X, the algorithm is still in the bandwidthlimited region, hence we cannot exploit the full computational power of the GPU, whereas for the older model (M2090) the limitation comes from the computational performance of the GPU. In the latter case, faster kernels would help to get a better performance, but for obvious reasons, we do not put effort in optimizing for obsolete hardware. This roofline model shows clearly that when using the PCIexpress bus, gen. 2.0, the performance of the nonlinearity on a Kepler K20X card is bound by the data transfer between host and device. Hence, a further optimization of the kernels will not help to improve the overall performance. This analysis also shows a way for further improvements. One can try to overcome the bandwidth limit by moving more computation to the GPU, while keeping the amount of transfer constant, or one uses the next generation 3 of the PCI bus, which nearly doubles the bandwidth. In both cases, the algorithm will become computebound and reaches the compute ceiling. Note that contrary to the original roofline model [15], this ceiling is not defined “topdown” from the nominal peak performance of the hardware, but is given by the actual kernel performance. Using such an applicationspecific metric facilitates the determination of the computational intensity in our case. It is justified, provided that the analysis is confined to the bandwidthlimited region of the roofline diagram.
3 The VERTEX code
VERTEX [3, 4] is a massively parallel, multidimensional neutrinoradiation hydrodynamics code for simulating corecollapse supernova explosions [6, 7, 8]. Typical model runs require between a few months on 256 cores (for twodimensional, axisymmetric simulations) and up to 64 000 cores (for the latest generation of threedimensional models) on HPC systems based on x86_64 processors.
3.1 Code structure
VERTEX employs a spherical grid and a hybrid MPIOpenMP parallelization, based on a standard domain decomposition. Each MPI domain is further divided into angular ’rays’ (see Fig. 2a) of computational work by virtue of a coarsegrained OpenMP parallelization.

(a) 

(b) 
Figure 2a schematically shows the division of the computational work into MPI tasks and OpenMP threads (’rays’), together with the mapping on the hardware. The figure also sketches the execution flow of the application, with different threads working independent of each other during a timestep.



(a)  (b) 
Table 5a shows that the run time of VERTEX on the CPU is dominated by solving the radiative transfer equations (item ’transport’), and in particular for computing neutrino absorption and emission rates (item ’rates’). Fig. 2a identifies the positions of the individual routines in the execution flow. About 50 percent of the run time is spent in the computation of one particular interaction rate (named ’rate kernel’, ’C2’). The different interaction rates are often termed ”local physics”, which expresses the fact that the computations are to a high degree independent of each other and provide a data parallelism on the grid level. Different interaction processes (’rates’) can be computed independently of each other, which implies additional, coarsegrained parallelism on the function level (see blowup in Fig. 2a).
3.2 Algorithmic details and GPU implementation
In the following the algorithm for offloading the ’rate kernel’ to the GPU is outlined. Due to its dominance in the code, high data parallelism and arithmetic intensity the suitability for the GPU shall become immediately apparent.
As input for the computations a few onedimensional arrays are needed, which represent the local thermodynamic conditions for which the interaction kernel is evaluated. All operations are performed on a fivedimensional grid representing discretized phase space. The size of this grid varies with the resolution, in a typical setup the total number of grid points is about . For the major part of the kernel, computations on each grid zone can be done independently of the others, which leads to a high degree of data parallelism (up to threads). Only after all grid zones are processed, a reduction (corresponding to a phase space integral) to a three dimensional grid is performed. This can be still done in parallel, but with much less parallelism ( threads). All computations are done twice for subsets of different input data, accounting for two possible reaction channels.
The actual implementation of the part ’C2’ is straightforward: the data is copied asynchronously to the GPU and the fivefold nested loops of the CPU version are separated in kernel calls with about 100 000 threads. The kernels are scheduled in streams, in order to allow the CUDA run time to overlap kernel executions corresponding to the twofold computation of the processes. The problem is compute bound, as data transfer is negligible (0.9 ms) compared to GPU computations (40 ms) and at least 140 doubleprecision floatingpoint operations are executed per transferred byte.
For good performance results it turned out to be crucial to use shared memory for the input data and to use as much registers as possible on the device. After tuning our CUDA code with the help of the NVidia profiler, we achieve an occupancy of 93% of the theoretical upper limit for the most important kernels. However, we still encounter about 10% of branchdivergence overhead and 25% of global memory replay overhead. Work is still ongoing to improve on the latter performance metrics.
As mentioned above, the different substeps ’C1’ to ’Cn’ (see Fig. 2b) are independent of each other and can be computed in any order within one OpenMP thread, or ’ray’. In the original code, however, the order across different OpenMP threads is always the same, e.g. when a thread computes substep ’C1’, also the other threads work on the same substep. An overlap of computations on the cores and the GPU was thus achieved by: a) individually shuffling the computations of the substeps ’C1’ to ’Cn’ on each ’ray’, and b) ensuring that the substep ’C2’ from each ’ray’ to the GPU is offloaded in a queue (see Fig. 2b). In an ideal situation where all steps ’C1’ to ’Cn’ take the same amount of execution time, work on the CPUs and the GPU would be perfectly overlapped. In reality, a balancing of about 80% could be reached.
3.3 Performance results
The rate kernel ’C2’ requires 2.16 s on one CPU thread (cf. Tab 5) and scales almost perfectly with OpenMP. The same kernel can be computed on the GPU in 0.04 s. Thus, with one GPU, speedups of 7 or 54 are achieved when comparing with one CPU socket or a single core, respectively. This demonstrates that a significant speedup was achieved with respect to a Sandy Bridge CPU. As the coarse grained OpenMP parallelization of VERTEX (which is crucial for achieving its excellent weak scalability) does not allow to use the threaded rate kernel on the CPU, the acceleration factor of 54 applies for production applications which effectively eliminates the rate kernel from the computing time budget and in practice accounts for a twofold acceleration (corresponding to the original 50% share of the rate kernel, cf. Tab 5) of the entire application.
4 Summary and Conclusions
With the specific cases of GENE and VERTEX we have shown that complex HPC applications can successfully be ported to heterogeneous CPUGPU clusters. Besides writing fast GPU code, exploiting and balancing both the GPU and the CPU resources of the heterogeneous compute nodes turned out to be an essential prerequisite for achieving good overall ”speedups”, which we define as the ratio of the run time obtained on a number of GPUaccelerated sockets and the run time measured with parallel code on the same number of CPU sockets.
In the case of VERTEX we have demonstrated twofold speedups which hold for production applications on GPUclusters with many hundreds of nodes. In particular, the excellent weak scalability of VERTEX [17] is not affected by the additional acceleration due to GPUs. Threefold speedups appear in reach but would require at least additional porting of a linear solver for a blocktridiagonal system. Limitations in the software environment (lack of devicecallable LAPACK functionality) have so far impeded a successful port of this part of the algorithm. Importantly, due to the specific code structure of VERTEX, such speedups would not have been possible with comparable programming effort by simply using more CPU cores.
The performance of GENE is currently limited by the data transfer between the host CPU and the GPU as we have shown by an elaborate performancemodeling analysis. After this bottleneck will have relaxed by upcoming hardware improvements (PCIe 3) further optimization efforts on the GPU code will increase the overall speedups on a heterogeneous cluster.
The question whether the effort of several personmonths, which we have invested for each code, and which we consider typical for such projects, is well justified cannot be answered straightforwardly. For complex, and ”living” scientific HPC codes, for which GENE and VERTEX can serve as prototypical examples, achieving up to threefold speedups in overall application performance appears very competitive [18]. Also from the point of view of hardware investment (buying GPUs instead of CPUs) and operational costs (”energy to solution”) the migration of applications from pure CPU machines to GPUaccelerated clusters can be considered costeffective if speedups of at least about two are achieved. On the other hand, while very valuable for increasing simulation throughput, twofold or threefold application speedups usually do not enable qualitatively new science objectives. For this reason we sometimes observe reluctance in the scientific community to invest significant human resources for achieving GPUperformance improvements in this range. This is further exacerbated by legitimate concerns about sustainability, maintainability and portability of GPUkernel code. These are no serious issues for GENE and VERTEX, where the parts we have ported to the GPU are not under heavy algorithmic development and were also carefully encapsulated by us. In general, however, the need for kernel programming, which is considered as a pain by many, currently appears as the largest hurdle for a broader adoption of GPU programming in the scientific HPC community. Moreover, it may turn out necessary to port significant parts of the application code to the GPU, e.g. in cases like GENE where the data transfers become a limiting factor, or even to completely reimplement the application.
These concerns could be mitigated by the establishment of a highlevel, directive based programming model, based e.g. on the OpenACC standard [12] or a future revision of OpenMP [19], together with appropriate compiler support. Also Intel’s Xeon Phi manycore coprocessor with its less disruptive programming model appears very prospective in this respect. Despite serious efforts, however, we were not yet successful with GENE or VERTEX to achieve performances on this platform which are competitive with the GPU. We attribute this mostly to a comparably lower maturity of the Xeon Phi software stack and we expect improvements with upcoming versions of the compiler and the OpenMP run time.
Most importantly, today’s GPUs (and manycore coprocessors) might provide a first glimpse on the architecture and the related programming challenges of future HPC architectures of the exascale era [10]. Applications need to be prepared in time for the massive SIMT and SIMD parallelism which is expected to become prevalent in such systems. Even on contemporary multicore CPUs with comparably moderate threadcounts and SIMD width, the experience we have gained with porting GENE and VERTEX has already led to appreciable performance improvements of the CPU codes.
Acknowledgments
We thank F. Jenko and H.Th. Janka for encouraging the development of GPU versions for GENE and VERTEX, respectively. NVidia Corp. and Intel Corp. are acknowledged for providing hardware samples and technical consulting.
References
 [1] Jenko, F., Dorland, W. Kotschenreuther, W., Rogers, B. N., Electron temperature gradient driven turbulence. Phys. Plasmas 7, 1904 (2000).
 [2] Görler, T., Lapillonne, X. et al., The global version of the gyrokinetic turbulence code GENE. Journal of Computational Physics 230, 7053 (2011).
 [3] Rampp, M., Janka, H.Th. Radiation hydrodynamics with neutrinos: Variable Eddington factor method for corecollapse supernova simulations. Astronomy & Astrophysics, 396, 361 (2002).
 [4] Buras, R., Janka, H.Th., Rampp, M., Kifonidis, K. Twodimensional hydrodynamic corecollapse supernova simulations with spectral neutrino transport. I. Numerical method and results for a 15 M star. Astronomy & Astrophysics 447, 1049 (2006).
 [5] Hanke, F., Müller, B., Wongwathanarat, A., Marek, A., Janka, H.Th. SASI Activity in ThreeDimensional NeutrinoHydrodynamics Simulations of Supernova Cores. Astrophysical Journal 770, 66 (2013).
 [6] Janka, H.Th. Explosion Mechanisms of CoreCollapse Supernovae. Annual Review of Nuclear and Particle Science. 62, 407 (2012).
 [7] Burrows, A. Colloquium: Perspectives on CoreCollapse Supernova Theory. Reviews of Modern Physics 85, 245 (2013).
 [8] Cardall, C., Endeve, E., Budiardja, R. D., Marronetti, P., Mezzacappa, A. Towards the CoreCollapse Supernova Explosion Mechanism. Advances in computational astrophysics: methods, tools, and outcome. ASP Conference Proceedings, 453, 81 (2012).
 [9] The Top500 List of supercomputers, http://www.top500.org.
 [10] Dongarra, J., Beckman, P. et al. The International Exascale Software Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011, ISSN 10943420.
 [11] Shimokawabe, T., Aoki, T. et al. Petascale phasefield simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, ACM New York, NY, US (2011) 3:1–3:11
 [12] The OpenACC standard. http://www.openaccstandard.org/
 [13] The GENE code. http://gene.rzg.mpg.de/.
 [14] Canuto, C., Hussaini, M. Y., Quarteroni, A., Zang, T. A., Spectral Methods. Vol. 1, Fundamentals in Single Domains. Springer, Berlin (2006).
 [15] Williams, S., Waterman, A. and Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM 52, 65 (2009).
 [16] Janka, H.Th. Explosion Models of CoreCollapse Supernovae. Hirschegg, (2012), http://theorie.ikp.physik.tudarmstadt.de/nhc/pages/events/hirschegg/2013/talks/Wed/Janka.pdf
 [17] Marek, A., Rampp, M., Hanke, F., Janka, H.Th. Towards Petaflops Capability of the VERTEX Supernova Code. This conference volume (2013).
 [18] Accelerating Computational Science Symposium 2012. Washington, US, (2012). https://www.olcf.ornl.gov/wpcontent/uploads/2012/07/ACSS_Final_v3.pdf
 [19] Stotzer, E. et al. Technical Report on Directives for Attached Accelerators (2012), http://www.openmp.org/mpdocuments/TR1_167.pdf