Hybrid CPUGPU generation of the Hamiltonian and Overlap matrices in FLAPW methods
Abstract
In this paper we focus on the integration of highperformance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum Jülich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by reengineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable sharedmemory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to offload parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16cores and 2 GPUs, we observe speedups of up to 5 with respect to our optimized sharedmemory code, which in turn means between 7.5 and 12.5 speedup with respect to the original FLEUR code.
copyrightbox
1 Introduction
Many legacy codes in scientific computing have grown over time with an eye on functionality, but little emphasis on portable performance and scalability. Often, these codes are a direct translation of mathematical formulae, and lack proper engineering (i.e. modularity, code reuse, etc). One such example is FLEUR, a software for electronic structure calculations developed at the Jülich Research Center during the last two decades [3]. In previous work by Di Napoli et al. [2], the authors made the effort of reengineering a portion of FLEUR’s code base in an attempt to demonstrate the value of writing the computational bottlenecks in terms of kernels provided by standardized and highlytuned libraries. There, they show an increase in performance and anticipate its portability beyond multicore architectures. In this paper, we confirm that the reengineering process indeed guarantees quick portability and highperformance on emerging heterogeneous architectures, as in the case of multicore CPUs equipped with one or more coprocessors such as GPUs.
In times where massivelyparallel heterogeneous architectures have become the most common computing platform, legacy scientific code has to be modernized. New software is often designed with portable efficiency and scalability in mind, and some old code is undergoing major rewriting to adapt to the newest architectures. However, there is still a lot of reluctance to undergo through the rewriting process since it requires a vast initial effort and may incur in validation issues. While it is understandable that domain scientists are hesitant to introduce major changes into a code developed and tested in the course of many years, legacy codes that do not go through this process are destined to be marginalized.
A critical insight in writing longlasting scientific code is to have a modular design where, at the bottom layers, the computational bottlenecks are written in terms of kernels available from standardized and highlytuned libraries. Examples of such kernels are fast Fourier transforms, matrix products, and eigensolvers, provided by a number of commercial as well as academic libraries. The most prominent example of standard and optimized scientific library is the Basic Linear Algebra Subprograms (BLAS). This library has its roots in the early realization of the necessity for portable performance. Today, BLAS kernels, which include matrix products and linear systems, are the building blocks for a multitude of libraries, so much that BLAS is the first library to be ported to and optimized for every new architecture. Therefore, writing code on top of BLAS and other standardized and broadly available libraries constitutes a first and essential step in the modernization of scientific software.
In this paper we follow the same approach illustrated in [2], where the authors made a major effort to address the computational bottlenecks of the FLEUR’s code base: the generation of the socalled Hamiltonian and Overlap matrices. In generating such matrices, the main goal of the original FLEUR code was the minimization of memory usage. Furthermore there is no notion of abstraction and encapsulation, and the different modules are tightly coupled. At this point, lowlevel optimizations were unfeasible, and the authors opted for a clean slate approach: starting from the mathematical formulation behind the code, they cast it in terms of matrix operations supported by the BLAS and LAPACK libraries. As presented in their results, despite lacking some mathematical insight that reduced the amount of computation in the FLEUR code, HSDLA (the new code) outperformed the original one with speedups between and . Most importantly, the authors claim that HSDLA could be easily ported to other architectures.
In this paper, we continue their work, and illustrate how such an initial reengineering effort enables a quick port to heterogeneous architectures consisting of multicore CPUs and one or more GPUs. More specifically, we quantify the minimal effort required in terms of additional code to attain substantial speedups. When running on a system equipped with two GPUs, we observe speedups of up to with respect to HSDLA and about one order of magnitude with respect to the corresponding FLEUR code. Moreover, we identify the additional work required to attain closetooptimal efficiency and scalability, and partially implement it to illustrate the idea. Finally, beyond the results specific to the case of FLEUR, the main contribution of this paper is to demonstrate that, despite the major initial effort, a reengineering of legacy codes is not only worth it but imperative in order to obtain longlasting portable performance and scalability.
This paper is organized as follows. Section 1.1 provides the background on Density Functional Theory (DFT) and the math behind the computation to generate the Hamiltonian and Overlap matrices. Section 2 gives an overview of the optimized algorithm behind HSDLA for the generation of these matrices. In Sec. 3 we discuss the porting of the code to heterogeneous architectures, including a review of the available BLAS libraries for GPUs and a simple analysis of the computation to decide which portions of the code to offload to the GPUs. Section 4 presents experimental results for 1, 2 and 4 GPUs, and points at potential improvements to the hybrid code. Finally, Sec. 5 draws conclusion and discusses future research directions.
1.1 DFT, FLAPW and the H and S matrices
The FLEUR code is based on the widely accepted framework of Density Functional Theory (DFT). In the last decade, Density Functional Theory (DFT) [6, 7] has become the “standard model” in Materials Science. Within the DFT framework, it is possible to simulate the physical properties of complex quantum mechanical systems made of few dozens up to few hundreds of atoms. The core of the method relies on the simultaneous solution of a set of Schrödingerlike equations. These equations are determined by a Hamiltonian operator containing an effective potential that depends functionally on the oneparticle electron density . In turn, the solutions of the equations determine the oneparticle electron density used in calculating the effective potential .
(1) 
In practice, this set of equations, also known as KohnSham (KS) [5], is solved selfconsistently; an initial guess for is used to calculate the effective potential which, in turn, is inserted in Eq. (1) whose solutions, , are used to compute a new charge density . Convergence is checked by comparing the new density to the starting one. When convergence is not reached, an opportune mixing of the two densities is selected as a new guess, and the process is repeated. This is properly called a SelfConsistent Field (SCF) iteration.
In principle, the theory only requires as input the quantum numbers and the positions of the atoms that are part of the investigated system. In practice, there is a plethora of DFT methods which depends on the discretization used to parameterize the KS equations. The discretization in the Fullpotential Linearized Augmented Plane Wave (FLAPW) method [10, 4] is based on plane wave expansion of , where the momentum vector and the band index replace the generic index . The kpoint wave function is expanded in terms of a basis set indexed by the vectors lying in the lattice reciprocal to configuration space up to a chosen cutoff value . In FLAPW, the physical (configuration) space of the simulation cell is divided into spherical regions, called MuffinTin (MT) spheres, centered around atomic nuclei, and interstitial areas between the MT spheres. The basis set takes a different expression depending on the region
(2) 
where the coefficients and are determined by imposing continuity of and its derivative at the boundary of the MTs. Due to this expansion the KS equations naturally translate to a set of generalized eigenvalue problems for the coefficients of the expansion where the Hamiltonian and Overlap matrices and are given by multiple integrals and sums
(3) 
Since the set of basis functions in Eq. (2) is implicitly labeled by the values the variable takes in the Brillouin zone, there are multiple Hamiltonian and Overlap matrices, one for each independent point.
Without loss of generality, we can abstract from the kpoint index and recover an explicit formulation of the and matrices by substituting Eq. (2) in Eq. (3) and carrying out the multiple integration. The computation is particularly complex within the MT regions where the initialization of the Hamiltonian and Overlap matrices is by far the most computationally intensive task. By exploiting the properties of the basis functions, the and matrices are directly expressed as functions of the set of and coefficients.
(4) 
(5) 
Notice that in Eq. (5) for convenience we have compacted the indexes into , and expressed the range of the index over all the distinct atom types . The new matrices are dense and their computation involves multiple integrals between the basis functions and the nonspherical part of the potential (See [2, Appendix A.2] for details). Due to the nonorthornormality of the basis function set (2), the matrix is nondiagonal, dense, and generically positive definite with the exception of having few very small singular values. On the opposite is always nondefinite and both matrices are either complex Hermitian or real symmetric.
2 Algorithm
As a first step towards using the BLAS and LAPACK libraries, all the involved objects in Eqs. (4) and (5) are expressed in matrix form, dropping indexes , , , and . Assuming the coefficient objects and as well as the matrices as input, matrices and can be computed as follows:
(6)  
(7) 
where and , , and , and is a diagonal matrix. Typical for the matrix sizes are , to , and .
Algorithm 1 illustrates the algorithm used to compute Eqs. (6) and (7) in HSDLA. Two main insights drive the design of the algorithm. First, it exploits symmetries to reduce the computational cost; then, it casts the computation in terms of BLAS and LAPACK kernels. Furthermore, when possible, multiple matrices are stacked together to allow for larger matrix products, which in general results in higher performance.
The computation of is split into two parts, and . The computation of corresponds to lines 4 through 10. The key idea behind the calculation is to rewrite the expression as
noting that is the Hermitian transpose of and that is itself Hermitian. The operations in parentheses are computed one at a time for each . Then, the results are aggregated into single large matrices for a large product.
The computation of corresponds to lines 17 through 29. The algorithm first attempts a Cholesky factorization of (), which requires the matrix to be Hermitian positive definite (HPD). While, in theory, is HPD, in practice, due to numerical considerations, the factorization may fail. Depending on the success or failure of each individual factorization, the results of operations in lines 21 and 24 are stacked on different temporary operands to then compute in two steps (lines 28 and 29).
The computation of (lines 13 through 15) is more straightforward. First, the product is computed as a single large product. Then is updated with the norms stored in and a second large product completes the computation.
3 Software reengineering and performance portability
In this section we set the focus on the porting of the multicore implementation of Alg. 1 to heterogeneous architectures consisting of one multicore node equipped with one or more GPUs. We perform a quick analysis of the computation to determine how to split the computation between CPU and GPU(s) with minimal modifications to the code, and illustrate how with these minor modifications one can already benefit considerably from the combined computational power of CPU and GPUs.
Given the characteristic values for , , and observed in our test cases, at least 97% of the computation is performed by the operations in lines 10, 13, 15, 28 and 29. Thanks to the aggregation of many small matrix products into single large ones, all these 5 operations are large enough to be efficiently computed on the GPUs. Therefore, the first step is to offload these computations to the GPU making sure that relatively high efficiency is attained.
All five calls correspond to BLAS kernels; we thus look into available BLAS implementations for GPUs. There exists a range of GPU libraries that offer BLAS functionality, both academic and commercial, such as cuBLAS [1], cuBLASXT, MAGMA [8], and BLASX [9]. The first two are commercial and developed by NVIDIA, the other two are academic efforts. From the point of view of programmability, the most convenient alternatives are cuBLASXT and BLASX, since they require minor or no changes to the calls to BLAS routines and take also care of the data transfers between CPU and GPU transparently. While BLASX offers certain advantages from the programmability perspective and claims higher performance and scalability (see [9]), we encountered some problems in the integration and opted for using cuBLASXT for our initial study.
Since cuBLASXT does not abide to the BLAS standard interface, three wrappers, of about 15 lines of code each, around the calls to zherk, zher2k and zgemm are required to ensure the code works seamlessly in both CPUonly and CPUGPU(s) modes. An example for zgemm follows:
void gpu_zgemm_( char *transa, char *transb, int *m, int *n, int *k, std::complex¡double¿ *alpha, std::complex¡double¿ *A, int *lda, std::complex¡double¿ *B, int *ldb, std::complex¡double¿ *beta, std::complex¡double¿ *C, int *ldc ) cublasOperation_t cu_transa = transa[0] == ’N’ ? CUBLAS_OP_N : transa[0] == ’T’ ? CUBLAS_OP_T : CUBLAS_OP_C; cublasOperation_t cu_transb = transb[0] == ’N’ ? CUBLAS_OP_N : transb[0] == ’T’ ? CUBLAS_OP_T : CUBLAS_OP_C; cublasXtZgemm( handle, cu_transa, cu_transb, *m, *n, *k, (cuDoubleComplex *)alpha, (cuDoubleComplex *)A, *lda, (cuDoubleComplex *)B, *ldb, (cuDoubleComplex *)beta, (cuDoubleComplex *)C, *ldc ); \@endparenv
In addition, two routines for proper initialization and cleanup of the cuda runtime and the devices are needed. Finally, for the data transfers between CPU and GPU to be efficient, memory for the matrices involved must be pinned (pagelocked).
With these minor modifications, about 100 lines of extra coding, the program is ready to offload most of the computation to multiple GPUs and attain nfold speedups. It is important to highlight that this simple extension is only possible thanks to the initial effort in rewriting the initial FLEUR code in terms of matrix (BLAS/LAPACK) operations. At that point the efficiency and scalability of the code may be easily ported to more complex architectures. Had the original FLEUR code not undergone the reengineering process, the coding of efficient low level routines for the GPUs would be a much more complex and timeconsuming effort.
4 Experimental Results
We turn now our attention to experimental results. We compare the performance of our hybrid CPUGPU implementation of Alg. 1 with the performance of the multicore (CPU only) HSDLA. As test cases we use two input systems describing two distinct physical systems, to which we refer as NaCl and AuAg, respectively. By including both an insulator and a conductor, these systems represent a heterogeneous sample with different physical properties. The tests generate the matrices and for one single kpoint, and different values. The actual problem sizes, that is, the values for , , and in each case are given in Tab. 1.
Test case  

NaCl  512  49  2256  3893  6217  9273  
AuAg  108  121  3275  5638  8970  13379 
We ran our experiments in two different compute nodes, which we will refer to as RWTH and JURECA. The RWTH node consists of two eightcore Sandy Bridge E52650 processors, running at a nominal frequency of 2.0 GHz, and 2 NVIDIA Tesla K20Xm GPUs. The node is equipped with 64 GBs of RAM. The combined peak performance for the 16 CPU cores in double precision is of 256 GFlops, while the peak performance for double precision of each GPU is of 1.3 TFlops, for a total of 2.6 TFlops. The JURECA node consists of two twelvecore Haswell E52680v3 processors, running at a nominal frequency of 2.5 GHz, and 2 NVIDIA K80 GPUs (each of which consists of two K40 GPU devices). The node is equipped with 128 GBs of RAM. The combined peak performance for the 24 CPU cores in double precision is of 960 GFlops, while the peak performance for double precision of each GPU device is of about 1.45 TFlops, for a total of 5.8 TFlops. In all cases, the code was linked to Intel MKL version 11.3.2 for the BLAS and LAPACK routines on the CPU; the GPU code was linked to NVIDIA cuBLASXT version 7.5.
Rwth.
Table 2 collects the timings for the NaCl test case for the three scenarios of interest (CPU only, CPU + 1 GPU, CPU + 2 GPUs). The speedup with respect to HSDLA is given in parentheses. As expected, the considerable gap in performance between CPU and GPU is reflected in the observed large speedups: up to and for 1 and 2 GPUs, respectively.
Similar results are presented in Tab. 3 for the AuAg test case, but in this case the observed speedups are even larger. The reason for this is that, while MKL is already close to its peak performance for the matrix sizes of NaCl, cuBLASXT still has room for improvement and benefits from the larger matrices in AuAg. In fact, one can expect still better speedups for larger systems.
Setup  

CPU only  18.27s  39.84s  91.52s  189.53s 
CPU + 1 GPU  8.03s (2.28)  15.87s (2.51)  35.64s (2.57)  68.59s (2.76) 
CPU + 2 GPUs  6.51s (2.81)  12.37s (3.22)  24.39s (3.75)  46.97s (4.04) 
Setup  

CPU only  15.64s  46.23s  104.25s  215.98s 
CPU + 1 GPU  7.52s (2.08)  16.16s (2.86)  35.62s (2.93)  71.35s (3.03) 
CPU + 2 GPUs  5.62s (2.78)  11.28s (4.10)  23.10s (4.51)  43.54s (4.96) 
Jureca.
Results for the JURECA node are presented in Tabs. 4 and 5 for NaCl and AuAg, respectively. In this case we show timings and speedups for up to 4 GPUs. The maximum observed speedups are , and for 1, 2 and 4 GPUs, respectively. Given that the increase in computational power in each case is of , and , respectively, these numbers are quite satisfactory.
Setup  

CPU only  9.334s  23.293s  41.500s  74.731s 
CPU + 1 GPU  6.474s (1.44)  14.502s (1.61)  32.728s (1.27)  66.546s (1.12) 
CPU + 2 GPUs  4.995s (1.87)  10.381s (2.24)  21.842s (1.90)  42.581s (1.76) 
CPU + 4 GPUs  4.720s (1.98)  8.760s (2.66)  15.449s (2.69)  26.575s (2.81) 
Setup  

CPU only  9.102s  22.681s  57.314s  100.190s 
CPU + 1 GPU  6.136s (1.48)  14.466s (1.57)  32.338s (1.77)  68.910s (1.45) 
CPU + 2 GPUs  4.376s (2.08)  9.699s (2.34)  20.800s (2.76)  42.242s (2.37) 
CPU + 4 GPUs  3.533s (2.58)  6.690s (3.39)  13.457s (4.26)  25.359s (3.95) 
4.1 Finetuning for performance and scalability
The observed speedups are substantial. Yet, one could expect even better results, especially in the case of the RWTH node where the computational power of the two GPUs combined is ten times larger than that of the CPUs. This potential for improvement comes as no surprise, since this is only a basic port to illustrate how far one can get with minimal code modifications; to attain closetooptimal performance further work is required. For ideal results, a hybrid and highlytuned BLAS as well as a GPUaccelerated version of the computation in the loops are needed.
In order to have a more tangible discussion, we provide in Tab. 6 a breakdown of the timings for the NaCl () test case running in the RWTH node with two K20x GPUs. The bottom rows correspond to the large BLAS operations offloaded to the two GPUs; the top rows correspond to the rest of the code (both loops and the application of the norm) executed on the CPU only. The efficiency is measured with respect to the combined performance of CPU plus GPUs.
Section (line(s))  Time  Performance  Efficiency 

Loop 1 (4–9)  2.27 secs  80.35 GFlops/s  0.03 
Loop 2 (17–27)  2.62 secs  34.81 GFlops/s  0.01 
U norm (14)  0.23 secs  1.01 GFlops/s  0.00 
S1 (13)  4.37 secs  1974.63 GFlops/s  0.69 
S2 (15)  4.41 secs  1956.72 GFlops/s  0.68 
H1 (10)  9.49 secs  1818.57 GFlops/s  0.63 
H2 (28)  2.32 secs  1859.72 GFlops/s  0.65 
H3 (29)  4.75 secs  1816.66 GFlops/s  0.63 
Three main messages can be extracted from Tab. 6:

NVIDIA’s cuBLASXT does a good job attaining an efficiency between and .

Yet, these operations may attain about of the peak if the matrices are large enough and the code is highly optimized and hybrid. This would mean attaining around 2.5 TFlops/s, that is, an extra speedup for these heavy computations.

When the target architecture offers massive parallelism, minor portions of code that do not scale may become a bottleneck. In our case, the of the computation that was not offloaded to the GPUs becomes nonnegligible. In fact, the weight of these operations in our experiments may account for up to 35% of the time to solution, and compromise the overall scalability. Due to the size of the matrices involved in these operations (between and for the matrices in our test cases), these products do not scale well, especially on GPUs. Specialized code is required to mitigate their impact in the overall time to solution.
5 Conclusions and Future Work
We concentrated on the benefits of rewriting scientific code in terms of standardized libraries for portable performance and scalability. As use case we considered a portion of the FLEUR code base, a software for electronic structure calculations.
We demonstrated that major efforts in reengineering part of the original FLEUR code, and writing it in terms of the BLAS and LAPACK libraries, enables a fast porting that exploits the vast computational power of emerging heterogeneous architectures such as multicore CPUs combined with multiple GPUs. Most importantly, the porting only required less than 100 new lines of code. The resulting implementation attains speedups of up to and for simulations run on a system equipped with two K20x GPUs, respectively, and speedups of up to , and for runs with 1, 2 and 4 GPUs, respectively, on a system equipped with two K80 GPUs (each consisting of two K40 GPUs).
While satisfactory, these results highlight room for improvement. In the future, we aim at developing more efficient hybrid CPUGPU routines for the major matrix products in the code as well as attaining sufficient scalability of the rest of the code to ensure a uniform overall scalability.
6 Acknowledgements
This work was partially funded by the Ministry of Science and Education of the Republic of Croatia and the Deutsche Akademische Austauschdienst (DAAD) from funds of the Bundesministeriums für Bildung und Forschung (BMBF) through project “PPP Kroatien” ID 57216700. Financial support from the Jülich Aachen Research AllianceHigh Performance Computing and the Deutsche Forschungsgemeinschaft (DFG) through grant GSC 111 is also gratefully acknowledged. Furthermore, the authors thank the RWTH IT Center and the Jülich Supercomputing Centre for the computational resources.
References
 [1] cuBLAS: The NVIDIA CUDA Basic Linear Algebra Subroutines, https://developer.nvidia.com/cublas
 [2] Di Napoli, E., Peise, E., Hrywniak, M., Bientinesi, P.: Highperformance generation of the Hamiltonian and Overlap matrices in FLAPW methods. Computer Physics Communications (Oct 2016), DOI: 10.1016/j.cpc.2016.10.003
 [3] FLEUR: The Jülich FLAPW code family (Oct 2016), http://www.flapw.de/pm/index.php
 [4] Jansen, H.J.F., Freeman, A.J.: TotalEnergy FullPotential Linearized AugmentedPlaneWave Method for Bulk Solids  Electronic and StructuralProperties of Tungsten. Physical Review B 30(2), 561–569 (Jul 1984)
 [5] Kohn, W., Sham, L.J.: SelfConsistent Equations Including Exchange and Correlation Effects. Phys.Rev. 140, A1133–A1138 (1965)
 [6] Nogueira, F., Marques, M.A.L., Fiolhais, C.: A primer in density functional theory. Lecture Notes in Physics, Springer, Berlin (2003)
 [7] Sholl, D., Steckel, J.A.: Density Functional Theory. A Practical Introduction, John Wiley & Sons (Sep 2011)
 [8] Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with GPU accelerators. In: Proc. of the IEEE IPDPS’10. pp. 1–8. IEEE Computer Society, Atlanta, GA (April 1923 2010), DOI: 10.1109/IPDPSW.2010.5470941
 [9] Wang, L., Wu, W., Xu, Z., Xiao, J., Yang, Y.: BLASX: A High Performance Level3 BLAS Library for Heterogeneous MultiGPU Computing. In: Proceedings of the 2016 International Conference on Supercomputing. pp. 20:1–20:11. ICS ’16, ACM, New York, NY, USA (2016)
 [10] Wimmer, E., Krakauer, H., Weinert, M., Freeman, A.J.: FullPotential SelfConsistent LinearizedAugmentedPlaneWave Method for Calculating the ElectronicStructure of Molecules and Surfaces  O2 Molecule. Physical Review B 24(2), 864–875 (Jul 1981)