Optimization and parallelization of Bspline based orbital evaluations in QMC on multi/manycore shared memory processors
Abstract
Bspline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large fourdimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present nodelevel optimizations of Bspline evaluations on multi/manycore shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transformation from arrayofstructures to structureofarrays (SoA). Then by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in Bspline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations. These optimizations are portable on four distinct cachecoherent architectures and result in up to 5.6x performance enhancements on Intel^{®} Xeon Phi™ processor 7250P (KNL), 5.7x on Intel^{®} Xeon Phi™ coprocessor 7120P, 10x on an Intel^{®} Xeon^{®} processor E5v4 CPU and 9.5x on BlueGene/Q processor. Our nested threading implementation shows nearly ideal parallel efficiency on KNL up to 16 threads. We employ roofline performance analysis to model the impacts of our optimizations. This work combined with our current efforts of optimizing other QMC kernels, result in greater than 4.5x speedup of miniQMC on KNL.
I Introduction
With the advances in algorithms and growing computing powers, quantum Monte Carlo (QMC) methods have become a leading contender for high accuracy calculations for the electronic structure of realistic systems. It is general and applicable to a wide range of physical and chemical systems in any dimension, boundary conditions etc. The favorable scaling of for an electron system and ample opportunities for parallelization make QMC methods uniquely powerful tools to study the electronic structure of realistic systems on largescale parallel computers. Despite the high computational cost of the method, this scalability has enabled calculations on a wide variety of systems, including large molecules[1], layered materials such as the graphite shown in Figure 1[2, 3, 4], transition metal oxides[5, 6] and general bulk properties of solids[7].
A prime trend in HPC field is the increase in parallelism available on a single SMP node. The average number of nodes of big clusters has not increased in recent years but the compute capacity of a node has been increasing through more cores, multiple hardware threads per core, and wider SIMD units. The manycore architecture expands the parallelism on a node dramatically. For example, the second generation Intel^{®} Xeon Phi™ processors, formerly codenamed Knights Landing (KNL) [8], have up to 72 cores, 4 threads per core and a doubleprecision SIMD width of 8. Its theoretical peak is more than 10 times of one IBM Blue Gene/Q (BG/Q) node of 16 cores, 4 threads per core and a doubleprecision SIMD width of 4 [9]. To take full advantage of modern day systems, it is essential to utilize SIMD units and cache efficiently.
One of the main computational bottlenecks in QMC simulations is the evaluation of single particle orbitals (SPOs) for an electron system. A Bspline basis is the most efficient representation for SPOs whose cost per orbital is . Spline interpolations are widely used in many applications including classical and quantum molecular dynamics[10] to lower the cost of complex analytic function evaluations. In QMC, Bspline based orbitals provide orders of magnitude better timetosolution than other standard basis sets such as Gaussiantype orbitals even for a modest problem size for the same accuracy. This work presents processes to improve onnode performance of Bspline evaluation routines in QMCPACK [17]. The random access patterns that are inherent in QMC make achieving high performance of Bspline based SPOs on any architecture challenging. Its arithmetic intensity, FLOPS per bytes, is low and its performance is sensitive to the memory subsystems – bandwidths and cache levels and their sizes. We demonstrate how one can achieve efficient vectorization, optimally utilize the multilevel cache system and exploit a large number of hardware threads as a proof of concept.
SIMD and cache utilization: Extensive analysis of QMCPACK simulations reveals that SIMD efficiency is low except for Bspline evaluations in intrinsics and BLAS/LAPACK routines that are available in optimized libraries, such as Intel MKL [11]. The abstractions for 3D physics, defined in arrayofstructures (AoS) formats, such as representing positions for particles with R[N][3] data layout, are primarily responsible for the low SIMD efficiency. AoS representations for physical entities are logical for expressing concepts and have been widely adopted in HPC applications. However, the computations using them are not efficient on modern CPUs. Implementing “hotspots” using architecturespecific intrinsics is possible but it is not scalable nor portable. Also currently, without any cacheblocking optimizations, the working set fall out of cache for large problems, causing low cache uitlization. Our goal is to transform all QMCPACK core routines to increase onnode performance and science productivity, while minimizing the development cost, and this work presents the first most important step to achieve the goal.
Beyond walker parallelism: The common parallelization strategy in QMC is distributing walkers (Monte Carlo samples) among tasks, where is the number of parallel processing units for a simulation. The OpenMP/MPI hybrid programming model adopted in QMCPACK has proven to be very effective on clusters of multicore sharedmemory processor (SMP) nodes. It minimizes the memory footprint per MPI task, and improves the overall parallel efficiency by considerably reducing collective communication time at a large task count [12]. However, the cost of processing each walker increases as or higher for an electron system. Also, the overall memory usage on a node increase as for walkers. This makes it essential to parallelize the execution of each walker for reducing the total time to solution and memory footprints; hence, expanding the applicability of QMC methods to larger system sizes. We implement efficient nested threading for Bspline based SPO evaluations, paving the way towards reducing and enabling strong scaling of QMC simulations.
Contributions and results: The two main contributions of this work are (i) nodelevel optimizations to improve single node efficiency in a portable fashion, without the use of processor specific intrinsics and (ii) parallelization of the orbital evaluations for each walker to reduce the timetosolution in the strongscaling runs. We develop simplified QMC codes (miniQMC) to facilitate rapid prototyping and benchmarking. They capture the computational and data access patterns and have similar profile of common QMC workloads on a SMP node. We demonstrate the impact of our optimizations on four distinct cachecoherent architectures with extensive analysis and on a wide range of problems, starting from 64carbon (128 SPOs) to 2048carbon (4096 SPOs) systems. Finally, we employ the roofline model analysis to quantify the relative and absolute performance impact from the optimizations.
The key accomplishments of this work are as follows. We apply data layout transformation from arrayofstructures to structureofarrays (SoA) to increase SIMD efficiency and bandwidth utilization. We implement “tunable tiling” or AoSoA transformation to achieve efficient cache utilization allowing sustained throughput across problem sizes. The obtained optimal tile size is independent of the problem size and can be tuned once for a specific architecture with miniQMC. These optimizations result in up to 5.6x(KNL), 5.7x(KNC), 10x(BDW) and 9.5x (BG/Q) speedups of Bspline routines. We provide an efficient nested threading implementation for each walker (Monte Carlo unit) and demonstrate more than 14x reduction in the timetosolution on 16 KNL nodes.
The optimizations discussed in this publication are broadly applicable to many HPC applications for increasing the utilization of wide SIMD units and caches. In addition to Bspline kernels, we also implement them in miniQMC for optimizing the other time consuming QMC kernels. With the combined work, we obtain greater than 4.5x speedup of miniQMC on KNL and BDW processors.
To faciliate easy transition of performance gains proven in miniQMC to QMCPACK, we take help of object oriented programming. We design and implement SoA container classes for particle abstractions in 3D and develop classes hiding the lowlevel computational engines for SIMD and memory/cache optimizations from the applicationlevel classes. The optimized kernels of miniQMC will directly be called from QMCPACK driver routines to transfer the performance gains.
Ii Related work
To conduct cutting edge scientific research on material science, QMCPACK has been deployed on the current generation of leadership supercomputers, Mira (IBM Blue Gene/Q) at Argonne National Laboratory and Titan (AMD Opteron CPUs and NVIDIA Tesla K20 GPU) at Oak Ridge National Laboratory. As the most performancecritical component in QMCPACK, 3D Bsplines orbitals have been extensively optimized over the years [13][14]. The highly optimized routines evaluating Bspline SPOs are implemented in QPX intrinsics [15] on BG/Q, SSE/SSE2 intrinsics on x86 and in CUDA [16] to maximize singlenode performance. The QPX and CUDA versions have adopted SoA data layout and FMA instructions to achieve high memory bandwidth utilization and FLOP rates. The single precision was first implemented in QMCPACK GPU port with significant speedups and memory saving and later introduced to the CPU version. Multiple algorithms using loop transformations and unrolling have been developed. The QMCPACK build system allows selecting precision and the optimal algorithms and implementations at compile time.
Iii QMC and Bsplinebased SPOs
In quantum mechanics, all physically observable quantities for a system containing particles can be computed from the dimensional wave function, . The SlaterJastrow trial wave functions , commonly used in QMC applications for solving electronic structures, are the product of Slater determinants of singleparticle orbitals (SPOs) and a Jastrow factor:
(1) 
with for spins. For the rest of the paper, we assume , . Here, denotes a set of SPOs and
(2) 
which ensures the antisymmetric property of a Fermionic wave function upon exchange of a pair of electrons. The Jastrow factors, which are factorized for computational efficiency, describe the dynamic correlation, whereas static correlation is described by the Slater determinants.
In the diffusion Monte Carlo algorithm (DMC), an ensemble of walkers (population) is propagated stochastically from generation to generation, where each walker is represented by . In each propagation step, the walkers are moved through position space by (i) a driftdiffusion process. Once a new configuration is sampled, the physical quantities (observables) such as the kinetic energy and Coulomb potential energies are computed for each walker, (ii) a measurement stage. Finally, the ensemble averages of the observables and trial energy are computed and depending on the energy of a walker relative to , each walker may reproduce itself, be killed, or remain unchanged by a (iii) branching process.
The particlebyparticle moves we employ change only one column of the matrices at a time and the ratio can be computed as
(3) 
When the move is accepted, we employ a rank1 update of using the ShermanMorrison formula. This allows the inverse to be updated in time rather than time for a full inversion and the ratios in . The computations of the manybody gradients and laplacian use the same ratio formula as
(4) 
The cost to compute the value of a scales linearly with the number of basis function evaluations which tends to grow with the system size. This amounts to cost for each particle move and in total SPO evaluations scale as per Monte Carlo step. For this reason, it is efficient to use a localized basis with compact support. In particular, 3D tricubic Bsplines provide a basis in which only 64 elements are nonzero at any given point in space [22, 13] and have complexity of . The onedimensional cubic Bspline is given by,
(5) 
where are the piecewise cubic polynomial basis functions and , the lower bound of for a grid spacing . Figure 2(a) shows the four basis functions contributing to the values at when .
Constructing a tensor product in each Cartesian direction, we can represent a 3D orbital as
(6) 
This allows for rapid evaluation of each orbital in constant time. Furthermore, the basis is systematically improvable with a single spacing parameter, so that the accuracy is not compromised. The coefficients are the interpolation tables for each orbital and remain constant throughout the simulations.
The use of 3D tricubic Bspline SPOs greatly improves the timetosolution. For a typical modest example problem with 32 electrons, the speed up of Bspline is more than sixfold over an equivalent PlaneWave (PW) basis set. The advantage of the Bspline SPOs grows as the system size grows. This computational efficiency comes at the expense of increased memory use. To keep the memory footprints small, QMCPACK uses hybrid parallelism with OpenMP/MPI where all the threads share the read only coefficient table.
Iv Baseline and miniQMC
We use the CORAL benchmark 4x4x1 problem [23] to establish the baseline
performance.
This benchmark represents typical QMC workloads on
current generation of HPC systems. It solves 256 electrons of 64atom ABstacked graphite
system consisting of 4 by 4 periodic images of the 4atom unit cell, shown
in blue in Fig. 1(b). It defines the grid sizes
==48 and =60 of =128 orbitals.
The details
of the systems used are listed in Table I. Systems
are described in section VI, containing the experiments.
Intel^{®} C++ Compilers
BDW  KNC  KNL  BG/Q  
Bsplines  18  28  21  22 
Distance Tables  30  23  34  39 
Jastrow  13  19  19  21 
# cores used  18  60  64  16 
OpenMP threads  36  240  256  64 
Bsplines  Distance Tables  Jastrow  
KNL  68.5  20.3  11.2 
Intel^{®} Xeon^{®} E52698 v4  55.3  22.6  22.1 
Table II shows the profile of the benchmark with publicly released QMCPACK [17] on single nodes. The main computational groups are Bsplines, distance tables and onebody and twobody Jastrow evaluations. Their total amounts to  across the platforms. Rest of the time is mostly spent on the assembly of SPOs using Bspline outputs, determinant updates and inverses. The optimization steps mentioned in Sec. II and QPX specializations accounted to reduce the Bspline share of the profile from 22% to 11% on BG/Q.
In parallel optimization efforts, we optimize DistanceTables and Jastrow kernels with the SoA transformation of particle position abstractions and with other algorithmic improvements. With these, Bspline routines consume more than 55% of run time for miniQMC as shown in table III.
Figure 3 shows the pseudocode for simplified miniQMC that captures the essential computations involving Bspline evaluations. An object named bSpline, of BsplineAoS class is created at L10, which creates and initializes 4D coefficient array P with dimensions [nx][ny][nz][N]. This read only P array is shared by all the threads through bSpline object. Independent Monte Carlo units called walkers, are created using pragma omp parallel at L13. To imitate the random access nature of QMC, each walker generates ns random positions (x,y,z) (shown on L18) for each of the three Bspline kernels, namely V, VGL and VGH. Each walker creates an object w of WalkerAoS class (shown on L16) which contains its own copy of output arrays. With the generated random positions, each walker fills up these output arrays. The iteration loop at L20 represents working through multiple generations of Monte Carlo simulation.
The pseudocode in Fig. 4 shows the kernel of VGH which computes values, gradients (real 3D vectors) and Hessians (symmetric 3D tensors). A total of 13 output components are evaluated. The input position (x,y,z) is random and provided by the miniQMC driver (Fig. 3 at L22, L24 and L26) to mimic QMC random moves by the quantum forces. The outputs v,g and h passed as parameters to the function call, are the starting addresses of the particle attributes, V[N] (values) and AoS types of G[N][3] (gradients) and H[N][3][3] (Hessians). Gradients (Hessians) are the first (second) derivatives with respect to the particle position and are used to compute quantum forces and Laplacians. The allocation of the coefficient array is done as 1D array and uses an aligned allocator and includes padding to ensure the alignment of P[i][j][k] to a 512bit cacheline boundary. All the computations in miniQMC are performed in single precision.
In 4D memory space, accesses to start at a random input grid point P[i0][j0][k0] that satisfies , for the grid spacing in direction. The same conditions apply to and . Total of 64 input streams are issued to access coefficient values. In total, 64 strideone reads and 13 mixedstrided accumulations are executed for each random input point. For large problems, the arithmetic intensity is low at 1 FMA for each accumulation of the output value and therefore, memory bandwidth plays a critical role in deciding the throughput of the Bspline routines. The cost of computing at (x,y,z) in Eq. 6 is amortized for , which had big impact on the performance of scalar processors.
Functions V and VGL differ from VGH by how many components are computed, while having the same computational and data access patterns. Based on the simulation cell type, either VGH or VGL is used. V is used with pseudopotentials for the local energy computation. For the graphite systems, VGH is used during the driftdiffusion phase. Multiple versions of Bspline implementations are available in QMCPACK and we use the optimized CPU algorithm [15] as the baseline.
V Optimizations and Parallelization
In this section, we describe the process and techniques to optimize Bspline kernels with SOA and AoSoA data layout transformations. Further, we describe the nested threading implementation, which helps reducing time to solution and memory footprints.
Va AoStoSoA transformation of outputs
Aligned data objects and contiguous accesses for reading and writing are essential for efficient vectorization. The innermost loop in Fig. 4(a) runs over number of splines N and provides sufficient work for efficient SIMD parallelism. Vectorization of this inner most loop results in streamed unit stride accesses of read only P array. However, accesses to the g(h) array have a 3(9)strided pattern due to the AoS of the particle abstractions in 3dimensional space. For example, g assumes the internal ordering of the gradients in [xyzxyz..xyz] sequence. These strided accesses result in gather and scatter instructions, which are less efficient than contiguous loads and stores and reduce SIMD efficiency. A better way to store this data is in a structureofarrays (SoA) format and use three separate streams for each component. It lets us align the individual output streams for efficient loading and storing and also eliminates the need to use gather/scatter instructions.
Figure 4 shows the pseudocode with AoStoSoA data layout transformation of the gradient and Hessian arrays. BsplineSoA class replaces BsplineAoS class in this optimized version for Fig. 3. This class uses three separate streams for the gradient vector and 6 streams for the Hessian tensor. Only a few components are shown for brevity. A minor change is made to exploit the symmetric nature of the Hessian, which results in a total of 10 (1+3+6) output streams instead of 13 for the baseline. To help the compiler autovectorize and ignore any assumed dependencies, we place #pragma omp simd (OpenMP 4.0) on top of the inner most loop. Also, we inform the compiler about the alignment of output and input arrays using directives. These changes lead to efficient vectorization over and ensure strideone accesses to both and the output arrays. In addition to the AoStoSoA transformation, we do a few other optimizations to the VGL function which are already present for VGH ; such as unrolling the loop over the z dimension, and move the allocation of temporary arrays of the baseline code out of the loops.
Datalayout transformations from AoS to SoA are commonly employed by HPC applications on Intel^{®} Xeon Phi™ [18, 19] and have been shown to be essential on the processors with wide SIMD units. The same transformation boosts performance of the other critical computational steps involving distance tables and Jastrow of QMCPACK. To minimize the impact on theoretical development, it is important to be able to implement the data layout transformation with minimal source code changes. To achieve this while keeping the performance penalty low, we only modify the code in performance critical regions to explicitly use the SoA containers representing abstractions for particle positions, and overload their square bracket operators to return the particle positions at an index, in the current AoS format. This lets us keep the internal data layout in SoA format and allows the use in both AoS and SoA formats. Other HPC codes written with object oriented programming can take advantage of this technique to gain performance with minimal programming efforts.
VB AoSoA transformation (“tiling”)
Efficient cache utilization is critical in getting high performance on large cachebased architectures. To realize this, tiling or blocking optimizations have proven to be highly effective in many gridbased applications [18]. This is also probably one of the most difficult and a tricky one to implement in a way that allows maximizing performance. We present our method of tiling proven to be efficient for Bspline routines in QMC.
Simulating periodic images of a primitive unit cell for the SPOs involves keeping the grid constant and increasing the number of splines , for solving larger systems. Accesses to the read only coefficient array are random, limiting the cache reuse. The working set size in bytes for this single precision (SP) array is , making it too big to fit in to even last level caches, for typical values. Also, output working set size grows with ; for example, full SP output working set size in bytes for VGH is , for walkers. For big N, these arrays fall out of caches. We apply AoSoA transformation or tiling optimization to increase performance by better cache utilization.
After the SoA transformation, the spline dimension becomes the innermost and contiguous dimension for both the input and output arrays. We tile both, input 4D coefficient array and the output arrays, along the spline dimension . We create arrays of BsplineSoA and WalkerSoA objects, effectively splitting the coefficients and outputs along in groups. We use as the tile size. Tiling of array is pictorially shown in Fig. 5(b). Figure 6 is a pseudo user code using BsplineSoA and WalkerSoA objects with as the spline dimension. Object oriented programming in QMCPACK allows us to use the SoA data containers developed for the previous section without any modification.
The AoSoA transformation enhances spatial and temporal locality by reducing working set size. The VGH working set size becomes and in bytes for inputs and outputs respectively. It allows the output working set to be kept in cache for accelerated reduction operations for appropriate values. Also, the tiled array can stay in shared LLC (if available) for small values. In addition, this AoSoA transformation exposes parallelism that will be utilized to shorten computational time in the strongscaling sense.
VC Exposing Parallelism Within Walker Update
Moving forward, exposing parallelism within a walker and increasing scaling to more nodes, are becoming necessary i) to solve problems faster and ii) to reduce memory footprints. The AoSoA data layout transformation described in the previous subsection exposes natural parallelism of Bspline operations. Each tile containing BsplineSoA and WalkerSoA objects shown in Fig. 6 can be executed independently, without any synchronization. The line 9 in Fig. 6 shows a method for exploiting this parallelism using #pragma omp parallel for. The implementation in miniQMC adopts an explicit data partition scheme by assigning threads for each walker and distributing objects among threads. This avoids any potential overhead from OpenMP nested run time environment and sets the maximum performance that can be obtained with our cacheandthreading optimizations. The pseudocode in Fig. 6 is representative of the one shown in Fig. 3, with the tiling and threading optimizations and omission of redundant details including the calls to VGL and V routines.
This parallelization strategy over independent objects keeps the benefits of smaller working sets, enabling efficient cache utilization. Also, independent execution of the tiles eliminates any potential overhead of synchronization. The working set size in bytes for VGH becomes (inputs) and (outputs). For the strong scaling runs, we decrease by factor, keeping the output working set size same. Alternate approach explored for the nested threading includes threading over the innermost dimension, without tiling. This approach does not reap the benefits of smaller working sets for efficient cache utilization and performs worse than the approach chosen here. We plan to extend this AoSoA design to parallelize other parts of QMCPACK and are evaluating various parallel algorithms and run time.
Vi Experiments and Results
Our experiments
For the analysis, we vary , the number of splines, from 128 to 4096, from current day problems to large problems planned as the grandchallenge on preexascale systems. The grid size is kept constant at , simulating periodic images of the primitive unit cell for the SPOs, e.g., the blue box in Fig. 1(b). The tile size is a tunable parameter on each system. We plan to provide an autotuning capability using miniQMC to guide the production runs similar to FFTW’s solution using wisdom files [28].
We use for the throughput per node, a QMC specific metric (operations/sec) to compare performance across problem sizes and processors. represents the work done on a node per second and is computed as , where is the total time for VGL or VGH. It is directly related to the efficiency of QMC simulations using Bspline SPOs. For the ideal performance, should be independent of and the grid sizes . Higher indicates both a more powerful node, e.g., KNC vs KNL, and the higher efficiency of the implementations. The speedup is the ratio of s before and after optimizations using the same number of nodes. As pointed out before, QMCPACK implementation does very little communication across nodes and the parallel efficiency with MPI is excellent. This way, we can reliably use the singlenode performance for the comparisons. We choose = 36 (BDW), 240 (KNC), 256 (KNL) and 64 (BG/Q), corresponding to 1 walker per thread on each system, as used in current QMC simulations.
The rest of the section presents the performance evolution with the processes detailed in Sec. V. We focus on the throughput of VGH for the analysis. Memory bandwidth usage is a key indicator of the efficiency. We measure the bandwidth utilization on KNL using VTune. The performance characteristics of VGL is similar to those of VGH on each system and therefore, only the final data for VGL are presented. Kernel V, having a single floating point output array do not need SoA data layout and only benefits with the AoSoA transformation and the nested threading versions.
Via Aos to SoA Data Layout Transformation
Figure 7(a) presents the performance before and after the AoStoSoA transformation of output arrays, showing 24x speedups for small to medium problem sizes on the Intel processors. We estimate the current gain with vectorization on KNL, of an implementation by comparing s without autovectorization and with vectorization by the compiler. We call it as vector efficiency in the following text. We use novec nosimd noopenmpsimd to turn off autovectorization by the Intel compilers. For a small problem such as 256, the vector efficiency is low at 1.2x with the AoS datatypes. In contrast, this efficiency with SoA objects is greater than 4. The read bandwidth utilization increases to sustained 238 GB/s from 6098 GB/s. KNC gets the biggest boost with the AoStoSoA transformation being an inorder processor with high memory bandwidth. This SoA technique is currently implemented in the QPX version for BG/Q, leading to 2.2x speedup. However, this intrinsics solution is not portable and more importantly, not needed with the transformation.
The AoStoSoA transformation achieves good speedups for small to medium problem sizes but, its effectiveness diminishes as increases beyond =512. Almost no speedup is obtained on KNC and KNL at =2048 and 4096. The sizes of L1/L2 caches are mainly responsible for the poor performance at large . Output arrays fall out of caches for large values. This is confirmed with VTune analysis on KNL: the write bandwidth usage increase to 177 GB/s at =4096 from 38 GB/s at 256. The drop in with increase in is less severe on BDW with the large shared L3 cache. Nevertheless, the absolute performance goes down for the big problems on all the cachebased systems, we have tested. To solve this problem, we apply an AoSoA data layout transformation.
ViB AoSoA Data Layout Transformation or Tiling
The ArrayofSoA data layout transformation is a cacheblocking optimization. We break up the problem into “active” working sets to make them fit in caches. Please note that, we divide the working set across the spline dimension , which is common to both the inputs and outputs. Output arrays get split and fit in to caches for efficient reduction operations. Input four dimensional coefficient array which is shared among all the threads, also gets split across its innermost dimension. This allows it to fit in the big last level caches. As shown in Fig. 4, access to array although random but, exhibit small amount of spatial locality across and dimensions. This tiling method also shortens the stride for outer dimensions, increasing cache reuse and possibly avoiding page faults for big grid sizes and large .
Figure 7(c) shows VGH performance with the increasing tile size for =2048. Starting at , we explore tile sizes in the multiple of two till and call with the highest throughput as the optimal tile size.
A striking feature for BDW is the peak at 64. The entire single precision working set of roughly 28MB, including array of bytes and the output of bytes fit in the 45MB L3 cache. The volume for the input working set at of roughly 56MB, exceeds the L3 size and therefore, VGH becomes memory bandwidth limited for . BG/Q having shared L2 of size 32MB, can hold 28MB of working set; hence also has a peak performance at . Intel^{®} Xeon^{®} processors with shared LLC behave similarly.
In contrast, KNC and KNL do not show the same pronounced peak. Although L2 caches are kept fully coherent among cores, but they are private to cores on KNC and to tiles with two cores on KNL. Due to random accesses of the large array, private L2 caches may contain duplicate copies of elements at any point of time. The duplication reduces the effective size for L2 cache. Even for tile size, the input working set size is 7MB, which may not allow it to fit in to their L2. For KNC and KNL, a performance peak is obtained at . The improvement comes from fitting output arrays in cache, allowing efficient reduction operations over 64 grid points. The performance goes up as increases till , reflecting the amortized cost of redundant computations of the prefactors for the same random position.
Figure 7(b) presents VGH performance before and after the AoSoA transformation, showing significant improvement for =2048 and 4096. We use the obtained optimal on each platform: =64 on BDW and 512 on KNC and KNL. With tiling, we obtain sustained throughput across the problem sizes on all the cachebased architectures we have tested. On BDW and BG/Q with the shared LLC, =64 gives the speedups of 1.22.5 (BDW) and 1.21.5 (BG/Q) across all the problem sizes. The speedups on KNC and KNL are obtained with =512. With the optimal =512 on KNC and KNL, output arrays stay in L1/L2 caches and the writebandwidth decreases from 177 GB/s to 43 GB/s even for =4096. The vector efficiency also increases from 2.5x to 4.2x.
Figure 8 summarizes the performance improvement on KNL for all the three functions V, VGL and VGH with the AoSoA transformation. Speedups are computed keeping the current AoS implementation in C/C++ in QMCPACK distribution as reference. AoStoSoA transformation does not apply to V and it only gets speedup with tiling. Speedup for VGL includes performance gain from the basic optimizations stated earlier, which are in addition to the AoSoA optimization; and provide greater overall speedup. Our optimizations boost the throughput by 1.85x(V), 6.4x(VGL) and 2.5x(VGH) on a node at .
In QMCPACK, each thread owns a ParticleSet object and can benefit from keeping it in cache during the entire run. This AoSoA approach will facilitate similar cacheblocking efficiency in the full application. The optimal tile size is independent of problem size for a particular cachebased architecture and can be estimated based on the cache sizes present and their sharing properties. For the production runs, can be tuned once for each architecture using miniQMC, making the new algorithm cacheoblivious for any and .
ViC Exploiting Parallelism within Walker Update
Extending parallelism beyond the parallelization over walkers, is a necessary step to scale out to more nodes, reducing the timetosolution and the memory footprints. We use threads for each walker update and reduce the number of walkers on a node by the same factor. This allows us to scale the same problem size from one node to nodes. This is well justified since the MPI efficiency remains perfect up to 1000s of nodes on multiple HPC systems [12].
Performance with the nested threading again heavily depends on cache system properties. The optimal tile sizes for Intel^{®} Xeon^{®} and BG/Q processors are determined mainly by the input working set size (bytes). This increases with , reducing the optimal tile size by the same factor and limiting the scalability. For , scaling on these platforms are limited to only 2 threads per walker for close to 80% parallel efficiency. For KNC and KNL, the optimal tile size is determined by the output working set size; for VGH bytes. This remains constant due to the decrease in by the same factor; keeping the optimal tile size same. This allows for strong scaling with nearly ideal parallel efficiency till for the optimal . Beyond that, the scaling depends on the performance with smaller nonoptimal tile sizes. For , scaling on KNC is limited to 8 threads per walker corresponding to for close to 80% parallel efficiency.
Figure 9 presents the speedup of the three Bspline kernels in miniQMC on KNL at =2048 with respect to . The total number of walkers per node and memory usage are reduced by the same factor. The optimal configuration of the AoSoA version is used as the reference for computing speedups with =512, =256 and =1. The parallel efficiency for =16 is greater than 90%, even though =128 is smaller than the optimal tile size. For bigger problem size of , we can expect the same parallel efficiency at 32 threads.
Vii Performance Summary and Roofline analysis
Our optimization processes enhance the utilization of cache and memory bandwidth on all the four distinct cachebased architectures, discussed here. The relative impact of each optimization step, varies depending upon the memory bandwidth and cache subsystem properties on each platform, as summarized in Table IV. The AoStoSoA data layout conversion (Opt A) significantly raises the performance for all of them. The same is true on GPUs [16]. The cache levels, sizes and sharing properties play critical roles for the AoSoA transformation (Opt B). The performance on the systems with shared LLC (BDW and BG/Q) is the best when a subblock of the Bspline table can fit into LLC. On Intel^{®} Xeon Phi™ (co)processors, AoSoA is an effective cacheblocking method and helps keep the output arrays in cache for the reduction operations, allowing sustained performance for all problem sizes. Finally, nested threading (Opt C) is used to exploit a new parallelism over the AoSoA objects. KNL benefits the most from Opt C and provides strong scaling of up to 14x with 16 threads on a node with roughly parallel efficiency. Users can tune and the number of nodes to use, to optimize their productivity, either to maximize the number of simulations per day (throughput) or minimize the timetosolution.
BDW  KNC  KNL  BG/Q  
V  A/B  2.0  1.2  1.3  1.3 
C  3.4  5.9  18.7  2.0  
VGL  A  4.2  4.0  5.1  7.4 
B  10.2  5.7  5.6  9.5  
C  17.2  42.1  80.6  15.8  
VGH  A  1.7  2.6  1.7  1.9 
B  3.7  5.2  2.3  2.7  
C  6.4  35.2  33.1  5.2  
for C  2(32)  8(256)  16(128)  2(32) 
Figure 10 presents roofline performance analysis[29][30] of VGH at various optimization steps for =2048 splines. Intel^{®} Advisor 2017 [31] is used to compute GFLOPS and cacheaware arithmetic intensity (AI), FLOPS per Byte, with static and runtime analysis. The rooflines based on the measured bounds on BDW and KNL are shown for guidance.
In all cases, the bytes transferred from the main memory are the same, 64 reads and 10 writes, and the difference in AI reflects the SIMD efficiency and cache reuse. The AoStoSoA transformation increases the AI as well as GFLOPS as we eliminate the inefficient gather and scatter instructions for the strided access to the outputs. The AoSoA transformation does not affect the AIs but increases the performance with the optimal tile sizes through the increased cache locality. Higher bandwidth available with MCDRAM on KNL is critical for higher performance, as indicated by the best 150 GFLOPS obtained on DDR with the AoSoA version.
In addition to optimizing Bspline orbital evaluations, we are applying the techniques described here such as SoA and AoSoA data layout transformations and nested threading, for the other compute intensive kernels of miniQMC and QMCPACK. Combining these optimizations, we get higher than 4.5x speedup of full miniQMC for a range of problem sizes on KNL and BDW systems. Our goal is to transfer these performance gains to QMCPACK with minimal code changes. Our SoA container classes developed in object oriented C++ and optimized compute kernels will be directly called from QMCPACK driver code. And, with the use of operator overloading feature in C++, non performance critical parts of QMCPACK require minimal changes for the transition.
Viii Conclusion and Future Work
We presented nodelevel optimizations of Bspline based SPO evaluations widely used in QMC, on multi/manycore shared memory processors and nested parallelism implementation, enabling strong scaling to reduce memory usage and time to solution. The techniques described in this work, such as SoA data layout transformation, cacheblocking and threading over the AoSoA objects are applicable to a broad range of algorithms and applications and must be taken into account for optimizations and code modernization. We are applying them in full QMCPACK to deliver the performance obtained in miniQMC and enable breakthrough QMC studies. We demonstrate the impact of our optimization techniques on four distinct cachecoherent architectures and on a range of problem sizes, used in current day simulations to the highly futuristic ones. Our methods to improve the performance enable performance portable solutions in QMCPACK and other QMC applications on current and future cachecoherent architectures.
Acknowledgment
We would like to thank Lawrence Meadows, John Pennycook, Jason Sewall, Harald Servat and Victor Lee for their helpful discussions and reviewing this manuscript. This work is supported by Intel Corporation to establish the Intel Parallel Computing Center at Argonne National Laboratory. LS was supported by the Advanced Simulation and Computing  Physics and Engineering models program at Sandia National Laboratories. AB was supported through the Predictive Theory and Modeling for Materials and Chemical Science program by the Office of Basic Energy Science (BES), Department of Energy (DOE). Sandia National Laboratories is a multi program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energyâs National Nuclear Security Administration under Contract No. DEAC0494AL85000. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract No. DEAC0206CH11357.
Footnotes
 Optimization Notice: Intel’s compilers may or may not optimize to the same degree for nonIntel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
References
 A. Benali, L. Shulenburger, N. A. Romero, J. Kim, and O. A. von Lilienfeld, “Application of diffusion monte carlo to materials dominated by van der waals interactions,” Journal of chemical theory and computation, vol. 10, no. 8, pp. 3417–3422, 2014.
 H. Shin, S. Kang, J. Koo, H. Lee, J. Kim, and Y. Kwon, “Cohesion energetics of carbon allotropes: Quantum monte carlo study,” The Journal of chemical physics, vol. 140, no. 11, p. 114702, 2014.
 P. Ganesh, J. Kim, C. Park, M. Yoon, F. A. Reboredo, and P. R. Kent, “Binding and diffusion of lithium in graphite: quantum monte carlo benchmarks and validation of van der waals density functional methods,” Journal of Chemical Theory and Computation, vol. 10, no. 12, pp. 5318–5323, 2014.
 L. Shulenburger, A. D. Baczewski, Z. Zhu, J. Guan, and D. Tomanek, “The nature of the interlayer interaction in bulk and fewlayer phosphorus,” Nano letters, vol. 15, no. 12, pp. 8170–8175, 2015.
 J. A. Santana, J. T. Krogel, J. Kim, P. R. Kent, and F. A. Reboredo, “Structural stability and defect energetics of zno from diffusion quantum monte carlo,” The Journal of chemical physics, vol. 142, no. 16, p. 164705, 2015.
 K. Foyevtsova, J. T. Krogel, J. Kim, P. Kent, E. Dagotto, and F. A. Reboredo, “Ab initio quantum monte carlo calculations of spin superexchange in cuprates: The benchmarking case of ca 2 cuo 3,” Physical Review X, vol. 4, no. 3, p. 031003, 2014.
 L. Shulenburger and T. R. Mattsson, “Quantum monte carlo applied to solids,” Physical Review B, vol. 88, no. 24, p. 245117, 2013.
 “Knights landing (knl): 2nd generation intel^{®} xeon phi™ processor.” [Online]. Available: http://www.hotchips.org/
 “Top 500, june 2013.” [Online]. Available: http://www.top500.org/lists/2013/06/
 M. Deserno and C. Holm, “How to mesh up ewald sums. i. a theoretical and numerical comparison of various particle mesh routines,” The Journal of chemical physics, vol. 109, no. 18, pp. 7678–7693, 1998.
 “Intel^{®} math kernel library (intel^{®} mkl).” [Online]. Available: https://software.intel.com/enus/intelmkl
 J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley, “Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no. 1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/17426596/402/i=1/a=012008
 K. P. Esler, “Einspline bspline library.” [Online]. Available: http://einspline.sf.net
 Q. Niu, J. Dinan, S. Tirukkovalur, A. Benali, J. Kim, L. Mitas, L. Wagner, and P. Sadayappan, “Globalview coefficients: a data management solution for parallel quantum monte carlo applications,” Concurrency and Computation: Practice and Experience, pp. n/a–n/a, 2016, cpe.3748. [Online]. Available: http://dx.doi.org/10.1002/cpe.3748
 Y. Luo, A. Benali, and V. Morozov, “Accelerating the bspline evaluation in quantum monte carlo,” 2015. [Online]. Available: http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/tech_poster_pages/post337.html
 K. P. Esler, J. Kim, L. Shulenburger, and D. M. Ceperley, “Fully accelerating quantum monte carlo simulations of real materials on gpu clusters,” Computing in Science and Engineering, vol. 14, p. 40, 2012. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/MCSE.2010.122
 “Qmcpack.” [Online]. Available: http://www.qmcpack.org
 J. Jeffers and J. Reinders, Eds., High Performance Parallelism Pearls: Multicore and Manycore Programming Approaches. Boston, MA, USA: Morgan Kaufmann Publishers Inc., 2015, vol. 1.
 J. Reinders and J. Jeffers, Eds., High Performance Parallelism Pearls: Multicore and Manycore Programming Approaches. Boston, MA, USA: Morgan Kaufmann Publishers Inc., 2015, vol. 2.
 “Intel^{®} simd data layout templates (intel^{®} sdlt).” [Online]. Available: https://software.intel.com/enus/node/600110
 A. Wells, “Simd building block.” [Online]. Available: unpublished
 D. Alfè and M. J. Gillan, “An efficient localized basis set for quantum Monte Carlo calculations on condensed matter,” Physical Review B, vol. 70, no. 16, p. 161101, 2004.
 “Coral collaboration, benchmark codes.” [Online]. Available: http://https://asc.llnl.gov/CORALbenchmarks/
 “Intel compiler options: g ip restrict unroll o3 qopenmp std=c++11”, with march=coreavx2 (bdw), mmic (knc) and xmicavx512 (knl).”
 “Ibm compiler options: g o3 qinline=auto:level=10 qarch=qp qsmp=omp qthreaded qstrict qhot=level=1 qtune=qp qsimd=auto.”
 “Intel^{®} vtune™ amplifier 2016.” [Online]. Available: https://software.intel.com/enus/intelvtuneamplifierxe
 “Hpctoolkit.” [Online]. Available: http://www.hpctoolkit.org
 M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005, special issue on “Program Generation, Optimization, and Platform Adaptation”.
 S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for floatingpoint programs and multicore architectures,” Communications of the ACM, 2009.
 A. Ilic, F. Pratas, and L. Sousa, “Cacheaware roofline model: Upgrading the loft,” IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2014.
 “Intel^{®} advisor 2017.” [Online]. Available: https://software.intel.com/enus/articles/gettingstartedwithinteladvisorrooflinefeature