# Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs

###### Abstract

Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi “Knights Landing” and Nvidia “Pascal.”

We propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic.

As an application scenario we use filter diagonalization studies on topological insulator materials. Performance numbers on up to 512 nodes of the Oakforest-PACS and Piz Daint supercomputers are presented, achieving beyond 100 Tflop/s for computing inner eigenvalues of sparse matrices of dimension .

## 1 Introduction and related work

Stacked memory technologies such as HBM2 and MCDRAM have boosted the attainable main memory bandwidth by a factor of five to six compared to conventional multicore systems. Soon after the commercial availability of these technologies, three out of the ten most powerful supercomputers were equipped with the new fast memories (see the TOP500 [2] list as of June 2017). Typically holding 16 GiB of data, the size of stacked memories is still very limited and hierarchical concepts have been implemented, offering additional large DDR4 memory spaces. The two major players as of today, Intel with its “self-hosted” Xeon Phi “Knights Landing” (KNL) series and Nvidia with its “Pascal” (P100) GPGPUs, implement these hierarchical concepts in different ways. While the KNL is directly connected to the DDR4 partition, the P100 accesses the large host node memory through the PCIe interface. However, both architectures are capable of transparently addressing the complete (slow and large) memory on a node, thereby offering easy access to large data sets.

The computation of bulks of eigenvalues of large sparse matrices is very data intensive, both in terms of bandwidth demands (i.e., low computational intensity) and data set sizes. Subspace projection using polynomial filters based on the Chebyshev iteration is an efficient approach for the computation of extremal and interior eigenvalues in quantum physics and quantum chemistry. Application areas include inner eigenvalue problems in the context of modeling graphene or topological insulator materials [21, 22] or the solution of an eigenvalue problem in density functional theory [28]. Beyond eigenvalue computations, Chebyshev polynomials can be used as acceleration techniques for linear solvers (see, e.g., [20, 5]) in various application areas (e.g., power flow modeling [10, 16, 15]). Moreover, the closely related kernel polynomial method (KPM) (see [26] for a review on KPM and its relation to Chebyshev polynomials) also relies on evaluating those polynomials to calculate spectral properties of sparse matrices, such as the density of states [6, 8].

From a computational perspective, the evaluation of Chebyshev polynomials is a simple series of vector operations and sparse matrix-vector multiplications (SpMV). It allows for kernel fusion to increase the computational intensity [12]. Global communication can be avoided or limited to a single invocation for the full-degree polynomial. In the above application scenarios the polynomial is usually evaluated for multiple vectors and the algorithm can be reformulated to use blocks of vectors. This further increases the computational intensity and pushes the corresponding sparse matrix-multiple-vector multiplication (SpMMV) towards regular data access [12]. We emphasize that the benefits of SpMMV have been known for a long time [9] but have only recently gained renewed interest (see, e.g., [17, 3, 4]).

Performance modeling, code optimization strategies, and parallel scalability
studies have been presented
for KPM [12] and Chebyshev filter diagonalization [19]. These investigations were performed on two Pflop/s-class supercomputers: the SuperMUC-Phase2 system^{1}^{1}1https://www.lrz.de/services/compute/supermuc/systemdescription/, which is based on the Intel Xeon Haswell, and the first phase of the Piz Daint supercomputer^{2}^{2}2http://www.cscs.ch/computers/piz_daint (Cray XC30), using Intel Xeon Sandy Bridge processors and Nvidia K20 GPGPUs.

### 1.1 Contribution

This paper extends existing work towards the new class of supercomputers using compute nodes that feature both high- and low-bandwidth memory and transparent access to the full memory address space of a node. The systems under consideration are phase two of Piz Daint and the Oakforest-PACS ^{3}^{3}3http://www.cc.u-tokyo.ac.jp/system/ofp/index-e.html system, representing the Nvidia P100-based accelerator and the standalone Intel Xeon Phi approach, respectively. As of June 2017 these supercomputers were ranked on position 3 and 7 of the TOP500 list.

We first investigate the attainable bandwidth within the compute nodes, focusing on the usage modes for accessing the low-bandwidth partitions. Concerning the transparent use of the low-bandwidth memory, the tighter hardware integration allows much faster access to large data sets on the KNL. Then we report on efforts porting and optimizing the code for the new compute device architectures and analyze the attainable performance levels and hardware bottlenecks if the working set data fits into the high-bandwidth memory. Our block vector implementation (i.e., storing all vectors in a consecutive array) and the simplicity of the algorithm allow for a straightforward implementation of subspace blocking strategies. We perform these in three directions: (1) We block for optimal compute performance, i.e., the computation of the Chebyshev filter polynomial is restricted to a subset of vectors at a time. (2) We show that the subspace blocking is adequate to enable the efficient use of transparent DRAM data access for large problems. (3) We interchange the original order of polynomial evaluation in combination with a pipeline strategy and demonstrate that overlapping of communication and computation between successive subblocks of size can be realized, avoiding the redundant memory transfers of standard communication hiding mechanisms in SpMMV. We investigate these approaches using scalable test cases (sparse matrices) from eigenvalue computations for topological insulator simulations together with realistic parameter settings for the filter diagonalization algorithm. We also show that these kinds of computations fit very well to this new class of supercomputers.

As our library is available as open-source software, our implementations and approaches can be easily adapted by the large community using numerical methods that involve the evaluation of Chebyshev polynomials of large sparse matrices.

### 1.2 Hardware Testbed

P100 | KNL | ||

Vendor | Nvidia | Intel | |

Model | Tesla P100 | Xeon Phi 7250 | |

Codename | Pascal | Knights Landing | |

Cores | 1792 (FP64 CUDA cores) | 68 (64 used) | |

Clock frequency | [MHz] | 1328–1480 | 1400 |

Peak performance | [Tflop/s] | 4.7–5.3 | 3 |

L2 cache capacity | [MiB] | 4 | 34 |

Fast memory technology | HBM2 | MCDRAM | |

Fast memory capacity | [GiB] | 16 | 16 |

Slow memory capacity | [GiB] | 64 | 96 |

The two supercomputers considered in the present work harness non-standard compute devices to bring forth their massive computational power. The Piz Daint system consists of 5,320 nodes, each equipped with an Intel Xeon E5-2690v3 compute node hosting one Nvidia Tesla “Pascal” (P100) GPGPU. Oakforest-PACS features 8,208 compute nodes, each with a self-hosted Intel Xeon Phi 7250 “Knights Landing” manycore CPU. In Table 1 we summarize the key features of the P100 and the KNL. From a high level point of view, both architectures have similar memory organization and key performance figures. However, technical implementations (e.g., SIMD vs SIMT execution; slow memory organization) and programming approaches (e.g., access to slow memory) are substantially different. As we focus in this work on large data sets and ways to use the slow memory, a more extensive evaluation of the different memory modes and the respective attainable data access rates is provided in the next section. Finally, the network structure of both supercomputers is briefly discussed in Section 4.

#### 1.2.1 Memory subsystems and operating modes

A crucial difference between both architectures is their basic operating mode. KNL is self-hosted, i.e., everything, including the operating system and management processes, runs on the compute device. The processor features a large partition of slow DDR4 memory and a small partition of fast MCDRAM memory. It can be configured such that each of them is visible to the programmer as a separate ccNUMA domain (“flat mode”). If both domains should be used, the programmer explicitly needs to specify the data location and, if required, copy data between the domains. Another operating mode uses MCDRAM as a transparent cache for the DDR4 memory (“cache mode”). In this case all memory requests go to the MCDRAM; if data is not available there it will be loaded from DDR4 memory transparently to the MCDRAM and delivered to the processing units. No explicit data management is required by the programmer.

The P100 GPGPU is installed as an accelerator via PCI-Express. The device itself only contains the fast HBM2 memory. In case the data sets exceed its capacity, the host memory has to be used. This can be done via explicit CUDA calls that copy data between host and GPGPU. The Pascal architecture is the first to support a transparent view to device memory and full host memory. Similar to the cache mode on KNL, this “Unified Memory” feature enables transparent data transfers between the host and the device (“managed mode”). Programmers need to allocate data with a special function (cudaMallocManaged()). Data transfers between host and GPGPU is then managed automatically by the Page Migration Engine (PME).

The memory subsystems and operating modes are illustrated in Figure 1.

For data sets fitting into the fast memory partitions, the operating modes on the left are preferred and data should be transferred via path (1) or (4). If the data set exceeds 16 GiB, KNL offers the cache mode which corresponds to path (6). Explicit transfers via path (1)+(2) can be used on P100, but require explicit coding of the data transfers by the programmer. This can be avoided using the managed mode, i.e., data path (3), which provides a transparent view of the complete address space of the host and the GPGPU.

In order to get estimates for achievable performance we investigate attainable bandwidth numbers for accessing large consecutive data sets, which is the typical memory access scenario for the application considered in this work. We use the STREAM benchmark [18] and adapt it to the different memory access modes. Appropriate data set sizes are chosen to measure the different bandwidth paths shown in Figure 1, i.e., for transparent access to the slow memory we use data sets larger than the fast memories. The measurements for all relevant data path combinations are shown in Table 2.

Mode | Copy | Scale | Add | Triad | |
---|---|---|---|---|---|

P100 |
HBM2 (1) | 542 | 542 | 556 | 557 |

DDR4-HBM2 explicit (1)+(2) | 13 | 13 | 12 | 12 | |

DDR4-HBM2 managed (3) | 3 | 2 | 3 | 3 | |

KNL |
MCDRAM (4) | 466 | 468 | 481 | 489 |

DDR4 (5) | 81 | 81 | 85 | 85 | |

DDR4-MCDRAM cache (6) | 60 | 60 | 60 | 59 |

On both architectures the highest bandwidth is naturally attained when using the fast memory only. Access speed to the slow memory component is substantially higher on the KNL owing to its on-chip DDR4 memory controllers, while host memory access on P100 is limited by the capabilities of the PCIe 3.016 interfaces. For explicit slow memory access approximately 75% of the maximum uni-directional PCIe bandwidth can be attained on the P100. However, the bandwidth for transparent access (“managed mode”; path (3) in Figure 1) breaks down to 2–3 Gbyte/s in our benchmarks, which may severely restrict the use of this mode in real world applications. These low transfer rates are caused by the PME, which handles all remote page faults generated by the GPGPU and tries to consolidate them into consecutive PCIe data transfers. An analysis of the “Host to Device Transfers” using the Nvidia profiler shows that the average transfer size granularity even for simple kernels like STREAM is in the order of 40 KiB, at which the PCIe 3.016 interface can deliver only 2–3 Gbyte/s. The minimum message size is 4 KiB (page size of host node), and the maximum size is close to 1 MiB (all threads executing concurrently access consecutive data).

In summary, the transparent access to large consecutive data sets in the slow memory on the P100 is twenty to thirty times slower, while the access to the fast memory is 25% faster than on the KNL.

### 1.3 Software Testbed

All computations were carried out using (real or complex) double precision data. Index values are 4-byte integers. For the P100, the CUDA toolkit in version 8.0.44 was used for compilation and the respective cuBLAS version was employed as a baseline implementation. The Intel C Compiler (ICC) version 17.0.1 was used for KNL with the corresponding MKL and MPI versions. On Piz Daint we used Cray MPICH 7.5.0.

The performance numbers we present for the Chebyshev Filter Diagonalization kernel are median values from ten consecutive runs applying the full filter polynomial. Before the actual measurements, one additional warmup run was performed on the assigned set of nodes. No error bars are given for the performance results because the variations were small ( 5%).

## 2 Chebyshev Filter Diagonalization

We investigate Chebyshev Filter Diagonalization (ChebFD) as a representative algorithm for large-scale and efficient eigenvalue computations. Filter diagonalization is frequently used to find a set of inner eigenstates of a sparse matrix in a given search interval of eigenvalues. It uses a window function approximated by a polynomial filter of degree to project a subspace of search vectors to a given search interval of eigenvalues. A comprehensive description of this method is given in [19]. The computational core of ChebFD is the application of the polynomial filter together with the computation of the Chebyshev moments. This is shown in Algorithm 1 for a formulation with block vectors.

Basic numerical operations involved in the filter application kernel (lines 7-13) are a SpMV involving a large sparse matrix and a series of scaled vector addition kernels (i.e., BLAS1 kernels). These kernels can be formulated as a single SpMMV operation involving special scaling factors and offset computations (line 9). In lines 10/11 the Chebyshev moments are computed; they are used to monitor the number of eigenstates in the search interval, which is not known a priori. Finally, in line 12 the “filtered vector” is updated. As the polynomial filter of degree is applied independently to search vectors, a block formulation as indicated in Algorithm 1 can be used. In particular, the block variant of SpMV, i.e., the SpMMV kernel, is favorable in terms of computational intensity since the matrix data () has to be loaded only times instead of times if the full filter polynomial is applied separately to each vector in the SpMV. Note that the block formulation of the vector kernels does not impact their computational efficiency as BLAS1 type operations are still involved, e.g., in line 10 independent dot products are computed.

The full filter diagonalization algorithm requires orthogonalization of the “filtered vectors” in the block vector after applying the filter above and before restarting the procedure. A rank-revealing technique such as SVQB [24] or TSQR [7] is used in the orthogonalization step, but as its contribution to the overall runtime is typically small for reasonably large filter polynomial degrees we do not include it in the performance analysis.

At this point we must emphasize that performance numbers presented in [19] use the ChebFD formulation presented here, while in [12] only the Chebyshev moments have been computed, i.e., Algorithm 1 without line 12.

### 2.1 Physical Application and Problem Setting

We have chosen the computation of a bulk of central eigenstates of a topological insulator as a test case for our performance study. Such applications are of current interest in quantum physics research. The model Hamiltonian [23] describing the topological insulator acts on a discrete 3D lattice of size carrying four degrees of freedom per site. The matrix formulation leads to a sparse matrix of size with an average of complex double precision non-zero elements per matrix row (denoted by “Topi---”). The matrices have several subdiagonals leading to a structure similar to 3D stencil. Please see [12, 19] for more details on the model Hamiltonian and its mapping to a sparse matrix. Relevant problem parameter settings in topological insulator research are matrix dimensions of and search spaces of . In terms of algorithmic efficiency it has been shown in [19] that high polynomial degrees () deliver best results.

## 3 Node-level implementation and performance analysis

The compute kernels and implementation alternatives discussed in the following are available for download at (https://bitbucket.org/essex/ghost). As parallelization approaches we use OpenMP for KNL (and CPU architectures) and CUDA for Nvidia GPGPUs. Best performance on the KNL for our application is typically achieved with four OpenMP threads per core.

### 3.1 Implementation

The structure of the ChebFD algorithm presented in Algorithm 1 can easily be mapped to a series of vector operations (of BLAS1 type) and a SpMMV. All vendors provide highly optimized library routines for these. However, calling those routines results in redundant data transfers for the involved block vectors. As discussed in [12, 19] it is possible to fuse all operations in the -loop of Algorithm 1 to a single algorithm-specific kernel (chebfd_op()) and perform all computations on the three block vectors () once they are in the cache or register. While this strategy allows for minimum data transfer, the tailored kernels become very bulky and complex. In particular, for GPU architectures the resulting CUDA kernel requires manual architecture-specific tuning as demonstrated in [12] for the Nvidia K20.

Another important issue to consider is the storage format of the block vectors. Here, a row-major approach is beneficial which drives the irregular access pattern of the SpMV towards streaming access as consecutive block vector elements are loaded for a single matrix element. Moreover row-major storage also enables SIMD/SIMT vectorization along the block vector elements and a simple compressed row storage (CRS) format can be used to store the matrix on all architectures.

On KNL, the implementation is done via AVX512 compiler intrinsics. The rather bulky nature of the kernel, together with the use of complex arithmetic, prevents efficient vectorization of high-level code from the compiler and necessitates the use of compiler intrinsics.

On the P100 we started with the KPM implementation for the K20m presented in [12]. Extending this kernel by the update of the “filtered” block vector (line 12 in Algorithm 1) is straightforward but does not change the computational bottlenecks, which were the reductions required in the dot products. Here a new feature of the Pascal architecture is employed: atomic additions using double precision numbers can increase the performance of the reduction operation for Chebyshev moments ().

### 3.2 Performance measurement

In Figure 2 we present the performance levels which can be achieved on the KNL and P100 architectures and demonstrate the need for optimized algorithm-specific kernels.

We find that a tuned implementation of the chebfd_op() kernel (labeled “GHOST”) outperforms implementations based on a minimum set of standard library calls (SpMMV and BLAS1) typically by 50%. It is interesting to note that our manual implementation of those standard calls (“GHOST-nofuse”) can even outperform the latest vendor-tuned implementations (MKL and cuBLAS/cuSPARSE). Achieving approximately 10% of their peak performance, the two compute devices outperform a standard CPU-based compute node by a factor of up to four.

### 3.3 ChebFD polynomial filter application subspace blocking

In agreement with published results for CPUs and previous-generation Nvidia GPGPUs [12, 19] we find that performance saturates (P100) or even decreases (KNL) at intermediate block vector size of . To enable optimal performance levels for the large values of required by the above application scenario, subspace blocking for the block vectors needs to be employed. Introducing a factor (), the application of the filter can be restricted to a sufficiently small vector block (holding vectors) to achieve optimal performance.

The corresponding implementation is shown in Algorithm 2. Though simple, this code transformation has strong implications on the row-major data layout of the block vectors. Row-major ordering now must also be restricted to blocks containing vectors while the blocks are stored column-wise (see LABEL:fig:layout). \@floatfigure\end@float We are now free to choose the vector block size independently of our baseline application, thus we will restrict the following performance analysis to vector blocks of size .

### 3.4 Performance analysis

We choose the Roofline performance model [27] to investigate the quality of our implementations and to detect the current hardware bottlenecks:

(1) |

This model assumes that the attainable performance is either limited by in-core execution () or by data transfer (), where is the main memory bandwidth (see Table 2 for typical values) if data comes from main memory. The arithmetic intensity of the ChebFD scheme for the Topi test case as a function of is given by [19]:

(2) |

This intensity value is calculated as the average numerical workload and minimum data traffic for applying one matrix row. The first term in the denominator () represents the matrix data traffic: As we have double complex entries, per matrix entry (using indices) are required. In average one row has 13 entries and we expect to reuse the row entries for each of the block vector entries. The minimum data traffic for the three vectors involved in chebfd_op() accounts for of data traffic, as is read only (see lines 9-12 in Algorithm 1).

Combining Eqs. 1 and 2 it becomes obvious that our performance measurements are far off the main memory bandwidth limit of the Roofline model for the full range of in Figure 2 on both architectures. Assuming an attainable main memory bandwidth of for the P100 ( for the KNL) the maximum performance should increase from () at to () at for the P100 (KNL). This is a first indication that the code is limited on both architectures not by the available main-memory bandwidth.

Choosing an intermediate value of , we investigate the actual data transfer volumes and attained data transfer rates in more detail, see Table 3.

Read (GB) | Write(GB) | Bandwidth (GB/s) | |
---|---|---|---|

Minimum | 3.77 | 2.15 | - |

KNL2 MCDRAM | 8.09 | 2.26 | 205.81 |

P100 HBM2 | 7.02 | 2.28 | 412.17 |

P100 L2 | 14.82 | 2.42 | 764.18 |

P100 TEX | 38.23 | - | }3129.36 |

P100 L1 | 29.96 | 2.42 |

Here, the data volumes have been measured with the Nvidia profiler [1] and
likwid-perfctr [25] for the P100 and the KNL^{4}^{4}4Due to the absence
of suitable tools, measurements were not conducted on the Oakforest-PACS system but on an Intel Xeon Phi 7210 with 64 cores and the same amount
of L2 cache., respectively.

In line with our analysis we find that the actual memory memory bandwidth rates using MCDRAM and HBM2 are far off the maximum attainable numbers presented in Table 2. We also find that the write data volume matches our assumption underlying (2) very well (two vector blocks each of size need to be written to main memory). On the other hand the measured read data volume is substantially higher than our model assumption indicating that the the right hand vector block involved in the spMMV is reloaded on the P100 (KNL) approximately four (five) times (see [13] for modeling right hand vector access). As always consecutive chunks of the block vectors are loaded latency effects are not expected to be the reason for low memory bandwidth utilization.

#### 3.4.1 P100

The Nvidia profiler identifies the high L1/TEX cache utilization as primary hardware bottleneck which operates at more than 3 TB/s bandwidth. Caching right hand side vector elements and warp broadcasts for reusing the matrix elements across threads may cause this pressure. This is different from previous results for the K20 presented in [19], where the TEX cache utilization was also high, but performance was limited by the reduction operations required in lines 10 and 11 of Algorithm 1. This bottleneck has been removed using the new atomic additions (see above).

#### 3.4.2 Knl

Reliable in-cache data traffic volume measurements were not available for the KNL architecture at the time of writing. Therefore we substantiate our expectation that the performance bottleneck is in-core by a brief scalability analysis of the device architecture and of our code on one KNL.

In Figure 3 we show attainable
bandwidth values and scalability of MCDRAM and L2 when running a
simple DAXPY kernel. As
the L2 cache segments are shared by two cores and we only perform
local L2 cache accesses we find perfect scalability across the
segments. On contrary, the MCDRAM bandwidth shows the typical
saturation behavior even if only one thread per core is run. However, our
ChebFD implementation scales well across the device and also
benefits substantially from using all SMT threads (see
LABEL:fig:knl_socketscaling), indicating that neither MCDRAM nor L2
access are the limiting factor. Thus, we identify the in-core execution
of the code to be the bottleneck.^{5}^{5}5The identification of the
respective bottleneck is ongoing but probably pointless as this
architecture line will not be continued by Intel..
\@floatfigure\end@float

In summary, we find that both architectures can achieve approximately 10% of their theoretical peak performance but are not able to fully utilize the new memory technology due to in-cache and in-core bottlenecks despite manually optimized kernels. Hence, future architectural developments on both systems should focus on improving cache and core architectures to leverage the additional costs of the new high bandwidth memory technologies for a broad range of applications.

### 3.5 Subspace blocking and large problems

Often in real world filter diagonalization applications the available main memory is the limiting factor as one typically aims at large physical problem sizes () and a large number of inner eigenstates () at the same time. Thus, the size of the high bandwidth memory can easily restrict the accessible problem space and may require (massive) parallelization to provide the required main memory space. As described in Section 1.2, the two architectures under consideration in this work address this problem and allow for a transparent access to large but slow memory regions located in the host node (P100) or in a separate DDR4 domain (KNL). We now investigate the transparent memory access mechanisms provided on both architectures, i.e., the “managed mode” on P100 and the “cache mode” on KNL, to use those large memory spaces implicitly.

As we have demonstrated in Table 2, the transfer rates of transparent data accesses to the slow memories are very low. In our case, the time (i.e., the work to be done on the device) between two accesses to the slow memory is determined by the polynomial filter degree : As shown in Algorithm 2, a local working set of vectors is loaded to the high-bandwidth memory and then reused times. Thus the data access to slow memory may be amortized if is large enough. \@floatfigure\end@float Indeed we observe no significant impact of the low-bandwidth memory access for an overall working set of 60 GiB beyond on P100 and on KNL (see LABEL:fig:large_data). The different behavior is expected due to the much lower transparent access bandwidth on the P100 (see Table 2).

As discussed in Section 2.1, algorithmic efficiency requires high polynomial degrees , which matches the demands of both architectures to achieve high single device performance for large data sets on the two architectures under consideration.

## 4 Large-scale performance

In this section we present scaling results on both supercomputers using data sets fitting into the fast memory of both devices. The Oakforest-PACS nodes are operated in “flat” mode. Distributed-memory parallelization is done using the GHOST library [14], which supports heterogeneous parallel execution using an MPI+X approach (currently, X). On Piz Daint we run one MPI process per host node and on Oakforest-PACS we use one MPI process and 256 OpenMP threads per KNL node (64 cores). We employ the standard data-level parallelization approach for ChebFD: Matrix elements and vector data are distributed across the MPI processes, each process working with a contiguous set of matrix rows and the corresponding part of the block vectors. The communication pattern is determined by the sparsity pattern of the matrix, and communication of remote block vector elements to local buffers must be performed before the process-local chebfd_op is applied As the matrix structure is reminiscent of a 3D stencil, nearest-neighbor communication dominates and leads to easy load balancing and a well-controlled communication volume.

The Cray-proprietary interconnect of Piz Daint uses a dragonfly network topology. Oakforest-PACS is based on Intel Omni-Path with a full fat-tree network built on 48-port leaf switches and 768-port spine switches. As these networks should provide sufficient bandwidth for nearest-neighbor communication, no optimized mapping of MPI ranks to the topology was done.

### 4.1 Weak scaling

The weak scaling experiments are based on the problem scaling used in [19]. A subdomain of is assigned to each process, corresponding to the Topi-128-64-64 problem considered so far. For scaling out we run processes () on a lattice with fixed dimension that is quadratic in and , i.e., Topi-()-()- for a given . As long as the communication time is small compared to the actual computation, which is the case for our choice of parameters, a simple communication scheme (“vector mode,” see Algorithm 3), can be used:

Data exchange using non-blocking MPI (lines 4 and 5) is separated from the process-local computation (). The weak scaling performance results for both systems are shown in Figure 4 for up to 512 nodes.

The communication overhead introduced by the vector mode is visible for two and eight nodes because communication sets in first in the direction at and then additionally in the direction at . From eight nodes onward we see perfect scaling since per-node communication and computation times stay constant. Based on the single-node performance the systems achieve a parallel efficiency of 81% (Oakforest-PACS) and 62% (Piz Daint), which compares to 73% obtained on the SuperMUC-Phase2 system [19]. The lower efficiency of Piz Daint can be attributed to the transfer of vector data over the PCIe bus (even though our implementation uses GPUdirect communication avoiding an intermediate copy of communication data in the host memory) and the high single-node performance. Still Piz Daint provides best absolute performance, achieving 132.7 TF/s at 512 nodes, thereby outperforming the CPU-only SuperMUC-Phase2 results in [19] by a factor of three at the same node count. The underlying numerical problem considered here is the computation of approximately 100 inner eigenvalues of a matrix of dimension .

### 4.2 Strong scaling and subspace pipelining

Sparse linear algebra problems often show limited strong scalability because the computation time per process decreases faster than the corresponding communication time. Our vector mode implementation, which was acceptable with weak scaling and rather large per-node problem sizes, must thus be improved by explicitly overlapping communication with computation. A typical approach to this problem is to do local computations (i.e., handle matrix elements which only access local vector elements) while communicating the non-local vector elements and then doing the remaining work with the just-received data, updating the partial results [11]. This implementation needs to update the local result vector twice and thus increases the main memory data traffic. Modifying the subspace blocking scheme introduced in Section 3.3 towards pipelining of computation and communication steps of successive filter applications as presented in Algorithm 4 offers an interesting alternative.

Instead of calculating the full polynomial for a given block of vectors, the polynomial degree for the full block vector is increased step by step. The inner loop runs over the full block vector and the computation on the current subblock can be overlapped with the communication required for the next subblock. This strategy avoids the overhead of writing the result vector twice, maintaining the same computational intensity as the non-MPI code. As long as is sufficiently large and asynchronous communication is supported by the MPI implementation, the communication should be effectively hidden. A comparison of vector mode and subspace pipelining for strong scaling on Oakforest-PACS is given in LABEL:fig:strongscaling for the Topi-128-128-64 problem (). \@floatfigure\end@float As expected, the benefit of subspace pipelining increases as the number of processes goes up because the communication becomes more relevant. A maximum speed-up with respect to vector mode of 50% could be observed in this test. Note that the speed-up of communication-hiding approaches in SpMV is limited to a factor of two. Further increasing the processor count will diminish the benefit of subspace pipelining as we reach the completely communication-bound regime.

On Piz Daint subspace pipelining did not show any benefits. With low-level experiments we have checked that non-blocking MPI communication using GPUdirect does not overlap with GPU computation (Cray is currently investigating this problem).

## Summary

This work has investigated performance properties and subspace blocking optimization techniques for a Chebyshev filter diagonalization (ChebFD) algorithm on the Intel Xeon Phi (“Knights Landing”) and Nvidia P100 (“Pascal”) architectures. Our block vector implementation achieves approximately 10% of the theoretical peak performance and is no longer memory bound on both architectures. We have demonstrated that subspace blocking with a sufficiently large polynomial filter degree enables efficient use of the complete node-level address space (i.e., high-bandwidth and low-bandwidth memory) transparently without impacting the node performance even if the working set exceeds the high-bandwidth memory size by far. Subspace blocking can be further extended towards a pipelining of the communication and computation phase in the filter application, which allows for simple communication hiding. Though this study has focused on using ChebFD in the context of inner eigenvalue computations for topological insulators, the basic strategies presented can be applied to many applications evaluating Chebyshev polynomials of large sparse matrices and should be of interest for block formulations of iterative solvers in sparse linear algebra.

## Acknowledgments

This work was funded by DFG SPP1648 through the ESSEX-II project and by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d35. We gratefully acknowledge the access to the Oakforest-PACS supercomputers at JCAHPC, Univ. of Tokyo. HF and GW gratefully acknowledge the hospitality of Los Alamos National Laboratory.

## References

- [1] NVIDIA Profiler, http://docs.nvidia.com/cuda/profiler-users-guide
- [2] TOP500 Supercomputer Sites (June 2017), http://www.top500.org
- [3] Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium, May 2012. IEEE Computer Society (2014)
- [4] Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing. pp. 75–82. HPC ’15, Society for Computer Simulation International, San Diego, CA, USA (2015), http://dl.acm.org/citation.cfm?id=2872599.2872609
- [5] Basermann, A., Reichel, B., Schelthoff, C.: Preconditioned CG methods for sparse matrices on massively parallel machines. Parallel Computing 23(3), 381–398 (1997), http://www.sciencedirect.com/science/article/pii/S0167819197000057
- [6] Bhardwaj, O., Ineichen, Y., Bekas, C., Curioni, A.: Highly scalable linear time estimation of spectrograms — a tool for very large scale data analysis (2013), poster at 2013 ACM/IEEE International Conference on High Performance Computing Networking, Storage and Analysis
- [7] Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comp. 34, 206–239 (Feb 2012)
- [8] Di Napoli, E., Polizzi, E., Saad, Y.: Efficient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications 23(4), 674–692 (Jul 2016)
- [9] Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFDâ99. pp. 233–240. Elsevier (1999)
- [10] Kamiabad, A.A., Tate, J.E.: Polynomial Preconditioning of Power System Matrices with Graphics Processing Units, pp. 229–246. Springer Berlin Heidelberg, Berlin, Heidelberg (2013), https://doi.org/10.1007/978-3-642-32683-7_8
- [11] Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Basermann, A., Bishop, A.R.: Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. pp. 1696–1702 (May 2012)
- [12] Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the Kernel Polynomal Method on large-scale CPU-GPU systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. pp. 417–426 (May 2015)
- [13] Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36(5), C401–C423 (2014), https://doi.org/10.1137/130930352
- [14] Kreutzer, M., Thies, J., Röhrig-Zöllner, M., Pieper, A., Shahzad, F., Galgon, M., Basermann, A., Fehske, H., Hager, G., Wellein, G.: GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems. International Journal of Parallel Programming pp. 1–27 (2016)
- [15] Li, X., Li, F.: Estimation of the largest eigenvalue in Chebyshev preconditioner for parallel conjugate gradient method-based power flow computation. IET Generation, Transmission Distribution 10(1), 123–130 (2016)
- [16] Li, X., Li, F.: GPU-based power flow analysis with Chebyshev preconditioner and conjugate gradient method. Electric Power Systems Research 116(Supplement C), 87–93 (2014), http://www.sciencedirect.com/science/article/pii/S0378779614001850
- [17] Liu, X., Chow, E., Vaidyanathan, K., Smelyanskiy, M.: Improving the performance of dynamical simulations via multiple right-hand sides. In: Proceedings of the 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012. pp. 36–47. IEEE Computer Society (2012)
- [18] McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (Dec 1995)
- [19] Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. Journal of Computational Physics 325, 226–243 (2016), http://www.sciencedirect.com/science/article/pii/S0021999116303837
- [20] Saad, Y.: Chebyshev acceleration techniques for solving nonsymmetric eigenvalue problems. Math. Comput. 42, 567–588 (1984)
- [21] Schubert, G., Fehske, H.: Metal-to-insulator transition and electron-hole puddle formation in disordered graphene nanoribbons. Phys. Rev. Lett. 108, 066402 (2012)
- [22] Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85, 201105 (2012)
- [23] Sitte, M., Rosch, A., Altman, E., Fritz, L.: Topological insulators in magnetic fields: Quantum Hall effect and edge channels with a nonquantized term. Phys. Rev. Lett. 108, 126807 (Mar 2012)
- [24] Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2002)
- [25] Treibig, J., Hager, G., Wellein, G.: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures. San Diego CA (2010)
- [26] Weiße, A., Wellein, G., Alvermann, A., Fehske, H.: The kernel polynomial method. Rev. Mod. Phys. 78, 275–306 (Mar 2006), https://link.aps.org/doi/10.1103/RevModPhys.78.275
- [27] Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (Apr 2009), http://doi.acm.org/10.1145/1498765.1498785
- [28] Zhou, Y., Saad, Y., Tiago, M.L., Chelikowsky, J.R.: Self-consistent-field calculations using Chebyshev-filtered subspace iteration. Journal of Computational Physics 219(1), 172–184 (2006), http://www.sciencedirect.com/science/article/pii/S002199910600146X