# A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel

###### Abstract

The Trotter-Suzuki approximation leads to an efficient algorithm for solving the time-dependent Schrödinger equation. Using existing highly optimized CPU and GPU kernels, we developed a distributed version of the algorithm that runs efficiently on a cluster. Our implementation also improves single node performance, and is able to use multiple GPUs within a node. The scaling is close to linear using the CPU kernels, whereas the efficiency of GPU kernels improve with larger matrices. We also introduce a hybrid kernel that simultaneously uses multicore CPUs and GPUs in a distributed system. This kernel is shown to be efficient when the matrix size would not fit in the GPU memory. Larger quantum systems scale especially well with a high number nodes. The code is available under an open source license.

###### keywords:

GPU Computing, MPI, Hamiltonian, Quantum Evolution, Trotter-Suzuki Algorithm, Hybrid Kernelsort

## 1 Introduction

The evolution of a general quantum system is described by the time-dependent Schrödinger equation. The generic solution of this equation involves calculating a matrix exponential, which is formally simple. However, computer implementations must consider several factors to achieve high performance and high accuracy – usually making a trade-off between these two indicators.

There is a wide range of numerical approaches to calculating a matrix exponential. However, since they are approximate, not all may preserve some desired analytical property of the original matrices. This is crucial for example with the quantum evolution operator – the exponential of the Hamiltonian matrix appearing in the Schrödinger equation, – which must be unitary in order to conserve the total probability. The reference method for calculating a matrix exponential is to diagonalize the matrix using an eigendecomposition, which is typically computationally intensive. While efficient eigendecomposition algorithms exist that use multicore CPU and multiple GPUs in a system Haidar et al. (2012), distributed variants that use several computer nodes in a cluster are hard to parallelize with close to ideal efficiency ”Ozdo ğan (2007). Traditional numerical integrators like the Runge-Kutta algorithm do not conserve unitarity, and unitary algorithms like the Crank-Nicholson scheme involve inverting a large matrix.

The Trotter-Suzuki algorithm approaches the problem through a slightly different angle. It decomposes the Hamiltonian as a sum of matrices that are easy to exponentiate De Raedt (1996), which are then used to approximate the exponential of the full Hamiltonian. The end result is an algorithm that is easy to parallelize. For the case of a single particle in real space that we treat here, the algorithm discretizes the domain with a finite mesh and calculates the pairwise evolution between neighboring sites in the mesh. The Trotter-Suzuki algorithm has been successfully used De Raedt (1996); Cucchietti et al. (2002); De Raedt et al. (2000). Efficient kernels for contemporary multicore CPUs and GPUs have already been developed Bederián and Dente (2011).

This paper introduces a distributed variant of the Trotter-Suzuki algorithm using existing high-performance kernels. Our implementation improves the single-node efficiency of the CPU and GPU kernels, it is able to use multiple GPUs in a single system, and also introduces a hybrid kernel to deal with large quantum systems that do not fit the GPU memory. The implementation also scales in a cluster, the CPU variant shows an almost linear speedup with the number of nodes, whereas the GPU variant is more efficient when the device has a higher load.

The rest of this paper is organized as follows. Section 2 gives a brief overview of the foundations of the Trotter-Suzuki algorithm. Section 3 provides the details of the distributed kernels, and Section 4 discusses the benchmark results on a large cluster. Finally, Section 5 concludes the paper and offers an insight on our future work.

## 2 Trotter-Suzuki Algorithm and Efficient Kernels

The non-relativistic Schrödinger equation describing the evolution of a quantum system is

(1) |

where is the Hamiltonian of the system, and the state of the system at time . By choosing a basis, the Hamiltonian can be written as a Hermitian matrix .
The formal solution to the Schrodinger equation for a time-independent Hamiltonian ^{1}^{1}1The general time-dependent form of the
Trotter-Suzuki algorithm is a trivial extension of the one we present Poulin et al. (2011). is given by

(2) |

where is the initial state of the system, with norm , and is the quantum evolution operator associated to Hamiltonian . Since is Hermitian, it is easy to see that the evolution operator is unitary, and that the norm of the state vector is constant over time, Thus, it is crucial that the numerical solution of the evolution operator be unitary De Raedt (1996), or the norm of the wave function – which gives the total probability of finding the particle somewhere, and must equal one – would not be conserved.

The Trotter-Suzuki algorithm decomposes the Hamiltonian into small diagonal or block diagonal matrices, where the exponential is easy to compute. The decomposition is based on the Trotter formula Trotter (1959):

where and are matrices. For sufficiently large , . The Trotter formula is readily generalized to the case of more than two contributions to by writing . This allows for choosing a decomposition that can be exploited to construct efficient algorithms. The error of the Trotter formula scales as times the norm of the commutator between and . Higher order approximations were later developed by Suzuki Suzuki (1990, 1993), who obtained expressions that are unitary for all orders.

To explain the algorithm, let us assume a one-dimensional Hamiltonian with the form , where is the mass of the particle, and the potential energy. Discretizing the Schrödinger equation using a finite mesh of points spaced by a distance , the Hamiltonian matrix is like that of a tight-binding chain – a tridiagonal matrix. Such a matrix can be split as a sum of a diagonal matrix, and two block-diagonal commuting matrices made up of matrices:

where the components are as follows. The diagonal matrix is written as

where is the effective energy at site , is the number of sites. The block diagonal matrices and are written as

and

where is the tunneling element between the sites. The exponential of a block matrix is itself a block matrix build from exponentials of matrices. These plane rotation matrices can be written as

where is the discrete time step. Using the above decomposition, the first-order approximation of the unitary time step evolution operator is given by

Approximants correct up to second order are obtained by symmetrization Suzuki (1985); De Raedt (1996):

(3) |

where is the transpose of matrix . Extending the algorithm for more than one dimension is also straightforward, as we can simply perform a decomposition of the Hamiltonian into five parts: the diagonal energies, and two terms for each dimension. Our implementation is based on the second order formulation of Equation 3. Higher order approximants are expressed as a sequence of applications of the first and second order operators. For example, the fourth order evolution operator is

(4) | |||||

where Suzuki (1990).

Because of its structure, the cost of applying any order of the Trotter-Suzuki operator scales linearly in time and memory. A general external potential is straightforward to implement and always adds a cost in time. Therefore, we perform our benchmarks assuming a flat potential landscape, , which leads to inducing a global phase factor in the wave function which we can ignore.

Notice in the above discussion that at no point an assumption was made about the “importance” of a particular contribution to the Hamiltonian. This is the reason why the Trotter-Suzuki approach can be used where perturbation methods break down De Raedt (1996).

## 3 Distributing the Workload Across a Cluster

We took the optimized CPU and GPU kernels of Bederián and Dente (2011) as our starting point. These kernels use a double-buffered data access pattern: the result is not calculated in place, but to a new buffer, and when a new iteration starts, the buffers are swapped. There are two CPU kernels: cache-optimized, and another cache-optimized that is further optimized to use the SSE instruction set of the CPU.

Cache optimization means that the input data is divided into blocks. A similar block division is present in the GPU kernel, where, instead of a hardware cache, the shared memory of the simultaneous multiprocessors is used to fetch and explicitly cache data. This block division results in extra calculations: a halo for each block has to be computed to get the correct results for the internal cells. This extra work pays off by the benefit of cached access to the data.

The unit of calculation of our distributed version is a process. Using a similar computational structure to the above, we refer to a block assigned to a process as a tile. The block division in a single node is simple: the halo computed by a block is simply discarded, and the next block reads its own halo from the main memory. In a distributed version the halos between the tiles have to be sent across the processes. Using a two-dimensional grid of processes, a tile contains elements of halos belonging to a total of eight other tiles: left, right, top, and bottom neighbours, and also the four diagonal neighbours. To minimize the number of communication requests, a wave pattern is used in the communication: left and right neighbours receive the halo first. This halo has the height of the inner cells of the tile (see Figure 1). Then the horizontal halos are sent to the top and bottom neighbours – the width is the full tile width. In this way the appropriate corner elements are propagated to the diagonal neighbours.

Communication is performed asynchronously, but there is a communication barrier between the left-right and the top-bottom halo exchanges due to data dependency. The generic approach to computation and calculations attempts to overlap the two by as much as possible, calculating the halo first and starting the communication simultaneously to the calculation of the rest of the evolution step (Figure 1).

The step of starting the halo exchange initiates the asynchronous left-right halo exchange. The last step, finishing the halo exchange, has a communication barrier waiting for the first exchange to finish. After this, it initiates the asynchronous top-bottom halo exchange, and has a second barrier, waiting for this communication to finish. There are variations to this pattern, as detailed below.

### 3.1 The CPU kernels

The cache optimized kernels of Bederián and Dente (2011) used OpenMP, a directive-driven parallelization of the code to use the power multicore architecture Dagum and Menon (1998). We found that using our explicit parallelization, the overall performance was better by 30 % if we used the same number of process within a node as there are cores. Hence, even single node execution is accelerated by our approach. This finding corresponds well with other benchmarks that compare OpenMP parallelism with more explicit forms of parallelism Krawezik (2003).

Finishing the halo exchange cannot be efficiently overlapped with communication. This means that once the computation of the iteration is completed, there is a communication overhead while the vertical halos are exchanged.

### 3.2 The GPU kernel

The GPU kernel had to be adjusted to work with communication. As it is customary in GPU programming, the device keeps the double-buffered matrices in its own memory, whereas MPI communication is performed from the host memory. This increases complexity as asynchronous memory copies from the device and to the device have to be performed.

To work with such transfer efficiently, we implemented the kernel with streams. A GPU stream is basically a queue of tasks for the GPU to perform: kernel execution, memory copies from and to the device. Given two streams, memory copies in one stream can overlap with computation in the other stream. We use two streams, queueing the halo computation and the memory copies in stream one, and the computation of the rest of cells in stream two. Since a kernel launch with streams returns the control to the calling process immediately, this also means that while stream two is executing, every halo exchange can be completed before the iteration is finished. Hence the distributed GPU version has a much better communication efficiency than the CPU variant.

The communication routine typically does an internal buffering of the data to be sent. This extra memory copy is unseen by the user. To avoid it, we use pinned memory, a specific way to allocate host memory for data that will be exchanged with the GPU. Pinned memory avoids internal buffering when it comes to communication, hence further increasing communication performance.

We use one process per GPU and single-node execution with multiple GPUs is possible.

### 3.3 The hybrid kernel

Using streams means most cores of a multicore CPU idle while waiting for the GPU to complete the calculations. GPUs also have less memory, limiting the size of the quantum system for which the evolution is computed. To address these two issues, we introduce a hybrid kernel.

The algorithm first calculates the maximum amount of the tile that can be computed on the GPU assigned to the process. Then, using only one stream, it launches the kernel for the corresponding elements on the internal area of the tile. After the asynchronous kernel launches, it proceeds to calculate the halo and start the halo exchange. Once the halo exchange is initiated, the elements not in the halo and not on the GPU are computed by the CPU. Finishing the halo exchange includes an extra step, an internal halo exchange between the part of the tile associated with the GPU and the rest of the matrix.

This kernel also uses one process per GPU. This means that in an eight-core system with two GPUs, six cores would be idle. To utilize the power of the extra cores, we use the same directive-driven parallelism as the original CPU kernels of Bederián and Dente (2011), relying on OpenMP. Hence all cores and all GPUs contribute to the work.

## 4 Discussion

### 4.1 Experimental configuration

The implementation of the distributed algorithm used MPI for communication^{2}^{2}2The source code is available at https://github.com/peterwittek/trotter-suzuki-mpi. We used bullx MPI, which is compatible with the MPI 2.1 standard, and it is built around OpenMPI. The compiler was the Intel Compiler Chain, with all optimization turned on and OpenMP enabled. While these are proprietary tools, the code can also be compiled with open source software, such as OpenMPI and GCC. The GPU code was implemented with CUDA and compiled with CUDA 4.0, running with the corresponding runtime.

The benchmarks were performed on the Minotauro cluster at the Barcelona Supercomputing Center. Every node has two Intel Xeon E5649 six-core processors with 12MB of cache memory, clocked at 2.53GHz, running Linux operating system with 24 GByte of RAM memory. Every node is equipped with two NVIDIA M2090 cards, each one with 512 CUDA cores and 6 GByte of GDDR5 memory. The MPI communication across the nodes is through an Infiniband Network.

Figure 2 illustrates the memory constraints on this cluster with respect to the size of the quantum system. The benchmarks below were chosen so as to maximize the memory usage on different cluster sizes.

### 4.2 Benchmark results

The benchmarks ran ten iterations on increasing cluster sizes, using a synthetic state in a square domain of different lengths , where , , and in single precision, and , , in double precision. Note that since this is a two dimensional quantum system, the dimension of the corresponding Hilbert space is , and the Hamiltonian matrices would have size . The dimensions were chosen so as to fill the device memory on cluster sizes one, four, and sixteen nodes. GPUs have a better a performance when the load is higher, whereas CPUs are less sensitive to the load. Choosing the matrix dimensions to fit the GPUs in certain configurations shows how the overall GPU performance decays as the cluster size increases, and it is also fair with respect to the CPU kernels due to their insensitivity to the load. The timing results are plotted in Figures 3, 4, and 5. The results show only the time taken in the main loop of the evolution (as depicted in Figure 1), as the initalization takes considerably different amounts of time with different types of kernels.

The CPU kernels show an almost linear scaling: the execution time is divided by approximately two as the cluster size is doubled. Communication overhead increases with the cluster size, so eventually the advantage of SSE optimization vanishes with large clusters (see also the next section on parallel efficiency).

The GPU kernel has a more interesting scaling pattern. When the device memory is loaded to at least 50 %, the scaling is close to linear, just as in the case of CPU kernels. Then the execution time of individual GPUs remains almost constant, the curve flattens out, and there is little benefit to gain by this kernel in large clusters.

The hybrid kernel trails the GPU kernel in cases where the problem would fit the GPU memory, with the execution time being marginally longer. The real advantage is in cases where the device memory is insufficient. In such cases, the speedup can be close to 2x in double precision compared to the CPU kernels.

### 4.3 Parallel efficiency

Analysing the MPI traces reveals important information about the communication patterns (Figure 6). We restrict our attention to four nodes alone, other cluster sizes show similar patterns. The CPU kernels communicate in an almost identical pattern irrespective of the SSE optimization. Both of them are quite close to being optimal. Without SSE optimization, the average load is 94 %, so on average, the communication overhead is approximately 6 % (Figure 7). The variation of load is small, the average/maximum ratio is 0.97. The SSE-optimized kernel has slightly worse indicators, with an average load of 90 %. This indicates that as the computation gets more efficient, the communication becomes a bottleneck.

The GPU kernel apparently has a very different pattern. Only a fraction of the processes do any kind of work, the ones that are associated with a GPU. The plot of the MPI trace does not show the time spent in the CUDA kernel, since the launch is asynchronous with streams. Having no computational load, the CPU spends most of its time communicating, resulting in an average load of barely 71 %, and an average/maximum ratio of 0.72, meaning that there is little variation across the processes.

The trace of the hybrid kernel is surprising, as it looks similar to the GPU trace. While there are considerably more threads, the ones that are associated with GPU operations follow the same pattern as above, resulting in long waits. The rest of the threads, however, overlap communication extremely efficiently. The overall load balance and parallel efficiency are very similar to the GPU case.

## 5 Conclusions and Future Work

In this paper we have shown a distributed variant of the Trotter-Suzuki algorithm based on efficient kernel implementations. We have improved the single-node efficiency of CPU kernels by replacing OpenMP directives by explicit MPI parallelization, and the GPU kernel by using streams. We have shown that our algorithm scales almost ideally, and that our hybrid kernel is efficient for calculating the evolution of large systems in smaller clusters.

The implementation can be improved in different ways. The current breakdown of tasks is entirely manual and hard-coded: halo calculation, halo communication, and calculation of internal cells, the latter which might be split between CPU and GPU resources. When we regard the computations alone, there is a well-defined task, the calculation of a block. The block size is different on the CPU and the GPU. The former is larger, but it is an integer multiple of the GPU block size, so we can use the CPU block size as the unit of calculation. Our implementation is a typical thread-parallel approach, where heavy processes perform the work. Task-based parallelism is another approach, and the task in our case would be the calculation of a block. OmpSs is a variant of OpenMP which takes this approach to parallelism Bueno et al. (2011). It handles heterogeneous hardware that includes GPUs and CPUs, and takes care of the memory copies based on explicitly expressed data dependencies. Asynchronous communication can also be defined as a task. Hence theoretically it is possible to have a hybrid approach that is not hard-coded, but the distribution of halo calculation and internal cell calculation is decided by the OmpSs runtime. This also means that part of the halo might be calculated by the GPU, and it should also be easier to work on a cluster where some nodes have GPUs and others do not. To achieve this flexibility, we are working on an OmpSs version of our implementation.

Another clear direction is to extend our implementation to three dimensional systems. The variety of possible decomposition strategies in this case is large, and a flexible implementation would be very useful to test out the performance of the different choices. The extension to three dimensions is also motivated by recent theoretical and experimental developments in ultracold atomic gases Lewenstein et al. (2007), which could not be simulated in a single computer with enough precision.

## 6 Acknowledgment

This work was carried out while P. W. was visiting the Department of Computer Applications in Science & Engineering at the Barcelona Supercomputing Center, funded by the “Access to BSC Facilities” project of the HPC-Europe2 programme (contract no. 228398).

## References

- Bederián and Dente (2011) Bederián, C., Dente, A., December 2011. Boosting quantum evolutions using Trotter-Suzuki algorithms on GPUs. In: Proceedings of HPCLatAm-11, 4th High-Performance Computing Symposium. Córdoba, Argentina.
- Bueno et al. (2011) Bueno, J., Martinell, L., Duran, A., Farreras, M., Martorell, X., Badia, R., Ayguade, E., Labarta, J., August 2011. Productive cluster programming with OmpSs. In: Proceedings of Europar-11. Bordeaux, France, pp. 555–566.
- Cucchietti et al. (2002) Cucchietti, F. M., Pastawski, H. M., Wisniacki, D. A., 2002. Decoherence as decay of the Loschmidt echo in a Lorentz gas. Physics Review E 65, 045206.
- Dagum and Menon (1998) Dagum, L., Menon, R., 1998. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering 5 (1), 46–55.
- De Raedt (1996) De Raedt, H., 1996. Computer simulation of quantum phenomena in nano-scale devices. Annual Reviews of Computational Physics 4, 107–146.
- De Raedt et al. (2000) De Raedt, H., Hams, A. H., Michielsen, K., De Raedt, K., 2000. Quantum computer emulator. Computer Physics Communications 132 (1–2), 1 – 20.
- Haidar et al. (2012) Haidar, A., Tomov, S., Yamazaki, I., Dong, T., Dongarra, J., Solca, R., Schulthess, T., May 2012. MAGMA: A breakthrough in solvers for eigenvalue problems. In: GPU Technology Conference. San Jose, CA, USA.
- Krawezik (2003) Krawezik, G., June 2003. Performance comparison of MPI and three OpenMP programming styles on shared memory multiprocessors. In: Proceedings of SPAA-03, 15th Annual Symposium on Parallel Algorithms and Architectures. San Diego, CA, USA, pp. 118–127.
- Lewenstein et al. (2007) Lewenstein, M., Sanpera, A., Ahufinger, V., Damski, B., Sen, A., Sen, U., 2007. Ultracold atomic gases in optical lattices: mimicking condensed matter physics and beyond. Advances in Physics 56 (2), 243–379.
- ”Ozdo ğan (2007) ”Ozdo ğan, C., 2007. Scaling behavior of TBMD code with parallel eigensolver. In: Science and Supercomputing in Europe Report. HPC-Europe, pp. 1040–1043.
- Poulin et al. (2011) Poulin, D., Qarry, A., Somma, R., Verstraete, F., 2011. Quantum simulation of time-dependent Hamiltonians and the convenient illusion of Hilbert space. Physical Review Letters 106 (17), 170501.
- Suzuki (1985) Suzuki, M., 1985. Decomposition formulas of exponential operators and Lie exponentials with some applications to quantum mechanics and statistical physics. Journal of Mathematical Physics 26, 601.
- Suzuki (1990) Suzuki, M., 1990. Fractal decomposition of exponential operators with applications to many-body theories and Monte Carlo simulations. Physics Letters A 146 (6), 319–323.
- Suzuki (1993) Suzuki, M., 1993. General decomposition theory of ordered exponentials. Proceedings of the Japan Academy. Ser. B: Physical and Biological Sciences 69 (7), 161–166.
- Trotter (1959) Trotter, H., 1959. On the product of semi-groups of operators. Proceedings of the American Mathematical Society 10, 545–551.