GPU accelerated Trotter-Suzuki solver for quantum spin dynamics

GPU accelerated Trotter-Suzuki solver for quantum spin dynamics

Axel D. Dente Carlos S. Bederián Pablo R. Zangara Horacio M. Pastawski Instituto de Física Enrique Gaviola (IFEG), CONICET-UNC, 5000, Córdoba, Argentina Facultad de Matemática, Astronomía y Física, Universidad Nacional de Córdoba, 5000, Córdoba, Argentina INVAP S.E., 8403, San Carlos de Bariloche, Argentina

The resolution of dynamics in out of equilibrium quantum spin systems lies at the heart of fundamental questions among Quantum Information Processing, Statistical Mechanics and Nano-Technologies. Efficient computational simulations of interacting many-spin systems are extremely valuable tools for tackling such questions. Here, we use the Trotter-Suzuki (TS) algorithm, a well-known strategy that provides the evolution of quantum systems, to address the spin dynamics. We present a GPU implementation of a particular TS version, which has been previously implemented on single cores in CPUs. We develop a massive parallel version of this algorithm and compare the efficiency between CPU and GPU implementations. This method reduces the execution time considerably and is capable of dealing with systems of up to 27 spins (only limited by GPU memory).

GPU Computing, Many-Body Quantum Dynamics, Trotter Suzuki
journal: Computer Physics Communications

1 Introduction

The evolution of a generic quantum spin system is described by an appropriate Schrödinger equation, where the Hamiltonian operator encloses all the information about its dynamics, including external fields and spin-spin interactions. The “academic” strategy involves the solution of an eigen-problem for and the state vectors merzbacher-1998-quantum-Mechanics (). This can be a major computational task, since the dimensionality of the underlying Hilbert space scales exponentially with the number of interacting spins in the system. In practice, such scenario constrains the problem to the treatment of few-spins systems, or dealing with special cases where symmetries provide further simplifications LiebMattis-1961 ().

Nowadays, there are many alternative strategies that can be employed instead of the full matrix diagonalization. A reliable approximate method needs to be unitary, i.e. it must conserve the total probability during the quantum evolution. In particular, standard numerical integrators like the Runge-Kutta algorithm fail to fulfill such condition. By contrast, the Trotter-Suzuki (TS) algorithm HdeRaedt-TrotterApplications (); DeRaedt-trotter () does preserve unitarity, since it approximates the exact evolution operator by a set of adequately built partial evolution operators. Provided that the time steps are short enough, the exact evolution is well approximated by the successive partial evolutions. Additionally, it is worthy to mention that the TS method does not rely on any Hilbert space truncation. This is particularly important when addressing long-time asymptotics of interacting many-body systems, a problem of major relevance in fundamental physics rigolNATURE2008 (); polkovnikovRMP ().

Within the last years, Graphics Processing Units (GPU) have been successfully employed to accelerate numerical algorithms that solve the Schrödinger equation Bederian-Dente-Trotter (); Cucchietti (); Bauke (). Most of these implementations deal with one-body or wave-packet quantum dynamics. However, the potential of the GPU has not been exploited within a wide range of many-body problems, such as interacting spin systems. Here, spin-spin interactions provide a substantial complexity, which turns out to be a crucial challenge not only for the computational implementation but also for out-of-equilibrium physics.

In this article we report on an implementation of a 4th order TS decomposition in GPU. We consider intrinsically interacting spin Hamiltonians without any Hilbert space truncation or any further assumption about symmetries or ad hoc dynamical conditions. Our implementation is based on the XYZ decomposition DeRaedt-XYZ-Descomposition (), in which the Hamiltonian is partitioned in the X, Y and Z components of the bilinear spin interactions. Since the Z component can be written in a simple diagonal form, it only modifies the phases of the states. Moreover, since the X and Y components are non-diagonal, local rotations are performed in order to map them into Z-like terms.

The Z-like terms and the local rotations are implemented by means of a massive parallelization scheme. The speedup obtained goes beyond 30 times compared with an OpenMP implementation on a 6-core CPU. Despite this decrease in execution time, the system size is still bounded by the maximum amount of memory available, which is about 6GB for current-generation GPUs. Thus, our implementation is capable of reaching a system of 27 spins, which is still a significant number when compared to other methods.

The paper is organized as follows. In Sec. 2 we summarize the basis of the TS algorithm. In Sec. 3 we describe the main procedures performed by our GPU implementation. In Sec. 4 we discuss the performance and accuracy. We include also in A a brief discussion on two scenarios where this computational strategy can be employed: magnetic resonance spin ensemble calculations, where the present work is currently being exploited, and its potential use in the evolution of Matrix Product States. Sec. 5 concludes the present work.

2 The Trotter-Suzuki Algorithm

In this section we summarize the TS method for spin systems as presented in Ref. DeRaedt-trotter (). We consider a time-independent spin Hamiltonian:


where is the spin operator at site and . The parameters define the local fields or chemical shifts, with the constraint , while are the couplings between two spins.

We assume that the -spin system is initially described by a state vector , which is expanded in the computational Ising basis:


Here, are complex coefficients and are tensor products of eigenvectors of each , typically referred as the -decoupled or Ising basis. In A we discuss two particularly relevant cases in which pure states, as in Eq. 2, are employed to evaluate the dynamics of complex many-body systems. In fact, we stress that the present computational strategy can be substantially exploited when combined with sophisticated physically-based representations Alvarez-PRL-2008 ().

The state of the system at an arbitrary time is formally given by . For simplicity, we set from now on .

The main idea of the TS method relies on finding an appropriate partition of which may yield a set of simple partial evolutions that approximate the exact one . If we consider Eq. 1 as a particular sum of terms , then:

The first order approximation to is


while the second and fourth order approximations can be expressed in terms of the first:

where . These are indeed satisfactory approximations to provided that must be small enough compared to the maximum local time-scale determined by . Such role is played by the local second moment of the interactions. This means that the partial evolution time step must satisfy:


Now we address specifically how to build an operator given by Eq. 3 from the total Hamiltonian of Eq. 1. The natural choice for the partition sum are the single-spin and the two-spin terms, which shall be mapped into a diagonal representation by suitable rotations. In particular, we consider the operators that rotate and axis into the axis:


and satisfy:

These operators rotate and into with the purpose of performing any phase correction in the computational Ising basis. The partial evolution yields a trivial phase for , while need to be properly rotated using and respectively. The -index labels each spin, and the corresponding global rotations are defined by and .

Let us first consider the single-spin operations,


Here, we stress that the approximate equality relies on the validity of Eq. 4. As mentioned above, non trivial exponentials are rotated:


Notice that this kind of single-spin operations can be computed exactly without rotations and without the approximation of Eq. 6. Nevertheless, our purpose is to write all the partial evolution operators in an explicit diagonal form. This strategy will be specifically exploited in the computational implementation, as shown in Section 3.

In analogy with Eq. 6, the two-spin operators yield:


Once again, the Z-terms yield diagonal operators:


The remaining two-spin terms are rotated by and accordingly,


Notice that, from Eqs. 7 and 10, the rotations for single and double spin terms can be performed simultaneously. In such sense, we stress that the TS implementation described here has been intentionally prepared in order to enable parallelization.

3 Implementation

Any state written in the form of Eq. 2 can be evolved by the successive application of partial evolution operators, as those defined in Sec. 2. At any time, the state vector is represented by a double precision complex array that occupies bytes of memory. Its th element stores the probability amplitude for the corresponding state of the Ising basis. Since these states are tensor products of up and down configurations for each spin, they can be written according to the -bit binary representation of . Naturally, the size of the system is constrained by the maximum amount of memory available in the GPU (in our case, 6GB for a NVIDIA Tesla C2075).

The implementation of the evolution is divided into two main modules: phase-corrections and axis-rotations.

3.1 Phase correction

This module performs the phase corrections depending on the Z-projection (up or down) and position of each spin in each state of the Ising basis. In this stage there are no cross terms between different basis states (all operations are diagonal), and therefore the parallelization is trivial. However, due to the two-spin terms, the time of evaluation of each phase correction increases quadratically with the number of spins.

Since the Hamiltonian is time-independent, the phase corrections can be computed in advance (i.e. before performing the actual sequence of partial evolutions) and stored in memory for the following evolutions, amortizing its high cost. In fact, Eq. 9 ensures that the phases only depend on the time step and the Hamiltonian parameters. The disadvantage of such pre-computed phase corrections lies on the amount of memory required to store three -element vectors, increasing memory usage by 150% and limiting simulations to 27 spins on 6GB GPUs instead of 28. This technique, however, reduces phase correction time for large systems by 97%.

Additionally, since every phase correction is followed by a backward rotation that also operates on the whole state, the phase correction kernel is merged with the backward rotation kernel. Therefore, the phase correction is performed when the rotation kernel reads elements from memory and just before applying the rotation. This saves two global memory operations on each element and a kernel call.

Combined, these improvements reduce the overall simulation time by up to 52%.

3.2 Rotation

The rotation module acts on pairs of basis states by means of the operators and defined in Eq. 5. If a particular basis state has the -th spin down, then it is paired to the basis state that has the -th spin up and the same configuration for the rest of spins. Let be the binary representation of a particular basis state in which the -th spin111Consistently with the binary representation, spins are enumerated from right to left. is down, i.e. its -th bit is : . Then, is paired to .

This pairing procedure must range the whole Hilbert space. Thus, for every value of the pairing must be repeated over the states that have the sought configuration in the -th spin.

In order to show how the parallelization can be performed at this stage, let us exemplify one case of the pairing procedure. If we address the rotation of the second spin in a four-spin system, then the pairs of states under consideration are the following:

Notice that the number of pairings is . From this example it is clear that there are no repetitions in any of the basis states when executing one specific rotation (i.e. a particular ). Thus, each rotation operation within the same -spin results independent and can be parallelized.

3.3 Efficient rotation on GPUs

As stated above, the rotation of the -th spin involves independent operations. Accordingly, a GPU kernel that rotates the -th spin on a single pair is launched on threads, one per pair. Since the rotation has an extremely low arithmetic intensity of 0.125 operations per byte transferred (i.e. 8 operations per pair of states read and written back to memory) compared to our GPU’s 27.46 double precision FLOPS/bandwidth ratio, the kernel is severely limited by memory bandwidth to feed each thread with data. To avoid this bottleneck, the optimized implementation reads each of these states from global memory and performs as many rotations as possible before writing them back.

Let us consider a subset of specific spins () denoted by and fix its state configuration, i.e. with a particular choice of or for each of the spins. Then, the set of all states that satisfy the specified configuration for the spins in is closed under the pairing operation that flips any spin which is not in . For example, in the case one can consider the set composed by spins 1 and 3 (i.e. ). Now, if we fix the configuration , the set of states that satisfy such configuration is , and this set is closed under the pairing operation for spins 2 and 4 (those which are not in ).

The previous observation allows us to launch thread blocks on the GPU, and define their specific spin configurations (the state of the spins in ) using the binary encoding of each thread block’s unique index. This leaves each thread block with its own set of states to rotate and, by the observation above, every state’s peers for spins are also within the set. Each thread block can then copy its set of states to the fast shared memory available in each multiprocessor within a GPU, and perform the rotations for all the spins in without any further global memory access. Once these rotations are done, the results are copied back to global memory and the kernel finishes.

3.4 Coalescing memory accesses

While the previous procedure improves rotation time, the approach has a problem: when rotating “higher” spins, the lower bits of the state handled by a thread block are defined by its thread block index, which results in adjacent states being interleaved across different thread blocks. For example, when rotating spins 1 and 2 in a single kernel call on a 4-spin problem, the set of states rotated by each thread block is:

Thread block States
0 , , ,
1 , , ,
2 , , ,
3 , , ,

However, when rotating spins 3 and 4, the set of states rotated by each thread block is:

Thread block States
0 , , ,
1 , , ,
2 , , ,
3 , , ,

It is easy to notice that consecutive states (e.g. and ) are assigned to different thread blocks. Since GPU memory controllers fetch long bursts (currently 128 bytes long) of contiguous data to L1 cache, which is local to each multiprocessor (which execute few thread blocks at a time), this interleaving wastes memory bandwidth significantly.

In order to mitigate the mentioned problem, we never include the lowest spins in , so that each thread block always operates on runs of at least consecutive elements in memory. Only the first rotation kernel that is called in a full evolution rotates spins , while the rest of the kernels perform fewer rotations than in the previous approach because these lower spins still consume shared memory. The improved bandwidth efficiency allowed by coalesced memory accesses results in a speedup over the previous kernel on high-spin rotations for , which matches the memory controller’s 128-byte line length. Figure 1 shows how a 6-spin rotation is performed using one kernel call with and two kernel calls with .

We also tried storing these elements in registers instead of shared memory, to preserve the maximum amount of rotations per kernel call. This generated significant register pressure due to the size of complex double precision numbers, producing low multiprocessor occupancy, and resulted in the kernel performing worse than the kernel that uses shared memory to store every element.

Figure 1: Scheme of memory accesses for a 6-spin rotation. The spins are arranged from left to right, enumerated from 1. The clear spins are set by the thread block index, and chequered spins are assigned and rotated within each block. Left panel: rotation of spins without memory coalescing. Right panel: the striped spins are used for memory coalescing; two kernel calls are performed, rotating spins and then .

We measured the rotation times for different values of and different thread block sizes, and found out that there are no optimal values that work for every rotation and every system size. Thus, we wrote a small tool that does a brute force search for the fastest series of rotations that perform a full evolution for every system size that fits in memory. These optimal configurations are stored in a file and are used by our rotation code in following executions.

3.5 Porting the model back to the CPU

We replicated the GPU scheme on a CPU using OpenMP intrinsics with few changes to the algorithm: we still precompute phase corrections, divide the system into blocks and perform as many rotations as possible, but each thread loops over whole blocks at a time. The rotation code is instantiated for many different block sizes and values of using C++ template metaprogramming, providing the compiler with as much compile-time data as possible to perform automatic vectorization. Instead of using shared memory to store intermediate values, since the blocking strategy has strong temporal and spatial locality we assume that the caches on the CPU will hold the values without hitting the memory bus. As in the GPU code, optimal rotation sizes and their tunable parameters are precomputed using a separate brute force search tool.

4 Results

4.1 Performance

Figure 2: Execution time (in logarithmic scale) for CPU and GPU implementations of the Trotter-Suzuki method as a function of the number of spins. Each point correspond to the evolution with 50 Trotter-Suzuki steps.

As a result of this implementation we plot in Fig. 2 the execution times for the original Fortran CPU code without optimizations, the backported CPU code running on one and six threads, and the three different GPU versions. The tests were run on an Intel Core i7-980 six-core processor with triple channel DDR3-1066 memory and a Tesla C2075 GPU. The CPU code was compiled with GCC 4.7.2, using -Ofast -march=native -mtune=native -fno-exceptions -fno-rtti -flto -fopenmp compiler flags, while the GPU code was compiled with CUDA 5.0 using -O -use_fast_math -gencode arch=compute_20,code=sm_20 compiler flags.

In Fig. 2, it can also be observed that for small systems, the original CPU code is faster than parallel implementations because computing the quadratic phase correction is faster than performing a table lookup for small values of , and it does not suffer from parallelization overhead like its CPU siblings. Meanwhile, the GPU versions do not have enough work to feed all the GPU’s execution resources.

On the other hand, when is considerably large, the execution time of the GPU implementation increases exponentially in . The inflection point around indicates that the GPU reaches its full capacity, and afterwards each GPU version scales as , with a factor higher than . In fact, this parameter can be obtained by a linear fitting of each curve in Fig. 2 for . The original CPU implementation yields , while the CPU Backported (6 thread) yields . The naive GPU-Basic and the GPU-Optimized result and respectively. These small differences between the implementations are reflected in the behavior of the relative speedup, which is shown in Fig. 3-a and b for selected implementations.

Since the memory bandwidth is the main limiting factor of the algorithm’s performance, we measured the bandwidth achieved by the multiple rotation kernel to verify the quality of our implementation. Our fastest CUDA kernel implementation uses 80% of the 120GBps reported by NVIDIA’s bandwidth test tool included in the CUDA SDK, while the backported CPU implementation uses between 50% and 70% of the 10 GBps reported by the STREAM benchmark McCalpin2007 () on our system. These results reflect our decision to focus our optimizations on GPU code and ensure that our CPU implementation avoids hitting the memory bus during a series of rotations. We estimate that better tuned CPU implementations can extract a similar percentage of the platform’s memory bandwidth. However, with current high-end dual-socket quad-channel DDR3 platforms reaching 100 GBps XeonBW () and current high-end GPUs doubling this quantity TeslaBW () we expect GPUs to maintain their dominance. The high bandwidth utilization on the GPU also implies that further improvements will require a different algorithm to perform rotations.

Figure 3: Speedup between the implementations. a: CPU-Backported vs GPU-Optimized. For large spin systems the GPU is about 30 times faster than the CPU. b: Optimized GPU version versus a naive implementation in the GPU. An increase greater than 6 times for long spin systems is observed.

4.2 Accuracy

The last issue to be addressed concerns the accuracy of our TS implementation. Several comparisons were performed between the present method and exact diagonalization schemes Lea2 (); Lea3 (); Lea4 (), i.e. the solution of the Schrödinger equation as an eigenproblem. The evolution of local and non-local physical magnitudes was compared for one-dimensional spin systems described by Hamiltonians in the form of Eq. 1. For 222Standard exact diagonalization is limited by the size of systems that can be handled., a sequence of TS steps yields a relative difference bounded by . A second accuracy test was provided by the spurious total magnetization observed in polarization-conserving Hamiltonians. In fact, this a strictly physical condition: if the initial state has zero total magnetization, then the evolved one should remain so. In general, we observe a linear increase of the total magnetization of the spin system, which is consistent with the general expectancies of a linear increase of the TS error as function of the number of steps. In particular, for and TS steps, we observe a relative deviation less than , a precision which is good enough for practical purposes.

An alternative strategy for error quantification is provided by the Loschmidt echo (LE) scholarpedia (). It evaluates the reversibility of a system’s dynamics in the presence of uncontrolled degrees of freedom Zangara-PRA-2012 (). For a particular initial state , the standard LE definition jalpa () is:


where is a reversed Hamiltonian, and encloses uncontrolled non-reversed degrees of freedom. In the present case, we set , which for an ideal perfectly accurate computation would yield , . In principle, this statement is completely irrespective of the TS approximation and its intrinsic accuracy (i.e. the validity of Eq. 4). As a matter of fact, exactly the same sequences of TS partial evolutions are applied in the forward and backwards dynamics, except for the change in the sign of . Therefore, the deviations of away from the unity could be intrinsically originated in the execution of floating point operations by the specific Hardware.

We consider initial states given by random superpositions of Ising states (e.g. Eq. 13), for a spin set. Two cases of different dynamical complexity are evaluated. The first case is a ring configuration described by a nearest neighbors Heisenberg Hamiltonian. As above, this interaction conserves total magnetization, and then the dynamics does not explore the whole Hilbert space. For this case, TS steps yield , while TS steps yield . Again, the error appears to increase linearly with the number of TS steps. However, we stress here that these deviations cannot be associated to the TS decomposition. The second case is built from the first, adding double quantum terms up to third next nearest neighbors. In this situation, total magnetization is not conserved, and therefore dynamics effectively mixes all spin projection subspaces exploring the whole Hilbert space. It turns out that TS steps yield , which is slightly larger than the first case. This may indicate that errors in computational operations can depend on the complexity of the many-body dynamics.

Once again, these examples evidence a satisfactory accuracy of our TS-GPU implementation. A systematic study of the LE as an error quantifier is well beyond the scope of this article. This would require ranging over and the number of TS steps, different Hamiltonian complexities and initial states, among many other factors. However, the LE turns to be a promising witness to address computational errors in GPU and CPU implementations.

5 Conclusions

We presented here a GPU implementation that boosted the Trotter-Suzuki method for quantum spin dynamics. We developed a parallelization scheme to exploit the massive parallel architecture of the GPU cards. The results showed a significant increase of performance when compared to a similar CPU implementation. In our tested platform, the speedup was measured to be of up to 30 times.

The benefits provided by this massive parallel hardware, boosted the capability of evaluating the dynamics of considerably large quantum spin systems. In particular, we were able to evolve a maximum of spins (limited only by the GPU memory) in reasonable execution times.

The comparison between our Trotter-Suzuki implementation with exact numerical approaches yielded estimated relative errors which turned to be fairly acceptable within the standard expectancies of this kind of computational strategy.

The implementation of this algorithm and the efficiency achieved open promising opportunities for studying fundamental questions within the field of out-of-equilibrium quantum many-body systems.

6 Acknowledgements

The authors acknowledge Lea F. Santos for providing the data for accuracy comparison, and Nicolás Wolovick for fruitful discussions. This work was performed with the financial support from CONICET, ANPCyT, SeCyT-UNC, MinCyT-Cor, and the NVIDIA professor partnership program.

Appendix A Spin systems described by single pure states

We will briefly mention here two cases where an interacting many-spin system can be described or approximately described by, a pure state: the random superposition (entangled) states for high temperature systems, and the so-called Matrix Product States. These are suited candidates for our GPU-boosted TS implementation.

In the first case, is characterized by the infinite temperature limit, as it is often the situation in Magnetic Resonance spin dynamics. Strictly speaking, cannot be described by a pure (single vector) state as in Eq. 2, but by a highly mixed state, typically denoted by a density matrix. This represents a whole probabilistic ensemble, and contains all the statistical information about blum-1996 (). The manipulation of the density matrix may rapidly be cumbersome due to memory requirements, as soon as its dimension scales as . But, provided the observables to be evaluated are local (they involve just a few spins within ), one can use just a few pure entangled states to compute ensemble-averaged quantities Alvarez-PRL-2008 (). This procedure enables a physical parallelization which relies on the quantum superposition principle. A simple case may be an initial state given by a single spin up-polarized (localized excitation), and the rest of them in the high-temperature thermal equilibrium, represented by a mixture of all states with amplitudes satisfying the appropriate statistical weights and random phases:


where, analogously to Eq. 2, are the states of the computational Ising basis. This case has been employed to evaluate specific time-dependent correlation functions for spin systems alvarez-PRA-2010 (); Zangara-PRA-2012 (); Fine2012 (), avoiding the storage, manipulation and diagonalization of overwhelmingly large matrices. Most importantly, it is nowadays employed to address the problem of thermalization in closed quantum systems, being assisted with our GPU implementation.

The second case, refers to Matrix Product States Werner1992 (); Banuls2009 (), which constitute a set of states that successfully approximates the exact state of in many physical situations. They are intimately related to renormalization group methodsWhite-DMRG-1992 (), and have proved very useful to deal with dynamical observables in one-dimensional quantum spin systemsBanuls2012 ().

For a chain of spins, a Matrix Product State is given by:


where is a -dimensional matrix, and represents an Ising state. The time-evolution of this kind of states is performed by a suitable TS algorithmBanuls2012 (), and therefore they are promising candidates for TS-GPU implementations like the one we present in this article.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description