GPU accelerated TrotterSuzuki solver for quantum spin dynamics
Abstract
The resolution of dynamics in out of equilibrium quantum spin systems lies at the heart of fundamental questions among Quantum Information Processing, Statistical Mechanics and NanoTechnologies. Efficient computational simulations of interacting manyspin systems are extremely valuable tools for tackling such questions. Here, we use the TrotterSuzuki (TS) algorithm, a wellknown strategy that provides the evolution of quantum systems, to address the spin dynamics. We present a GPU implementation of a particular TS version, which has been previously implemented on single cores in CPUs. We develop a massive parallel version of this algorithm and compare the efficiency between CPU and GPU implementations. This method reduces the execution time considerably and is capable of dealing with systems of up to 27 spins (only limited by GPU memory).
keywords:
GPU Computing, ManyBody Quantum Dynamics, Trotter Suzuki1 Introduction
The evolution of a generic quantum spin system is described by an appropriate Schrödinger equation, where the Hamiltonian operator encloses all the information about its dynamics, including external fields and spinspin interactions. The “academic” strategy involves the solution of an eigenproblem for and the state vectors merzbacher1998quantumMechanics (). This can be a major computational task, since the dimensionality of the underlying Hilbert space scales exponentially with the number of interacting spins in the system. In practice, such scenario constrains the problem to the treatment of fewspins systems, or dealing with special cases where symmetries provide further simplifications LiebMattis1961 ().
Nowadays, there are many alternative strategies that can be employed instead of the full matrix diagonalization. A reliable approximate method needs to be unitary, i.e. it must conserve the total probability during the quantum evolution. In particular, standard numerical integrators like the RungeKutta algorithm fail to fulfill such condition. By contrast, the TrotterSuzuki (TS) algorithm HdeRaedtTrotterApplications (); DeRaedttrotter () does preserve unitarity, since it approximates the exact evolution operator by a set of adequately built partial evolution operators. Provided that the time steps are short enough, the exact evolution is well approximated by the successive partial evolutions. Additionally, it is worthy to mention that the TS method does not rely on any Hilbert space truncation. This is particularly important when addressing longtime asymptotics of interacting manybody systems, a problem of major relevance in fundamental physics rigolNATURE2008 (); polkovnikovRMP ().
Within the last years, Graphics Processing Units (GPU) have been successfully employed to accelerate numerical algorithms that solve the Schrödinger equation BederianDenteTrotter (); Cucchietti (); Bauke (). Most of these implementations deal with onebody or wavepacket quantum dynamics. However, the potential of the GPU has not been exploited within a wide range of manybody problems, such as interacting spin systems. Here, spinspin interactions provide a substantial complexity, which turns out to be a crucial challenge not only for the computational implementation but also for outofequilibrium physics.
In this article we report on an implementation of a 4th order TS decomposition in GPU. We consider intrinsically interacting spin Hamiltonians without any Hilbert space truncation or any further assumption about symmetries or ad hoc dynamical conditions. Our implementation is based on the XYZ decomposition DeRaedtXYZDescomposition (), in which the Hamiltonian is partitioned in the X, Y and Z components of the bilinear spin interactions. Since the Z component can be written in a simple diagonal form, it only modifies the phases of the states. Moreover, since the X and Y components are nondiagonal, local rotations are performed in order to map them into Zlike terms.
The Zlike terms and the local rotations are implemented by means of a massive parallelization scheme. The speedup obtained goes beyond 30 times compared with an OpenMP implementation on a 6core CPU. Despite this decrease in execution time, the system size is still bounded by the maximum amount of memory available, which is about 6GB for currentgeneration GPUs. Thus, our implementation is capable of reaching a system of 27 spins, which is still a significant number when compared to other methods.
The paper is organized as follows. In Sec. 2 we summarize the basis of the TS algorithm. In Sec. 3 we describe the main procedures performed by our GPU implementation. In Sec. 4 we discuss the performance and accuracy. We include also in A a brief discussion on two scenarios where this computational strategy can be employed: magnetic resonance spin ensemble calculations, where the present work is currently being exploited, and its potential use in the evolution of Matrix Product States. Sec. 5 concludes the present work.
2 The TrotterSuzuki Algorithm
In this section we summarize the TS method for spin systems as presented in Ref. DeRaedttrotter (). We consider a timeindependent spin Hamiltonian:
(1) 
where is the spin operator at site and . The parameters define the local fields or chemical shifts, with the constraint , while are the couplings between two spins.
We assume that the spin system is initially described by a state vector , which is expanded in the computational Ising basis:
(2) 
Here, are complex coefficients and are tensor products of eigenvectors of each , typically referred as the decoupled or Ising basis. In A we discuss two particularly relevant cases in which pure states, as in Eq. 2, are employed to evaluate the dynamics of complex manybody systems. In fact, we stress that the present computational strategy can be substantially exploited when combined with sophisticated physicallybased representations AlvarezPRL2008 ().
The state of the system at an arbitrary time is formally given by . For simplicity, we set from now on .
The main idea of the TS method relies on finding an appropriate partition of which may yield a set of simple partial evolutions that approximate the exact one . If we consider Eq. 1 as a particular sum of terms , then:
The first order approximation to is
(3) 
while the second and fourth order approximations can be expressed in terms of the first:
where . These are indeed satisfactory approximations to provided that must be small enough compared to the maximum local timescale determined by . Such role is played by the local second moment of the interactions. This means that the partial evolution time step must satisfy:
(4) 
Now we address specifically how to build an operator given by Eq. 3 from the total Hamiltonian of Eq. 1. The natural choice for the partition sum are the singlespin and the twospin terms, which shall be mapped into a diagonal representation by suitable rotations. In particular, we consider the operators that rotate and axis into the axis:
(5) 
and satisfy:
These operators rotate and into with the purpose of performing any phase correction in the computational Ising basis. The partial evolution yields a trivial phase for , while need to be properly rotated using and respectively. The index labels each spin, and the corresponding global rotations are defined by and .
Let us first consider the singlespin operations,
(6) 
Here, we stress that the approximate equality relies on the validity of Eq. 4. As mentioned above, non trivial exponentials are rotated:
(7) 
Notice that this kind of singlespin operations can be computed exactly without rotations and without the approximation of Eq. 6. Nevertheless, our purpose is to write all the partial evolution operators in an explicit diagonal form. This strategy will be specifically exploited in the computational implementation, as shown in Section 3.
In analogy with Eq. 6, the twospin operators yield:
(8) 
Once again, the Zterms yield diagonal operators:
(9) 
The remaining twospin terms are rotated by and accordingly,
(10) 
3 Implementation
Any state written in the form of Eq. 2 can be evolved by the successive application of partial evolution operators, as those defined in Sec. 2. At any time, the state vector is represented by a double precision complex array that occupies bytes of memory. Its th element stores the probability amplitude for the corresponding state of the Ising basis. Since these states are tensor products of up and down configurations for each spin, they can be written according to the bit binary representation of . Naturally, the size of the system is constrained by the maximum amount of memory available in the GPU (in our case, 6GB for a NVIDIA Tesla C2075).
The implementation of the evolution is divided into two main modules: phasecorrections and axisrotations.
3.1 Phase correction
This module performs the phase corrections depending on the Zprojection (up or down) and position of each spin in each state of the Ising basis. In this stage there are no cross terms between different basis states (all operations are diagonal), and therefore the parallelization is trivial. However, due to the twospin terms, the time of evaluation of each phase correction increases quadratically with the number of spins.
Since the Hamiltonian is timeindependent, the phase corrections can be computed in advance (i.e. before performing the actual sequence of partial evolutions) and stored in memory for the following evolutions, amortizing its high cost. In fact, Eq. 9 ensures that the phases only depend on the time step and the Hamiltonian parameters. The disadvantage of such precomputed phase corrections lies on the amount of memory required to store three element vectors, increasing memory usage by 150% and limiting simulations to 27 spins on 6GB GPUs instead of 28. This technique, however, reduces phase correction time for large systems by 97%.
Additionally, since every phase correction is followed by a backward rotation that also operates on the whole state, the phase correction kernel is merged with the backward rotation kernel. Therefore, the phase correction is performed when the rotation kernel reads elements from memory and just before applying the rotation. This saves two global memory operations on each element and a kernel call.
Combined, these improvements reduce the overall simulation time by up to 52%.
3.2 Rotation
The rotation module acts on pairs of basis states by means of the operators and defined in Eq. 5. If a particular basis state has the th spin down, then it is paired to the basis state that has the th spin up and the same configuration for the rest of spins. Let be the binary representation of a particular basis state in which the th spin^{1}^{1}1Consistently with the binary representation, spins are enumerated from right to left. is down, i.e. its th bit is : . Then, is paired to .
This pairing procedure must range the whole Hilbert space. Thus, for every value of the pairing must be repeated over the states that have the sought configuration in the th spin.
In order to show how the parallelization can be performed at this stage, let us exemplify one case of the pairing procedure. If we address the rotation of the second spin in a fourspin system, then the pairs of states under consideration are the following:
Notice that the number of pairings is . From this example it is clear that there are no repetitions in any of the basis states when executing one specific rotation (i.e. a particular ). Thus, each rotation operation within the same spin results independent and can be parallelized.
3.3 Efficient rotation on GPUs
As stated above, the rotation of the th spin involves independent operations. Accordingly, a GPU kernel that rotates the th spin on a single pair is launched on threads, one per pair. Since the rotation has an extremely low arithmetic intensity of 0.125 operations per byte transferred (i.e. 8 operations per pair of states read and written back to memory) compared to our GPU’s 27.46 double precision FLOPS/bandwidth ratio, the kernel is severely limited by memory bandwidth to feed each thread with data. To avoid this bottleneck, the optimized implementation reads each of these states from global memory and performs as many rotations as possible before writing them back.
Let us consider a subset of specific spins () denoted by and fix its state configuration, i.e. with a particular choice of or for each of the spins. Then, the set of all states that satisfy the specified configuration for the spins in is closed under the pairing operation that flips any spin which is not in . For example, in the case one can consider the set composed by spins 1 and 3 (i.e. ). Now, if we fix the configuration , the set of states that satisfy such configuration is , and this set is closed under the pairing operation for spins 2 and 4 (those which are not in ).
The previous observation allows us to launch thread blocks on the GPU, and define their specific spin configurations (the state of the spins in ) using the binary encoding of each thread block’s unique index. This leaves each thread block with its own set of states to rotate and, by the observation above, every state’s peers for spins are also within the set. Each thread block can then copy its set of states to the fast shared memory available in each multiprocessor within a GPU, and perform the rotations for all the spins in without any further global memory access. Once these rotations are done, the results are copied back to global memory and the kernel finishes.
3.4 Coalescing memory accesses
While the previous procedure improves rotation time, the approach has a problem: when rotating “higher” spins, the lower bits of the state handled by a thread block are defined by its thread block index, which results in adjacent states being interleaved across different thread blocks. For example, when rotating spins 1 and 2 in a single kernel call on a 4spin problem, the set of states rotated by each thread block is:
Thread block  States  

0  ,  ,  ,  
1  ,  ,  ,  
2  ,  ,  ,  
3  ,  ,  , 
However, when rotating spins 3 and 4, the set of states rotated by each thread block is:
Thread block  States  

0  ,  ,  ,  
1  ,  ,  ,  
2  ,  ,  ,  
3  ,  ,  , 
It is easy to notice that consecutive states (e.g. and ) are assigned to different thread blocks. Since GPU memory controllers fetch long bursts (currently 128 bytes long) of contiguous data to L1 cache, which is local to each multiprocessor (which execute few thread blocks at a time), this interleaving wastes memory bandwidth significantly.
In order to mitigate the mentioned problem, we never include the lowest spins in , so that each thread block always operates on runs of at least consecutive elements in memory. Only the first rotation kernel that is called in a full evolution rotates spins , while the rest of the kernels perform fewer rotations than in the previous approach because these lower spins still consume shared memory. The improved bandwidth efficiency allowed by coalesced memory accesses results in a speedup over the previous kernel on highspin rotations for , which matches the memory controller’s 128byte line length. Figure 1 shows how a 6spin rotation is performed using one kernel call with and two kernel calls with .
We also tried storing these elements in registers instead of shared memory, to preserve the maximum amount of rotations per kernel call. This generated significant register pressure due to the size of complex double precision numbers, producing low multiprocessor occupancy, and resulted in the kernel performing worse than the kernel that uses shared memory to store every element.
We measured the rotation times for different values of and different thread block sizes, and found out that there are no optimal values that work for every rotation and every system size. Thus, we wrote a small tool that does a brute force search for the fastest series of rotations that perform a full evolution for every system size that fits in memory. These optimal configurations are stored in a file and are used by our rotation code in following executions.
3.5 Porting the model back to the CPU
We replicated the GPU scheme on a CPU using OpenMP intrinsics with few changes to the algorithm: we still precompute phase corrections, divide the system into blocks and perform as many rotations as possible, but each thread loops over whole blocks at a time. The rotation code is instantiated for many different block sizes and values of using C++ template metaprogramming, providing the compiler with as much compiletime data as possible to perform automatic vectorization. Instead of using shared memory to store intermediate values, since the blocking strategy has strong temporal and spatial locality we assume that the caches on the CPU will hold the values without hitting the memory bus. As in the GPU code, optimal rotation sizes and their tunable parameters are precomputed using a separate brute force search tool.
4 Results
4.1 Performance
As a result of this implementation we plot in Fig. 2 the execution times for the original Fortran CPU code without optimizations, the backported CPU code running on one and six threads, and the three different GPU versions. The tests were run on an Intel Core i7980 sixcore processor with triple channel DDR31066 memory and a Tesla C2075 GPU. The CPU code was compiled with GCC 4.7.2, using Ofast march=native mtune=native fnoexceptions fnortti flto fopenmp compiler flags, while the GPU code was compiled with CUDA 5.0 using O use_fast_math gencode arch=compute_20,code=sm_20 compiler flags.
In Fig. 2, it can also be observed that for small systems, the original CPU code is faster than parallel implementations because computing the quadratic phase correction is faster than performing a table lookup for small values of , and it does not suffer from parallelization overhead like its CPU siblings. Meanwhile, the GPU versions do not have enough work to feed all the GPU’s execution resources.
On the other hand, when is considerably large, the execution time of the GPU implementation increases exponentially in . The inflection point around indicates that the GPU reaches its full capacity, and afterwards each GPU version scales as , with a factor higher than . In fact, this parameter can be obtained by a linear fitting of each curve in Fig. 2 for . The original CPU implementation yields , while the CPU Backported (6 thread) yields . The naive GPUBasic and the GPUOptimized result and respectively. These small differences between the implementations are reflected in the behavior of the relative speedup, which is shown in Fig. 3a and b for selected implementations.
Since the memory bandwidth is the main limiting factor of the algorithm’s performance, we measured the bandwidth achieved by the multiple rotation kernel to verify the quality of our implementation. Our fastest CUDA kernel implementation uses 80% of the 120GBps reported by NVIDIA’s bandwidth test tool included in the CUDA SDK, while the backported CPU implementation uses between 50% and 70% of the 10 GBps reported by the STREAM benchmark McCalpin2007 () on our system. These results reflect our decision to focus our optimizations on GPU code and ensure that our CPU implementation avoids hitting the memory bus during a series of rotations. We estimate that better tuned CPU implementations can extract a similar percentage of the platform’s memory bandwidth. However, with current highend dualsocket quadchannel DDR3 platforms reaching 100 GBps XeonBW () and current highend GPUs doubling this quantity TeslaBW () we expect GPUs to maintain their dominance. The high bandwidth utilization on the GPU also implies that further improvements will require a different algorithm to perform rotations.
4.2 Accuracy
The last issue to be addressed concerns the accuracy of our TS implementation. Several comparisons were performed between the present method and exact diagonalization schemes Lea2 (); Lea3 (); Lea4 (), i.e. the solution of the Schrödinger equation as an eigenproblem. The evolution of local and nonlocal physical magnitudes was compared for onedimensional spin systems described by Hamiltonians in the form of Eq. 1. For ^{2}^{2}2Standard exact diagonalization is limited by the size of systems that can be handled., a sequence of TS steps yields a relative difference bounded by . A second accuracy test was provided by the spurious total magnetization observed in polarizationconserving Hamiltonians. In fact, this a strictly physical condition: if the initial state has zero total magnetization, then the evolved one should remain so. In general, we observe a linear increase of the total magnetization of the spin system, which is consistent with the general expectancies of a linear increase of the TS error as function of the number of steps. In particular, for and TS steps, we observe a relative deviation less than , a precision which is good enough for practical purposes.
An alternative strategy for error quantification is provided by the Loschmidt echo (LE) scholarpedia (). It evaluates the reversibility of a system’s dynamics in the presence of uncontrolled degrees of freedom ZangaraPRA2012 (). For a particular initial state , the standard LE definition jalpa () is:
(11) 
where is a reversed Hamiltonian, and encloses uncontrolled nonreversed degrees of freedom. In the present case, we set , which for an ideal perfectly accurate computation would yield , . In principle, this statement is completely irrespective of the TS approximation and its intrinsic accuracy (i.e. the validity of Eq. 4). As a matter of fact, exactly the same sequences of TS partial evolutions are applied in the forward and backwards dynamics, except for the change in the sign of . Therefore, the deviations of away from the unity could be intrinsically originated in the execution of floating point operations by the specific Hardware.
We consider initial states given by random superpositions of Ising states (e.g. Eq. 13), for a spin set. Two cases of different dynamical complexity are evaluated. The first case is a ring configuration described by a nearest neighbors Heisenberg Hamiltonian. As above, this interaction conserves total magnetization, and then the dynamics does not explore the whole Hilbert space. For this case, TS steps yield , while TS steps yield . Again, the error appears to increase linearly with the number of TS steps. However, we stress here that these deviations cannot be associated to the TS decomposition. The second case is built from the first, adding double quantum terms up to third next nearest neighbors. In this situation, total magnetization is not conserved, and therefore dynamics effectively mixes all spin projection subspaces exploring the whole Hilbert space. It turns out that TS steps yield , which is slightly larger than the first case. This may indicate that errors in computational operations can depend on the complexity of the manybody dynamics.
Once again, these examples evidence a satisfactory accuracy of our TSGPU implementation. A systematic study of the LE as an error quantifier is well beyond the scope of this article. This would require ranging over and the number of TS steps, different Hamiltonian complexities and initial states, among many other factors. However, the LE turns to be a promising witness to address computational errors in GPU and CPU implementations.
5 Conclusions
We presented here a GPU implementation that boosted the TrotterSuzuki method for quantum spin dynamics. We developed a parallelization scheme to exploit the massive parallel architecture of the GPU cards. The results showed a significant increase of performance when compared to a similar CPU implementation. In our tested platform, the speedup was measured to be of up to 30 times.
The benefits provided by this massive parallel hardware, boosted the capability of evaluating the dynamics of considerably large quantum spin systems. In particular, we were able to evolve a maximum of spins (limited only by the GPU memory) in reasonable execution times.
The comparison between our TrotterSuzuki implementation with exact numerical approaches yielded estimated relative errors which turned to be fairly acceptable within the standard expectancies of this kind of computational strategy.
The implementation of this algorithm and the efficiency achieved open promising opportunities for studying fundamental questions within the field of outofequilibrium quantum manybody systems.
6 Acknowledgements
The authors acknowledge Lea F. Santos for providing the data for accuracy comparison, and Nicolás Wolovick for fruitful discussions. This work was performed with the financial support from CONICET, ANPCyT, SeCyTUNC, MinCyTCor, and the NVIDIA professor partnership program.
Appendix A Spin systems described by single pure states
We will briefly mention here two cases where an interacting manyspin system can be described or approximately described by, a pure state: the random superposition (entangled) states for high temperature systems, and the socalled Matrix Product States. These are suited candidates for our GPUboosted TS implementation.
In the first case, is characterized by the infinite temperature limit, as it is often the situation in Magnetic Resonance spin dynamics. Strictly speaking, cannot be described by a pure (single vector) state as in Eq. 2, but by a highly mixed state, typically denoted by a density matrix. This represents a whole probabilistic ensemble, and contains all the statistical information about blum1996 (). The manipulation of the density matrix may rapidly be cumbersome due to memory requirements, as soon as its dimension scales as . But, provided the observables to be evaluated are local (they involve just a few spins within ), one can use just a few pure entangled states to compute ensembleaveraged quantities AlvarezPRL2008 (). This procedure enables a physical parallelization which relies on the quantum superposition principle. A simple case may be an initial state given by a single spin uppolarized (localized excitation), and the rest of them in the hightemperature thermal equilibrium, represented by a mixture of all states with amplitudes satisfying the appropriate statistical weights and random phases:
(12) 
where, analogously to Eq. 2, are the states of the computational Ising basis. This case has been employed to evaluate specific timedependent correlation functions for spin systems alvarezPRA2010 (); ZangaraPRA2012 (); Fine2012 (), avoiding the storage, manipulation and diagonalization of overwhelmingly large matrices. Most importantly, it is nowadays employed to address the problem of thermalization in closed quantum systems, being assisted with our GPU implementation.
The second case, refers to Matrix Product States Werner1992 (); Banuls2009 (), which constitute a set of states that successfully approximates the exact state of in many physical situations. They are intimately related to renormalization group methodsWhiteDMRG1992 (), and have proved very useful to deal with dynamical observables in onedimensional quantum spin systemsBanuls2012 ().
For a chain of spins, a Matrix Product State is given by:
(13) 
where is a dimensional matrix, and represents an Ising state. The timeevolution of this kind of states is performed by a suitable TS algorithmBanuls2012 (), and therefore they are promising candidates for TSGPU implementations like the one we present in this article.
References

(1)
E. Merzbacher, Quantum
Mechanics, Wiley, New York, 1998.
URL http://books.google.com.ar/books?id=6Ja_QgAACAAJ 
(2)
E. Lieb, T. Schultz, D. Mattis,
Two
soluble models of an antiferromagnetic chain, Annals of Physics 16 (3)
(1961) 407 – 466.
doi:DOI:10.1016/00034916(61)901154.
URL http://www.sciencedirect.com/science/article/pii/0003491661901154 
(3)
H. De Raedt, B. De Raedt,
Applications of the
generalized trotter formula, Phys. Rev. A 28 (1983) 3575–3580.
doi:10.1103/PhysRevA.28.3575.
URL http://link.aps.org/doi/10.1103/PhysRevA.28.3575 
(4)
H. De Raedt, K. Michielsen, Handbook of
Theoretical and Computational Nanotechnology, American Scientic Publishers,
Los Angeles, 2006.
URL arXiv:quantph/0406210  (5) M. Rigol, V. Dunjko, M. Olshanii, Thermalization and its mechanism for generic isolated quantum systems, Nature 452 (2008) 854–858. arXiv:0708.1324, doi:10.1038/nature06838.

(6)
A. Polkovnikov, K. Sengupta, A. Silva, M. Vengalattore,
Colloquium:
Nonequilibrium dynamics of closed interacting quantum systems, Rev. Mod.
Phys. 83 (2011) 863–883.
doi:10.1103/RevModPhys.83.863.
URL http://link.aps.org/doi/10.1103/RevModPhys.83.863 
(7)
C. S. Bederián, A. D. Dente,
Boosting
quantum evolutions using trottersuzuki algorithms on gpus, Proceedings of
HPCLatAm 2011: HighPerformance Computing Symposium 40 (2011) 63–75.
URL http://www.40jaiio.org.ar/sites/default/files/T2011/HPC/902.pdf 
(8)
P. Wittek, F. M. Cucchietti,
A
secondorder distributed trotterâsuzuki solver with a hybrid cpuâgpu
kernel, Computer Physics Communications 184 (4) (2013) 1165 – 1171.
doi:10.1016/j.cpc.2012.12.008.
URL http://www.sciencedirect.com/science/article/pii/S0010465512004043 
(9)
H. Bauke, C. H. Keitel,
Accelerating
the fourier split operator method via graphics processing units, Computer
Physics Communications 182 (12) (2011) 2454 – 2463.
doi:10.1016/j.cpc.2011.07.003.
URL http://www.sciencedirect.com/science/article/pii/S0010465511002414 
(10)
H. De Raedt, A. Hams, K. Michielsen, K. De Raedt,
Quantum computer emulator,
Computer Physics Communications 132 (12) (1999) 28.
URL http://arxiv.org/abs/quantph/9911041  (11) G. A. Álvarez, E. P. Danieli, P. R. Levstein, H. M. Pastawski, Quantum parallelism as a tool for ensemble spin dynamics calculations, Phys. Rev. Lett. 101 (12) (2008) 120503. doi:10.1103/PhysRevLett.101.120503.

(12)
J. D. McCalpin, Stream: Sustainable
memory bandwidth in high performance computers, Tech. rep., University of
Virginia, Charlottesville, Virginia, a continually updated technical report.
http://www.cs.virginia.edu/stream/ (19912007).
URL http://www.cs.virginia.edu/stream/ 
(13)
Intel Corporation,
IntelÂ®
Xeon Processor E52600 v2 Product Family Memory bandwidth benchmarks,
[Online; accessed 28October2013] (2013).
URL http://www.intel.com/content/www/us/en/benchmarks/server/xeone52600v2/xeone5v2hpcmemorybandwidth.html 
(14)
Eleks Software,
NVIDIA
Tesla K20 benchmark: facts, figures and some conclusions, [Online; accessed
28October2013] (2012).
URL http://www.elekslabs.com/2012/11/nvidiateslak20benchmarkfacts.html  (15) K. Joel, D. Kollmar, L. F. Santos, An introduction to the spectrum, symmetries, and dynamics of heisenberg spins1/2 chains, ArXiv eprintsarXiv:1209.0115.
 (16) L. F. Santos, F. Borgonovi, F. M. Izrailev, Chaos and statistical relaxation in quantum systems of interacting particles, Physical Review Letters 108 (9) (2012) 094102. arXiv:1110.4663, doi:10.1103/PhysRevLett.108.094102.
 (17) L. F. Santos, private communication (2013).
 (18) D. Wisniacki, A. Goussev, R. Jalabert, H. Pastawski, Loschmidt echo, Scholarpedia 7 (8) (2012) 11687.

(19)
P. R. Zangara, A. D. Dente, P. R. Levstein, H. M. Pastawski,
Loschmidt echo as a
robust decoherence quantifier for manybody systems, Phys. Rev. A 86 (2012)
012322.
doi:10.1103/PhysRevA.86.012322.
URL http://link.aps.org/doi/10.1103/PhysRevA.86.012322  (20) R. A. Jalabert, H. M. Pastawski, Environmentindependent decoherence rate in classically chaotic systems, Phys. Rev. Lett. 86 (12) (2001) 2490–2493. doi:10.1103/PhysRevLett.86.2490.
 (21) K. Blum, Density Matrix Theory and Applications, Physics of Atoms and Molecules, Plenum Press, New York, 1996.

(22)
G. A. Álvarez, E. P. Danieli, P. R. Levstein, H. M. Pastawski,
Decoherence as
attenuation of mesoscopic echoes in a spinchain channel, Phys. Rev. A 82
(2010) 012310.
doi:10.1103/PhysRevA.82.012310.
URL http://link.aps.org/doi/10.1103/PhysRevA.82.012310  (23) T. A. Elsayed, B. V. Fine, Regression relation for pure quantum states and its implications for efficient computing, ArXiv eprintsarXiv:1208.4652.

(24)
M. Fannes, B. Nachtergaele, R. Werner,
Finitely correlated states on
quantum spin chains, Communications in Mathematical Physics 144 (3) (1992)
443–490.
doi:10.1007/BF02099178.
URL http://dx.doi.org/10.1007/BF02099178 
(25)
M. C. Bañuls, M. B. Hastings, F. Verstraete, J. I. Cirac,
Matrix product
states for dynamical simulation of infinite chains, Phys. Rev. Lett. 102
(2009) 240603.
doi:10.1103/PhysRevLett.102.240603.
URL http://link.aps.org/doi/10.1103/PhysRevLett.102.240603 
(26)
S. R. White, Density
matrix formulation for quantum renormalization groups, Phys. Rev. Lett. 69
(1992) 2863–2866.
doi:10.1103/PhysRevLett.69.2863.
URL http://link.aps.org/doi/10.1103/PhysRevLett.69.2863 
(27)
A. MüllerHermes, J. I. Cirac, M. C. Bañuls,
Tensor network
techniques for the computation of dynamical observables in onedimensional
quantum spin systems, New Journal of Physics 14 (7) (2012) 075003.
URL http://stacks.iop.org/13672630/14/i=7/a=075003