Optimizing LargeScale ODE Simulations
Abstract
We present a strategy to speed up RungeKuttabased ODE simulations of large systems with nearestneighbor coupling. We identify the cache/memory bandwidth as the crucial performance bottleneck. To reduce the required bandwidth, we introduce a granularity in the simulation and identify the optimal cluster size in a performance study. This leads to a considerable performance increase and transforms the algorithm from bandwidth bound to CPU bound. By additionally employing SIMD instructions we are able to boost the efficiency even further. In the end, a total performance increase of up to a factor three is observed when using cache optimization and SIMD instructions compared to a standard implementation. All simulation codes are written in C++ and made publicly available. By using the modern C++ libraries Boost.odeint and Boost.SIMD, these optimizations can be implemented with minimal programming effort.
1 Introduction
The numerical integration of ordinary differential equations (ODEs) is a frequent and highly important task in all scientific areas. The efficient implementation of ODE simulations therefore plays a crucial role in many scientific applications as for example seen from the numerous scientific libraries devoted to solving ODEs, e.g. [25, 12, 23, 3]. In recent years, the available computational power has continued to grow tremendously, but the utilization of this power introduces new challenges, especially in the field of numerical simulations where optimal performance is often crucial. The most prominent example is GPGPU computing that has become very popular in the past years, and for which the applicability of ODE simulations was shown recently [19, 6, 7, 1]. Besides the highly parallelized GPUs containing hundreds of cores, also normal CPUs now consist of many cores, and modern workstations include several CPUs. Hence it seems that nowadays performance gains are best achieved by parallelization of the algorithms and thus utilizing all available cores. But this is shortsighted. To reach optimal runtime one always should start tuning the singlethread performance before considering parallelization and multithreading. This is a rather obvious fact, as an optimized singlethread performance is also beneficial for possible later parallelization of the algorithm. However, current processors are highly complex and although they can provide enormous computational power, it has become increasingly difficult to utilize their full performance.
An often overlooked problem in numerical computing is memory bandwidth limitation. In many cases programs are not bound by the available computational power (CPU throughput), but rather by the fact that the CPU has to wait for the required data to arrive from the main memory. Indeed, memory access typically has a latency of the order of a hundred CPU cycles, which can introduce enormous performance losses. Therefore, CPUs are equipped with several layers of caches: L1, L2 and in modern processors also L3 cache. Those caches can preload the required data for the CPU with the purpose of reducing latencies due to main memory access. However, many algorithms are implemented in a way that makes efficient cache usage hard or even impossible. In such situations, adjusting the algorithm to reduce the required cache and/or memory bandwidth can greatly improve the performance. The importance of cache tuning is known since a long time [13, 15] and has become increasingly relevant as the computational power continued to grow faster than the memory bandwidth [8]. Consequently, many modern numerical libraries address this problem and include optimization techniques for enhancing cache performance [14, 11]. In the following, we will see that to reduce memory bandwidth, it can be even beneficial to repeat parts of the computation instead of accessing previous results.
But even if the algorithm is cache friendly and limited by the CPU throughput, there is room for improvement. Modern processors include SIMD extensions (Single Instruction Multiple Data) that allow the parallel computation of two or four doubleprecision operations in a single clockcycle on a single core [22]. Although compilers are capable of making use of SIMD instructions automatically, a process typically called autovectorization, the specific use of SIMD instructions by the programmer can give further significant performance gains as will be also demonstrated here.
In the following sections, we present a new technique to increase the performance of large ODE simulations of onedimensional chains with nearestneighbor interactions. The speed up is based on reducing the required cache bandwidth by introducing a granularity in the algorithm. By adjusting the granularity we will be able to optimize the cache usage and thus increase the performance significantly. We will then make use of SIMD instructions to further tune the efficiency and reach a speed up of up to a factor of three compared to the standard implementation (cf. Fig. 7).
2 Optimizing Simulation Performance
We address the quite general situation of a highdimensional ODE with nearestneighbor interactions. The numerical approximation of the solution will be computed with the explicit RungeKutta4 method [23]. This is the standard, general purpose routine and widely used due to its robustness and simple implementation. However, the techniques presented below can be applied also to any other explicit RungeKutta method.
2.1 Model with NearestNeighbor Coupling
Let us introduce a chain with nearest neighbor interactions as follows: the state of the th element of the chain is represented by , in general a dimensional vector. The chain has length , i.e. , hence in total the system is described by scalar values. The independent variable is usually representing the time in case of dynamical systems. Most generally, an ODE with nearestneighbor interactions can be written as:
(1) 
where denotes the derivative with respect to and we have omitted the explicit timedependence . The function represents the local term and , the nearestneighbor coupling. Note that in this setup both the local and the coupling term can be different for each site. In many cases, however, one faces homogeneous situations where and are the same for all elements:
(2) 
Here, we will also assume such a homogeneous situation. Examples for such systems include nonlinear oscillator chains [18, 4], nonlinear KleinGordon models [10], the discrete nonlinear Schrödinger model [16] and classical spin chains [20]. Furthermore, sometimes also discretized partial differential equations are solved by explicit schemes, e.g. [27]. Nevertheless, the cache optimization technique presented below can be generalized to heterogenic cases, but will not work with implicit routines.
2.2 RungeKutta Schemes
To simulate such a system with nearestneighbor coupling, we first have to choose a memory layout for representing the state . To avoid unnecessary cache misses, the best strategy is to place the data required at the same time in the algorithm also close in memory. For a system with nearestneighbor coupling this is quite simple: the local function needs all elements of a single state , while the coupling additionally requires the left and right neighbors. So the best memory layout is very naturally to put all elements of consecutively one after another. This is shown in Fig. 1 for an example with , i.e. . There, it is also illustrated how all data required to evaluate the rhs of the ODE for one element (see Eq. (3) below) is found in a continuous memory block. Only at the boundary there is potentially an access to the other end of the chain if periodic boundary conditions are employed. However, in the case of large systems, which are considered here, the boundary becomes negligible for the overall performance.
The implementation of an explicit RungeKutta scheme is rather simple. Starting from some given initial condition for the time we will find an approximate solution at for some small step size . The calculation consists of several stages, where at each stage first an evaluation of the rhs of the ODE is performed:
(3) 
Then, a new intermediate approximation is computed from the current, and possibly previous, rhs evaluations:
(4) 
This intermediate approximation is then used in the next stage to compute another rhs evalution and so on. Fig. 2 schematically shows the flow of the algorithm. After stages the final result is obtained using the previously computed evaluations of the rhs:
(5) 
The number of stages as well as the parameters and are properties of the RungeKutta scheme. The prominent RungeKutta4 scheme for example has stages, i.e. four evaluations of the rhs function in Eq. (3) per time step.
Implementing this algorithm is rather straightforward and several libraries exist offering such functionality [25, 12, 23, 3]. However, all those implementations potentially suffer from severe performance loss when dealing with large systems due to memory bandwidth limitations. For example, evaluating the rhs of the ODE Eq. (3) using double precision (8 Bytes per value) has a memory requirement of . For optimal performance, this data should be available in the L1 or L2 cache of the CPU, as those caches can be accessed by the CPU with low latency and extremely high bandwidth. However, those caches have only limited size, typically less than a few hundred Kilobytes. For large systems, i.e. , the memory requirement of the state and the rhs outrun the available L1/L2 cache and the CPU has to employ the L3 cache, which has a larger latency and much lower bandwidth. For very large systems, , eventually also the main memory has to be used resulting in even more significant bandwidth limitations. That means, for large systems any iteration over or is slowed down by cache latency and/or memory bandwidth.
The RungeKutta scheme involves many such iterations. For each stage, the rhs evaluation Eq. (3) as well as the computation of the intermediate approximation Eq. (4) involves an iteration over all elements, hence a total of such iterations. This can lead to a severe performance loss as shown in Fig. 4(a). There, we use and which means one state occupies 24 Megabytes, much more than the available L1/L2 cache size. And indeed, we observe a significant activity for the L3 cache (bottom panel in Fig. 4(a)) for the standard implementation, which results in a suboptimal performance. More details on the performance measurements follow below. First, we will provide an approach to circumvent this problem by introducing granularity to the algorithm.
2.3 Introducing Granularity
As described above, every iteration across all elements can lead to performance loss if the state becomes too large. The solution is rather simple: divide the state into clusters such that each cluster fits into L1/L2 and do a whole RungeKutta step on each cluster in turn. Instead of iterations for each RungeKutta step this would involve only a single iteration and therefore reduce the required L3/memory bandwidth by a factor of . For an uncoupled system, i.e. , this is trivial as all systems are independent from each other and thus can be treated separately. In the presence of coupling, however, the situation is more complicated. Suppose we divide the state into clusters, each having elements, i.e. . The cluster size is also called granularity. Now we want to perform one RungeKutta step for a single cluster that starts at index and ends at : , where is the start index of the next cluster. Hence, we have to compute the function values in Eq. (3). But in general (arbitrary coupling), for this computation we need all values , not only those from the cluster due to the coupling terms.
For nearestneighbor coupling, the situation is much better. Here, we are only missing the values and to compute the values at the left and right boundary of the cluster: and . So we simply append those values to the left and right end of the cluster. Although we are only interested in the result , we perform the RungeKutta step on the extended cluster . However, this only fixes the first stage of the RungeKutta step. That means we obtain a correct intermediate approximation for , while the boundary values of the extended cluster and are now wrong as for those again the required coupling terms are not available. But for the next stage we need exactly those boundary values. Therefore, we extend the cluster even further and add two neighbors on the left and right, which then gives correct results for two stages of the RungeKutta scheme. For each stage of the algorithm the error introduced by the missing coupling term at the edges advances one index into the cluster, as sketched in Fig. 3. So finally, for a RungeKutta scheme with stages we have to extend the original cluster by left and right neighbors to ensure the correct computation of the cluster, as depicted in Fig. 3. In the end, we have the correct result for the cluster , while the extra values on the left and right are wrong and will be neglected. Those values belong to the left (right) neighboring clusters and their correct computation will be performed typically before (after) the current cluster. Consequently, we add an overlap computation of values per cluster, which means an increase of the required computations by a factor of . But we can hope that this increase of computations is outweighted by the reduced cache/memory access of the algorithm.
And indeed, for an optimal choice of the granularity , the cache optimization described above leads to a performance increase of roughly a factor of two, as shown examplarily in Fig. 4(a) (“Cache optimized” vs “Standard”). Some estimate for the optimal granularity can be deduced from the L1/L2 cache size. However, there is a tradeoff between cache usage and the amount of overlap computation and thus we employ performance study where different granularity values are compared, as shown in Fig. 4(b). Again, details on the performance results will follow below.
2.4 SIMD Instructions
As seen in Fig. 4(a), we can reach about 4200 Megaflops/s, that is 1.2 Flops/cycle on a 3.8 GHz Intel Xeon processor with the cache optimized implementation. This is quite impressive, as typically the floating point unit of the CPU saturates at 1 Flop/cycle. Hence, we were able to turn the memory bandwidth bound implementation into a CPU bound version by introducing granularity. The extra performance comes from the SIMD (Single Instruction Multiple Data) units in the CPU, that can perform up to four floating point operations per cycle. Modern compilers can utilize those extra SIMD registers automatically (autovectorization). However, by explicitly using SIMD registers in the program one can hope to reach an even higher performance than by relying on autovectorization.
For that we first have to adjust the memory layout of the implementation. To make optimal use of the SIMD registers, we will organize the data in packs of size .^{1}^{1}1We use a pack size of four as the target processor supports the AVX extensions with 256 Bit registers capable of holding four values (double precision). Older CPUs might only support SSE with 128 Bit registers, where a pack size of two should be used. To allow for optimal utilization of the SIMD units, we have to ensure that for all elements of an SIMD pack the same operations are performed, and that the operation of one entry in the pack does not depend on any other entry of the same pack. Only then can the code be fully vectorized with significant performance gains. Again assume we have a chain of threedimensional systems (), i.e. , and a chain length of : . To make sure the same operations are performed on all entries of a pack we should only put either , , or values in a pack, e.g. , and so on. But with this organization, the computation of the rhs of the ODE in Eq. (3) would introduce coupling between the entries of the pack. To avoid that, we divide the chain in pieces of length and represent each piece by one entry of the SIMD packs. So let , , and be the starting indices of the four parts of the chain. Then the first SIMD pack will contain , the second one and so on. This memory layout is sketched in Fig. 4. Compared to the original memory layout in Fig. 1, we replace the single values, e.g. , by four values from distant points of the chain ().
Using such a memory layout, it is now possible to compute four values at once by virtue of SIMD. In principle this could give a performance gain of a factor four in the best case. However, note that this relies on the fact that the dynamical equations are indeed independent of the chain index, as the above usage of SIMD registers requires that the same operations are performed for each entry of the SIMD pack and hence in the above description also for each element in the chain. Note also that at the edges of the four parts of the chain there is coupling between different entries of the SIMD pack. For example the second chain is stored in the second SIMD entries, but the left neighbor of its first element is the last element of the first chain and hence stored in the first entries of the SIMD packs. So the edges need special treatment and their calculation can not be carried out using the SIMD vectorization. But again for long chains this has negligible performance impact. Finally, the cache optimization described above can be similarly applied to this memory layout as well. It works exactly in the same way, only that one now deals with SIMD packs instead of scalar values, but in return one has a reduced chain length of .
Fig. 4(a) shows the performance comparison of the standard implementation, the optimized cache version and SIMD version. As seen there, using SIMD instructions gives another performance boost of about 50% compared to the cache optimized version, which means a total speed up of a factor of three compared to the standard implementation in this simulation. Details of these performance tests are described below.
3 Performance Study
In the following, we will compare the performance of the different approaches presented above. The examplary system is a chain of Roessler oscillators [24] with nearestneighbor coupling.
3.1 Coupled Roessler Chain
A single Roessler oscillator has a threedimensional () state following the dynamics:
(6) 
The parameters are constant and fixed to some typical values: , , . For those parameter values, the dynamics of the Roessler system is generally chaotic due to the existence of a strange attractor [21]. We add dispersive nearestneighbor coupling in the first coordinate by defining:
(7) 
At the boundaries, we set and , i.e. periodic boundary conditions. Substituting and in Eq. (2) defines our chain of coupled Roessler oscillators that will be used for performance investigations below.
3.2 Performance Results
We simulate a chain of coupled Roessler oscillators. As mentioned above, we use the wellknown and popular RungeKutta4 algorithm with stages for the numerical time evolution. We start from random initial conditions and perform steps with a stepsize of . We measure the runtime needed to perform those steps and then quantify the performance in terms of RungeKutta steps per second:
(8) 
This is repeated for times and we use the minimal runtime (maximal performance) of those 10 trials. Additionally, we measure the CPU throughput in terms of Flops/s (FLoating point OPerations per second) as well as the cache transfer bandwidth using the likwid framwork [26], which is based on the processor’s performance counters.
3.2.1 Results on Intel Xeon
In a first run, the simulations are performed on an Intel Xeon E5 processor with 3.8 GHz (single core, turbo mode). The simulations are implemented in C++ and compiled with the Intel Compiler version 15.0. The Intel Xeon processor has 32KB/256KB/20MB of L1/L2/L3 cache respectively. The representation of one state requires , clearly above the L1/L2 cache capacity, but also bigger than the available L3 cache. Therefore, we expect the standard implementation, where iteration over the state are performed in each RungeKutta step, to suffer severely from bandwidth limitations. This is seen in Fig. 4(a), where the top bar in all three panels represents this standard implementation. We find a CPU throughput of roughly 2000 MFlops/s, which is significantly below the 3800 MFlops/s the FPU unit of this CPU is capable of. Accordingly, the L3 cache bandwidth of about 17 GByte/s clearly indicates significant L3 usage. Hence, the performance remains at below 30 RungeKutta steps per second.
Next, we analyze the performance of the cache optimized implementations introducing granularity and overlap computations as described above. Therefore, in a first step we identify the optimal granularity by means of a performance study for different values . The result is shown in Fig. 4(b) (circles), where we plot the performance (top panel), the CPU throughput (center panel) and the L2 cache bandwidth (bottom panel) for increasing granularity. The granularity is measured in Bytes via to allow for comparison with the cache size. The analyzed range for spawns from values significantly smaller than the L1 cache size () to about the L2 cache size (). We find that the CPU throughput (center panel) is rather independent of the granularity and stays at about 4 GFlops/s. The performance (top panel) however increases with increasing granularity because due to the larger sizes of the clusters (larger granularity), less overlap computation are performed and less Flops/s are wasted for neglected results. The L2 bandwidth starts to increase at a granularity size of about 16 KByte, which is consistent with the L1 cache size of 32 KByte as for example a rhs computation needs to store the state as well as the result. However, the L2 cache access does not have a severe influence on the performance as it is typically fast enough to supply the FPU with the required data. Nevertheless, we identify the optimal granularity at 48 KBytes, which corresponds to a length of each cluster of elements. Fig. 4(a) then shows the performance results for this optimal value of the granularity (“Cache optimized”). And indeed, we find a performance improvement as well as a CPU throughput improvement of a factor of two, while the L3 bandwidth drops by a factor of eight, as expected as now only one iteration over all clusters needs to be performed instead of before.
Finally, we additionally employ the SIMD instructions as explained above. Again we identify the optimal granularity in a similar fashion as before. The results are shown as squares in Fig. 4(b). With SIMD instructions, the performance is more sensitive to the granularity than before and we find an optimal value of 24 KByte, which corresponds to an optimal cluster length of elements^{2}^{2}2Remember that each element of the chain in the SIMD case consists of a pack of four values.. Note, that with SIMD instructions, the L2 access does lead to a drop in the CPU throughput, as now potentially four times as much data is required per CPU cycle. However, the optimal performance (top panel in Fig. 4(b)) does not correspond to the best CPU throughput (second panel) due to the overlap computations that contribute to the throughput but not to the real performance.
Fig. 4(a) shows the performance, CPU throughput and cache bandwidths for the SIMD version at optimal granularity (“Cache optimized + SIMD”). We observe a speed up of about 50% over the plain cache optimized version and a factor three over the standard implementation. The gain in CPU throughput is significantly higher, which can be partially explained by the fact that in this case more overlap computations have to be performed ( at vs at ). However, note that the performance counters used to measure the CPU throughput are not completely reliable. Hence, one should always measure the actual runtime of an algorithm to obtain real performance results. Finally, note that the maximal CPU throughput of about 10 GFlops/s shown in Fig. 4(a) means about 2.6 Flops/cycle on the 3.8 GHz CPU. This is quite impressive, but there is still room for improvements as the SIMD registers can generally do up to 4 Flops/cycle, or even more for specific sequences of additions/multiplications. However, this would involve handtuned code which is beyond the scope of this article.
3.2.2 Results on AMD Opteron
We repeated the same tests on an AMD Opteron 6272 with 2.1 GHz and L1/L2/L3 cache sizes of 48KB/1MB/8MB [5], where we used clang3.4 to compile the C++ implementations.^{3}^{3}3Test with the g++4.9 compiler showed similar, but slightly lower (%) performance results. Fig. 4(a) shows the performance results for the three different implementations. Again, the standard implementation suffers greatly from bandwidth limitations with a rather poor performance of only 10 steps per second. Introducing granularity again greatly improves the performance by reducing the required L3 cache bandwidth. Fig. 5(b) shows the granularity dependence of the performance for the AMD processor. For the nonSIMD version we find similar results as for the Intel Xeon before, with a performance increase for increasing granularity at a rather constant rate of CPU throughput of about 2000 MFlops/s corresponding to roughly 1 Flop/cycle. Hence again we are able to turn the algorithm from cache bandwidth bound to Flops bound by introducing granularity. Additional utilization of SIMD instructions leads to further performance gains of about 40%. The maximal CPU throughput of about 3000 MFlop/s corresponds to roughly 1.4 Flops/cycle, clearly below the 4 Flops/cycle that is possible with this processor’s AVX extensions. As for the Intel Xeon, this indicates that the simulation does not yet make full use of the available SIMD registers Again, it might be that with further handtuning of SIMD instructions additional performance gains can be reached, but this is beyond the scope of this article.
Nevertheless, we identified the optimal granularity and found a speed up of a factor of three compared to the standard implementation. The overall speed up is similar to the Intel Xeon, with similar contributions from cache optimization and SIMD usage. However, the final performance of the Intel Xeon is about another three times better than for the AMD Opteron. This is partly due to the higher clock speed of 3.8 GHz vs only 2.1 GHz for the Opteron, but also because of the better SIMD utilization for the Intel Xeon. Indeed, the AMD Opteron only reaches about 1.4 Flops/cycle, roughly half of what was observed for the Intel Xeon above. Note however, that these results should not be taken as a comparison of the two processors, as they have very different properties, e.g. 16 cores in the AMD Opteron vs eight cores in the Intel Xeon.
3.2.3 Dependence on System Size
Finally, we measured the speed up due to cache optimization and SIMD instructions for different chain lengths . The result is shown in Fig. 7 for both the Intel Xeon (upper panel) and the AMD Opteron (lower panel). The darker regions correspond to the performance gain from cache optimization, while the lighter regions indicate the additional gain from SIMD instructions. The system size is given in Bytes required to store one instance of the state by computing . Note that the optimal granularity was observed to be independent of the system size (results not presented), as one expects because it only depends on the available L1/L2 cache size.
For both processors, using the SIMD instructions gives a constant gain independent on system size. For the Intel Xeon, the influence of cache optimization has a clear increase when the systems get larger than 2 MBytes. This can be explained by the L3 cache size. To perform a whole RungeKutta4 step, the algorithm needs to store the state , an intermediate state and rhs evaluations . This gives a total memory requirement of states. That means for the Intel Xeon with 20 MBytes L3 cache, for system sizes larger than about 3 MBytes the L3 chache is not sufficient to contain all required data and the even slower main memory has to be used. This leads to a further performance decrease of the standard implementation and hence bigger speed up from the cache optimizations. For the AMD Opteron with an L3 cache size of 8 MBytes, the transition to main memory should happen at system sizes of about 1 MByte, which is also consistent with Fig. 7. Furthermore, there we observe another increase of the speed up at system sizes reaching about 2 GBytes. This is probably due to the fact that the main memory of this machine is divided into NuMA domains (Nonuniform Memory Access). So for very large systems, the singlecore simulation requires memory outside of its NuMA domain, which induces further bandwidth limitations and hence even bigger speed up due to cache optimization.
4 Implementation
All simulations were implemented in C++ and the sources are made publicly available [17]. For the numerical RungeKutta4 algorithm the Boost.odeint library was used [3, 2]. It provides modern, generic implementations of ODE solvers and greatly simplified the implementation of the cache optimized algorithm. For example, Boost.odeint performs the required memory allocations for the temporary results of the RK4 steps and also supports an easy way to introduce granularity with the help of a range interface. To efficiently employ the SIMD registers, ’s SIMD library [9] (proposed Boost.SIMD) was used. It provides a C++ abstraction to the lowlevel SIMD registers and allows for a very fast and easy porting of the simulation to SIMD. By using this library, the source code is generic in the sense that it will be decided at compiletime, and not in the source code, which SIMD extensions are going to be employed (e.g. SSE3, SSE4 or AVX). Furthermore, the generic implementation of Boost.odeint allowed us to readily use SIMD data types, and therefore SIMD registers, without the need to reimplement the RungeKutta4 algorithm with SIMD instructions.
For more details of the implementations we refer to the source codes [17]. We only emphasize here that the introduction of granularity as well as the SIMD usage can be done with reasonable programming effort when using Boost.odeint and Boost.SIMD. For example, the simulation of the coupled Roessler system with granularity and SIMD instructions requires merely about 220 lines of C++ code, only about 100 lines more than a basic standard implementation with Boost.odeint. Hence, applying the techniques presented here in ODE simulations with C++ is certainly achievable in most situations.
5 Conclusions and Outlook
In this article, we have presented a technique to improve the performance of largescale ODE simulations with nearestneighbor coupling based on explicit RungeKutta schemes. The main performance bottleneck for such simulations is the cache and memory bandwidth which prevents the full employment of CPU throughput due to the lack of data available to the CPU. We are able to overcome this problem by introducing a granularity to the algorithm which leads to a more cache efficient implementation. To ensure the correctness of the algorithm, we introduced overlap computations, which effectively increased the required computational effort. However, performance tests on two processors, an Intel Xeon and an AMD Opteron, showed that for optimal granularity, this is clearly outweighted by the better cache utilization. With this improvement, we again arrived at an implementation that is bound by the CPU throughput instead of cache/memory bandwidth. We were then able to further increase the performance by employing SIMD instructions. For both processors we found a total performance increase of up to a factor of three, depending on the system size (cf. Fig. 7). We conclude that for largescale simulations, the techniques presented here should be considered as a valuable option to increase simulation performance, typically even before exploring multicore parallelization.
Note that from Figures 4(b) and 5(b) one finds that the optimal granularity roughly corresponds to the L1 cache sizes of the used processor. Hence, it is possible to automatically choose the optimal granularity without employing a performance study by adjusting the granularity to the L1 cache size of the target machine.
However, the cache optimizations are only promising if the program indeed suffers from memory/cache bandwidth limitations. In cases, for example, where the rhs of the ODE involves more complicated expressions that require several CPU cycles to be computed, the cache/memory bandwidths might not impose a bottleneck and hence introducing granularity would not lead to any performance gain. In such situations, one should directly try to utilize SIMD instructions to increase the CPU throughput and obtain better performance this way.
To ensure the correct computation in cases with granularity we introduced overlap computations. The presented strategy only works for nearestneighbor coupling. In general, however, the same idea could also be applied in more complicated situations. In case of nextnearestneighbor couplings, for example, simply twice as long overlaps have to be introduced, and similarly with longer coupling ranges. In principle, also for situations with distant, but sparse couplings granularity can be implemented. But this is more complicated as one has to keep exact track of the error propagation in the RungeKutta scheme. Furthermore, if the coupling is not sparse enough, the gain from cache optimization might not outweight the additional overlap computations anymore. Nevertheless, with a factor three speed up, the cache optimization techniques presented here proved to be highly valuable for nearestneighbor coupling and can potentially be generalized to more complex situations as well.
6 Acknowledgements
I thank Karsten Ahnert and Denis Demidov for fruitful discussions. Moreover, I thank Denis Demidov for proof reading the manuscript. Finally, I thank Hartmut Kaiser and the STE AR group for hospitality and computer cluster access.
This work was supported by the European Comission through the Marie Curie Initial Training Network Neural Engineering Transformative Technologies (NETT) under the project number 289146.
References
 [1] K. Ahnert, D. Demidov, and M. Mulansky, Solving ordinary differential equations on GPUs, in Numerical Computations with GPUs, Springer, 2014, pp. 125–157.

[2]
K. Ahnert and M. Mulansky, odeint.
http://www.odeint.com
, 2009–2014.  [3] , odeint  Solving ordinary differential equations in C++, in Symposium on the âNumerical Solution of Differential Eq. and their Applications, AIP Conference Proceedings, 2011.
 [4] G. Bordyugov, A. Pikovsky, and M. Rosenblum, Selfemerging and turbulent chimeras in oscillator chains, Physical Review E, 82 (2010), p. 035205.
 [5] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes, Cache hierarchy and memory subsystem of the amd opteron processor, IEEE micro, 30 (2010), pp. 16–29.
 [6] D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling, Programming CUDA and OpenCL: A case study using modern C++ libraries, SIAM Journal on Scientific Computing, 35(5) (2013), pp. C453–C472.
 [7] S. Dindar, E.B. Ford, M. Juric, Y.I. Yeo, J. Gao, A.C. Boley, B. Nelson, and J. Peters, SwarmNG: A CUDA library for parallel body integrations with focus on simulations of planetary systems, New Astronomy, 23 (2013), pp. 6–18.
 [8] C. Ding and K. Kennedy, The memory of bandwidth bottleneck and its amelioration by a compiler, in Proceedings of the 14th International Symposium on Parallel and Distributed Processing IPDPS., IEEE, 2000, pp. 181–189.
 [9] P. Estérie, J. Falcou, M. Gaunard, and J.T. Lapresté, Boost.SIMD: generic programming for portable simdization, in Proceedings of the 2014 Workshop on programming models for SIMD/Vector processing, ACM, 2014, pp. 1–8.
 [10] S. Flach, D.O. Krimer, and C. Skokos, Universal spreading of wave packets in disordered nonlinear systems, Physical Review Letters, 102 (2009), p. 024101.
 [11] F. Günther, M. Mehl, M. Pögl, and C. Zenger, A cacheaware algorithm for PDEs on hierarchical data structures based on spacefilling curves, SIAM Journal on Scientific Computing, 28 (2006), pp. 1634–1650.
 [12] A.C. Hindmarsh, P.N. Brown, K.E. Grant, S.L. Lee, R. Serban, D.E. Shumaker, and C.S. Woodward, Sundials: Suite of nonlinear and differential/algebraic equation solvers, ACM Transactions on Mathematical Software (TOMS), 31 (2005), pp. 363–396.
 [13] A. Kagi, J.R. Goodman, and D. Burger, Memory bandwidth limitations of future microprocessors, in Computer Architecture, 1996 23rd Annual International Symposium on, IEEE, 1996, pp. 78–78.
 [14] M. Kowarschik and C. Weiß, An overview of cache optimization techniques and cacheaware numerical algorithms, in Algorithms for Memory Hierarchies, Springer, 2003, pp. 213–232.
 [15] A.R. Lebeck and D.A. Wood, Cache profiling and the SPEC benchmarks: A case study, Computer, 27 (1994), pp. 15–26.
 [16] M. Mulansky, Simulating DNLS models, arXiv:1304.1608, (2013).

[17]
, Roessler performance
tests source codes.
Available on Github:
https://github.com/mariomulansky/olsos
, 2014.  [18] M. Mulansky and A. Pikovsky, Energy spreading in strongly nonlinear disordered lattices, New Journal of Physics, 15 (2013), p. 053015.
 [19] L. Murray, GPU acceleration of RungeKutta integrators, Parallel and Distributed Systems, IEEE Transactions on, 23 (2012), pp. 94–101.
 [20] V. Oganesyan, A. Pal, and D.A. Huse, Energy transport in disordered classical spin chains, Physical Review B, 80 (2009), p. 115104.
 [21] E. Ott, Chaos in Dynamical Systems, Cambridge University Press, 2002.
 [22] D.A. Patterson and J.L. Hennessy, Computer organization and design: the hardware/software interface, Newnes, 2013.
 [23] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes 3rd Edition: The Art of Scientific Computing, Cambridge University Press, 2007.
 [24] O.E. Rössler, An equation for continuous chaos, Physics Letters A, 57 (1976), pp. 397 – 398.
 [25] L.F. Shampine and M.W. Reichelt, The Matlab ODE suite, SIAM Journal on Scientific Computing, 18 (1997), pp. 1–22.
 [26] J. Treibig, G. Hager, and G. Wellein, Likwid: A lightweight performanceoriented tool suite for x86 multicore environments, in 39th International Conference on Parallel Processing Workshops (ICPPW), IEEE, 2010, pp. 207–216.
 [27] W. YuShun, L. QingHong, and S. YongZhong, Two new simple multisymplectic schemes for the nonlinear Schrödinger equation, Chinese Physics Letters, 25 (2008), pp. 1538–1540.