# Gravitational tree-code on graphics processing units: implementation in CUDA

###### Abstract

We present a new very fast tree-code which runs on massively
parallel Graphical Processing Units (GPU) with NVIDIA CUDA
architecture. The tree-construction and calculation of multipole
moments is carried out on the host CPU, while the force calculation
which consists of tree walks and evaluation of interaction list is
carried out on the GPU. In this way we achieve a sustained
performance of about 100GFLOP/s and data transfer rates of about
50GB/s. It takes about a second to compute forces on a million
particles with an opening angle of . The code has
a convenient user interface and is freely available for
use^{1}^{1}1http://castle.strw.leidenuniv.nl/software/octgrav.html.

###### keywords:

, gravity, tree-code, GPU, parallel, CUDA## 1 Introduction

Direct force evaluation methods have always been popular because of their simplicity and unprecedented accuracy. Since the mid 1980’s, however, approximation methods like the hierarchical tree-code (1986Natur.324..446B, ) have gained enormous popularity among researchers, in particular for studying astronomical self-gravitating -body systems (2003gnbs.book…..A, ) and for studying soft-matter molecular-dynamics problems (Frenkel2001, ). For these applications, direct force evaluation algorithms strongly limit the applicability, mainly due to the time complexity of the problem.

Tree-codes, however, have always had a dramatic set back compared to direct methods, in the sense that the latter benefits from the developments in special purpose hardware, like the GRAPE and MD-GRAPE family of computers 1998sssp.book…..M (); 2001ASPC..228…87M (), which increase workstation performance by two to three orders of magnitude. On the other hand, tree-codes show a better scaling of the compute time with the number of processors on large parallel supercomputers 169640 (); WarBecGod97 () compared to direct -body methods 2007NewA…12..357H (); 1233038 (). As a results, large scale tree-code simulations are generally performed on Beowulf-type clusters or supercomputers, whereas direct -body simulations are performed on workstations with attached GRAPE hardware.

Tree-codes, due to their hierarchical and recursive nature are hard to run efficiently on dedicated Single Instruction Multiple Data (SIMD) hardware like GRAPE, though some benefit has been demonstrated by using pseudo-particle methods to solve for the higher-order moments in the calculation of multipole moments of the particle distributions in grid cells 2004ApJS..151…13K ().

Recently, the popularity of computer games has led to the development of massively parallel vector processors for rendering three-dimensional graphical images. Graphical Processing Units (or GPUs) have evolved from fixed function hardware to general purpose parallel processors. The theoretical peak speed of these processors increases at a rate faster than Moores’ law (Moore, ), and at the moment top roughly 200 GFLOP for a single card. The cost of these cards is dramatically reduced by the enormous volumes in which they are produced, mainly for gamers, whereas GRAPE hardware remains relatively expensive.

The gravitational -body problem proved to be rather ideal to port to modern GPUs, and the first successes in porting the -body problem to programmable GPUs were achieved by Nyland04 (), but it was only after the introduction of the NVIDIA G80 architecture that accurate force evaluation algorithms could be implemented 2007NewA…12..641P () and that the performance became comparable to special purpose computers 2008NewA…13..103B (); Gaburov2009630 ().

Even in these implementations, the tree-code, though pioneered in 2008NewA…13..103B (), still hardly resulted in a speed-up compared to general purpose processors. In this paper we present a novel implementation of a tree-code on the NVIDIA GPU hardware using the CUDA programming environment.

## 2 Implementation

In the classical implementation of the tree-code algorithm all the work is done on the CPU, since special purpose hardware was not available at that time 1986Natur.324..446B (). With the introduction of GRAPE special purpose hardware 1990CoPhC..60..187I (); 1991PASJ…43..841F (), it became computationally favourable to let the special purpose hardware, instead of the CPU, calculate accelerations. Construction of the interaction list in these implementations takes nearly as much time as calculating the accelerations. Since the latest generation of GPUs allows more complex operations, it becomes possible to build the interaction list directly on the GPU. In this case, it is only necessary to transport the tree-structure to the GPU. Since the bandwidth on the host computer is about an order of magnitude lower than on the GPU, it is also desirable to offload bandwidth intensive operations to the GPU. The construction of the interaction list is such an operation. The novel element in our tree-code is construction of the interaction list on the GPU. The remaining parts of the tree-code algorithm (tree-construction, calculation of node properties and time integration) are executed on the host. The host is also responsible for the allocation of memory and the data transfer to and from the GPU. In the next sections we will cover the details of the host and device steps.

### 2.1 Building the octree

We construct the octree in the same way as done in the original BH tree-code. We define the computational domain as a cube containing all particles in the system. This cube is recursively divided into eight equal-size cubes called cells. The length of the resulting cells is half the length of the parent cell. Each of these cells is further subdivided, until less than particles are left. We call these cells leaves, whereas cells containing more than particles are referred to as nodes. The cell containing all particles is the root node.

The resulting tree-structure is schematically shown in Fig. 1. From this tree-structure we construct groups for the tree walk (c.f. section 2.2), which are the cells with the number of particles less than , and compute properties for each cell, such as boundary, mass, centre of mass, and quadrupole moments, which are required to calculate accelerations 1993ApJ…414..200M ().

In order to efficiently walk the octree on the device, its structure requires some reorganisation. In particular, we would like to minimise the number of memory accesses since they are relative expensive (up to 600 clock cycles). In Fig. 2, we show the tree-structure as stored on the GPU. The upper array in the figure is the link-list of the tree, which we call the main tree-array. Each element in this array (separated by blue vertical lines) stores four integers in a single 128-bit words (dashed vertical lines). This structure is particularly favourable because the device is able to read a 128-bit word into four 32-bit registers using one memory access instruction. Two array-elements represent one cell in the tree (green line) with indices to each of the eight children in the main tree-array (indicated by the arrows). A grey filled element in this list means that a child is a leaf (it has no children of its own), and hence it needs not to be referenced. We also use auxiliary tree-arrays in the device memory which store properties of each cell, such as its boundary, mass, centre of mass and multiple moments. The index of each cell in the main tree-array is directly related to its index in the auxiliary tree-arrays by bitshift and addition operations.

The device execution model is designed in such a way that threads which execute the same operation are grouped in warps, where each warp consists of 32 threads. Therefore, all threads in a warp follow the same code path. If this condition is not fulfilled, the divergent code path is serialised, therefore negatively impacting the performance (NVIDIA.CUDA, ). To minimise this, we rearrange groups in memory to make sure that neighbouring groups in space are also neighbouring in memory. Combined with similar tree paths that neighbouring groups have, this will minimise data and code path divergence for neighbouring threads, and therefore further improves the performance.

### 2.2 Construction of an interaction list

In the standard BH-tree algorithm, the interaction lists are constructed for each particle, but particles in the same groups have similar interaction lists. We make use of this fact by building the lists for groups instead of particles (1990JCoPh..87..161B, ). The particles in each group, therefore, share the same interaction list, which is typically longer than it would have been by determining it on a particle basis. The advantage here is the reduction of the number of tree walks by . The tree walk is performed on the GPU in such a way that a single GPU thread is used per group. To take advantage of the cached texture memory, we make sure that neighbouring threads correspond to neighbouring groups.

Owing to the massively parallel architecture of the GPU, two tree walks are required to construct interaction lists. In the first walk, each thread computes the size of the interaction list for a group. This data is copied to the host, where we compute the total size of the interaction list, and memory addresses to which threads should write lists without intruding on other threads’ data. In the second tree walk, the threads write the interaction lists to the device memory.

We implemented the tree walk via a non-recursive stack-based algorithm (the pseudo code is shown in List 1), because the current GPU architecture does not support recursive function calls. In short, every node of the tree, starting with the root node, reads indices of its children by fetching two consecutive 128-bit words (eight 32 bit integers) from texture memory. Each of these eight children is tested against the node-opening criteria (the pseudo code for this step is shown in List 2), and in the case of a positive result a child is stored in the stack (line 4 in the listing), otherwise it is considered to be a part of the interaction list. In the latter case, we check whether the child is a leaf, and if so, we increment a counter for the leaf-interaction list (line 5), otherwise a counter for the node-interaction list (line 6). This division of the interaction lists is motivated by the different methods used to compute the accelerations from nodes and leaves (c.f. section 2.3). In the second tree walk, we store the index of the cell in the appropriate interaction list instead of counting the nodes and leafs.

### 2.3 Calculating accelerations from the interaction list

In the previous step, we have obtained two interaction lists: one for nodes and one for leaves. The former is used to compute accelerations due to nodes, and the latter due to leaves. The pseudo-code for a particle-node interaction is shown in List 3 and the memory access pattern is demonstrated in the left panel of Fig. 3. This algorithm is similar to the one used in the kirin and sapporo libraries for direct -body simulations 2008NewA…13..103B (); Gaburov2009630 (). In short, we use a block of threads per group, such that a thread in a block is assigned to a particle in a group; these particles share the same interaction list. Each thread loads a fraction of the nodes from the node-interaction list into shared memory (blue threads in the figure, lines 2 and 3 in the listing). To ensure that all the data is loaded into shared memory, we put a barrier for all threads (line 4), and afterwards each thread computes gravitational acceleration from the nodes in shared memory (line 5). Prior loading a new set of nodes into the shared memory (green threads in the figure), we ensure that all the threads have completed their calculations (line 6). We repeat this cycle until the interaction list is exhausted.

Calculations of gravitational acceleration due to the leaves differs in several ways. The pseudo-code of this algorithm is presented in List 4, and the memory access pattern is displayed in the right panel of Fig. 3. First, each thread fetches leaf properties, such as index of the first body and the number of bodies in the leaf, from texture memory into shared memory (red lines in the figure, lines 2 and 3 in the listing). This data is used to identify bodies from which the accelerations have to be computed (black lines). Finally, threads read these bodies into shared memory (blue and green lines, line 6) in order to calculate accelerations (line 8). This process is repeated until the leaf-interaction list is exhausted.

## 3 Results

In this section we study the accuracy and performance of the tree code. First we quantify the errors in acceleration produced by the code and then we test its performance. For this purpose we use a model of the Milky Way galaxy (2005ApJ…631..838W, ). We model the galaxy with and particles, such that the mass ratio of bulge, disk and halo particles is 1:1:4. We then proceed with the measurements of the code performance. In all test we use and which we find produce the best performance on both G80/G92 and GT200 architecture. The GPU used in all the tests is a GeForce 8800Ultra.

### 3.1 Accuracy of approximation

We quantify the error in acceleration in the following way: ,
where and are
accelerations calculated by the tree and direct summation
respectively. The latter was carried out on the same GPU as the
tree-code. This allowed us to asses errors on systems as large as 10
million particles^{2}^{2}2We used the NVIDIA 8800Ultra GPU for this
paper, and it takes 10 GPU hours to compute the exact force on a
system with 10 million particles with double precision emulation
(Gaburov2009630, ). In Fig. 4 we show error
distributions for different numbers of particles and for different
opening angles. In each panel, we show which fraction of particles
(vertical-axis) has a relative error in acceleration larger than a
given value (horizontal axis). The horizontal lines show 50th, 10th
and 1st percentile of cumulative distribution. This data shows that
acceleration errors in this tree-code are consistent with the errors
produced by existing tree-codes with quadrupole corrections
(2002JCoPh.179…27D, ; 2001NewA….6…79S, ; 2001PhDT……..21S, ).

We show dependence of errors on both opening angle and number of particles in Fig. 5. In the leftmost panel of the figure, we plot the median and the first percentile of the relative force error distribution as a function of the opening angle for various number of particles (the lowest blue dotted and red dashed lines), and (the upper blue dotted and red dashed lines). As expected, the error increases as a function of with the following scaling from the least-squared fit, . However, the errors increase with the number of particles: the error doubles when the number of particles is increased by two orders of magnitude. This increase of the error is caused by the large number of particles in a leaf, which in our case is 64, to obtain the best performance. We conducted a test with , and indeed observed the expected decrease of the error when the number of particles increases; this error, however, is twice as large compared to for .

### 3.2 Timing

In Fig. 6 we present the timing data as a function of and for various . The (dotted line in the figure) is independent of , which demonstrates that construction of the octree only depends on the number of particles in the system, with . This time becomes comparable to the time spend on the GPU calculating accelerations for and . This is caused by the empirically measured near-linear scaling of time spend on GPU with . As the number of particles increases, the GPU part of the code performs more efficiently, and therefore the scaling drops from to near-linear (Fig. 7). We therefore conclude, that the optimal opening angle for our code is .

In the leftmost panel of Fig. 7 we show dependence of the time spent on the host and the device for various opening angles. In particular, scaling falls between and , which we explained by the increased efficiency of the GPU part of our code with larger number of particles. This plot also shows that the host calculation time is a small fraction of the GPU calculation time, except for and . The middle panel of the figure shows the ratio of the time spent on the device to the total time. Finally, the rightmost panel shows the ratio between the time required to compute forces by direct summation and the time required by the tree-code. As we expected, the scaling is consistent with .

### 3.3 Device utilisation

We quantify the efficiency of our code to utilise the GPU resources by measuring both
instruction and data throughput, and then compare the results to the theoretical peak
values provided by the device manufacturer. In Fig. 8 we show both
bandwidth and computational performance as function of for three different
. We see that the calculation of accelerations operates at about 100GFLOPs^{3}^{3}3We
count 38 and 70 FLOPs for each body-leaf and body-node interaction respectively.. This
is comparable to the peak performance of the GRAPE-6a special-purpose hardware, but this
utilises only % of the computational
power of the GPU^{4}^{4}4Our tests were carried out on a NVIDIA
8800Ultra GPU, which has 128 streaming processors each operating at
clock speed of 1.5Ghz. Given that the GPU is able to perform up to two
floating point operation per clock cycle, the theoretical peak performance
is GFLOP/s.. This occurs because the average
number of bodies in a group is a factor of 3 or 4 smaller than the
, which we set to 64 in our tests. On average,
therefore, only about 30% of the threads in a block are
active.

The novelty of this code is the GPU-based tree walk. Since there is little arithmetic intensity in these operations, the code is, therefore, limited by the bandwidth of the device. We show in Fig. 8 that our code achieves respectable bandwidth: in excess of 50GB/s during the first tree walk, in which only (cached) scatter memory reads are executed. The second tree walk, which constructs the interaction list, is notably slower because there data is written to memory–an operation which is significantly slower compared to reads from texture memory. We observe that the bandwidth decreases with in both tree walks, which is due to increasingly divergent tree-paths between neighbouring groups, and an increase of the write to read ratio in memory operations.

## 4 Discussion and Conclusions

We present a fast gravitational tree-code which is executed on Graphics Processing Units
(GPU) supporting the NVIDIA CUDA architecture. The novelty of this code is the GPU-based
tree-walk which, combined with the GPU-based calculation of accelerations, shows good
scaling for various particle numbers and different opening angles . The hereby
produced energy error is comparable to existing CPU based tree-codes with quadrupole
corrections. The code makes optimal use of the available device resources and shows
excellent scaling to new architectures. Tests indicate that the NVIDIA GT200 architecture,
which has roughly twice the resources as the used G80 architecture, performs the
integration twice as fast. Specifically, the sustained science rate on a realistic galaxy
merger simulation with particles is particles/second. Our
tests revealed our GPU implementation to be two order of magnitudes faster than the
widely-used CPU version of the Barnes-Hut tree-code from the NEMO stellar dynamics package
(1995ASPC…77..398T, ). However, our code is only an order of magnitude faster
compared to a SSE-vectorised tree-code specially tuned for x86-64 processors and
the Phantom-grape library
^{5}^{5}5Private communication with Keigo Nitadori, the author of
the Phtanom-grape library http://grape.mtk.nao.ac.jp/ nitadori/phantom/. Hamada et al
(1654123, ) presented a similarly tuned tree code, in which Phantom-grape is replaced
with a GPU library (2007astro.ph..3100H, ). In this way, it was possible to achieve
the 200 GFLOP/s, and science rate of about particles/s. However, this code does not
include quadrupole corrections, and therefore GFLOP/s comparison between the two codes is
meaningless. Nevertheless, the science rate of the two codes is comparable for similar
opening angles, which implies that our tree-code provides more accurate results for the
same performance.

As it generally occurs with other algorithms, the introduction of a massively parallel accelerator usually makes the host calculations and non-parallelisable parts of the code, as small as they may be, the bottleneck. In our case, we used optimised device code and for the host code we used general tree-construction and tree-walk recursive algorithms. It is possible to improve these algorithms to increase the performance of the host part, but it is likely to remain a bottleneck. Even with the use of modern quad-core processors this part is hard to optimize since its largely a sequential operation.

## Acknowledgements

We thank Derek Groen, Stefan Harfst and Keigo Nitadori for valuable suggestions and reading of the manuscript. This work is supported by NWO (via grants #635.000.303, #643.200.503, VIDI grant #639.042.607, VICI grant #6939.073.803 and grant #643.000.802). We thank the University of Amsterdam, where part of this work was done, for their hospitality.

## References

## References

- (1) J. Barnes, P. Hut, A Hierarchical O(NlogN) Force-Calculation Algorithm, Nature324 (1986) 446–449.
- (2) S. J. Aarseth, Gravitational N-Body Simulations, Gravitational N-Body Simulations, by Sverre J. Aarseth, pp. 430. ISBN 0521432723. Cambridge, UK: Cambridge University Press, November 2003., 2003.
- (3) D. Frenkel, B. Smit, Understanding Molecular Simulation, Second Edition: From Algorithms to Applications (Computational Science Series, Vol 1), 2nd Edition, Academic Press, 2001.
- (4) J. Makino, M. Taiji, Scientific simulations with special-purpose computers : The GRAPE systems, Scientific simulations with special-purpose computers : The GRAPE systems by Junichiro Makino & Makoto Taiji. Chichester ; Toronto : John Wiley & Sons, c1998., 1998.
- (5) J. Makino, Direct Simulation of Dense Stellar Systems with GRAPE-6, in: Dynamics of Star Clusters and the Milky Way, Vol. 228 of Astronomical Society of the Pacific Conference Series, 2001, pp. 87–+.
- (6) M. S. Warren, J. K. Salmon, A parallel hashed oct-tree n-body algorithm, in: Supercomputing ’93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, ACM, New York, NY, USA, 1993, pp. 12–21. doi:http://doi.acm.org/10.1145/169627.169640.
- (7) M. S. Warren et al, Parallel supercomputing with commodity components, in: H. R. Arabnia (Ed.), Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’97), 1997, pp. 1372–1381.
- (8) S. Harfst, A. Gualandris, D. Merritt, R. Spurzem, S. Portegies Zwart, P. Berczik, Performance analysis of direct N-body algorithms on special-purpose supercomputers, New Astronomy 12 (2007) 357–377. arXiv:arXiv:astro-ph/0608125, doi:10.1016/j.newast.2006.11.003.
- (9) A. Gualandris, S. Portegies Zwart, A. Tirado-Ramos, Performance analysis of direct n-body algorithms for astrophysical simulations on distributed systems, Parallel Comput. 33 (3) (2007) 159–173. doi:http://dx.doi.org/10.1016/j.parco.2007.01.001.
- (10) A. Kawai, J. Makino, T. Ebisuzaki, Performance Analysis of High-Accuracy Tree Code Based on the Pseudoparticle Multipole Method, ApJS151 (2004) 13–33. doi:10.1086/381391.
- (11) G. E. Moore, Cramming more components onto integrated circuits, Electronics 38 (8).
- (12) L. Nyland, M. Harris, J. Prins, The rapid evaluation of potential fields using programmable graphics hardware (2004).
- (13) S. F. Portegies Zwart, R. G. Belleman, P. M. Geldof, High-performance direct gravitational N-body simulations on graphics processing units, New Astronomy 12 (2007) 641–650. doi:10.1016/j.newast.2007.05.004.
- (14) R. G. Belleman, J. Bédorf, S. F. Portegies Zwart, High performance direct gravitational N-body simulations on graphics processing units II: An implementation in CUDA, New Astronomy 13 (2008) 103–112. arXiv:arXiv:0707.0438, doi:10.1016/j.newast.2007.07.004.
- (15) E. Gaburov, S. Harfst, S. Portegies Zwart, Sapporo: A way to turn your graphics cards into a grape-6, New Astronomy 14 (7) (2009) 630 – 637. doi:DOI:10.1016/j.newast.2009.03.002.
- (16) T. Ito, J. Makino, T. Ebisuzaki, D. Sugimoto, A special-purpose N-body machine GRAPE-1, Computer Physics Communications 60 (1990) 187–194. doi:10.1016/0010-4655(90)90003-J.
- (17) T. Fukushige, T. Ito, J. Makino, T. Ebisuzaki, D. Sugimoto, M. Umemura, GRAPE-1A: Special-Purpose Computer for N-body Simulation with a Tree Code, Publ. Astr. Soc. Japan 43 (1991) 841–858.
- (18) S. L. W. McMillan, S. J. Aarseth, An O(N log N) integration scheme for collisional stellar systems, ApJ414 (1993) 200–212. doi:10.1086/173068.
- (19) NVIDIA Corp., CUDA Programming manual, CUDA Programming manual.
- (20) J. E. Barnes, A Modified Tree Code Don’t Laugh: It Runs, Journal of Computational Physics 87 (1990) 161–+.
- (21) L. M. Widrow, J. Dubinski, Equilibrium Disk-Bulge-Halo Models for the Milky Way and Andromeda Galaxies, ApJ631 (2005) 838–855. arXiv:arXiv:astro-ph/0506177, doi:10.1086/432710.
- (22) W. Dehnen, A Hierarchical O(N) Force Calculation Algorithm, Journal of Computational Physics 179 (2002) 27–42. arXiv:arXiv:astro-ph/0202512.
- (23) V. Springel, N. Yoshida, S. D. M. White, GADGET: a code for collisionless and gasdynamical cosmological simulations, New Astronomy 6 (2001) 79–117. doi:10.1016/S1384-1076(01)00042-2.
- (24) J. G. Stadel, Cosmological N-body simulations and their analysis, Ph.D. thesis, AA(UNIVERSITY OF WASHINGTON) (2001).
- (25) P. Teuben, The Stellar Dynamics Toolbox NEMO, in: Astronomical Data Analysis Software and Systems IV, Vol. 77 of Astronomical Society of the Pacific Conference Series, 1995, pp. 398–+.
- (26) T. e. a. Hamada, 42 tflops hierarchical n-body simulations on gpus with applications in both astrophysics and turbulence, in: SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, 2009, pp. 1–12. doi:http://doi.acm.org/10.1145/1654059.1654123.
- (27) T. Hamada, T. Iitaka, The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units, ArXiv Astrophysics e-printsarXiv:arXiv:astro-ph/0703100.