Bonsai: A GPU Tree-Code

Bonsai: A GPU Tree-Code

Jeroen Bédorf, Evghenii Gaburov, and Simon Portegies Zwart Leiden Observatory, Leiden University, P.O. Box 9513, 2300 RA Leiden, The Netherlands Northwestern University, 2131 Tech Drive, Evanston 60208, IL, USA
Abstract

We present a gravitational hierarchical -body code that is designed to run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are exectued on the GPU which eliminates the need for data transfer between the Central Processing Unit (CPU) and the GPU. Our tests indicate that the gravitational tree-code outperforms tuned CPU code for all parts of the algorithm and show an overall performance improvement of more than a factor 20, resulting in a processing rate of more than million particles per second.

1 Introduction

A populair method for simulating gravitational -body systems is the hierarchical tree-code algorithm orginaly introduced by 1986Natur.324..446B. This method reduces the computational complexity of the simulation from to per crossing time. The former, though computationally expensive, can easily be implemented in parallel for many particles. The later requires more communication and book keeping when developing a parallel method. Still for large number of particles () hierarchical111 Tree data-structures are commonly referred to as hierarchical data-structures. In this work we use an octree data-structure. methods are more efficient than brute force methods. Currently parallel octree implementations are found in a wide range of problems, such as: self gravitating systems, smoothed particle hydrodynamics, molecular dynamics, clump finding, ray tracing and voxel rendering. All of these problems require a high amount of computation time. For high resolution simulations () 1 Central Processing Unit (CPU) is not sufficient, one has to use computer clusters or even supercomputers, both of which are expensive and scarce. A GPU provides an atractive alternative to such systems.

The GPU is a massively parallel processor which is specialised in performing independend parallel computations. For an overview and more details see 2007NewA...12..641P; 2007astro.ph..3100H; 2008NewA...13..103B; Gaburov2009630. In this work we have implemented all the seperate parts of the Barnes-Hut tree-code algorithm on the GPU. This includes the tree-construction and computation of the tree-properties (multipole moments). By doing so we remove the need to communicate large amounts of data between the CPU and GPU. Since there is no time lost by CPU-GPU communication we can make optimal use of the GPU in a block time-step algorithm. In previous work OctGravICCS10; Hamada:2010:TAN:1884643.1884644 parts of the algorithm were executed on the CPU which only allows for efficient (parallel) execution if shared time-steps are used.

Full implementation details, performance and accuracy characteristics can be found in 2011arXiv1106.1900B

2 Results

The algorithms are implemented as part of the gravitational -body code Bonsai222The code is publicly available at:
http://castle.strw.leidenuniv.nl/software.html
. We compare the performance of the GPU algorithms with (optimized) CPU implementations of comparable algorithms. The performance of the tree-traverse depends critically on the multipole acceptance critera (MAC) () which sets the trade-off between speed and accuracy. In our implementation we use a combination of the method introduced by Barnes (1994) and the method used for tree-traversion on vector machines see Barnes1994; 1990JCoPh..87..161B. To compare the CPU and GPU implementations we measure the wall-clock time for the most time critical parts of the algorithms. For the tree-construction we distinguish three parts; Sorting of the particles along a Space Filling Curve (Morton) (sorting in the figure) , reordering of particle properties based on the sort (moving) and construction of the tree-datastructure (tree-construction). Furthermore, timings are presented for the computation of the multipole moments and tree-traverse. As for hardware we used a Xeon E5620 CPU with 4 physical cores and a NVIDIA GTX480 GPU. The resuls are presented in Fig. 1.

Figure 1.: Wall-clock time spent by the CPU and the GPU on various primitive algorithms. The bars show the time spent on the five selected sections of code. The results indicate that our GPU code outperforms the CPU code on all fronts and is between 2 and 30 times faster. Note that the y-axis is in logscale. (Timings using a million body Plummer sphere with )

To measure the scaling of the implemented algorithms we execute simulations using Plummer spheres (1915MNRAS..76..107P) with ( 32k) up to ( 4M) particles. We measure the performance of the same algorithms as in the previous paragraph. The results can be found in Fig. 2. The wall-clock time spent in the sorting, moving, tree-construction and multipole computation algorithms scales linearly with for . For smaller , however, the scaling is sub-linear, because the algorithms require more than particles to saturate the GPU. In theory the tree-traverse scales as , whereas empirically the wall-clock time scales almost linearly with . This is explained in the inset of Fig. 2, which shows the average number of interactions per particle during the simulation. The average number of particle-cell interactions doubles between  32k and  1M and keeps gradually increasing for  1M. This break is not clearly visible in the timing results since for small particle numbers ( 1M ) not all GPU resources are satured. Finally, more than 90% of the wall-clock time is spent on tree-traversion with . This allows for block time-step execution where the tree-traverse time is reduced by a factor , where is the number of particles that have to be updated.

Figure 2.: The wall-clock time spent by various parts of the program versus the number of particles . We used Plummer models as initial conditions and varied the number of particles over two orders of magnitude. The scaling of the tree-walk is between (shown with the black solid line) and the theoretical and is due to the average number of interactions staying roughly constant (see inset). The asymptotic complexity of the tree-construction approaches , as expected, since all the constituent primitives share the same complexity. The timings are from the GTX480 GPU with = 0.75.

3 Discussion and Conclusions

We have presented an efficient gravitational -body tree-code. In contrast to other existing GPU tree-codes, this implementation is executed completely on the GPU. On a GTX480 the number of particles processed per unit time is million particles per second with . This allows us to routinely carry out simulations on the GPU. Since the current version can only use 1 GPU, the limitation is the amount of memory. For 5 million particles gigabyte of GPU memory is required.

Even though the sorting, moving and tree-construction parts of the code take up roughly 10% of the execution time in the presented timings, these methods do not have to be executed during each time-step when using the block time-step method. It is sufficient to only recompute the multipole moments of tree-cells that have updated child particles, and only when the tree-traverse shows a considerable decline in performance does the complete tree-structure has to be rebuild. This decline is the result of inefficient memory reads and an increase of the average number of particle-cell and particle-particle interactions. This quantity increases because the tree-cell size increases, which causes more cells to be opened by the MAC.

Although the implemented algorithms are designed for a shared-memory architecture, they can be used to construct and traverse tree-structures on parallel GPU clusters using the methods described in 169640; 1996NewA....1..133D. Furthermore, in case of a parallel GPU tree-code, the CPU can exchange particles with the other nodes, while the GPU is traversing the tree-structure of the local data, making it possible to hide most of the communication time.

The implemented algorithms are not limited to the evaluation of gravitational forces, but can be applied to a variety of problems, such as neighbour search, clump finding algorithms, fast multipole method and ray tracing. In particular, it is straightforward to implement Smoothed Particle Hydrodynamics in such a code, therefore having a self-gravitating particle based hydrodynamics code implemented on the GPU.

Acknowledgments

This work is supported by NOVA and NWO grants (#639.073.803, #643.000.802, and #614.061.608, VICI #643.200.503, VIDI #639.042.607). The authors would like to thank Massimiliano Fatica and Mark Harris of NVIDIA for the help with getting the code to run on the Fermi architecture

\cb@dobiblio

Bedorf˙J.bbl

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
152537
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description