Gravitational tree-code on graphics processing units: implementation in CUDA

Gravitational tree-code on graphics processing units: implementation in CUDA

Evghenii Gaburov egaburov@strw.leidenuniv.nl Jeroen Bédorf bedorf@strw.leidenuniv.nl Simon Portegies Zwart spz@strw.leidenuniv.nl Leiden Observatory, Leiden University, Leiden The Netherlands
Abstract

We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of . The code has a convenient user interface and is freely available for use111http://castle.strw.leidenuniv.nl/software/octgrav.html.

keywords:
, gravity, tree-code, GPU, parallel, CUDA

1 Introduction

Direct force evaluation methods have always been popular because of their simplicity and unprecedented accuracy. Since the mid 1980’s, however, approximation methods like the hierarchical tree-code (1986Natur.324..446B, ) have gained enormous popularity among researchers, in particular for studying astronomical self-gravitating -body systems (2003gnbs.book…..A, ) and for studying soft-matter molecular-dynamics problems (Frenkel2001, ). For these applications, direct force evaluation algorithms strongly limit the applicability, mainly due to the time complexity of the problem.

Tree-codes, however, have always had a dramatic set back compared to direct methods, in the sense that the latter benefits from the developments in special purpose hardware, like the GRAPE and MD-GRAPE family of computers 1998sssp.book…..M (); 2001ASPC..228…87M (), which increase workstation performance by two to three orders of magnitude. On the other hand, tree-codes show a better scaling of the compute time with the number of processors on large parallel supercomputers 169640 (); WarBecGod97 () compared to direct -body methods 2007NewA…12..357H (); 1233038 (). As a results, large scale tree-code simulations are generally performed on Beowulf-type clusters or supercomputers, whereas direct -body simulations are performed on workstations with attached GRAPE hardware.

Tree-codes, due to their hierarchical and recursive nature are hard to run efficiently on dedicated Single Instruction Multiple Data (SIMD) hardware like GRAPE, though some benefit has been demonstrated by using pseudo-particle methods to solve for the higher-order moments in the calculation of multipole moments of the particle distributions in grid cells 2004ApJS..151…13K ().

Recently, the popularity of computer games has led to the development of massively parallel vector processors for rendering three-dimensional graphical images. Graphical Processing Units (or GPUs) have evolved from fixed function hardware to general purpose parallel processors. The theoretical peak speed of these processors increases at a rate faster than Moores’ law (Moore, ), and at the moment top roughly 200 GFLOP for a single card. The cost of these cards is dramatically reduced by the enormous volumes in which they are produced, mainly for gamers, whereas GRAPE hardware remains relatively expensive.

The gravitational -body problem proved to be rather ideal to port to modern GPUs, and the first successes in porting the -body problem to programmable GPUs were achieved by Nyland04 (), but it was only after the introduction of the NVIDIA G80 architecture that accurate force evaluation algorithms could be implemented 2007NewA…12..641P () and that the performance became comparable to special purpose computers 2008NewA…13..103B (); Gaburov2009630 ().

Even in these implementations, the tree-code, though pioneered in 2008NewA…13..103B (), still hardly resulted in a speed-up compared to general purpose processors. In this paper we present a novel implementation of a tree-code on the NVIDIA GPU hardware using the CUDA programming environment.

2 Implementation

Figure 1: Illustration of our tree-structure, shown in 2D for clarity. Initially, the space is recursively subdivided into cubic cells until all cells contain less than particles (blue squares). All cells (including parent cells) are stored in a tree-structure. Afterwards, we compute a tight bounding box for the particles in each cell (dotted rectangles) and cell’s boundary. The latter is a cube with a side length equal to the largest side length of the bounding box and the same centre (green squares).

In the classical implementation of the tree-code algorithm all the work is done on the CPU, since special purpose hardware was not available at that time 1986Natur.324..446B (). With the introduction of GRAPE special purpose hardware 1990CoPhC..60..187I (); 1991PASJ…43..841F (), it became computationally favourable to let the special purpose hardware, instead of the CPU, calculate accelerations. Construction of the interaction list in these implementations takes nearly as much time as calculating the accelerations. Since the latest generation of GPUs allows more complex operations, it becomes possible to build the interaction list directly on the GPU. In this case, it is only necessary to transport the tree-structure to the GPU. Since the bandwidth on the host computer is about an order of magnitude lower than on the GPU, it is also desirable to offload bandwidth intensive operations to the GPU. The construction of the interaction list is such an operation. The novel element in our tree-code is construction of the interaction list on the GPU. The remaining parts of the tree-code algorithm (tree-construction, calculation of node properties and time integration) are executed on the host. The host is also responsible for the allocation of memory and the data transfer to and from the GPU. In the next sections we will cover the details of the host and device steps.

2.1 Building the octree

We construct the octree in the same way as done in the original BH tree-code. We define the computational domain as a cube containing all particles in the system. This cube is recursively divided into eight equal-size cubes called cells. The length of the resulting cells is half the length of the parent cell. Each of these cells is further subdivided, until less than particles are left. We call these cells leaves, whereas cells containing more than particles are referred to as nodes. The cell containing all particles is the root node.

The resulting tree-structure is schematically shown in Fig. 1. From this tree-structure we construct groups for the tree walk (c.f. section 2.2), which are the cells with the number of particles less than , and compute properties for each cell, such as boundary, mass, centre of mass, and quadrupole moments, which are required to calculate accelerations 1993ApJ…414..200M ().

In order to efficiently walk the octree on the device, its structure requires some reorganisation. In particular, we would like to minimise the number of memory accesses since they are relative expensive (up to 600 clock cycles). In Fig. 2, we show the tree-structure as stored on the GPU. The upper array in the figure is the link-list of the tree, which we call the main tree-array. Each element in this array (separated by blue vertical lines) stores four integers in a single 128-bit words (dashed vertical lines). This structure is particularly favourable because the device is able to read a 128-bit word into four 32-bit registers using one memory access instruction. Two array-elements represent one cell in the tree (green line) with indices to each of the eight children in the main tree-array (indicated by the arrows). A grey filled element in this list means that a child is a leaf (it has no children of its own), and hence it needs not to be referenced. We also use auxiliary tree-arrays in the device memory which store properties of each cell, such as its boundary, mass, centre of mass and multiple moments. The index of each cell in the main tree-array is directly related to its index in the auxiliary tree-arrays by bitshift and addition operations.

Figure 2: Illustration of the tree structure as stored in device memory.

The device execution model is designed in such a way that threads which execute the same operation are grouped in warps, where each warp consists of 32 threads. Therefore, all threads in a warp follow the same code path. If this condition is not fulfilled, the divergent code path is serialised, therefore negatively impacting the performance (NVIDIA.CUDA, ). To minimise this, we rearrange groups in memory to make sure that neighbouring groups in space are also neighbouring in memory. Combined with similar tree paths that neighbouring groups have, this will minimise data and code path divergence for neighbouring threads, and therefore further improves the performance.

2.2 Construction of an interaction list

In the standard BH-tree algorithm, the interaction lists are constructed for each particle, but particles in the same groups have similar interaction lists. We make use of this fact by building the lists for groups instead of particles (1990JCoPh..87..161B, ). The particles in each group, therefore, share the same interaction list, which is typically longer than it would have been by determining it on a particle basis. The advantage here is the reduction of the number of tree walks by . The tree walk is performed on the GPU in such a way that a single GPU thread is used per group. To take advantage of the cached texture memory, we make sure that neighbouring threads correspond to neighbouring groups.

Owing to the massively parallel architecture of the GPU, two tree walks are required to construct interaction lists. In the first walk, each thread computes the size of the interaction list for a group. This data is copied to the host, where we compute the total size of the interaction list, and memory addresses to which threads should write lists without intruding on other threads’ data. In the second tree walk, the threads write the interaction lists to the device memory.

1 while (stack.non_empty)
2   node = stack.pop                      ;; get next node from the stack
3   one  = fetch(children, node + 0)      ;; cached fetch 1st four children
4   two  = fetch(children, node + 1)      ;; cached fetch 2nd four children
5   test_cell<0...4>(node, one, stack)    ;; test sub-cell in octant one to four
6   test_cell<5...8>(node, two, stack)    ;; test sub-cell in octant four to eight
List 1: A pseudo code for our non-recursive stack-based tree walk.
1template<oct>test_cell(node, child, stack)
2   child = fetch(cell_pos, 8*node +oct)    ;; fetch data of the child
3   if (open_node(leaf_data, child))        ;; if the child has to be opened,
4     if (child != leaf) stack.push(child)  ;;  store it in stack if it is a node
5     else              leaf += 1           ;;  otherwise increment the leaf counter
6   else                cell += 1           ;; else, increment the cell counter
List 2: Pseudo code for test_cell subroutine.

We implemented the tree walk via a non-recursive stack-based algorithm (the pseudo code is shown in List 1), because the current GPU architecture does not support recursive function calls. In short, every node of the tree, starting with the root node, reads indices of its children by fetching two consecutive 128-bit words (eight 32 bit integers) from texture memory. Each of these eight children is tested against the node-opening criteria (the pseudo code for this step is shown in List 2), and in the case of a positive result a child is stored in the stack (line 4 in the listing), otherwise it is considered to be a part of the interaction list. In the latter case, we check whether the child is a leaf, and if so, we increment a counter for the leaf-interaction list (line 5), otherwise a counter for the node-interaction list (line 6). This division of the interaction lists is motivated by the different methods used to compute the accelerations from nodes and leaves (c.f. section 2.3). In the second tree walk, we store the index of the cell in the appropriate interaction list instead of counting the nodes and leafs.

2.3 Calculating accelerations from the interaction list

In the previous step, we have obtained two interaction lists: one for nodes and one for leaves. The former is used to compute accelerations due to nodes, and the latter due to leaves. The pseudo-code for a particle-node interaction is shown in List 3 and the memory access pattern is demonstrated in the left panel of Fig. 3. This algorithm is similar to the one used in the kirin and sapporo libraries for direct -body simulations 2008NewA…13..103B (); Gaburov2009630 (). In short, we use a block of threads per group, such that a thread in a block is assigned to a particle in a group; these particles share the same interaction list. Each thread loads a fraction of the nodes from the node-interaction list into shared memory (blue threads in the figure, lines 2 and 3 in the listing). To ensure that all the data is loaded into shared memory, we put a barrier for all threads (line 4), and afterwards each thread computes gravitational acceleration from the nodes in shared memory (line 5). Prior loading a new set of nodes into the shared memory (green threads in the figure), we ensure that all the threads have completed their calculations (line 6). We repeat this cycle until the interaction list is exhausted.

1 for (i = 0; i < list_len; i += block_size)
2   cellIdx = cell_interact_lst[i + thread_id]
3   shared_cells[threadIdx] = cells_lst[cellIdx] ;; read nodes to the shared memory
4   __syncthreads()                              ;; thread barrier
5   interact(body_in_a_thread, shared_cell)      ;; evaluate accelerations
6   __syncthreads()                              ;; thread barrier
List 3: Body-node interaction
1 for (i = 0; i < list_len; i += block_size) {
2   leaf = leaf_interaction_list[i + threads_id]
3   shared_leaves[threadIdx] = cells_list[leaf]  ;; read leaves to the shared memory
4   __syncthreads()
5   for (j = 0; j < block_size; j++)            ;; process each leaf
6     shared_bodies[thread_id] =  bodies[shared_leaves[j].first + thread_id]
7     __syncthreads();
8     interact(body_in_a_thread, shared_bodies, shared_leaves[j].len);
9     __syncthreads();
List 4: Body-leaf interaction
Figure 3: Memory access pattern in a body-node (left) and body-leaf (right) interaction.

Calculations of gravitational acceleration due to the leaves differs in several ways. The pseudo-code of this algorithm is presented in List 4, and the memory access pattern is displayed in the right panel of Fig. 3. First, each thread fetches leaf properties, such as index of the first body and the number of bodies in the leaf, from texture memory into shared memory (red lines in the figure, lines 2 and 3 in the listing). This data is used to identify bodies from which the accelerations have to be computed (black lines). Finally, threads read these bodies into shared memory (blue and green lines, line 6) in order to calculate accelerations (line 8). This process is repeated until the leaf-interaction list is exhausted.

3 Results

In this section we study the accuracy and performance of the tree code. First we quantify the errors in acceleration produced by the code and then we test its performance. For this purpose we use a model of the Milky Way galaxy (2005ApJ…631..838W, ). We model the galaxy with and particles, such that the mass ratio of bulge, disk and halo particles is 1:1:4. We then proceed with the measurements of the code performance. In all test we use and which we find produce the best performance on both G80/G92 and GT200 architecture. The GPU used in all the tests is a GeForce 8800Ultra.

Figure 4: Each panel displays a fraction of particles having relative acceleration error (vertical axis) greater than a given value (horizontal axis). In each panel, we show errors for various opening angles from (the leftmost curve in each panel), , , , and (the rightmost curve). The number of particles are , , for panels from left to right respectively. The dotted horizontal lines show 50%, 10% and 1% of the error distribution.

3.1 Accuracy of approximation

We quantify the error in acceleration in the following way: , where and are accelerations calculated by the tree and direct summation respectively. The latter was carried out on the same GPU as the tree-code. This allowed us to asses errors on systems as large as 10 million particles222We used the NVIDIA 8800Ultra GPU for this paper, and it takes 10 GPU hours to compute the exact force on a system with 10 million particles with double precision emulation (Gaburov2009630, ). In Fig. 4 we show error distributions for different numbers of particles and for different opening angles. In each panel, we show which fraction of particles (vertical-axis) has a relative error in acceleration larger than a given value (horizontal axis). The horizontal lines show 50th, 10th and 1st percentile of cumulative distribution. This data shows that acceleration errors in this tree-code are consistent with the errors produced by existing tree-codes with quadrupole corrections (2002JCoPh.179…27D, ; 2001NewA….6…79S, ; 2001PhDT……..21S, ).

We show dependence of errors on both opening angle and number of particles in Fig. 5. In the leftmost panel of the figure, we plot the median and the first percentile of the relative force error distribution as a function of the opening angle for various number of particles (the lowest blue dotted and red dashed lines), and (the upper blue dotted and red dashed lines). As expected, the error increases as a function of with the following scaling from the least-squared fit, . However, the errors increase with the number of particles: the error doubles when the number of particles is increased by two orders of magnitude. This increase of the error is caused by the large number of particles in a leaf, which in our case is 64, to obtain the best performance. We conducted a test with , and indeed observed the expected decrease of the error when the number of particles increases; this error, however, is twice as large compared to for .

Figure 5: The median and the first percentile of the relative acceleration error distribution as a function of the opening angle and the number of particles. In the leftmost panel we show lines for (the bottom dotted and dashed lines) and (the top dotted and dashed lines) particles. The middle and the right panels display the error for (the bottom lines), and (the upper lines).

3.2 Timing

In Fig. 6 we present the timing data as a function of and for various . The (dotted line in the figure) is independent of , which demonstrates that construction of the octree only depends on the number of particles in the system, with . This time becomes comparable to the time spend on the GPU calculating accelerations for and . This is caused by the empirically measured near-linear scaling of time spend on GPU with . As the number of particles increases, the GPU part of the code performs more efficiently, and therefore the scaling drops from to near-linear (Fig. 7). We therefore conclude, that the optimal opening angle for our code is .

Figure 6: Wall-clock timing results as function of the opening angle and number of particles. In each panel, the solid line shows the time spent on the GPU. The dotted line on the top panel shows the time spent on the host, and the total wall-clock time is shown with the dashed line.

In the leftmost panel of Fig. 7 we show dependence of the time spent on the host and the device for various opening angles. In particular, scaling falls between and , which we explained by the increased efficiency of the GPU part of our code with larger number of particles. This plot also shows that the host calculation time is a small fraction of the GPU calculation time, except for and . The middle panel of the figure shows the ratio of the time spent on the device to the total time. Finally, the rightmost panel shows the ratio between the time required to compute forces by direct summation and the time required by the tree-code. As we expected, the scaling is consistent with .

3.3 Device utilisation

We quantify the efficiency of our code to utilise the GPU resources by measuring both instruction and data throughput, and then compare the results to the theoretical peak values provided by the device manufacturer. In Fig. 8 we show both bandwidth and computational performance as function of for three different . We see that the calculation of accelerations operates at about 100GFLOPs333We count 38 and 70 FLOPs for each body-leaf and body-node interaction respectively.. This is comparable to the peak performance of the GRAPE-6a special-purpose hardware, but this utilises only % of the computational power of the GPU444Our tests were carried out on a NVIDIA 8800Ultra GPU, which has 128 streaming processors each operating at clock speed of 1.5Ghz. Given that the GPU is able to perform up to two floating point operation per clock cycle, the theoretical peak performance is GFLOP/s.. This occurs because the average number of bodies in a group is a factor of 3 or 4 smaller than the , which we set to 64 in our tests. On average, therefore, only about 30% of the threads in a block are active.

Figure 7: Timing results as a function of particle number. The leftmost panel displays time spent on the GPU (black dash-dotted lines) and host CPU (blue solid line) parts as a function of the number of particles. The expected scaling is shown in the red solid line. The ratio of the time spent on GPU to the total wall-clock time is given in the middle panel. The speed-up compared to direct summation is shown in the rightmost panel. The expected scaling is shown with a solid red line.

The novelty of this code is the GPU-based tree walk. Since there is little arithmetic intensity in these operations, the code is, therefore, limited by the bandwidth of the device. We show in Fig. 8 that our code achieves respectable bandwidth: in excess of 50GB/s during the first tree walk, in which only (cached) scatter memory reads are executed. The second tree walk, which constructs the interaction list, is notably slower because there data is written to memory–an operation which is significantly slower compared to reads from texture memory. We observe that the bandwidth decreases with in both tree walks, which is due to increasingly divergent tree-paths between neighbouring groups, and an increase of the write to read ratio in memory operations.

Figure 8: Device utilisation as a function of the opening angle and number of particles. Each bottom panel shows the bandwidth for the first tree walk (solid line) and the second tree walk (dotted line). The top halves show the performance of the calculation of the accelerations for the node interaction list (solid line) and the leaf interaction list (dotted line) in GFLOP/s.

4 Discussion and Conclusions

We present a fast gravitational tree-code which is executed on Graphics Processing Units (GPU) supporting the NVIDIA CUDA architecture. The novelty of this code is the GPU-based tree-walk which, combined with the GPU-based calculation of accelerations, shows good scaling for various particle numbers and different opening angles . The hereby produced energy error is comparable to existing CPU based tree-codes with quadrupole corrections. The code makes optimal use of the available device resources and shows excellent scaling to new architectures. Tests indicate that the NVIDIA GT200 architecture, which has roughly twice the resources as the used G80 architecture, performs the integration twice as fast. Specifically, the sustained science rate on a realistic galaxy merger simulation with particles is particles/second. Our tests revealed our GPU implementation to be two order of magnitudes faster than the widely-used CPU version of the Barnes-Hut tree-code from the NEMO stellar dynamics package (1995ASPC…77..398T, ). However, our code is only an order of magnitude faster compared to a SSE-vectorised tree-code specially tuned for x86-64 processors and the Phantom-grape library 555Private communication with Keigo Nitadori, the author of the Phtanom-grape library http://grape.mtk.nao.ac.jp/ nitadori/phantom/. Hamada et al (1654123, ) presented a similarly tuned tree code, in which Phantom-grape is replaced with a GPU library (2007astro.ph..3100H, ). In this way, it was possible to achieve the 200 GFLOP/s, and science rate of about particles/s. However, this code does not include quadrupole corrections, and therefore GFLOP/s comparison between the two codes is meaningless. Nevertheless, the science rate of the two codes is comparable for similar opening angles, which implies that our tree-code provides more accurate results for the same performance.

As it generally occurs with other algorithms, the introduction of a massively parallel accelerator usually makes the host calculations and non-parallelisable parts of the code, as small as they may be, the bottleneck. In our case, we used optimised device code and for the host code we used general tree-construction and tree-walk recursive algorithms. It is possible to improve these algorithms to increase the performance of the host part, but it is likely to remain a bottleneck. Even with the use of modern quad-core processors this part is hard to optimize since its largely a sequential operation.

Acknowledgements

We thank Derek Groen, Stefan Harfst and Keigo Nitadori for valuable suggestions and reading of the manuscript. This work is supported by NWO (via grants #635.000.303, #643.200.503, VIDI grant #639.042.607, VICI grant #6939.073.803 and grant #643.000.802). We thank the University of Amsterdam, where part of this work was done, for their hospitality.

References

References

  • (1) J. Barnes, P. Hut, A Hierarchical O(NlogN) Force-Calculation Algorithm, Nature324 (1986) 446–449.
  • (2) S. J. Aarseth, Gravitational N-Body Simulations, Gravitational N-Body Simulations, by Sverre J. Aarseth, pp. 430. ISBN 0521432723. Cambridge, UK: Cambridge University Press, November 2003., 2003.
  • (3) D. Frenkel, B. Smit, Understanding Molecular Simulation, Second Edition: From Algorithms to Applications (Computational Science Series, Vol 1), 2nd Edition, Academic Press, 2001.
  • (4) J. Makino, M. Taiji, Scientific simulations with special-purpose computers : The GRAPE systems, Scientific simulations with special-purpose computers : The GRAPE systems by Junichiro Makino & Makoto Taiji. Chichester ; Toronto : John Wiley & Sons, c1998., 1998.
  • (5) J. Makino, Direct Simulation of Dense Stellar Systems with GRAPE-6, in: Dynamics of Star Clusters and the Milky Way, Vol. 228 of Astronomical Society of the Pacific Conference Series, 2001, pp. 87–+.
  • (6) M. S. Warren, J. K. Salmon, A parallel hashed oct-tree n-body algorithm, in: Supercomputing ’93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, ACM, New York, NY, USA, 1993, pp. 12–21. doi:http://doi.acm.org/10.1145/169627.169640.
  • (7) M. S. Warren et al, Parallel supercomputing with commodity components, in: H. R. Arabnia (Ed.), Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’97), 1997, pp. 1372–1381.
  • (8) S. Harfst, A. Gualandris, D. Merritt, R. Spurzem, S. Portegies Zwart, P. Berczik, Performance analysis of direct N-body algorithms on special-purpose supercomputers, New Astronomy 12 (2007) 357–377. arXiv:arXiv:astro-ph/0608125, doi:10.1016/j.newast.2006.11.003.
  • (9) A. Gualandris, S. Portegies Zwart, A. Tirado-Ramos, Performance analysis of direct n-body algorithms for astrophysical simulations on distributed systems, Parallel Comput. 33 (3) (2007) 159–173. doi:http://dx.doi.org/10.1016/j.parco.2007.01.001.
  • (10) A. Kawai, J. Makino, T. Ebisuzaki, Performance Analysis of High-Accuracy Tree Code Based on the Pseudoparticle Multipole Method, ApJS151 (2004) 13–33. doi:10.1086/381391.
  • (11) G. E. Moore, Cramming more components onto integrated circuits, Electronics 38 (8).
  • (12) L. Nyland, M. Harris, J. Prins, The rapid evaluation of potential fields using programmable graphics hardware (2004).
  • (13) S. F. Portegies Zwart, R. G. Belleman, P. M. Geldof, High-performance direct gravitational N-body simulations on graphics processing units, New Astronomy 12 (2007) 641–650. doi:10.1016/j.newast.2007.05.004.
  • (14) R. G. Belleman, J. Bédorf, S. F. Portegies Zwart, High performance direct gravitational N-body simulations on graphics processing units II: An implementation in CUDA, New Astronomy 13 (2008) 103–112. arXiv:arXiv:0707.0438, doi:10.1016/j.newast.2007.07.004.
  • (15) E. Gaburov, S. Harfst, S. Portegies Zwart, Sapporo: A way to turn your graphics cards into a grape-6, New Astronomy 14 (7) (2009) 630 – 637. doi:DOI:10.1016/j.newast.2009.03.002.
  • (16) T. Ito, J. Makino, T. Ebisuzaki, D. Sugimoto, A special-purpose N-body machine GRAPE-1, Computer Physics Communications 60 (1990) 187–194. doi:10.1016/0010-4655(90)90003-J.
  • (17) T. Fukushige, T. Ito, J. Makino, T. Ebisuzaki, D. Sugimoto, M. Umemura, GRAPE-1A: Special-Purpose Computer for N-body Simulation with a Tree Code, Publ. Astr. Soc. Japan 43 (1991) 841–858.
  • (18) S. L. W. McMillan, S. J. Aarseth, An O(N log N) integration scheme for collisional stellar systems, ApJ414 (1993) 200–212. doi:10.1086/173068.
  • (19) NVIDIA Corp., CUDA Programming manual, CUDA Programming manual.
  • (20) J. E. Barnes, A Modified Tree Code Don’t Laugh: It Runs, Journal of Computational Physics 87 (1990) 161–+.
  • (21) L. M. Widrow, J. Dubinski, Equilibrium Disk-Bulge-Halo Models for the Milky Way and Andromeda Galaxies, ApJ631 (2005) 838–855. arXiv:arXiv:astro-ph/0506177, doi:10.1086/432710.
  • (22) W. Dehnen, A Hierarchical O(N) Force Calculation Algorithm, Journal of Computational Physics 179 (2002) 27–42. arXiv:arXiv:astro-ph/0202512.
  • (23) V. Springel, N. Yoshida, S. D. M. White, GADGET: a code for collisionless and gasdynamical cosmological simulations, New Astronomy 6 (2001) 79–117. doi:10.1016/S1384-1076(01)00042-2.
  • (24) J. G. Stadel, Cosmological N-body simulations and their analysis, Ph.D. thesis, AA(UNIVERSITY OF WASHINGTON) (2001).
  • (25) P. Teuben, The Stellar Dynamics Toolbox NEMO, in: Astronomical Data Analysis Software and Systems IV, Vol. 77 of Astronomical Society of the Pacific Conference Series, 1995, pp. 398–+.
  • (26) T. e. a. Hamada, 42 tflops hierarchical n-body simulations on gpus with applications in both astrophysics and turbulence, in: SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, 2009, pp. 1–12. doi:http://doi.acm.org/10.1145/1654059.1654123.
  • (27) T. Hamada, T. Iitaka, The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units, ArXiv Astrophysics e-printsarXiv:arXiv:astro-ph/0703100.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
242914
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description