Treebased solvers for adaptive mesh refinement code FLASH  I: gravity and optical depths.
Abstract
We describe an OctTree algorithm for the MPIparallel, adaptive meshrefinement code FLASH, which can be used to calculate the gas selfgravity, and also the angleaveraged local optical depth, for treating ambient diffuse radiation. The algorithm communicates to the different processors only those parts of the tree that are needed to perform the tree walk locally. The advantage of this approach is a relatively low memory requirement, important in particular for the optical depth calculation, which needs to process information from many different directions. This feature also enables a general treebased radiation transport algorithm that will be described in a subsequent paper, and delivers excellent scaling up to at least 1500 cores. Boundary conditions for gravity can be either isolated or periodic, and they can be specified in each direction independently, using a newly developed generalisation of the Ewald method. The gravity calculation can be accelerated with the adaptive block update technique by partially reusing the solution from the previous timestep. Comparison with the Flash internal multigrid gravity solver shows that tree based methods provide a competitive alternative, particularly for problems with isolated or mixed boundary conditions. We evaluate several multipole acceptance criteria (MACs) and identify a relatively simple APE MAC which provides high accuracy at low computational cost. The optical depth estimates are found to agree very well with those of the RADMC3D radiation transport code, with the tree solver being much faster. Our algorithm is available in the standard release of the FLASH code in version 4.0 and later.
keywords:
galaxies: ISM – gravitation – hydrodynamics – ISM: evolution – radiative transfer1 Introduction
Solving Poisson’s equation for general mass distributions is a common problem in numerical astrophysics. Gridbased hydrodynamic codes frequently use iterative multigrid or spectral methods for that purpose. On the other hand, particle codes often use treebased algorithms. The extensive experience with tree gravity solvers in particle codes can be transferred to the domain of gridbased codes. Here we describe an implementation of the treebased gravity solver for the Adaptive Mesh Refinement (AMR) code Flash (Fryxell et al., 2000) and show that its efficiency is comparable to the Flash intrinsic multigrid solver (Ricker, 2008). An advantage of this approach is that the tree code can be used for more general calculations performed in parallel with the gravity; in particular, calculation of the optical depth in every cell of the computational domain with the algorithm developed by Clark et al. (2012) and general radiation transport with the TreeRay algorithm (described in Paper II; Wünsch et al., in prep.).
Hierarchically structured, treebased algorithms represent a wellestablished technique for solving the gravitational Nbody problem at reduced computational cost (Barnes & Hut, 1986, hereafter BH86). Many Lagrangian codes implement trees to compute the selfgravity of both collisionless (stars or dark matter) and collisional (gas) particles, e.g. Gadget2 (Springel, 2005), Vine (Wetzstein et al., 2009; Nelson et al., 2009); EvoL (Merlin et al., 2010), Seren, (Hubber et al., 2011), Gandalf (Hubber et al., 2018). The three most important characteristics of the tree algorithm are the tree structure (also called the grouping strategy), the multipole acceptance criterion (MAC) deciding whether to open a childnode or not, and the order of approximation of the integrated quantity within nodes (e.g. mass distribution).
Tree structure: Each node on the tree represents a part of the computational domain, hereafter a volume, and the childnodes of a given parentnode collectively represent the same volume as the parentnode. The most common ’OctTree’ structure is built by a recursive subdivision of the computational domain, where every parentnode is split into eight equalvolume childnodes, until we reach the last generation. The nodes of the last generation are called leafnodes and they cover the whole computational domain.
Tree structures other than the OctTree are also often used. Bentley (1979) constructs a balanced ”kd” binary tree by recursively dividing parentnodes so that each of the resulting childnodes contains half () of the particles in the parentnode; this tree structure is used in the codes pkdgrav (Stadel, 2001) and Gasoline (Wadsley et al., 2004). In contrast, Press (1986) constructs a binary tree, from the bottom up, by successively amalgamating nearest neighbour particles or nodes into parent nodes. This ”Presstree” has been further improved by Jernigan & Porter (1989), and is used, for instance, by Benz et al. (1990) and Nelson et al. (2009). More complex structures have been suggested. For example, Ahn & Lee (2008) describe the ”kmeans” algorithm, in which a parentnode is adaptively divided into childnodes according to the particle distribution in the parentnode.
There seems to be no unequivocally superior tree structure. Waltz et al. (2002) compare OctTrees with binary trees, and find that OctTrees provide slightly better performance with the same accuracy. On the other hand, Anderson (1999) argues, on the basis of an analytical study, that certain types of binary trees should provide better performance than OctTrees. Makino (1990) points out that differences in performance are mainly in the tree construction part, and that the treewalk takes a comparable amount of time in either type of treestructure. Therefore, the choice of treestructure should be informed by more technical issues, like the architecture of the computer to be used, other software to which the tree will be linked, and so on.
Multipole acceptance criterion: Another essential part of a tree code is the criterion, or criteria, used to decide whether a given node can be used to calculate the gravitational field, or whether its childnodes should be considered instead. This is a key factor determining the accuracy and performance of the code. Since this criterion often reduces to deciding whether the multipole expansion representing the contribution from the node in question provides a sufficiently accurate approximation for the calculation of the gravitational potential, it is commonly referred to as the multipole acceptance criterion (MAC). We retain this terminology even though nodes in the code presented here may possess more general properties than just a multipole expansion.
The original BH86 geometric MAC uses a simple criterion, which is purely based on the ratio of the angular size of a given node and its distance to the cell at which the gravitational potential should be computed. More elaborate methods also take into account the mass distribution within a particular node or even constrain the allowed total acceleration error (Salmon & Warren, 1994, SW94; see §2.2.1).
Order of approximation: Springel et al. (2001) suggest that if the gravitational acceleration is computed using multipole moments up to order , then the maximum error is of the order of the contribution from the multipole moment. There is no consensus on where to terminate the multipole expansion of the mass distribution in a node. The original BH86 tree code uses moments up to second order (), i.e. quadrupoles, and many authors follow this choice. Wadsley et al. (2004) find the highest efficiency using in the Gasoline code. On the other hand, SW94 find that their code using the SumSquare MAC is most efficient with , i.e. just monopole moments. This suggests that the optimal choice of may depend strongly on other properties of the code and its implementation, and possibly also on the architecture of the computer. Springel (2005) advocates using just monopole moments on the basis of memory and cache usage efficiency. We follow this approach and consider only monopole moments, i.e. for all implemented MACs.
Further improvements: Tree codes have often been extended with new features or modified to improve their behaviour. Barnes (1990) noted that neighbouring particles interact with essentially the same nodes, and introduced interaction lists that save time during a treewalk. This idea was further extended by Dehnen (2000, 2002) who describes a tree with mutual nodenode interactions. This greatly reduces the number of interactions that have to be calculated, leading – in theory – to an CPUtime dependence on the number of particles, . Dehnen’s implementation also symmetrizes the gravitational interactions to ensure accurate momentum conservation, which is in general not guaranteed with treecodes. Recently, Potter et al. (2017) develop this so called Fast Multipole Method (FMM) further and implement it into massively parallel cosmological Nbody code PKDGRAV3.
Hybrid codes. Treecodes are also sometimes combined with other algorithms into ’hybrid’ codes. For example, Xu (1995) describes a TreePM code which uses a tree to calculate shortrange interactions, and a particlemesh method (Hockney & Eastwood, 1981) to calculate longrange interactions. The TreePM code has been developed further by Bode et al. (2000); Bagla (2002); Bode & Ostriker (2003); Bagla & Khandai (2009); Khandai & Bagla (2009). There are also general purpose tree codes available, that can work with both Nbody and gridbased codes, e.g. the MPI parallel tree gravity solver FLY Becciani et al. (2007).
In this paper we describe a newly developed, costefficient, treebased solver for selfgravity and diffuse radiation that has been implemented into the AMR code FLASH. This code has been developed since 2008, and since FLASH version 4.0 it is a part of the official release. The GPU accelerated tree gravity solver, based on the early version of the presented code, has been developed by Lukat & Banerjee (2016). The paper is organized as follows: In §2 we describe the implemented algorithm, which splits up into the treesolver (§2.1), the gravity module (§2.2) and the optical depth module (§2.3). Accuracy and performance for several static and dynamic tests are discussed in §3, and we conclude in §4. In appendix A we provide formulae for acceleration in computational domains with periodic and mixed boundary conditions, and in appendix B we give runtime parameters of the code.
2 The algorithm
The flash code (Fryxell et al., 2000) is a complex framework consisting of many interoperable modules that can be combined to solve a specific problem. The tree code described here can only be used with a subset of the possible flash configurations. The basic requirement is usage of the parameshbased grid unit (see MacNeice et al. 2000 for a description of the paramesh library); support for other grid units (uniform grid, Chombo) can be added in future. Furthermore, the grid geometry must be 3D Cartesian.
The paramesh library defines the computational domain as a collection of blocks organised into a tree data structure which we refer to as the amrtree. Each node on the amrtree represents a block. The block at the top of the amrtree, corresponding to the entire computational domain, is called the rootblock and represents refinement level . The rootblock is divided into eight equalvolume blocks having the same shape and orientation as the rootblock, and these blocks represent refinement level . This process of block division is then repeated recursively until the blocks created satisfy an adaptivemesh refinement criterion. The blocks at the bottom of the tree, which are not divided, are called leafblocks, and the refinement level of a leafblock is labelled . In regions where the AMR criterion requires higher spatial resolution, the leafblocks are smaller and their refinement level, , is larger (i.e. they are further down the tree).
The number of grid cells in a block (a logically cuboidal collection of cells; see
below) must be the same in each direction and equal to
where is an arbitrary integer number. In practice, it should
be , because most hydrodynamic solvers do not allow
blocks containing fewer than cells, in order to avoid overlapping of ghost
cells. Note that the above requirements do not exclude noncubic computational
domains, because such domains can be created either by setting up blocks with
different physical sizes in each direction or by using more than one root
block
Within each leafblock is a local blocktree which extends the amrtree down to the level of individual grid cells. All blocktrees have the same number of levels, . The nodes on a blocktree represent refinement levels nodes here), nodes here), nodes here), and so on. The nodes at the bottom of the blocktree are leafnodes, and represent the grid cells on which the equations of hydrodynamics are solved.
Each node – both the nodes on the amrtree, and the nodes on the local blocktrees – stores collective information about the set of grid cells that it contains, e.g. their total mass, the position of the centre of mass, etc.
Our algorithm consists of a general treesolver implementing the tree construction, communication and treewalk, and modules which include the calculations of specific physical equations, e.g. gravitational accelerations or optical depths. The treesolver communicates with the physical modules by means of interface subroutines which allow physical modules, on the one hand to store various quantities on the nodes, and on the other hand to walk the tree accessing the quantities stored on the nodes. When walking the tree, physical modules may use different MACs that reflect the nature of the quantity they are seeking to evaluate. An advantage of this approach is that it makes code maintenance more straightforward and efficient. Moreover, new functionality can be added easily by writing new physical modules or extending existing ones, without needing to change the relatively complex treesolver algorithm.
The boundary conditions can be either isolated or periodic, and they can be specified in each direction independently, i.e. mixed boundary conditions with one or two directions periodic and the remaining one(s) isolated are allowed (see §2.2).
In the following §2.1, we describe the treesolver, and in §2.2 and §2.3, respectively, we give descriptions of the gravity module and the module (called OpticalDepth) which calculates heating by the interstellar radiation field.
2.1 Treesolver
The treesolver creates and utilises the tree data structure described above. Maintaining a copy of the whole tree on each processor would incur prohibitively large memory requirements. Therefore, only the amrtree (i.e. the top part of the tree, between the rootblock node and the leafblock nodes) is communicated to all processors. The blocktree within a leafblock is held on the processor whose domain contains that leafblock, and communicated wholly or partially to another processor only if it will be needed by that processor during a subsequent treewalk. The treesolver itself stores in each treenode – with the exception of the leafnodes – the total mass of the node and the position of its centre of mass, i.e. four floating point numbers. For leafnodes (the nodes corresponding to individual grid cells) only their masses are stored, because the positions of their centres of mass are identical to their geometrical centres and are already known. Additionally, each physical module can store any other required quantity on the treenodes.
The treesolver consists of three steps: treebuild, communication and treewalk. In the treebuild step, the tree is built from bottom up by collecting information from the individual grid cells, summing it, and propagating it to the parent treenodes. The initial stages of this step, those that involve the blocktrees within individual leafblocks, are performed locally. However, as soon as the leafblock nodes are reached, information has to be exchanged between processors because parentnodes are not necessarily located on the same processor. At the end of this step, each processor possesses a copy of the amrtree plus all the blocktrees corresponding to leafblocks that are located on that processor.
The communication step ensures that each processor imports from all other processors all the information that it will need for the treewalks, which are subsequently called by the physical modules. To this end, the code considers all pairs of processors, and determines what tree information the one processor (say CPU0; see Figure 1) needs to export to the other processor (say CPU1). To do this, the code walks the blocktrees of all the leafblocks on CPU0, and applies a suite of MACs (required by the treesolver itself and the used physical modules) in relation to all the leafblocks on CPU1. This suite of MACs determines for each leafblock on CPU0, the level of its blocktree that delivers sufficient detail to CPU1 to satisfy the resolution requirements of all the physical modules that will be called before the tree is rebuilt. Thus, a leafblock on CPU0 that has very little physical influence on any of the leafblocks on CPU1 (for example by virtue of being very distant or of low mass) may only need to send CPU1 the information stored on its lowest (i.e. coarsest resolution) level, . Conversely, a leafblock on CPU0 that has a strong influence on at least one of the leafblocks on CPU1 (for example by virtue of being very close or very massive) may need to send the information stored on its highest (finest resolution) level, . In order to simplify communication, the required nodes of each blocktree on CPU0 are then stored in a onedimensional array, ordered by level, starting at and proceeding to higher levels (see Figure 2). Finally, the arrays from all the blocktrees on CPU0 are collated into a single message and sent to CPU1. This minimizes the number of messages sent, thereby ensuring efficient communication, even on networks with high latency.
Note that this communication strategy in which treenodes are communicated differs from a commonly used one in which particles (equivalents of grid cells) are communicated instead (e.g. Gadget Springel, 2005). In this way, the communication is completed before the treewalk is executed and the treewalk runs locally, i.e. separately on each processor. The communication strategy adopted in this work provides a significant benefit for the OpticalDepth and the TreeRay modules as they work with a large amount of additional information per grid cell (or particle), which does not have to be stored and communicated (see §2.3).
The final step is a treewalk, in which the whole tree is traversed in a depthfirst manner for each grid cell or in general for an arbitrary target point (e.g. the position of a sink particle). During the process, the suite of MACs is evaluated recursively for each node and if it is acceptable for the calculation, subroutines of physical modules that do the calculation are called, otherwise its childnodes are opened.
The treesolver itself only implements a simple geometric MAC (Barnes & Hut, 1986), which accepts a node if its angular size, as seen from the target point, , is smaller than a userset limit, . Specifically, if is the linear size of the node and is the position of the centre of mass of the node, the node is accepted (and so its childnodes need not be considered) if
(1) 
It has been shown by Salmon & Warren (1994, hereafter SW94) that the BH86 MAC can lead to unexpectedly large errors when the target point is relatively far from the centre of mass of the node but very close to its edge. Several alternative geometric MACs were suggested to mitigate this problem (Salmon & Warren, 1994; Dubinski, 1996). Following Springel (2005), we extend the geometric MAC by setting the parameter such that a node is only accepted if the target point lies outside a cuboid times larger than the node (with the default value ). Additional MACs specific to the physical modules are implemented by those modules (see §2.2).
The treewalk is the most time consuming part of the treesolver. Typically it takes more than 90% of the computational time spent by the whole treesolver. We stress that the treewalk does not include any communication; the tree is traversed in parallel independently on each processor for all the grid cells in the spatial domain of that processor. The treesolver exhibits very good scaling, with speedup increasing almost linearly up to at least 1500 CPU cores (see §3.5).
2.2 Gravity module
This module calculates the gravitational potential and/or the gravitational acceleration. We use the same approach as Springel (2005) and store only monopole moments in the tree, because this substantially reduces memory requirements and communication costs. Since masses and centres of mass are already stored on the treenodes by the treesolver, the gravity module does not contribute any extra quantities to the tree.
In §2.2.1 we describe three datadependent MACs which can be used instead of the geometric MACs of the treesolver: Maximum Partial Error (MPE), Approximate Partial Error (APE) and the (experimental) implementation of the SumSquare MAC. Furthermore, the code features three different types of gravity boundary conditions. These are isolated (see §2.2.2), fully periodic (§2.2.3), and mixed boundary conditions (§2.2.4). Finally in §2.2.6, we describe a technique called the Adaptive Block Update to save computational time by reusing the solution from previous timestep when possible.
Datadependent MACs
A general weakness of the purely geometric MACs is that they do not take into account the amount and internal distribution of mass in a node. This can make the code inefficient if the density is highly nonuniform. For example, if the code calculates the gravitational potential of the multiphase interstellar medium, the contribution from nodes in the hot rarefied gas is very small, but it is calculated with the same opening angle as the much more important contribution from nodes in dense molecular cores.
MPE MAC (Maximum Partial Error): To compensate for the above problem, SW94 propose a MAC based on evaluating the maximum possible error in the contribution to the gravitational acceleration at the target point, , that could derive from calculating it using the multipole expansion of the node up to order (instead of adding directly the contributions from all the constituent grid cells)
(2)  
(3) 
Here, is the mass centre of the node; is the distance from to the target point; is the distance from to the furthest point in the node; is the order multipole moment, obtained by summing contributions from all the grid cells in the node; and are the masses and positions of these grid cells. The node is then accepted only if is smaller than some specified maximum allowable acceleration error. This threshold can either be set by the user as a constant value, , in the physical units used by the simulation
(4) 
or it can be set as a relative value, , with respect to the acceleration from the previous timestep
(5) 
APE MAC (Approximate Partial Error): An alternative way to estimate the partial error of a node contribution was suggested by Springel et al. (2001). It takes into account the node total mass, but it ignores the internal node mass distribution. It is therefore faster, but less accurate. Using multipole moments up to order , the error of the gravitational acceleration is of order the contribution from the multipole moment
(6) 
where is the mass in the node and in our case since we only store monopole moments. Similar to the MPE MAC, the APE error limit can be either set absolutely as (Equation 4), or relatively through (Equation 5).
SumSquare MAC: SW94 argue that it is unsafe to constrain the error using the contribution of a single node only, since it is not known a priori how these contributions combine. They suggest an alternative procedure, which limits the error in the total acceleration at the target point; one variant of this procedure is the SumSquare MAC which sums up squares of given by Equation (2) over all nodes considered for the calculation of the potential/acceleration at a given target point. In this way, the SumSquare MAC controls the total error in acceleration resulting from the contribution of all treenodes. This MAC requires a special treewalk which does not proceed in the depthfirst manner. Instead it uses a priority queue, which onthefly reorders a list of nodes waiting for evaluation according to the estimated error resulting from their contribution. This feature is still experimental in our implementation, nevertheless we evaluate its accuracy and performance and compare it to other MACs in §3.4.
Isolated boundary conditions
In the case of isolated boundary conditions (BCs), the gravitational potential in a target point given by position vector is
(7) 
where index runs over all nodes accepted by the MAC during the treewalk, and are the node mass and position. The gravitational acceleration is then obtained either by differentiating the potential numerically, or it is calculated, as
(8) 
The first approach needs less memory and is slightly faster. The second approach results in less noise, because numerical differentiation is not needed.
Periodic boundary conditions
In the case of periodic boundary conditions in all three directions, the gravitational potential is determined by the Ewald method (Ewald, 1921; Klessen, 1997), which is designed to mitigate the very slow convergence in case one evaluates contributions to the potential, essentially where , over an infinite number of periodic copies, by brute force. This is achieved by splitting it into two parts
(9) 
and summing the term in Fourier space; is an arbitrary constant controlling the number of nearby and distant terms which have to be taken into consideration. In this section, we present formulae only for the potential. The expressions for acceleration are straightforward to derive, and we list them in appendix A.
The computational domain is assumed to be a rectangular cuboid, with sides , and where and are arbitrary real numbers. The gravitational potential at the target point, , is then
(10)  
(11) 
Here, the first inner sum corresponds to shortrange contributions,
, from the nearest domains in physical space,
and the second sum constitutes longrange contributions,
. The outer sum runs over all accepted nodes in
the computational domain is the mass of a node, and is
its centre of mass
(12)  
(13) 
and .
Mixed boundary conditions
We generalise the Ewald method, which was developed for computational domains with periodic BCs in all spatial directions, to computational domains with mixed BCs. In three dimensional space, mixed BCs can be of two types: periodic BCs in two directions (without loss of generality we choose  and directions), and isolated BCs in the third ()direction; and periodic BCs in one direction (we choose ), and isolated BCs in the other two directions. We abbreviate the former case of mixed BCs as 2P1I, and the latter case as 1P2I. Configuration 2P1I has planar symmetry with axis , while configuration 1P2I has an axial symmetry along axis . These configurations might be convenient for studying systems with the symmetry (i.e. layers or filaments). We note that directions that can be defined as periodic are given by computational domain boundaries and thus they can only be parallel with one or more of the Cartesian coordinate axes.
We find the expression for for mixed BCs of 2P1I type by taking a limit of Equation (11). Consider a computational domain with sidelengths , , and with periodic boundary conditions in all three directions, for which the gravitational potential is given by Equation (11). Next we shift periodic copies of this domain in the direction so that the periodicity in the direction is times larger, i.e. , where is an integer number and is the extent in the direction of the original computational domain (Figure 3). Since the copies are shifted and not stretched, the mass distribution between and is unaltered, and the density is zero between and , leaving all mass concentrated in planeparallel layers of thickness and with normals pointing in direction . As increases, the layers move away from one another, but Equation (11) still holds. In the limit , the periodic copies of the computational domain are touching one another in  and directions, however, neighbouring layers in the direction are at infinite distance and hence they do not contribute to the gravitational field in the original computational domain.
As increases, the short–range contributions are zero for all , because the argument of the complementary error function in Equation (11) tends to infinity. The long–range term in the limit becomes
(14)  
The condition (13), which is now requires us to conserve resolution in the direction in Fourier space, i.e. to increase the range of with linearly (see Figure 3). Note that is independent of , because we restrict all mass in the computational domain to interval , (i.e. for any target point at and node at ). Bearing this in mind, the term after the limit sign in Equation (14) corresponds to a Riemann sum over interval with equally spaced partitions of size . Using the identity where , the limit becomes
(15) 
where
(16) 
To keep the notation compact, we introduce and . In order to evaluate the integral analytically, we extend the interval of integration to infinity (this extension means that we evaluate the sum even slightly more accurately than by condition 13) If , we have
(17)  
where . When , integral (16) is infinite, but this property can be circumvented. With the help of we get two integrals corresponding to these two terms. The former one is infinite, but independent of the spatial coordinates and we set it to zero. The latter one can easily be integrated
(18) 
Now we can write the potential as
(20)  
Note that the ratio is not contained in as we may expect, because it is of no physical significance when the BCs are isolated in this direction.
The modification of the Ewald method for a computational domain with mixed BCs of type 1P2I can be derived in a similar way to the previous case. However, the integration is more demanding here, because the result of the limiting process is a double integral (we integrate Equation (16) along instead of equations (17) and (18)). Applying a substitution which corresponds to a rotation, this integral can be transformed into a 1D integral, but we have not been able to express it in a closed form. In this case (1P2I), we arrive at
(21)  
where function is given by
(22) 
and . Function is the Bessel function of the first kind and zeroth order.
Lookup table for the Ewald array
Since the explicit evaluation of and at each timestep would be prohibitively time consuming, these functions are precalculated before the first hydrodynamical time step, and their values are stored in a lookup table. We experiment with two approaches to approximate the above functions from the lookup table at the time when the gravitational potential is evaluated.
In the first approach, the function is precalculated on a set of nested grids, and particular values are then found by trilinear interpolation on these grids. Coverage of the grids increases towards the singularity at the origin (). The gravitational potential at target point is then calculated as
(23) 
In the second approach, we avoid the singularity of by subtracting the term from . This enables us to use only one interpolating grid with uniform coverage for the whole computational domain. Moreover, for mixed BCs, can be approximated at some parts of the computational domain by analytic functions. The function converges to with increasing for configuration 2P1I, and it converges to with increasing for configuration 1P2I. The convergence is exponential and the relative error in acceleration is always smaller than if and for configuration 2P1I and 1P2I, respectively. Accordingly, we use the analytic expression in these regions and precalculate only at the region where or , so the grid covers only a fraction of the computational domain if the computational domain is elongated. In combination with using only one interpolating grid this results in smaller demands on memory while it retains the same accuracy as in the first approach.
In the second approach, we precalculate not only but also its gradient. The actual value of at a given location is then estimated by a Taylor expansion to the first order. This is faster than the trilinear interpolation used in the first approach, and leads to a speed up in the Gravity module by a factor of to depending on the shape of the computational domain, the adopted BCs, and whether the potential or acceleration is used. Thus the second approach appears to be superior to the first one. In each approach, if gravitational accelerations rather than the potential are required, we adopt an analogous procedure for each of its Cartesian components.
Note that in a very elongated computational domain, the evaluation of can be accelerated by adjusting the parameter . Since is precalculated, the choice of is of little importance in our implementation and we do not discuss it further in this paper.
Adaptive block update
Often, it is not necessary to calculate the gravitational potential/acceleration at each grid cell in each timestep. Since the FLASH code uses a global timestep controlled by the CourantFriedrichsLewy (CFL) condition, there may be large regions of the computational domain where the mass distribution almost does not change during one timestep. In such regions, the gravitational potential/acceleration from the previous time step may be accurate enough to be used also in the current timestep. Therefore, to save the computational time, we implement a technique called the Adaptive Block Update (ABU). If activated, the treewalk is modified as follows. For each block, the treewalk is at first executed only for the eight corner grid cells of the current block. Then, the gravitational potential or acceleration (or any other quantity calculated by the treesolver, e.g. the optical depth) in those eight grid cells is compared to the values from the previous timestep. If all the differences are smaller than the required accuracy (given e.g. by Equation 4 or 5), the previous timestep values are adopted for all grid cells of the block.
For some applications, the eight test cells in the block corners may not be sufficient. For instance, if the gas changes its configuration in a spherically symmetric way within a block, the gravitational acceleration at the block corners does not change, even though the acceleration may change substantially in the block interior. Such situation is more probable if larger blocks than default cells are used. Therefore, it is easily possible to add more test cells by editing array gr_bhTestCells in file gr_bhData.F90, where test cells are listed using cell indices within a block, i.e. in a form (1,1,1), (1,1,8)…(8,8,8).
ABU can save a substantial amount of the computational time, however, on large numbers of processors it works well only if a proper load balancing among processors is ensured, i.e. each processor should be assigned with a task of approximately the same computational cost. FLASH is parallelized using a domain decomposition scheme and individual blocks are distributed among processors using the space filling Morton curve (see Fryxell et al., 2000, for details). Each processor receives a number of blocks estimated so that their total expected computational time measured by a workload weight is approximately the same as the one for the other processors. By default, FLASH assumes that processing each leafblock takes approximately the same amount of time to compute, and it assigns workload weight to each leafblock (because it includes active grid cells) and workload weights to all other blocks (they are used only for interpolations between different AMR levels).
The assumption of the same workload per leafblock cannot be used with ABU, because if the full treewalk is executed for a given block less often, the average computational time spent on it is substantially lower in comparison with more frequently updated blocks. It is generally hard to predict whether a given block will be fully updated in the next timestep or not without additional information about the calculated problem. Therefore, we implement a simple block workload estimate that leads in most cases to better performance than using the uniform workload, even though it may not be optimal. It is based on the assumption that the probability that the block will be updated is proportional to the amount of work done on the block during several previous timesteps. This assumption is motivated by considering that a typical simulation includes on one hand regions where the density and the acceleration change rapidly (e.g. close to fast moving dense massive objects), and on the other hand, regions where the acceleration changes slowly (e.g. large volumes filled with hot rarefied gas). Consequently, the past workload of a given block provides an approximate estimate its current workload. However, this information is valid only until the density field evolves enough to change the above property of the region. The time at which this happens can be approximately estimated as the gas crossing time of a singe block. Due to the CFL condition, the corresponding number of timesteps is approximately a number of grid cells in a block along one direction. Specifically, the block workload estimate works as follows. For each leafblock, a total number of node contributions during the treewalk to all its grid cells, , is determined. Then, the workload weight, , of that block is calculated as
(24) 
where is the workload weight from the previous timestep, is a characteristic number of timesteps on which the workload changes, is a dimensionless number limiting the maximum workload weight, and is the maximum taken over all leafblocks in the simulation. In this way, the block workload weight depends on its treesolver computational cost during the last several () timesteps and is between (zero cost) and (maximum cost). By default, we set two global parameters and . The workload weight of nonleaf blocks remains equal to .
2.3 OpticalDepth module
The OpticalDepth module is used to evaluate the simplified solution to the radiative transfer equation
(25) 
where is the specific intensity at frequency , is the specific intensity at the source location, and is the optical depth along a given path through the computational domain at frequency . In this form, the problem of evaluating what radiation intensity reaches a given point in the computational domain, i.e. a given target point, is reduced to computing the optical depth in between a radiation source and the target point. The optical depth is proportional to the absorption crosssection and the column density along the path.
Hence, the OpticalDepth module calculates the total and/or specific column densities (e.g. of molecular hydrogen) for each cell in the computational domain, and can therefore be used to compute the local attenuation of an arbitrary external radiation field. The implementation presented here follows the idea of the Treecol method (Clark et al., 2012), which has been implemented in the Gadget code (Springel et al., 2001). It has been established as a fast but accurate enough approximative radiative transfer scheme to treat the (self)shielding of molecules –onthefly – in simulations of molecular cloud formation (e.g. Clark & Glover, 2014). Recently, the method has also been applied in largerscale simulations of MilkyWay like galaxies (Smith et al., 2014) with the Arepo code (Springel, 2010). The implementation presented here has been successfully used in several recent works on the evolution of the multiphase ISM in galactic discs (Walch et al., 2015; Girichidis et al., 2016; Gatto et al., 2017; Peters et al., 2017).
In principle, the OpticalDepth module adds another dimension to the accumulation of the node masses during the treewalk. For each grid cell, the module constructs a Healpix sphere (Górski et al., 2005) with a given number of pixels, , each representing a sphere surface element with index corresponding to polar and azimuth angles and , respectively. This temporary map is filled while walking the tree, as only the treenodes in the line of sight of a given pixel contribute to it, and are added accordingly. At the end of the treewalk, one has acquired a column density map of a given quantity, e.g. total mass.
Since the treewalk in FLASH is executed on a blockbyblock basis, the additional memory requirement for the local pixel maps is , where is the number of quantities that are mapped and stored. For this paper, we map variables: (1) the total mass giving the total hydrogen column density, ; (2) the H column of molecular hydrogen, which is used to compute its selfshielding and which contributes to the shielding of CO; and (3) the CO column of carbonmonoxide, which is necessary to compute the selfshielding of CO. We store three separate maps because we actually follow the relative mass fractions of multiple species in the simulation using the FLASH Multispecies module. After the treewalk for a given block has finished, the local maps are erased and the arrays can be reused for the next block. This approach is only possible because the treewalk is computed locally on each processor (see §2.1).
When using the OpticalDepth module, there are two major modifications with respect to the usual treewalk (as described above). First, the intersection of a given treenode with the line of sight of each pixel has to be evaluated during the treewalk. Second, at the end of the treewalk for a given block, the acquired column density maps have to be evaluated for each cell.
NodeRay intersection: The mapping of treenodes onto the individual pixels represents the core of all additional numerical operations that have to be carried out when running OpticalDepth in addition to the gravity calculation. It has to be computationally efficient in order to minimise additional costs. At this point, we do not follow the implementation of Clark et al. (2012), who make a number of assumptions about the shape of the nodes and their projection onto the pixels, which are necessary to reduce the computational cost. Instead, we precompute the number of intersecting Healpix rays and their respective, relative weight for a large set of nodes at different angular positions (, ) and different angular sizes . These values are stored in a lookup table, which is accessed during the treewalk. In this way, the mapping of the nodes is highly efficient. Since , , and are known, we can easily compute the contribution of a node to all intersecting pixels by simply multiplying the mass (or any other quantity that should be mapped) of the node with the corresponding weight for each pixel and adding this contribution to the pixel map. For better accuracy, we oversample the Healpix tessellation and construct the table for four times more rays than actually used in the simulation.
Radiative heating and molecule formation: The information that is obtained by the OpticalDepth module is necessary to compute the local heating rates and the formation and dissociation rates of H and CO. At the end of the treewalk for a given block, the mean physical quantities needed by the Chemistry module calculating the interaction of the radiation with the gas are determined. For instance, the mean visual extinction in a given grid cell is
(26) 
where the constant comes from the standard relation between the hydrogen column density, , and the visual extinction in a given direction (Draine & Bertoldi, 1996). The weighted mean is calculated in this fashion, because the photodissociation rates of molecules such as CO and the photoelectric heating rate of the gas all depend on exponential functions of the visual extinction (see Clark et al., 2012, for details). Additionally, the shielding coefficients, and (Glover & Mac Low, 2007; Glover et al., 2010), as well as the dust attenuation, (Glover & Clark, 2012; Clark et al., 2012), are computed by averaging over the Healpix maps in a similar way. These quantities are stored as globally accessible variables and can be used by other modules. In particular, we access them in the Chemistry module, which locally (in every cell) evaluates a small chemical network (Glover et al., 2010) on the basis of its current density and internal energy and recomputes the relative mass fractions of the different chemical species. The evaluation of the chemical network is operator split and employs the Dvode solver (Brown et al., 1989) to solve a system of coupled ODEs that describes the chemically reactive flow for the given species, i.e. their creation and destruction within a given time step. Here, we explicitly follow the evolution of five species, i.e. the different forms of hydrogen (ionised, H, atomic, H, and molecular, H) as well as ionised carbon (C) and carbonmonoxide (CO). Details about the chemical network, e.g. the considered reactions and the employed rate coefficients in the current implementation can be found in Glover et al. (2010) and Walch et al. (2015).
Parameters: The main parameters controlling both the accuracy and the speed of the calculation are the number of pixels per map , and the opening angle, , with which the tree is walked (see Equation (1)). Both should be varied at the same time. A high number of used with a relatively large opening angle will not improve the directional information since the nodes that are mapped into each solid angle will not be opened and thus, a spatial resolution that is sufficient for a finegrained map cannot not be achieved. Therefore we vary both and at the same time.
The number of Healpix pixels is directly related to the solid angle of each element on the unit sphere
(27) 
Tests in §3.3.1 show, in agreement with Clark et al. (2012), that the code efficiency is optimal if is approximately the same as the angular size Healpix elements, i.e.
(28) 
Therefore, for , 48, 192 pixels we recommend to use , , .
3 Accuracy and performance
Since more computational time is needed to reach higher accuracy when solving numerical problems, accuracy and performance are connected and therefore, these two properties should always be evaluated at the same time. However, they are often highly dependent on the specific type of the problem and finding a test that allows one to objectively measure both accuracy and performance is hard. Another complication is that the treesolver saves time by using the information from the previous timestep (if ABU is switched on), and thus any realistic estimate of the performance must be measured by running a simulation in which the mass moves in a similar way as in real applications and by integrating the computational time over a number of timesteps. Unfortunately, such simulations are unavoidably too complex to have an analytic solution against which the accuracy could be easily evaluated.
Therefore we perform two types of tests: static tests that measure accuracy using simple problems and dynamic tests that evaluate accuracy and performance together. The static tests need substantially less CPU time and thus allow for a higher number of parameter sets to be tested. Furthermore, analytic or semianalytic solutions are known and the results can be compared to them. On the other hand, the dynamic tests represent more complex simulations which are more similar to problems that one would actually want to solve with the presented code. They also show how well the treesolver is coupled with the hydrodynamic evolution (where we use the standard PPM Riemann solver of the Flash code) and how the error accumulates during the evolution. In this section, we describe four static and two dynamic tests of the Gravity module and one test of the OpticalDepth module.
When possible, i.e. for fully periodic of fully isolated boundary conditions, we compare the results obtained with the new treesolver to the results obtained with the default multigrid Poisson solver of FLASH (Ricker, 2008). The multigrid solver is an iterative solver and the accuracy is controlled by checking the convergence of the L2 norm of the Poisson equation residual . The iteration process is stopped when where is the residual norm in the th iteration and is the limit set by user. If isolated boundary conditions are used, the gravitational potential at the boundary is calculated by a multipole Poisson solver expanding the density and potential field into a series up to a multipole of order . By default in Flash version 4.4. However, using this value we found unexpectedly high errors close the boundaries (see test §3.1.1 and Figures 4 and 5), and therefore we use (the highest value allowed for technical reasons) in most tests because it yields the smallest error.
In general, the calculated gravitational acceleration deviates from the exact analytical solution due to two effects. The first one is the inherent inaccuracy of the gravity solver (either the tree gravity solver or the multigrid solver), the second one is caused by an imperfect discretisation of the density field on the grid. Since we are mainly interested in evaluating the first effect, we measure the error by comparing the calculated accelerations to the reference solution obtained by direct ”” summation of all interactions of each grid cell with all the other grid cells in the computational domain. We additionally give the difference between the analytical and the ”integrated” acceleration when possible.
We define the relative error of the gravitational acceleration at the point as
(29) 
where is the acceleration of the reference solution and is its maximum taken over the whole computational domain.
In most of the gravity module tests, we control the error by setting the
absolute limit on the acceleration, which is calculated from the
initial maximum acceleration in the computational domain, , as
; typically,
or . The difference
Most of the tests were carried out on cluster Salomon of the Czech National
Supercomputing Centre IT4I
3.1 Static tests of gravity module
In order to test all combinations of the boundary conditions implemented in the Gravity module, we present four static tests. A marginally stable BonnorEbert sphere is used to test the code with isolated boundary conditions (see §3.1.1) and a density field perturbed by a sine wave not aligned with any coordinate axis is used to test setups with fully periodic boundary conditions (§3.1.2). For mixed boundary conditions, periodic in two directions and isolated in a third one, or periodic in a single direction and isolated in the remaining two, we use an isothermal layer in hydrostatic equilibrium (§3.1.3) and an isothermal cylinder in hydrostatic equilibrium, respectively (§3.1.4). Finally, in §3.1.5, we test how the code accuracy depends on the alignment or nonalignment of the gas structures with the grid axes using a set of parallel cylinders lying in the xyplane inclined at various angles with respect to the xaxis.
BonnorEbert sphere
mod.  solver  quan.  MAC  

(a)  tree  accel.  APE      0.0009  83  
(b)  tree  accel.  APE      0.0057  35  
(c)  tree  accel.  BH    0.5    0.0008  110 
(d)  tree  pot.  APE      0.0085  80  
(e)  tree  pot.  APE      0.031  38  
(f)  tree  pot.  BH    0.5    0.0095  106 
(g)  mg  pot.        0  0.058  21 
(h)  mg  pot.        15  0.077  20 
We give the model name in column 1. The following columns are:

solver: indicates whether the treesolver or the multigrid solver (mg) is used

quan.: quantity calculated by the gravity solver (acceleration or potential which is then differentiated)

MAC: Multipole Acceptance Criterion (BarnesHut or Approximate Partial Error)

: requested accuracy of the solver as given by Equation (4) ( where is the maximum gravitational acceleration in the computational domain)

: maximum opening angle when the BarnesHut MAC is used

: maximum relative error in the computational domain given by Equation (29)

: time (in seconds) to calculate a single timestep on 8 cores
We calculate the radial gravitational acceleration of a marginally stable BonnorEbert sphere (Ebert, 1955; Bonnor, 1956, BES) with mass M, temperature K and dimensionless radius . The resulting BES radius is pc and the central density is g cm. The sphere is embedded in a warm rarefied medium with temperature K and density g cm, which ensures that the gas pressure across the BES edge is continuous. We use an AMR grid controlled by the Jeans criterion – the Jeans length has to be resolved by at least by 64 cells and at most by 128 cells. It results in an effective resolution of in the centre of the BES.
Figure 4 shows the relative error in the gravitational acceleration, , as a function of radial coordinate, , and Table 1 lists all models, their maximum relative error, , and the time to calculate one time step, . We compare the solutions calculated with the tree gravity solver using the geometric (BH) MAC with (red curves) to the ones calculated using the APE MAC with (green lines) and (blue lines), respectively. The APE MAC and as well as the geometric MAC with always give a maximum relative error which is smaller than . In case of the APE MAC and , the maximum relative error reaches . Note that the error due to the discretisation of the density field is also of the order of 1% (black line; the jumps are due to changes in the refinement level in the AMR grid).
With the tree gravity solver, the user may choose to directly compute the gravitational accelerations (left panel of Figure 4) or to calculate them by numerical differentiation of the gravitational potential (right panel of Figure 4). Usually, the latter is the standard practice in gridbased 3D simulations, also because only one field variable, the potential, has to be stored instead of three, the accelerations in three spatial directions. However, for the treesolver we generally find that the error in the gravitational accelerations is significantly smaller (about a factor of in the test presented here) if they are computed directly. This is independent of the used MAC.
For comparison, we also show the results obtained with the multigrid solver (magenta lines) using and (solid lines) or (dotted lines), respectively. Although the mass distribution is spherically symmetric, the order of the multipole expansion of the boundary condition affects the accuracy of the multigrid solver relatively far away from boundaries, even inside the BES. The error of the multigrid solver is very low in the central region, it reaches in regions where the refinement level changes (due to numerical differentiation of the potential), and increases to relatively high values at the border of the computational domain ( for and for ), due to inaccuracy of the boundary conditions calculated by the multipole solver. We note that a direct calculation of the gravitational acceleration is not possible with the multigrid solver.
The distribution of the relative error in the plane through the centre of the BES is depicted in Figure 5. The results show that the acceleration obtained with the tree gravity solver using the APE MAC with has a substantially smaller error if it is calculated directly (top left panel; see Table 1 model (b)) instead of by numerical differentiation of the potential (top right panel; model (e)). The bottom panels show the results for the multigrid solver with (model (g)) and (model (h)), respectively. The default setting of gives errors of 5% near the domain boundaries due to the low accuracy of the multipole solver. This error propagates into a large fraction of the computational domain.
Sinewave perturbation (Jeans test)
In a computational domain with fully periodic boundary conditions we calculate the gravitational acceleration of a smooth density field with a harmonic perturbation,
(30) 
where is the mean density and is the amplitude of the perturbation. The computational domain is a cube of size pc with grid cells in each direction. The wavevector was chosen such that it is not aligned with any of the coordinate axes. The gravitational acceleration can be obtained analytically with the help of the Jeans swindle (Jeans, 1902; Kiessling, 1999)
(31) 
Figure 6 shows the maximum relative error as a function of the position on a line parallel to the perturbation wavevector . The maximum error is computed from all points projected to a given position on the line. It can be seen that the error of the multigrid solver (magenta curve) is very small, almost the same as the difference between the analytical solution and the reference solution (black line). This is because without the need to calculate the boundary conditions separately, and on a uniform grid, the FFT accelerated multigrid method is extremely efficient. Again, the results for the treesolver simulations show that direct calculation of the acceleration (solid curves) leads to a much lower error than the calculation of the potential and subsequent differentiation (dashed lines). In particular, the calculation of the potential with the geometric MAC that does not take into account the different mass density in the treenodes leads to a relative error greater than %. However, a direct calculation of the acceleration gives very accurate results for both, the geometric MAC and the APE MAC with . In Table 2 we list all models with their respective and .
model  solver  quan.  MAC  

(a)  tree  accel.  APE    0.0009  480  
(b)  tree  accel.  APE    0.0062  210  
(c)  tree  accel.  BH    0.5  0.0029  250 
(d)  tree  pot.  APE    0.0180  330  
(e)  tree  pot.  APE    0.0270  130  
(f)  tree  pot.  BH    0.5  0.15  150 
(g)  mg  pot.        0.0016  9 
Isothermal layer in hydrostatic equilibrium
In order to test the accuracy of the tree gravity module with mixed boundary conditions (periodic in two directions and isolated in the third one), we calculate the gravitational acceleration of an isothermal layer in hydrostatic equilibrium. The vertical density distribution of the layer is (Spitzer, 1942)
(32) 
where g cm is the midplane density and km s is the isothermal sound speed. The corresponding vertical component of the gravitational acceleration is
(33) 
The computational domain is a cube of side length pc and a uniform resolution of grid cells in each direction.
Figure 7 shows the maximum relative error in the acceleration as a function of the coordinate, where the maximum is taken over all cells with the same coordinate. It can be seen that the error is almost independent of and there is only a small difference between the cases where the gravitational acceleration is calculated directly (solid lines) or where it is obtained by differentiation of the potential (dashed lines). The reason is that the density field in this test has relatively shallow gradients (e.g. compared to the Jeans test discussed in the previous section) and numerical differentiation leads to particularly severe errors for steep gradients. We find the largest error for runs with APE MAC and . All other runs have small errors, which are comparable to the difference between the analytical and the reference solution, resulting from the discretisation of the density field. The results are summarised in Table 3.
model  solver  quan.  MAC  

(a)  tree  accel.  APE    0.00017  170  
(b)  tree  accel.  APE    0.0035  106  
(c)  tree  accel.  BH    0.5  180  
(d)  tree  pot.  APE    0.00029  99  
(e)  tree  pot.  APE    0.0028  45  
(f)  tree  pot.  BH    0.5  0.00043  107 
Isothermal cylinder in hydrostatic equilibrium
In the next static test, we evaluate the accuracy of the tree gravity module for mixed boundary conditions, which are isolated in two directions and periodic in the third one. We calculate the gravitational acceleration of an isothermal cylinder in hydrostatic equilibrium. The long axis of the cylinder is parallel to coordinate and the radius is given as . The density distribution is (Ostriker, 1964)
(34) 
where g cm is the central density and km s is the isothermal sound speed. The density distribution is cut off at radius pc and embedded in an ambient gas with km s and the same pressure as the pressure at the cylinder boundary. The corresponding gravitational acceleration is
(35) 
The computational domain has dimensions and contains grid cells.
Figure 8 shows the maximum relative error of the gravitational acceleration in radial direction, where the maximum error is calculated for all grid cells at the same distance to the cylinder axis. In all runs, the error is a very weak function of . If numerical differentiation of the potential is used, it is the dominant source of the error, which is as large as 1% in these cases (see dashed lines). The results are summarised in Table 4.
model  solver  quan.  MAC  

(a)  tree  accel.  APE    
(b)  tree  accel.  APE    
(c)  tree  accel.  BH    0.5  
(d)  tree  pot.  APE    
(e)  tree  pot.  APE    
(f)  tree  pot.  BH    0.5 
Inclined cylinders
In order to test whether the alignment of gas structures with the coordinate axes has an impact on the code accuracy, i.e. whether the algorithm is sensitive to any grid effects, we calculate gravitational field of the set of parallel cylinders in the 2P1I geometry. The axes of all cylinders lie in the plane and they are inclined at angle with respect to the axis. The computational domain has an extent pc in the isolated direction and approximately pc in the periodic and directions. The exact extents in the latter two directions are chosen so that the computational domain composes a periodic cell of the infinite plane of cylinders, i.e. the cylinders connect contiguously to each other at the and periodic boundaries. Each cylinder has the same radius and density profile as the cylinder described in section §3.1.4, the distance between the cylinder axes is pc. We have calculated models with increasing from to with a step . For all models, the gravity tree solver was running with the BH MAC and maximum opening angle .
Figure 9 shows the relative error of the gravitational acceleration, , calculated in the plane using Equation (29). The reference acceleration, , is either obtained numerically by the integration (four panels on the left for ), or analytically by summing up potential of parallel cylinders (four panels on the right). The error with respect to the integration is always smaller than . The error with respect to the analytical acceleration is of order and is always slightly higher than the former error, as it includes contribution from the imperfect discretisation of the density field reaching the highest values along the cylinder edges where the density field has a discontinuity. The bottom panel show the maximum as a function of demonstrating that the code accuracy is almost independent of the inclination of the gaseous structures with respect to coordinate axes.
3.2 Dynamic tests of gravity module
We run two dynamic tests of the gravity module. The first one (described in §3.2.1) is a collapse of a cold adiabatic sphere suggested by Evrard (1988) and it tests how well the energy is conserved during the gravitational collapse. The second one, describes the evolution of a turbulent sphere (§3.2.2). Both test the accuracy of the gravity module and its coupling to the hydrodynamic solver.
Evrard test
The Evrard test (Evrard, 1988) describes the gravitational collapse and a subsequent rebounce of an adiabatic, initially cold sphere. It is often used to verify energy conservation in SPH codes (e.g. Springel et al., 2001; Wetzstein et al., 2009), its application on gridbased codes is unfortunately less common. The initial conditions consist of a gaseous sphere of mass , radius and density profile
(36) 
The initial, spatially constant temperature is set so that the internal energy per unit mass is
(37) 
where is the gravitational constant. The standard values of the above parameters, used also in this work, are .
In Figure 10<