Cholla : A New MassivelyParallel Hydrodynamics Code for Astrophysical Simulation
Abstract
We present Cholla (Computational Hydrodynamics On ParaLLel Architectures), a new threedimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using stateoftheart techniques, including the unsplit Corner Transport Upwind (CTU) algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over ten million cells per GPUsecond while using an exact Riemann solver and PPM reconstruction. Owing to the massivelyparallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions () can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
1. Introduction
Over the past fifty years, the field of computational hydrodynamics has grown to incorporate a wide array of numerical schemes that attempt to model a large range of astrophysical phenomena. From the pioneering work of Godunov (1959) and Courant et al. (1967), the sophistication of hydrodynamics solvers has steadily improved. Many astrophysical simulation codes now use high order reconstruction methods, implement very accurate or exact Riemann solvers, and model additional physics including gravity, cooling, magnetohydrodynamics, radiative transfer, and more (e.g. Kravtsov, 1999; Knebe et al., 2001; Fryxell et al., 2000; Teyssier, 2002; Hayes et al., 2006; Stone et al., 2008; Bryan et al., 2014). While these advanced techniques result in simulations of unprecedented physical accuracy, they can also be extremely computationally expensive. Given the detailed and expensive physical processes currently being modeled, new numerical approaches to modeling Eulerian hydrodynamics should be considered. This work presents a new, massivelyparallel hydrodynamics code Cholla (Computational Hydrodynamics On ParaLLel Architectures) that leverages Graphics Processing Units (GPUs) to accelerate astrophysical simulations.
Historically, our ability to numerically model larger and more complex systems has benefitted from improvements in technology, especially increased storage and faster clock speeds for central processing units (CPUs). Algorithmic improvements such as adaptive mesh refinement (e.g., Berger & Oliger, 1984; Berger & Colella, 1989) have had a major impact on the ability of codes to achieve higher resolution, but much of the basic structure of static mesh grid codes has remained. In the last decade, computer speed has improved significantly as a result of increased parallelization, and the fastest supercomputers^{1}^{1}1http://www.top500.org now rely on hardware accelerators like GPUs or Intel Xeon Phi coprocessors to provide the bulk of their computational power. To leverage the full capabilities of these systems, multicore CPU chips and accelerators must be used simultaneously in the context of a single hydrodynamics code. While similar parallelization and vectorization techniques apply to a variety of hardware accelerators, Cholla utilizes GPUs to perform all its hydrodynamical calculations. Engineering Cholla to run natively on GPUs allows us to take advantage of the inherently parallel structure of gridbased Eulerian hydrodynamics schemes, and enables the substantial computational performance gain demonstrated in this paper.
Accelerators and other special purpose hardware have been used in astrophysical simulations for many years (e.g., Sugimoto et al., 1990; Aarseth, 1999; Spurzem, 1999; Portegies Zwart et al., 2004; Harfst et al., 2007; Portegies Zwart & Bédorf, 2014). Early work adapting Eulerian hydrodynamics solvers to the GPU indicated a promising avenue to accelerate simulations, with developers reporting speedups of or more as compared to CPUonly implementations (e.g., Brandvik & Pullan, 2007; Pang et al., 2010; Bard & Dorelli, 2010; Kestener et al., 2010). These preliminary efforts clearly illustrated the substantial performance gains that could be achieved using a single GPU. More recently, a number of multidevice GPUbased hydrodynamical simulation methods have been presented, including AMR techniques (Schive et al., 2010; Wang et al., 2010), two dimensional Galerkin approaches (Chan et al., 2012), Smoothed Particle Hydrodynamics (SPH) codes (Sandalski, 2012; Domínguez et al., 2013), and hybrid schemes (Kulikov, 2014).
While these codes represent substantial advancement, the field of massivelyparallel astrophysical hydrodynamics is still relatively new. All of the aforementioned methods have been restricted to secondorder spatial reconstruction, and many would require considerable modification to run on a cluster. In contrast, the hydrodynamics solver implemented in Cholla is among the most complex and physically accurate of those that have been adapted to GPU hardware. Successfully implementing such a complex solver in a hybrid environment on cluster scales displays our ability to merge the stateoftheart in CPU hydrodynamics with a new generation of computer hardware.
As is evidenced by the number of CPU codes presented in the literature, room exists for many different approaches optimized for different purposes. With Cholla, we have built a fast, GPUaccelerated static mesh hydrodynamics module that can be used efficiently on its own or in conjunction with a variety of additional physics. Beyond accelerating the hydrodynamics calculation, offloading the work onto the GPU frees the CPU to perform other tasks. This excess computational capacity makes Cholla an excellent bedrock for developing complex physical models that require hydrodynamics.
The large dynamic range of spatial scales in many astrophysical problems requires simulations with both high resolution and a high level of physical accuracy. The interaction of a high mach number shock with a gas cloud falls into this category (Klein et al., 1994). Using the power of Cholla, we can efficiently run high resolution simulations of the cloudshock problem to investigate how cloud density structure affects the destruction of highdensity gas. Given the inhomogeneous nature of gas in galaxies, our results have wideranging implications, from the impact of supernovae on the gas in their immediate environment to the evolution of dense gas in galactic outflows.
In the following sections, we fully describe Cholla. The code models solutions to the equations of hydrodynamics using the Corner Transport Upwind (CTU) algorithm (Colella, 1990; Gardiner & Stone, 2008), and includes multiple choices for both interface reconstruction and Riemann solvers. The CTU algorithm is presented in Section 2 along with brief descriptions of the reconstruction methods and Riemann solvers, which are fully documented in the Appendices. The code structure, including the simulation setup, CUDA functions, optimization strategies necessary to take advantage of the GPU architecture, and Message Passing Interface (MPI; Forum, 1994) implementation and scalability, is described in Section 3. We then demonstrate the excellent performance of Cholla on a suite of canonical hydrodynamics tests in Section 4. In Section 5, we derive new results describing the interaction of a high mach number shock with a turbulent gas cloud. We conclude in Section 6.
2. Hydrodynamics
Hydrodynamics is relevant to many astrophysical processes and represents one of the most computationally demanding parts of numerical simulations. Creating a fast hydrodynamics solver is therefore an important step in increasing the resolution and speed with which astrophysical calculations can be performed. In this section, we present the equations modeled by Cholla, and then describe the numerical algorithms used to model them. Cholla includes a variety of reconstruction techniques and Riemann solvers, each of which is described below.
In differential conservation law form (see, e.g., Toro, 2009), the multidimensional Euler equations can be written:
(1) 
(2) 
(3) 
(4) 
(5) 
Here is the mass density, , , and are the , , and components of velocity, is the pressure, and is the total energy per unit volume,
(6) 
where is the threecomponent velocity vector. The total energy includes the specific internal energy, , and the specific kinetic energy,
(7) 
Equation 1 describes the conservation of mass, Equations 24 the conservation of momentum, and Equation 5 the conservation of energy. To model solutions to this system of conservation laws, an equation of state is also necessary. We use the equation of state for an ideal gas,
(8) 
where is the ratio of specific heats. Incorporating a real gas equationofstate model (Colella & Glaz, 1985) would not be incompatible with the structure of Cholla, though it is beyond the scope of our current work.
The Euler equations can also be written in vector notation. We define the vector of conserved quantities with components in three Cartesian dimensions,
(9) 
including density, the three components of momentum, and total energy. We will also refer to the vector of primitive variables,
(10) 
that includes density, the three components of velocity, and pressure. We define three flux vectors
(11) 
(12) 
and
(13) 
Using these definitions, we can compactly write the three dimensional Euler equations in the conservative form:
(14) 
The Euler equations can also be written using the primitive variables. In one dimension, the equations can be written as a set of linear hyperbolic equations of the form
(15) 
The matrix is diagonalizable and can be written
(16) 
where is a matrix of right eigenvectors, is a diagonal matrix of eigenvalues, and is a matrix of left eigenvectors. The eigenvalues of are real and correspond to the speeds at which information propagates for the fluid equations. If we further define the characteristic variables, , according to
(17) 
we can write the system a third way:
(18) 
This description is called the characteristic form of the Euler equations. The characteristic variables are sometimes called wave strengths, because they describe the magnitude of the jump in the primitive variables across an associated wave. For an extensive treatment of the Euler equations and related subjects see, e.g., Laney (1998), LeVeque (2002), and Toro (2009).
2.1. The CTU Algorithm
Cholla models the Euler equations using a threedimensional implementation of the Corner Transport Upwind (CTU) algorithm optimized for magnetohydrodynamics (MHD; Colella, 1990; Saltzman, 1994; Gardiner & Stone, 2008). The sixsolve version of CTU used in Cholla is fully adapted for MHD and documented in both Gardiner & Stone (2008) and Stone et al. (2008). We will describe here an abbreviated version including only the steps relevant for hydrodynamics calculations.
The CTU algorithm is a Godunovbased method. A Godunov scheme uses a finitevolume approximation to model the Euler equations, evolving average values of the conserved quantities in each cell using fluxes calculated at cell interfaces (Godunov, 1959). In three dimensions, the calculation to update the conserved quantities at timestep can be written as
(19)  
Here the superscript refers to the next time step, and are the updated values of the conserved variables. The subscript refers to the threedimensional Cartesian index of the cell. Indices that are displaced by half an integer refer to interfaces. For example, is the interface between cell and cell . The simulation time step is , and , , and refer to the cell widths in each dimension. We use lowercase versions of and when referring to a cellaveraged quantity, and uppercase versions when referring to an estimated value at a cell edge (e.g., and ). The fluxes in Equation 19 are averages in both space and time. Table 1 summarizes our notation.
Symbol  Definition 

Conserved variable averages at time  
,  Reconstructed boundary values at the left and right of an interface 
,  Initial timeevolved boundary values 
, ,  Initial onedimensional fluxes 
,  Transversefluxevolved boundary values 
, ,  CTU fluxes 
Updated conserved variable averages at time 
When applied to every cell in the grid, Equation 19 conserves each of the quantities in . However, the physical accuracy of a method based on Equation 19 depends strongly on how the flux averages are calculated. Many hydrodynamics codes calculate the fluxes using only onedimensional information as outlined in the method known as Strang (1968) splitting. While an elegant technique, Strang splitting may lead to asymmetries in hydrodynamics calculations and is not advantageous for some formulations of MHD (see, e.g., the discussion in Balsara, 2004). Therefore, we employ instead the unsplit CTU algorithm to improve on the onedimensional calculation of fluxes by taking into account transverse fluxes that can cross cell interfaces in multidimensional simulations.
Before beginning a simulation the computational domain and boundaries must be initialized, as described in Section 3. Once the fluid properties on the grid have been initialized, the first simulation time step, , is calculated using the equation
(20) 
where is the CourantFriedrichsLewy (CFL) number and is the average sound speed in the cell. In adiabatic hydrodynamics, the sound speed is a function of pressure and density
(21) 
and can be calculated for each cell using the average values of the primitive variables. Thus, is the maximum wave speed in a cell with respect to the grid.
The minimum value of across the entire grid is determined and used in the CTU calculation, constraining the time step for every cell to be equal. In two or three dimensions, Equation 20 is modified such that the minimum required timestep is computed for each direction in the cell. In three dimensions, this minimization is computed as
(22) 
With a suitable choice of , Equation 22 ensures that the Courant condition is satisfied in all three dimensions. Note that for the sixsolve CTU algorithm the CFL number must be below for the solution to remain stable (Gardiner & Stone, 2008).
Once we have calculated the timestep we carry out the following procedure:

Reconstruct values of the conserved variables on both sides of every interface using the average value of the conserved quantities in adjacent cells. These reconstructed boundary values, denoted and , represent an approximation to the true value of each conserved variable at the interface. For a multidimensional simulation, the reconstruction must be carried out in each dimension separately. We use additional subscripts to indicate which interface values we are calculating. For example, the lefthand reconstructed boundary value between cell and cell is denoted , while the righthand value at that interface is . Cholla includes several different options for the interface reconstruction, which we describe in Section 2.2. In order for the CTU algorithm to be secondorder accurate in time, the reconstructed boundary values must be evolved half a time step before using them to calculate fluxes. The initial time evolution is considered part of the reconstruction, and is also inherently onedimensional. We label the initial timeevolved boundary values and . An example of a piecewiselinear reconstruction in the direction is shown in Figure 1.

Using the initial timeevolved boundary values as inputs, solve a Riemann problem at each cell interface in each direction. The solution to the Riemann problems yields a set of onedimensional fluxes, , , and , corresponding to the , , and interfaces, respectively. Like the boundary value arrays, the flux arrays contain five conserved value fluxes for each direction and interface. The Riemann solvers implemented in Cholla are described in Section 2.3.

Evolve the initial onedimensional timeevolved boundary values half a time step using the transverse fluxes. For example, at the interface between cell and cell the transversefluxevolved boundary values are
(23) For and interfaces, a cyclic permutation is applied. Thus, the interface states are evolved using the and fluxes, the interface states are evolved using the and fluxes, and the interface states are evolved using the and fluxes. This step is only relevant for multidimensional problems. In one dimension, the CTU algorithm reduces to just the initial reconstruction step, a flux calculation, and a final conserved quantity update.

Use the transversefluxevolved boundary values, and , as inputs for a new set of Riemann problems. The solution to these Riemann problems yields the CTU fluxes , , and . Each flux is calculated using a onedimensional Riemann problem at a single interface, but includes contributions from adjacent perpendicular interfaces as a result of the evolution in Equation 23. The CTU fluxes are secondorder accurate in time (Colella, 1990).

Use the CTU fluxes to update the conserved quantities in each cell as in Equation 19,
(24) The updated conserved quantities are used to calculate the next time step .

Repeat the algorithm until the final simulation time is reached.
2.2. Interface Reconstruction
Cholla currently implements five methods for cell interface reconstruction. These include the piecewise constant method (PCM), two versions of the piecewise linear method (PLM), and two versions of the piecewise parabolic method (PPM). Differences between the versions of piecewise linear and piecewise parabolic reconstruction are demonstrated in the tests presented in Section 4. Access to multiple reconstruction options often proves useful, since lowerorder methods are faster but higherorder methods are typically more accurate. Employing different versions of cell reconstruction also enables the impact of reconstruction on the evolution of a simulation to be quantified. Here, we give a brief overview of each of the reconstruction techniques implemented in Cholla . Detailed descriptions of the piecewise linear and piecewise parabolic options can be found in Appendix A.
2.2.1 Piecewise Constant Reconstruction
The simplest reconstruction technique is the piecewise constant method (Godunov, 1959; Courant et al., 1967). In PCM, the initial timeevolved boundary values and are set equal to the cell average quantities on either side of the interface, i.e.
(25) 
and
(26) 
Note that in this notation, the boundary value at the right of the interface is at the left side of the cell, and vice versa. While the piecewise constant method is generally too diffusive for practical applications, it has merit for code testing, and is useful as a comparison to higher order reconstruction techniques.
2.2.2 Piecewise Linear Reconstruction
The second and third reconstruction techniques implemented in Cholla are both forms of the piecewise linear method, a scheme that is secondorder accurate in space and time (e.g. Toro, 2009). The PLMP reconstruction method detailed below primarily involves the primitive variables, while the PLMC method subsequently explicated involves projecting the primitive variables onto wave characteristics.
PLMP follows the method outlined in Chapter 13.4 of Toro (2009). First, the cellaverage values of the primitive variables are used to calculate slopes of each variable across the left and right interface of each cell. We use the van Leer (1979) limiter to monotonize the slopes, thereby reducing the likelihood of spurious oscillations in a numerical solution. The limited slopes are used to calculate reconstructed values of the primitive variables at the cell interfaces, and . To convert these reconstructed boundary values into input states for the Riemann problem, we need to evolve them by half a time step. For PLMP, we do this by converting primitive quantities back into conserved variables and calculating the associated fluxes using Equations 1113. We use these fluxes to evolve the reconstructed boundary values half a time step, generating the initial timeevolved boundary states for the first set of Riemann problems, and .
The second linear reconstruction technique, PLMC, is based on the method outlined in Stone et al. (2008). This reconstruction also uses a linear approximation to model the distribution of the conserved quantities in each cell, but limits the slopes of the characteristic variables (rather than the primitive quantities) and evolves the reconstructed boundary values differently. Rather than simply evolving the boundary values using the associated fluxes as in PLMP, we employ the more sophisticated approach first described and termed “characteristic tracing” in Colella & Woodward (1984). We calculate a first approximation to the timeevolved boundary values by integrating under the linear interpolation function used to calculate and . The domain of dependence of the reconstructed boundary value integral is defined by the minimum (for the lefthand interface) or maximum (for the righthand interface) wave speed. This integration is then corrected by including the contribution from each of the other characteristics approaching the interface. Once the corrections have been made, the calculation provides the initial timeevolved boundary values and that act as input states for the first set of Riemann problems. This process is more fully described in Appendix A.
2.2.3 Piecewise Parabolic Reconstruction
The remaining two reconstruction techniques implemented in Cholla are both versions of the piecewise parabolic method (PPM) originally described in Colella & Woodward (1984). We call the first method PPMP as it performs the reconstruction using primitive variables. Our PPMP implementation closely follows the FLASH code documentation (Fryxell et al., 2000). The second method, abbreviated PPMC, uses an eigenvalue decomposition to project onto the characteristic variables and is based on the Athena code documentation (Stone et al., 2008). Each PPM reconstruction method is described in detail in Appendix A.
The approach to slope limiting differs slightly between the two parabolic reconstruction techniques. PPMP calculates slopes at each interface the same way as PLMP, using van Leer (1979) limiting in the primitive variables. The slopes are limited in the characteristic variables for PPMC. In the parabolic methods, slopes are calculated using a stencil of five cells (two on either side of the cell for which we are calculating boundary values), which allows us to create a parabolic reconstruction of the primitive variables. This parabolic reconstruction makes PPM thirdorder accurate in space, though it remains only secondorder accurate in time (Colella & Woodward, 1984).
Several differences between the two parabolic reconstruction methods warrant further discussion. PPMP identifies and steepens the slopes near contact discontinuities, which results in a method that is less diffusive for contact waves. Downsides to contact steepening include the necessity of empirically determined criteria for selecting contact discontinuities and increased oscillatory behavior in the solution near shocks. In Cholla the ability to turn off contact steepening is retained, allowing an explicit comparison of results obtained with and without the technique. PPMP also flattens shocks that have become too narrow to be treated accurately, which reduces the potential for severe postshock oscillations. The more diffusive nature of PPMC renders a comparable correction unnecessary. Because the criteria for detecting a shock requires information from three cells on either side of an interface, the stencil for PPMP is larger than for PPMC. Both methods employ the characteristic tracing method of Colella & Woodward (1984) to translate from boundary extrapolated values based on the parabolic interpolation to input states for the Riemann problem, though the methods differ in detail (see Appendix A).
2.3. Riemann Solvers
Much effort has been devoted to finding efficient numerical algorithms to solve the Riemann problem (e.g., Toro, 2009), an initial value problem consisting of two constant states separated by a jump discontinuity. As displayed in Figure 2, the Riemann problem has an analytic solution that enables a numerical model for Eulerian hydrodynamics, as it allows for the calculation of the flux across an interface separating two initially discontinuous states. While Riemann solvers that calculate a numerical solution to the exact Riemann problem can be incorporated in hydrodynamics codes, the implicit nature of the Riemann solution requires an iterative step in the exact numerical solver. A large number of Riemann problems must be solved for every time step in a simulation, and the corresponding computational cost is substantial. As a result, a variety of approximate Riemann solvers have been engineered to quickly solve an approximation to the Euler equations at the expense of some physical accuracy.
Cholla computes numerical solutions to the Riemann problems on GPUs. Floating point operations are performed very efficiently on a GPU; additional factors like memory latency and data transfer contribute a larger share of the computational expense of the method. Thus, adding the extra operations needed for the iterative procedure in an exact solver versus an approximate one does not impact the performance speed of Cholla in the same way as for a CPUbased code. However, there are certain problems where the extra diffusion in an approximate solver is helpful, for example to deal with the well known carbuncle instability (Quirk, 1994) that affects gridaligned shocks. For this reason, Cholla includes both an exact solver and the linearized solver first described by Roe (1981) that gives an exact solution to a linear approximation of the Euler equations. Detailed descriptions of our implementation of both solvers can be found in Appendix B.
2.3.1 The Exact Solver
Cholla implements the exact Riemann solver presented in Toro (2009). The solver uses a NewtonRaphson iteration to calculate the pressure of the gas in the intermediate state of the Riemann solution that lies between the initial states on the left and right of the interface, as shown in Figure 2. Once the pressure in the intermediate state has been found, the exact solution for the primitive variables between the left and right initial states can be calculated explicitly at any later point in time. The pressure and velocity are used to determine the solution at the cell interface, and the values of the primitive variables at that point are used to calculate the fluxes of conserved variables at the interface according to Equations 11  13. Transverse velocities are passively advected as scalar quantities.
The Toro Riemann solver gives a numerically exact solution to the Riemann problem in one dimension, and will never return negative densities or pressures if the input states are physically selfconsistent. However, the input states on either side of the cell interface are estimated quantities, and because of the extrapolation involved in the reconstruction techniques they could be physically invalid. In these situations, the solver may be presented with an initial value problem without a physically valid solution. To prevent artificial vacuum or negative pressure solutions owing to such a circumstance, a pressure floor of in the adopted unit scheme is enforced in Cholla . In practice, when using an exact Riemann solver the pressure floor has proved necessary only when performing the Noh test described in Section 4.
2.3.2 The Roe Solver
One common alternative to calculating an exact solution to the Riemann problem is to linearize the nonlinear conservations laws and solve the resulting approximate problem exactly. In one dimension, the nonlinear Euler equations can be replaced with the following linearized equation
(27) 
where A is a constant Jacobian evaluated at some average state that is a function of the initial states on either side of the cell interface. This method was employed by Roe (1981), and Cholla includes a linearized solver very similar to the original Roe solver.
The first step in the Roe solver is to calculate the average state . This average state, along with the eigenvalues, , and left and right eigenvectors of the Jacobian , and , can be used to calculate the Roe fluxes at the interface:
(28) 
Here, are the characteristics of the solution, and
(29) 
are the characteristic variables, determined by projecting the differences in the initial left and right states, , onto the left eigenvectors. and are fluxes calculated with the left and right input states using Equation 11. Expressions for the average state, , as well as the eigenvalues and eigenvectors are given in Roe (1981) and Appendix B. The matrix is not actually needed in the calculation.
As pointed out by Einfeldt et al. (1991), there are certain Riemann problems that will cause any linearized solver to fail. In these cases, the linearized solution to the Riemann problem results in negative densities or pressures in the intermediate state calculated between the left and right input states. Because this intermediate state is used to calculate the fluxes returned by the solver, these unphysical solutions may lead to numerical pathologies. A failsafe is needed to deal with the case where the Roe solver produces negative pressures or densities. Following the method of Stone et al. (2008), we check the intermediate densities and pressures before returning the fluxes calculated with the Roe solver. Should any of them be negative, we revert to using the simpler HLLE Riemann solver, described below.
2.3.3 The HLLE Solver
The HLLE solver is a modification of the HLL solver first described by Harten et al. (1983) and later modified by Einfeldt (1988). Although the method is extremely diffusive for contact discontinuities, as demonstrated by Einfeldt et al. (1991) the HLLE solver is guaranteed to be positively conservative (that is, the density and internal energy remain positive). The HLLE solver calculates the interface flux using an average of the left and right state fluxes, together with bounding speeds comprising the largest and smallest physical signal velocities in the solution to the exact Riemann problem. If the Roe solver produces negative densities or pressures, we replace the Roe fluxes with a new numerical flux
(30) 
The fluxes and , and the slopes are calculated as in the Roe solver. The signal velocities and are calculated using the largest and smallest eigenvalues of the Roe matrix as described in Appendix B. Because the HLLE solver quickly allows contact discontinuities to diffuse, we do not use it as a standalone Riemann solver in Cholla.
3. Code Architecture
Cholla is a gridbased hydrodynamics code that takes advantage of the massively parallel computing power of GPUs. In order to harness this power, Cholla was designed with the operation of the GPU in mind. In this section, we describe the overall structure of Cholla, including optimization strategies necessary to benefit from the parallel architecture of GPUs. As is standard in GPU programming, we will use the term “host” to refer to the CPU, and “device” to refer to the GPU.
Cholla consists of a set of C/C++ routines that run on the host plus functions called kernels that execute on a device. The device kernels and the host functions that call them are written in CUDA C, an extension to the C language introduced by NVIDIA^{2}^{2}2http://developer.nvidia.com. All of the CUDA functions are contained in a separate hydro module so that they can be compiled independently with the NVIDIA nvcc compiler. In addition, we have written a C/C++ version of the hydro module that performs the same calculations as all of the GPU kernels, so it is possible to run Cholla without using graphics cards. We use this mode for testing, but it is not recommended for performance since the structure of the code is optimized for use with GPUs.
3.1. Simulation Overview
Before detailing each piece of the code, we give a general overview of the steps followed by Cholla when a simulation is run. Given the power of a single GPU, small problems can easily be run on a single host/device pair. For large problems, Cholla can be run using the MPI library, and we describe our MPI implementation in Section 3.7. If MPI is enabled, the simulation volume will be split into subvolumes according to the number of processes. Each subvolume will then be treated as a selfcontained simulation volume for the duration of each simulation time step. The main difference between an MPI and nonMPI simulation is the method for applying boundary conditions at the end of each time step; we describe that method in Section 3.7.
Portions of our algorithm that require information from potentially distant cells in the global simulation volume must be carried out on the host. The main host functions set initial conditions, apply boundary conditions, and perform any interprocess communications. Parts of the calculation that only require information from nearby cells can be carried out on the device. Because the bulk of the computational work resides in the CTU calculation that requires a stencil containing only local cells, essentially all of the hydrodynamical computations are performed on the GPU.
The steps in the Cholla algorithm are listed below and illustrated in Figure 3.

Initialize the simulation by setting the values of the conserved fluid quantities for all cells in the simulation volume, and calculate the first time step.

Transfer the array of conserved variables to the GPU. The conserved variable array contains the values of each conserved quantity for every cell in the simulation volume.

Perform the CTU calculation on the GPU, including updating the conserved variable array and computing the next time step.

Transfer the updated conserved variable array back to the CPU.

Apply the boundary conditions. When running an MPI simulation, this step may require interprocess communication to exchange information for cells at the edges of subvolumes.

Output simulation data if desired.
The initialization of the simulation is carried out on the host. The initialization includes setting the values of the conserved variables for both the real and the ghost cells according to the conditions specified in a text input file. Ghost cells are a buffer of cells added to the boundaries of a simulation volume to calculate fluxes for real cells near the edges. The number of ghost cells reflects the size of the local stencil used to perform fluid reconstruction. Because updating the ghost cells at each time step may require information from cells that are not local in memory, the values of the ghost cells are set on the host before transferring data to the GPU.
Once the simulation volume has been initialized on the CPU, the hydrodynamical calculation begins. The host copies the conserved variable array onto the device. Because the GPU has less memory than the CPU, the conserved variable array associated with a single CPU may be too large to fit into the GPU memory at once. If so, Cholla uses a series of splitting routines described in Section 3.6 to copy smaller pieces of the simulation onto the GPU and carries out the hydrodynamics calculations on each subvolume. At the end of the hydro calculation the next time step is calculated on the device using a GPUaccelerated parallel reduction. The updated conserved variables and new time step are then transferred back to the host. The host updates the values of the ghost cells using the newly calculated values of the real cells, and Steps 2  5 repeat until the desired final simulation time is reached.
After each time step the values of the ghost cells are reset using the newly updated values of the conserved variables. Cholla includes three standard boundary conditions: periodic, reflective, and transmissive. These can be set in any combination on any of the borders of the simulation volume. For periodic and transmissive boundaries, the conserved variable values of each ghost cell are copied from the appropriate real cell. For reflective boundaries we follow the same process but reverse the sign of the perpendicular component of momentum. Cholla also includes the capability to define custom boundary conditions, such as the analytic boundaries specified in the Noh Strong Shock test (see Section 4.3.1). In a simulation performed using MPI communication, any necessary boundary regions are exchanged between relevant processes as described in Section 3.7.
3.2. Memory Structure
The data for a simulation in Cholla are contained in two structures. A header stores information about the size and shape of the domain, as well as global variables including the simulation time. A second structure contains the values of the conserved variables for each cell in the simulation. In an object oriented programming model, these values would often be stored in memory as an array of structures,
where is the total number of cells in the grid. In a CPUbased simulation code, this configuration can improve the performance of memory accesses.
The object oriented model is intuitive, but the memory structure is not efficient when implemented on the GPU. On NVIDIA GPUs, calculations are performed simultaneously by thousands of individual computational elements called cores, analogous to but individually much less powerful than a typical CPU core. The set of instructions carried out on a single GPU core is called a thread. The efficiency of the GPU comes in part from its ability to efficiently schedule the execution of millions of threads requested in a single kernel call. The scheduling is organized by streaming multiprocessors on the device that schedule threads for execution in groups called warps. Each thread warp performs a given set of operations simultaneously in the execution model often referred to as Single Instruction Multiple Data (SIMD). Given that data operations across cores on the GPU are rapidly executed in a massively parallel manner via the SIMD approach, hardware timescales such as the GPU global memory access time can represent a considerable fraction of the total computational expense of a calculation. Techniques to reduce the expense of global memory accesses include the organization of data needed by each thread warp into adjacent regions in physical memory. To facilitate this advantageous data locality, Cholla organizes conserved variables into a structure of arrays:
The thread warps can retrieve the conserved quantities within this structure of arrays in global memory with unit stride in memory accesses, reducing collisions in the access pattern.
The process of initiating a data transfer from the host to the device involves an associated computational overhead. Limiting the number of transfers required by the algorithm mitigates this overhead, as hardware latency may cause many small transfers to take longer than one large transfer. The allocation of a single structure containing the conserved variable arrays ensures a contiguous data layout in memory, and limits the required data transfers to two (single transfers to and from the device).
3.3. The GPU Grid
Once the simulation volume has been initialized on the host and the values of the ghost cells have been set, the array of conserved variables is transferred to global memory on the GPU. When kernels are then executed on the device, the GPU launches a grid of thread blocks. The GPU grid and thread block dimensions are set by the programmer and are application dependent. Cholla typically uses one or two dimensional grids of one dimensional thread blocks; the latter arrangement is illustrated in Figure 4. We emphasize that the dimensions of the GPU grid are not constrained to match the dimensions of the simulation, as the location of a cell in the simulation volume can always be mapped to a unique index within the GPU grid of thread blocks.
The dimensions of the GPU grid can affect the efficiency with which the device performs calculations and dictate the mapping from a realspace cell index to a thread index. To define the thread index, the CUDA programming model includes builtin data elements that return specific values for each thread, as shown in the following pseudocode:
Here, threadId returns the ID of the thread within the block, blockIdx returns the ID of the thread block within the grid, and blockDim returns the dimensions of the block. By combining these predefined quantities, a unique global index can be calculated for each thread. This pseudocode assumes a one dimensional grid of one dimensional blocks, but could easily be adjusted to create an equivalent mapping for two or three dimensional blocks or grids. We use the thread index to assign each thread the work of computing the conserved variable update for a single cell in the simulation. For Cholla, we choose a one dimensional block of threads because most of the kernels are one dimensional in nature. The PPM reconstruction, for example, requires only a one dimensional stencil and is carried out separately for the , , and interfaces. Because a new GPU grid with different dimensions can be initiated every time a device function is called, the thread index calculation and subsequent mapping to a real cell index must be performed within each GPU kernel. No data needs to be transferred back to the CPU between kernels, as all information needed between kernels is stored in the GPU global memory.
3.4. The GPU Kernels
Cholla leverages a modular design that enables an easy selection of the reconstruction method or Riemann solver, and facilitates the incorporation of new features. Each reconstruction method and Riemann solver is performed through an associated kernel executed by the GPU. A routine implementing the CTU algorithm calls these kernels through a wrapper function that segregates CUDA calls from the rest of the code. This organization allows for a flexible compilation structure in which nonCUDA code (including MPI calls) can be compiled with standard C compilers. Within the CUDA wrapper for the CTU algorithm, the following steps are followed:

Allocate arrays within GPU global memory to hold the conserved variables , the initial timeevolved boundary values , the initial onedimensional fluxes , the transversefluxevolved boundary values , and the CTU fluxes .

Transfer the conserved variable data from the host to the device and store in the newly allocated arrays in GPU global memory.

Call the reconstruction kernel for each dimension.

Call the Riemann solver kernel for each dimension.

Call the kernel to perform the transverse flux update (Equation 23).

Call the Riemann solver kernel for each dimension again.

Call the kernel to update the conserved variables and calculate the next time step (Equation 24).

Transfer the conserved variable arrays back to the CPU.
Step 1 involves the allocation of memory on the GPU, and it should be noted that the global GPU memory available is typically small compared with the CPU memory. Depending on the device, the simulation size, the number of MPI processes, and the domain decomposition, each process’s conserved variable data may exceed the available GPU memory. An excess may occur even if the local grid governed by each process is small (e.g., ). When necessary, Cholla uses a set of splitting routines to divide the simulation volume into more manageable subvolumes that are then treated according to the steps listed above. Section 3.6 describes these splitting routines in more detail.
The GPU kernel calls in Steps 37 resemble traditional C function calls, but kernels are implemented with additional variables that establish the dimensions of the grid of thread blocks launched by the GPU. For example, the syntax for calling Cholla’s PPM reconstruction function is:
PPM_reconstruction<<<BlocksPerGrid,  
where the triple chevron syntax, <<<,>>>, informs the CUDAenabled compiler that this function should be executed on the GPU device. Since the amount of data processed at once by the GPU is limited by its available global memory, BlocksPerGrid can always be set large enough to assign a thread to each cell.
Separate kernels carry out different parts of the CTU algorithm, but each kernel shares common elements. Every kernel must begin with a calculation of the index of each thread, as described in Section 3.3. Using the appropriate mapping, the index of each thread of the kernel can be translated to a unique realspace cell index in the simulation volume. The threads within the kernel then retrieve necessary cell data from the GPU global memory. For the reconstruction function, these data would include the values of the conserved variables for the cell assigned to that thread, as well as those of the nearby cells within the reconstruction stencil. Once the data have been retrieved, the threads carry out any relevant calculations, and load the result into the relevant GPU global memory array. Once all of the threads have finished their calculations the kernel returns, and the process continues through each of the steps listed above.
3.5. Time Step Calculation
The implementation of most kernels described in the previous section (reconstruction, Riemann solver, and transverse flux update) follows closely the descriptions of the CTU calculation given in Section 2. However, the final conserved variable update kernel in each iteration of the algorithm is extended to include the calculation of the next simulation time step via a parallel reduction operation performed on the GPU. Reductions on the GPU are a commonly used process, and examples of this operation can be found in e.g., the CUDA toolkit^{3}^{3}3https://developer.nvidia.com/cudatoolkit. We include a brief explanation of our implementation of the parallel reduction operation here as a concrete example of the advantage of moving a given function from the CPU to the GPU.
Thread blocks on the GPU have a limited amount of “shared memory” that each thread in the block can access rapidly, as illustrated in Figure 4. At the end of the conserved variable computation, the updated conserved variable data for each cell are stored in the private register memory assigned to each thread. The updated values of the conserved variables are used by each thread to calculate the minimum time step associated with its cell, according to Equation 22. The individually calculated time steps are then loaded into an array the size of the thread block in shared memory  note that each thread block has its own array. The threads in the block then perform a treebased parallel reduction on the time step array, finding the minimum time step for the entire block. This minimum value is uploaded into an array in the GPU global memory, and is then passed back to the CPU, where the final reduction is performed.
Calculating the time step on the GPU achieves a performance gain relative to a CPU since executing a large number of floating point operations is extremely efficient on the GPU. The shared memory reduction on the GPU reduces the number of loops needed on the CPU by a factor of ThreadsPerBlock (typically set to for an NVIDIA Kepler K20X GPU). For reference, we find the parallel GPU reduction time step calculation for a simulation can achieve a performance gain relative to a single CPU core depending on the architecture ( vs. ).
3.6. Subgrid Splitting
As mentioned in previous sections, the total amount of memory on a single device may be quite limited when compared to the memory available on the host. The memory footprint on the GPU for each cell in the simulation volume is of order kilobytes, including conserved variables, interface values, and fluxes. At present, a typical GPU has only a few gigabytes of global memory, though this number has been increasing with each new generation of devices. Therefore, current devices can typically only hold the information for a 3D hydrodynamical simulation of size . In Cholla, slightly more cells can fit for 1D and 2D simulations owing to the reduced number of transverse interface states and fluxes that must be stored. The limited memory resources often require that the simulation volume associated with a local process may need to be successively subdivided to fit on a GPU. We term this subdivision process “subgrid splitting”. Similar methods have been used in CPUbased codes such as HERACLES (González et al., 2007).
In practice, subgrid splitting is typically only needed for multidimensional simulations or 1D simulations with millions of cells. A description of the 1D subgrid splitting is provided below as a straightforward example. First, the size of the simulation volume that will fit on the GPU at once given the global memory available is calculated and stored in a variable, e.g. MaxVol. The local volume governed by each local process is further split into subvolumes of size less than or equal to MaxVol. We refer to these subvolumes as “subgrid blocks”. The CTU calculations for each subgrid block are performed on the GPU sequentially, including any necessary ghost cells from nearby subvolumes. Memory buffers on the CPU are used to avoid overwriting grid cells that act as ghost cells for neighboring subgrid blocks. The procedure of copying data to the GPU, calculating, and transferring data back to the CPU is repeated until the hydro step for the entire simulation volume has completed. Because the conserved variables for the simulation are contiguous in memory on the host, copying them into buffers via memcpy contributes a negligible amount to the total simulation run time.
To illustrate the subgrid blocking method, Figure 5 displays a twodimensional grid with an example subgrid blocking by a factor of four. Each of the four subgrid blocks (red, green, blue, purple) require real cells from adjacent subgrid blocks to act as subgrid ghost cells to form the full computational stencil for the fluid reconstruction and CTU calculations (indicated by colored dashed lines). Since these subgrid blocks also abut either local simulation boundaries between local processes or global boundaries at the edges of the illustrated region, they also require standard ghost cells (gray regions) to complete their computational stencils. The subgrid block regions outlined by the dashed lines are transferred sequentially to the GPU for the CTU calculation, taking care to preserve in memory the subgrid ghost cell boundary regions between subgrid blocks until the entire local volume has been processed. The standard ghost cells are updated after the CTU calculation, since they depend either on communication between MPI processes or the global boundary conditions of the simulation. Note that Figure 5 illustrates a very small grid for convenience, and the actual subgrid regions of a 2D simulation would be orders of magnitude larger.
We performed a variety of tests using subgrid blocks of different shapes in order to determine a method of division that helps minimize GPU communication overhead. For 2D simulations, splitting the volume into approximately square regions works well. For 3D simulations, we find that maintaining an aspect ratio that is approximately even in the and directions with a longer aspect ratio in the direction works well. In practice, we keep the aspect ratio even in and while maintaining an aspect ratio in that is roughly 5 times the geometric mean of the  and aspect ratios. Memory access overheads can be reduced by first copying multiple subgrid blocks into buffers on the CPU, and then transferring subarrays containing individual subgrid blocks to the GPU for computation. Even in simulations where local volumes must be subdivided into subgrid blocks dozens of times the overhead associated with copying the conserved variables into buffers on the CPU is insignificant, typically limited to of the total time taken for the CTU calculation.
Transferring data to and from the GPU at each time step is time consuming, often taking of the entire GPU computation time for the hydro module. Therefore, strategies that reduce the fraction of the simulation volume transferred at each time step are desirable. For example, simulations that do not require subgrid splitting might achieve a performance boost by only transferring ghost cells that need to be updated via MPI. Such a strategy is beyond the scope of the current work, but is certainly worth exploring in future versions of Cholla.
3.7. MPI Implementation and Scaling
The massively parallel algorithm implemented by Cholla can be adapted to execute on multiple GPUs simultaneously. Cholla can thereby gain a multiplex advantage beyond the significant computation power afforded by a single GPU. The parallelization is implemented using the MPI library. The global simulation volume is decomposed into subvolumes, and the subvolumes are each assigned a single MPI process. In Cholla, each MPI process runs on a single CPU that has a single associated GPU, such that the number of MPI processes, CPUs, and GPUs are always equal. When the simulation volume is initialized, each process is assigned its simulation subvolume and surrounding ghost cells. Since the hydrodynamical calculation for every cell is localized to a finite stencil, only the ghost cells on the boundary of the volume may require updating from other processes via MPI communication every time step. Compared with a simulation done on a single CPU/GPU pair, additional overheads for a multiprocess simulation can therefore include MPI communications needed to exchange information at boundaries and potential inefficiencies in the GPU computation introduced by the domain decomposition. While domain decomposition influences communications overheads in all MPIparallelized codes by changing the surface areatovolume ratio of computational subvolumes, domain decomposition additionally affects the performance of a GPUaccelerated code by changing the ratio of ghost to real cells in memory that must be transferred to the GPU. Since memory transfers from the CPU to the GPU involve considerable overhead, domain decompositions that limit the fraction of ghost cells on a local process are favorable. Cholla therefore allows for two different domain decompositions, described below.
3.7.1 Slab Decomposition
Following the domain decomposition utilized by the Fastest Fourier Transform in the West discrete Fourier transform library (FFTW; Frigo & Johnson, 2005), Cholla can use a slabbased decomposition in which the simulation volume is sliced only in one dimension. In the slab decomposition a maximum of two boundaries may be shared between processes, and because there are limited communications the slab decomposition proves efficient for simulations run with a small number of processes. With the addition of more processes the slabs grow narrower, the ratio of boundary ghost cells to real cells for each subvolume increases rapidly, and the time required to exchange boundary cells between processes remains nearly constant. Although these features cause a computational inefficiency that continues to degrade with increasing numbers of processes, Cholla nonetheless includes an optional slab decomposition for use with limited processes and in conjunction with FFTW.
The division of the simulation volume for Cholla’s slab decomposition is straightforward. When the FFTW library slab decomposition is used, the slab width on each process is optimized for accelerating discrete Fourier transform computations. Otherwise, the number of cells in the dimension spanning the total simulation volume is divided evenly across the number of processes, and any remaining cells are split as evenly among the processes as possible. Once the domains have been assigned, each process initializes the real cells associated with its volume and exchanges boundary cells. First, each process posts a receive request for each MPI boundary. If the process has a global simulation boundary along the direction, it posts either one receive request in the case of reflective, transmissive, or analytic global boundary conditions, or two in the case of global periodic boundary conditions. Processes that are surrounded by other processes will always have two MPI boundaries. The processes then send the initialized values of the real cells from their subvolume that are needed by other processes. While waiting for the cell exchange communications to complete, each process computes the cell values on its nonMPI boundaries (typically the  and boundaries). This asynchronous ordering of communication and boundary computation minimizes the amount of time the CPU must sit idle while waiting to receive boundary cell information. Once all the receives have completed, each process proceeds through the CTU step as though it were an independent simulation volume. At the end of each time step, boundary cells must again be exchanged along with the information from each local subvolume required to determine the global simulation time step.
3.7.2 Block Decomposition
In addition to begin computationally inefficient, a slab decomposition limits the total number of processes that can be run as a result of the finite dimensions of the simulation volume. To improve upon both these factors, Cholla includes a block decomposition that seeks to minimize the aspect ratios of the simulation subvolume evolved by each process. For a block decomposition, up to six MPI communications for cell exchanges may be required per time step. Despite this increased number of communications, the reduction of the surface areatovolume ratio of the block decomposition improves its efficiency beyond that achieved by a slab decomposition for large numbers of processes.
In the case of a block decomposition the , , and dimensions for each subvolume are kept as even as possible. Once a simulation volume is appropriately divided and local real cells initialized by each process, boundary cells between subvolumes must be exchanged. To keep the number of MPI communications to a maximum of six, the processes exchange boundary cells in a specific order. First, each process posts a receive request for any MPI boundaries on an face. While waiting for those data to arrive, the processes set ghost cells on any nonMPI faces. Once the boundaries arrive, the processes post receive requests for MPI boundaries on faces, including the corner regions just transferred in the exchange along the direction. While waiting for the boundaries to arrive, each process computes the ghost cell values along nonMPI boundaries. By first sending the boundaries, processes can receive information needed for the boundary exchange from diagonallyadjacent processes without directly exchanging information with those processes. The same procedure is followed for the boundaries.
Figure 6 illustrates this process for a 2D simulation with periodic boundaries and a fourprocess decomposition. Each process first initializes its real cells, represented by the large colored squares. To perform the hydrodynamical simulation timestep in parallel across separate processes, each process must receive boundary ghost cells from the real cells hosted by surrounding processes. The boundary regions for each process are outlined in Figure 6 and labelled 1 and 2. The processes first exchange boundary information (region 1) via two MPI communications. Once the boundary exchange is complete, boundary information (region 2) is exchanged, including the corner regions received from other processes. The same procedure is repeated for boundaries in a 3D simulation. Following this pattern keeps the required number of MPI communications to a maximum of six, instead of the potential twentysix communications that would be required to separately exchange each face, edge, and corner boundary region for an interior subvolume in a 3D simulation. When using the block decomposition with a large number of GPUs, the MPI communications typically comprise only a few percent of the total computation time.
We note briefly that the block decomposition implemented in Cholla may also be adapted to enable the use of Fast Fourier Transform libraries that use a block decomposition, such as the Parallel FFT package written by Steve Plimpton^{4}^{4}4http://www.sandia.gov/ sjplimp/docs/fft/README.html.
3.7.3 Scaling
The scalability of the Cholla MPI implementation to more than one GPU warrants a discussion. To study the scaling of the code, the GPUaccelerated International Business Machines iDataplex cluster El Gato at the University of Arizona was used. Using El Gato, we have tested both the strong and weak scaling of Cholla using up to 64 GPUs. The results are shown in Figures 7 and 8. For both scaling tests, a threedimensional sound wave perturbation problem with periodic boundary conditions and a block decomposition is used to maximize the number of MPI communications required per timestep. In both the strong and weak scaling tests, Cholla updates an average of cells per second using the NVIDIA Kepler K20X GPUs available on El Gato. The 3D sound wave perturbation requires work to be done by every cell and uses thirdorder spatial reconstruction and an exact Riemann solver. The test is therefore relatively inefficient. By contrast, on a 2D sound wave test using secondorder spatial reconstruction and a Roe solver, Cholla updates an average of cells per second. All tests have been performed using double precision.
For the strong scaling test, a grid is evolved for time steps. The timing results for the total test runtime, the CTU algorithm (performed on the GPU), and the boundary computation including ghost cell exchange communication are tracked separately. We exclude the simulation initialization from the runtime, as it comprises a significant fraction of the runtime for these short tests and obscures the results. As Figure 7 shows, the overall scaling of Cholla is close to ideal, with the CTU step scaling slightly better than ideal beyond 8 processes. The increased efficiency at 16 processes or more owes to the decomposition decreasing the cells per process below the number that necessitates subgrid splitting, thereby reducing the CPUGPU communications overhead. All of the GPU calculations contained within the CTU step scale better than ideal at 64 processes. That the boundary condition computation does not scale as well primarily owes to the reduced number of MPI communications needed in runs with small numbers of processes compared with tests utilizing large numbers of processes where every subvolume boundary requires an MPI communication per timestep. The scaling between an 8 process run and a 64 process run, both of which require MPI communications for all boundaries, is close to ideal. We achieve an effective bandwidth for the MPI ghost cell exchange of 2.4 gigabits per second in all runs.
The weak scaling performance of Cholla is shown in Figure 8. A double precision sound wave perturbation with periodic boundaries is again used, but in this test the size of the total computational volume is rescaled with the number of processes to keep the number of cells per process approximately constant at (note that each process uses its own distinct GPU during the simulation). The test is run for 10 time steps. The total runtime efficiency as a function of the number of processes remains roughly constant beyond a single process. The CTU algorithm comprises the majority of the computational cost of each timestep, and exhibits nearly perfect weak scaling. The boundary condition calculation for the serial case does not involve MPI communications and is correspondingly inexpensive when using a single process. With two or more processes, MPI communications induce an additional overhead beyond the single process case. However, the weak scaling of the boundary conditions is reasonably maintained to 64 processes.
4. Tests
A large variety of hydrodynamics tests exist in the literature, some of which have been used for several decades (e.g. Sod, 1978). Their ubiquity makes these canonical tests an excellent way to compare the performance of Cholla with other codes. In addition, many tests have been designed to explicitly show the failings of hydrodynamics solvers in various environments or highlight the circumstances where they perform exceptionally well. In choosing the tests shown below, we attempt to demonstrate the breadth of problems Cholla can simulate. We also demonstrate the effects of changing reconstruction methods or Riemann solvers, and show differences in the outcomes of tests where they are relevant. If not otherwise specified, the following tests were performed using piecewise parabolic reconstruction with the characteristic variables (PPMC) and an exact Riemann solver. All tests were performed in double precision on GPUs.
Before delving into the specifics of each test, we make a note about the convergence rate of Cholla . Many shockcapturing methods revert to first order at shocks (see, e.g., Laney, 1998), so to test the convergence rate of Cholla the smooth perturbation test described in Stone et al. (2008) is employed. Both the PPMP and PPMC implementations in Cholla demonstrate secondorder convergence in the L1 error norm out to grid resolutions of cells.
4.1. 1D Hydrodynamics
4.1.1 Sod Shock Tube
The Sod problem (Sod, 1978) is often the first test performed by hydrodynamics codes, and we do not diverge from precedent here. Though the Sod problem is not a difficult test to run, the solution contains several important fluid features. We present the test here as an example of the ability of Cholla to resolve both shocks and contact discontinuities within a narrow region of just a few zones. The initial conditions are simply a Riemann problem, with density and pressure on the left, on the right, and an initial velocity . The initial discontinuity is at position . For this and all of the following one dimensional tests, orthogonal velocities are set to zero. We use an ideal gas equation of state with for all tests, unless otherwise noted.
The test is computed on a grid of 100 cells until a final time of . By that time a shock, a contact discontinuity, and a rarefaction fan have formed and spread enough to be clearly visible as seen in Figures 9 and 10. As described in Section 2.2, Cholla has two versions of piecewise parabolic interface reconstruction. PPMP follows the FLASH code documentation (Fryxell et al., 2000) and includes contact discontinuity steepening and shock flattening, while PPMC is based on the Athena code documentation (Stone et al., 2008) and reconstructs the interface values using characteristics without explicit steeping or flattening. As can be seen in the density plot, the contact discontinuity is resolved over just two zones using PPMP, and over three to four zones using PPMC. Because of its the explicit treatment of contacts, the PPMP method is slightly better at resolving contact discontinuities, but is also more susceptible to nonphysical oscillations as demonstrated in later tests.
4.1.2 Strong Shock Test
The strong shock test (Fryxell et al., 2000) resembles the Sod shock tube, but is more discriminating owing to the much more severe differences between the left and right initial states. This test starts with an initial discontinuity at , with left and right densities and . Initial pressures are and . The initial velocities are set to zero, as in the Sod test. The problem is calculated on a grid of 100 cells, and the resulting density in the numerical solution using both PPMP and PPMC is shown at time in Figure 11.
As can be seen in Figure 11, both PPMP and PPMC do a decent job reproducing the exact solution on this difficult problem. However, the differences between the two reconstruction methods have more discernible effects in this test. The contact discontinuity at is better resolved with PPMP, but the solution is more oscillatory in the region between the contact discontinuity and the tail of the rarefaction fan. In constructing the linear slopes across interfaces, both PPMP and PPMC use limiters that are designed to be total variation diminishing (TVD). However, the thirdorder reconstruction leads to added complications (Colella & Woodward, 1984). Despite attempts to preserve monotonicity (see Appendix A), problems with strong shocks are observed to cause oscillations in both methods. Due to its more diffusive nature, we find that PPMC is less susceptible to oscillations in regions with strong density and pressure contrasts. The inclusion of contact discontinuity steepening in PPMP keeps contacts sharp but tends to exacerbate the oscillations. The discontinuity detection relies on a number of heuristically determined constants, and the resulting slopes are not always TVD. The oscillations present in the upper panel of Figure 11 can be significantly reduced by lowering the value of the constant that determines whether a zone contains a density discontinuity or a shock (see Equation A36). This constant is labeled “” in Colella & Woodward 1984, not to be confused with “”, the coefficient used in their artificial dissipation scheme. When the discontinuity detection in PPMP is turned off entirely (equivalent to setting ) the two methods produce very similar results on the strong shock test.
4.1.3 Strong Rarefaction Test
The strong rarefaction test, or 123 problem, was originally used by Einfeldt et al. (1991) to illustrate a scenario that causes a subset of approximate Riemann solvers to fail. Because the solution contains a region where the energy is largely kinetic and the pressure is close to vacuum, the Roe solver (or any other linearized solver) will produce negative densities or pressures in the numerical solution (Einfeldt et al., 1991). The initial conditions consist of a fluid with constant density and pressure but opposite receding velocity at the center. Specifically, we set , , , and . In Figure 12, we show the results of this test on a grid of 128 cells at time using PPMP reconstruction and the exact solver. This test is not a challenge using the exact solver, but without modification the Roe solver would fail on this problem. For this reason, we test the density and pressure produced in the solution by the Roe solver and revert to the HLLE solver if necessary. With that fix we can run the problem with either solver, and in fact the HLLE fluxes are only needed on the first step of the simulation.
4.1.4 Shu and Osher Shocktube
The ShuOsher shocktube test shows the tendency of PPM to cut off maxima in smoothly varying problems as a result of the slope limiters imposed in the reconstruction method (Shu & Osher, 1989). The test consists of a strong shockwave propagating into a region with a sinusoidally varying density. The initial conditions are , , and ; , , and . We run the problem on the domain with the initial discontinuity at . The results of the test using both 200 cells and 800 cells are shown in Figure 13. As can be seen, the low resolution solution does lose some of the amplitude of the peaks. Using newer versions of the limiting functions can help alleviate this problem (Colella & Sekora, 2008; Stone et al., 2008), although we have not yet implemented these limiters in Cholla.
4.1.5 Interacting Blast Waves
Originally described in Colella & Woodward (1984), the interacting blast wave test helps quantify the behavior of a code near strong shocks and contact discontinuities. The test consists of a medium with initially constant density , with on the domain . Reflecting boundary conditions are used. Two shocks are initialized on either side of the domain, with for , for , and in between. The problem is run until time , at which point the shocks and rarefactions in the initial solution have interacted multiple times.
We show plots of the density computed with both PPMP and PPMC in Figure 14. We plot a low resolution solution with 400 grid cells over a high resolution reference solution computed with 9600 cells. As can be seen in the figure, PPMP does an excellent job keeping the contact discontinuities at and contained within just two zones, as compared to the solution computed with PPMC in which the contacts are smeared over many cells. In addition, PPMC tends to more severely cut off the maximum at , while PPMP does a decent job of keeping the full height although the peak is slightly offset. Both reconstruction techniques do a good job reproducing the shocks and the rarefaction fan between and .
4.2. 2D Hydrodynamics
4.2.1 Implosion Test
The implosion test is a converging shock problem first presented in Hui et al. (1999). The version presented here is described in Liska & Wendroff (2003), and begins with a square region of high density and pressure containing a diamondshaped region of low density and pressure. These initial conditions evoke the traditional Sod shock tube problem extended to two dimensions, but inclined to the grid by 45 degrees rather than aligned as in the one dimensional case. As the test begins material moves inward rapidly toward the center, leading to an implosion. When run for a short amount of time, this test demonstrates the ability of a code to resolve contact discontinuities and other fluid features for a nongrid aligned shock tube. When run for enough time to evolve well past the initial shock tube solution, the test illustrates the symmetry (or lack thereof) of a code.
Figure 15 shows the results of the implosion test run with PPMC and an exact solver at an early time and a later time . The problem was run on a grid with a domain , and reflecting boundary conditions at every boundary, comprising the upper right quadrant of the axisymmetric test described above. The initial density and pressure within the diamondshaped region are and , while outside the density and pressure are and . Initial velocities are zero, as in the Sod test. A discontinuous interface is located along the diagonal running from to . In the upper panel of Figure 15 a rarefaction fan can be seen expanding outward from this interface. As the upper panel shows, Cholla does an excellent job resolving the contact at early times, as can be seen along the diagonal from to .
At the later time a jet has appeared in the solution. The production of the jet is a direct result of ability to preserve symmetry in the Cholla solution to numerical accuracy. Liska & Wendroff (2003) demonstrated that codes that employ nonsymmetry preserving methods like Strang splitting may fail to produce the jetlike feature. The fact that this test is so sensitive to the symmetry of the problem makes it useful for diagnosing potential coding errors, but the test also demonstrates the extent to which a nonsymmetric algorithm can impact the physical accuracy of the result. As this test shows, relatively largescale features in the solution can be completely lost if a code fails to maintain a sufficient level of symmetric accuracy.
The bottom panel of Figure 15 shows the results of this test recomputed at a much higher resolution of . The same largescale features are apparent in the solution, but as expected the smallscale density perturbations and shape of the jet have clearly not converged. However, this high resolution test serves as further evidence of the ability of Cholla to preserve axisymmetry even in a very difficult problem. At this extreme resolution, the code must perform over time steps and more than cell updates to reach time . At that point, the results are still exactly symmetric (to floatingpoint precision), demonstrating that symmetry preservation in Cholla is a robust feature of the code.
4.2.2 Explosion Test
The explosion test, also from Liska & Wendroff (2003), is designed to test the evolution of an unstable contact discontinuity and is highly sensitive to numerical noise in the initial conditions. This noise seeds an instability that grows as the solution evolves. The test starts with a domain , that contains a circularly symmetric region of high density and pressure, with and inside a circle with radius . Reflecting inner boundaries and transmissive outer boundaries are used. Outside the circle the density and pressure are set to and . The initial velocities are zero. Because the problem is sensitive to initial perturbations at the interface, the density and pressure for cells are areaweighted at the boundary. For each cell on the boundary of the circle the percentage of the area inside the radial boundary is computed, and the initial cell data weighted appropriately.
The test problem is performed on a grid of cells, and Figure 16 shows the result of the calculation using PLMP and PPMP at . As expected, the higher order reconstruction method does a better job preserving the narrow structure of the contact, but is also more susceptible to structure along the interface as the instability develops. Thus, as the problem progresses, the lower order more diffusive method may result in a cleaner solution. We note that both methods preserve the exact symmetry of the problem, provided the initial conditions are symmetric.
4.2.3 KelvinHelmholtz Instability
A KelvinHelmholtz instability test demonstrates the extent to which a hydrodynamics code resolves mixing caused by shear flows. In this test, two fluids at different densities flow past each other and characteristic eddies appear and grow at the interface between the fluids. The growth of the eddies in the linear regime can be analytically described (Chandrasekhar, 1961) and depends on properties of the fluid and the interface itself. At a discontinuous interface, small eddies will develop first at the grid scale of the simulation, and these will gradually grow and combine into larger eddies as shown in Figure 17.
The exact nature of the instability depends sensitively on the resolution and initial conditions of the test (e.g., Robertson et al., 2010). The test shown in Figure 17 was run on a grid, with a domain , in order to maintain square cells. The simulation initial conditions include a dense fluid with density in the middle third of the box, surrounded by a less dense fluid with density in the outer thirds. The denser fluid has a velocity , and the less dense fluid has a velocity ; the velocities are initially . The entire simulation volume is initialized in pressure equilibrium with . A smallamplitude perturbation is added to the  and velocities of every cell in the grid, in proportion to the position following and . The simulation is evolved to , by which time the growth of the eddies has entered the nonlinear regime.
As the eddies grow, more mixing between the high density and low density material occurs. Resolving this mixing is an important task for a hydrodynamics code, as the amount of mixing can have a significant impact on broad features in the simulation outcome. The level of mixing tends to increase with resolution as well as with higher order reconstruction techniques. Therefore, KelvinHelmholtz instabilities highlight the importance of having an efficient high order reconstruction method and a fast code. As expected for a high resolution grid code with a high order reconstruction method, Cholla does an excellent job of resolving the shear mixing.
4.3. 3D Hydrodynamics
4.3.1 Noh’s Strong Shock
The Noh strong shock test, originally described in one dimension by Noh (1987), demonstrates how well a code can track a strong, high mach number shock. This test is considered difficult to perform in either two or three dimensions, as many hydrodynamics codes cannot run the test accurately and some fail completely (Liska & Wendroff, 2003). The test starts with a constant density of throughout the grid, with zero pressure and constant velocity toward the origin. For this test, the adiabatic index is set to . These initial conditions result in a formally infinite strength shock reflecting outward from the origin with spherical symmetry. Cholla cannot be run with zero pressure, so we set the initial pressure to a low number, , but we note that the results are relatively insensitive to the initial pressures below .
The Noh test is initialized in an octant on the domain with reflecting inner boundaries. The outer boundaries are set according to the analytic solution for the density and energy, which in two or three dimensions is
where is the radius in polar or spherical coordinates, and is the dimensionality of the problem. The momentum follows from the velocity and the solution for the density, and the total energy is set to
We evolve the solution to , by which time the shock has propagated through more than half of the computational domain. The density immediately in front of the shock as well as the density of the post shock gas can also be calculated analytically. In the 3D case, the gas immediately before the shock has a density of , and the postshock gas has a corresponding density of .
Running the Noh test on a Cartesian grid creates strong, gridaligned shocks that provoke a behavior in the numerical solution known as the carbuncle instability. The carbuncle instability arises as a result of oscillatory crossflow solutions to the Riemann problem near such shocks (Quirk, 1994). This problem is addressed by implementing a form of the H correction, as described in Sanders et al. (1998) and detailed in Appendix C. By incorporating information about the fastest transverse wave speeds, the H correction adds dissipation to the 1D fluxes calculated by the Roe Riemann solver that reduces the carbuncle strength.
The result of the Noh test using PPMC with and without the H correction can be seen in Figure 18. Without the H correction, the solution suffers from strong oscillatory behavior, particularly along the edges where the shock is aligned with the grid. In the version with the H correction applied, the unstable behavior along the axes is effectively absent. The region near the origin where the density dips down is a density error known as “wall heating” and is not related to the carbuncle instability. Wall heating is the feature that the 1D Noh test was originally designed to demonstrate. The slight noise along the shock front is a result of the strength of the shock, and is similar to the minor oscillations seen in the 1D strong shock test. We note that implementing the H correction increases the stencil required for CTU. Because the current version of Cholla is designed to accommodate a maximum of four ghost cells, we currently implement the H correction only with PPMC or lowerorder reconstruction methods.
5. New Results for Astrophysical Phenomena: ShockwaveISM Interactions
We now showcase the power of Cholla in an astrophysical context by simulating the interaction of supernova blast waves with clouds in the interstellar medium (ISM). Advancing theoretical understanding of the conditions in the ISM and the effects of supernova feedback is an active area of research (e.g. Agertz et al., 2013; Kim & Ostriker, 2014; Martizzi et al., 2014). Stellar feedback is thought to affect the evolution of galaxies on large scales by generating outflows and regulating star formation rates, but the scales on which this feedback couples to the ISM are unresolved in large cosmological simulations (Martin, 1999; Mac Low & Klessen, 2004). Using high resolution methods to constrain the physics of the ISM on subparsec scales is thus critical to improving the subgrid prescriptions applied in simulations of galaxy formation. In addition, simulating the ISM on smaller scales provides highresolution numerical results that can be compared to observations of gas within our galaxy.
The superior shockcapturing abilities of gridbased hydrodynamic codes enables the to serve as an important tool for simulating the types of highmach number shocks observed in starforming regions of the ISM. In addition, simulating the interaction of these shocks with gas clouds benefits from highorder spatial reconstruction techniques that accurately trace the hydrodynamic instabilities that develop in ISM gas. Fast, physically accurate codes like Cholla are therefore wellsuited for simulating problems like shockwaveISM interactions.
Theoretical work describing the interaction between shocks and gas clouds has a long history extending back at least to the calculations of McKee & Cowie (1975). In order to treat the problem analytically, these authors presented early computations of a high mach number, planar shock hitting a spherical cloud. Using this simple setup, the speed of the shock within the cloud, , can be calculated using only the density contrast between the cloud and the ambient medium, , and the speed of the shock in the ambient medium, , as follows,
(31) 
Klein et al. (1994) carried out a formative numerical study of the cloudshock problem, in which they defined a characteristic timescale for the evolution of the cloud. This “cloud crushing time”, , corresponds roughly to the internal shock crossing time, i.e. . Using Equation 31, the cloud crushing time can be related to the radius of the cloud, , the density contrast , and , via
(32) 
Using this characteristic timescale, various stages of the cloud’s evolution can be described. The mixing time of dense cloud gas and the ambient medium holds interest for both quantifying the impact of supernovae on their immediate environments and for the survival time of dense gas in galactic outflows. The shocked cloud experiences various hydrodynamic instabilities that cause its destruction, most importantly the KelvinHelmholtz instability (KHI). The long wavelength modes of the KHI tend to break the cloud apart, while the shorter wavelength modes mix cloud material with the surrounding medium. Using 2D adiabatic simulations with density contrasts in the range , Klein et al. (1994) demonstrated a spherical cloud is destroyed by largescale instabilities in , where the destruction time, , is defined as the time it takes for the mass in the core of the cloud to be reduced to a fraction of the initial cloud mass. Meanwhile, smallscale instabilities efficiently mix cloud material with the ambient medium in a time of order . These results were corroborated in 3D simulations by Xu & Stone (1995).
Subsequent work on the cloudshock problem has examined how additional physics such as magnetic fields, radiative cooling, selfgravity, and thermal conduction affect the cloud’s evolution (e.g. Mac Low et al., 1994; Fragile et al., 2005; Orlando et al., 2005; Melioli et al., 2006; Shin et al., 2008). These studies have produced many useful results, ranging from the stabilizing effects of magnetic fields in certain configurations to the structural properties of the cloud necessary for gravitational collapse. However, still missing from the literature is an attempt to connect the evolution of clouds with realistic density structures to the analytic theory derived for spheres. Almost all studies of the cloudshock problem have investigated only spherical or elliptical overdensities, despite the approximately lognormal density distributions of ISM clouds (Padoan & Nordlund, 2002). One exception is the work of Nakamura et al. (2006), which showed that clouds with a steeply tapering density profiles were destroyed and mixed more quickly than those with a shallow density gradient. Another is the study by Cooper et al. (2009), which included a simulation of a cloud with a fractal density distribution. However, that work focused primarily on the longterm survival of radiatively cooling cloud fragments in a galaxyscale hot wind and did not attempt to produce an analytic timescale for cloud destruction or mixing for the fractal cloud case. A numerical study evaluating the evolution of a cloud with a realistic density distribution in the context of the analytic timescales defined by Klein et al. (1994) remains to be performed.
In this section, we apply Cholla to a preliminary study of this problem. Our goal is to determine whether a realistic cloud with a given mean density is mixed with the ISM on a timescale comparable to a spherical cloud with the same mass and mean density. If not, we wish to adapt the analytical framework of Klein et al. (1994) to more realistic clouds to characterize their destruction process. To address these issues we carry out a series of highresolution, 3D hydrodynamic simulations comparing clouds with spherical density distributions to clouds with more realistic density distributions created using a Mach turbulence simulation, and we devise a simple alteration to the Klein et al. (1994) formalism that enables and analytical description of the cloud evolution for both spherical and realistic density distributions.
5.1. The Simulations
We run three sets of simulations with low, medium, and high mean density contrasts, as listed in Table 2. Each simulation is run in a box with side lengths , corresponding to a resolution of . In each simulation, a cloud is placed with its center at . In all simulations the cloud is initially at rest, and the temperature of the gas is set such that the cloud is in pressure equilibrium with the ambient medium. In the low density simulations, both the spherical cloud and the realistic cloud have a mean density that is 10 times the initial ambient density, . In the intermediatedensity simulations, , and in the highdensity simulations, . The realistic cloud consists of a spherical region excised from a Mach turbulence simulation (Robertson & Goldreich, 2012) and has a lognormal density distribution that is truncated at densities below that of the ambient medium. Higher density regions are scaled such that the mean density of the realistic cloud matches that of the spherical cloud. The highest density regions in the realistic clouds are three orders of magnitude above the ambient density, and the lowest temperatures are of order K. The radius of the spherical cloud in all simulations is pc, and is set such that the total mass in the spherical cloud matches that of the realistic cloud. The lowdensity clouds (both spherical and realistic) have an initial mass , the clouds in the intermediate simulation have an initial mass while the highdensity clouds have an initial mass . The initial cloud masses and mean densities include all material above the ambient density. Figure 19 shows projections of the initial conditions and several later snapshots for the intermediatedensity simulations.
We consider the interaction of small clouds with an old supernova blast wave, such that the radius of shock wave can be assumed to be infinite with respect to the cloud radius, and the shock can therefore be treated as planar. The simulation starts with a planar shock wave propagating upward through the box in the direction through an ambient medium with an initial number density of hydrogen atoms , and initial temperature of . Using the shock jump conditions in the strong shock limit (Zeldovich & Raizer, 1966), the postshock density, velocity, and pressure are given by
(33)  
where is the shock speed, is the Mach number of the shock, and is the sound speed in the interstellar medium. For the given initial ISM conditions and an adiabatic index , a Mach 50 shock travels at . The postshock density is , , and the postshock temperature is K. We set an inflowing boundary with the postshock quantities. All other boundaries are outflowing.
5.2. Results
Using Equation 32 we can calculate cloudcrushing times for the spherical clouds. For the the intermediatedensity simulation, , while for the low and highdensity simulations and , respectively. In order to analyze the mixing of the clouds in the following analysis, we define the cloud mass as the sum of material with a density , where is the postshock density of the ambient medium, which is for these simulations. As mentioned previously, Klein et al. (1994) defined a destruction time corresponding to the cloud breaking into large fragments. While this timescale is useful for the case of a spherical cloud, the realistic cloud contains many separate regions of high density, with no single welldefined core. Therefore, we investigate instead the mixing time , defined as the time at which the cloud mass (material with ) is reduced to 50% of the initial cloud mass. This definition leads to mixing times for the spherical clouds that agree to within a few percent of those quoted in Klein et al. (1994), and an easy comparison between timescales in the spherical and realistic cloud simulations.
The mass fractions of the spherical and realistic clouds as a function of time for all cases are shown in Figure 20. Here we briefly discuss the overall evolution of the intermediate density cases. The shock wave hits at kyr, at which point the clouds start to compress. Owing to the definition of cloud mass, the compression of lowdensity cloud material causes the realistic cloud to initially gain mass. In contrast, the spherical cloud lacks internal low density material that can be compressed above the threshold to add to the cloud mass, and the effective cloud mass begins to slowly decrease. At 10 kyr the shock wave has passed through the spherical cloud, which begins to reexpand (see Figure 19, middle panels). The shock wave finishes its passage through the realistic cloud about 1 kyr earlier. In both simulations, the cloud is accelerated by the postshock wind, and hydrodynamic instabilities begin to mix cloud material into the ambient medium. Figure 19 shows the continued evolution of each cloud in snapshots at , the typical cloud destruction time as defined by Klein et al. (1994), and at , the measured mixing time for the intermediate density spherical cloud in our simulations.
As can be seen in Figure 20, the realistic cloud is mixed significantly earlier than the spherical cloud. In the intermediatedensity case, the spherical cloud is mixed in 38 kyrs, or , comparable to the results from previous studies (e.g. Klein et al., 1994; Nakamura et al., 2006). Table 2 shows that the mixing time as a function of does not vary substantially as a function of density for the spherical clouds. In contrast, the realistic cloud is mixed in only 27 kyrs. Using a cloudcrushing time estimated from the mean density, this mixing time corresponds to , a much shorter timescale than in the spherical cloud simulation.
Parameter  

Spherical Cloud  9.98  19.97  39.93  
(kyr)  5.65  8.00  11.33  
(kyr)  28  38  53  
4.96  4.75  4.68  
Realistic Cloud  11.45  20.73  38.22  
6.11  10.68  18.95  
(kyr)  4.42  5.84  7.79  
(kyr)  21  27  36  
4.75  4.62  4.62 
Note. – Average parameters for both the spherical and realistic clouds for the lowdensity, intermediatedensity, and highdensity simulations are listed. Parameters include average cloud density, , median cloud density, , cloud crushing times using either the average (spherical cloud) or median (realistic cloud) density, or , mixing times in kyr, and mixing times as a function of cloudcrushing time.
Evolution of the low and highdensity simulations is qualitatively similar (see Table 2). In each case, the realistic cloud is mixed much more quickly than the spherical cloud, despite having approximately the same initial mass and mean density. Over the density range explored in this work, the mixing time for the realistic clouds is that of a spherical cloud with the same mean density. Clearly, the definition of the cloud crushing time using the mean density does not fit the evolution of the realistic cloud very well. In our simulations, a realistic cloud is mixed on a timescale comparable to a spherical cloud with half its mass.
We can explain this new result in terms of the structural properties of the spherical and realistic clouds. While the mean density and mass of the realistic cloud matches that of the spherical cloud, of the volume in the realistic cloud is filled with gas below the mean density. As a result, the internal shock propagates more quickly through the cloud, expediting the introduction of velocity shear between high density cloud material and the postshock wind. Rather than a single reflected bow shock, the realistic clouds experience many reflected shocks from each highdensity region, as can be seen in the middle panel of Figure 19. The individual highdensity regions fill a much smaller volume than the spherical cloud, effectively decreasing the cloud radius term that appears in Equation 32. The increased surface area for velocity shear for each high density region in the realistic cloud leads to material being more quickly ablated and mixed with the ambient medium. In contrast, the highdensity core of the spherical cloud is protected by the outer material and survives beyond the mixing time.
The evolution of the realistic cloud can still be described within the analytic framework of Klein et al. (1994) if a more appropriate density is used in the definition of the cloud crushing time. Rather than using the average density, we analyze the mixing time for the realistic cloud using the median density, as the median better represents the density of the volumefillling gas within the cloud. Median cloud densities and the corresponding cloudcrushing times are given in the bottom half of Table 2. Using these cloudcrushing times, the mixing time for the realistic cloud matches much better the results for the spherical case.
Further examinations of this problem are left for future work, but potentially interesting investigations include a more detailed parameter study incorporating clouds that are not initially at rest (i.e. have a realistic momentum distribution), and shocks with a finite radius of curvature (i.e. a nearby supernova). Nonetheless, our results clearly show that significant quantitative and qualitative differences exist between the destruction and mixing of an idealized, spherical cloud and a cloud with a lognormal density distribution, even when each has a similar mean density. Further, we newly show that these differences can be understood by applying the Klein et al. (1994) formalism adapted to use the median cloud density.
6. Conclusions
In this work we have presented Cholla, a new, massivelyparallel, threedimensional hydrodynamics code optimized for Graphics Processor Units (GPUs). Cholla uses the unsplit Corner Transport Upwind algorithm (Colella, 1990; Gardiner & Stone, 2008), multiple Riemann solvers, and a variety of reconstruction methods to model numerical solutions to the Euler equations on a static mesh.
In writing the code, we have maintained a modular structure that allows for the implementation of different hydrodynamical schemes. Cholla features five methods for interface reconstruction, including the firstorder piecewise constant method, two secondorder linear reconstruction methods, and two thirdorder methods based on the original piecewise parabolic method developed by Colella & Woodward (1984). There are multiple Riemann solvers, including the exact solver from Toro (2009) and a linear solver based on the method of Roe (1981). Incorporating multiple reconstruction and Riemann solver methods provides the ability to test results for a dependence on the particular numerical techniques used, and the strengths and weaknesses of the different methods are discussed. Cholla also implements an optional diffusive correction called the H correction (Sanders et al., 1998) to suppress instabilities along gridaligned shocks. The H correction adds additional diffusion to the Roe fluxes based on the fastest transverse velocities. The Appendices of this paper detail all of the equations used in the code, and supplement the discussion of each method presented in the main text.
The strategies employed in designing Cholla to run natively on GPUs are extensively detailed. The necessity of transferring data to and from the GPU with every time step requires a specific memory layout to improve efficiency. Once information has been transferred to the GPU, the CTU integration algorithm can be effectively divided into kernel functions that execute on the device, similar to functions in a traditional CPU code. Each of these kernels is selfcontained and contributes to the modularity of the code. Because the GPU has limited global memory storage, the simulation volume must often be subdivided to optimize performance and compensate for GPU memory constraints. The strategy we employ to execute this “subgrid splitting” efficiently is also presented.
The architectural differences between CPUs and massivelyparallel GPUs require the illucidation of several key concepts in GPU programming. When a device kernel is called, the GPU launches a large set of individual computational elements called threads. A single kernel call can launch millions of threads. In Cholla, each thread is assigned the work of computing the updated hydrodynamic conserved variables for a single real cell in the simulation volume. The streaming multiprocessors on the GPUs handle the work of assigning threads to GPU cores, which means that Cholla will be easily transportable to newer generations of GPU hardware or other coprocessors. Thousands of cores operating simultaneously on each device results in thousands of simultaneous cell updates, making the hydrodynamics solver in Cholla very fast. GPUs are designed to optimize for throughput as opposed to latency. As demonstrated in the GPUbased time step calculation presented in Section 3.5, adding additional floating point calculations to GPU kernels is relatively inexpensive. This feature can be exploited to incorporate more physics on the GPU, such as cooling, at relatively little computational cost.
The scalability of Cholla and its performance on a range of hydrodynamics problems was also documented. Using the Message Passing Interface (MPI) library (Forum, 1994), Cholla can be run across many GPUs in parallel. The code incorporates both slabbased and blockbased domain decompositions and exhibits excellent strong and weak scaling in blockdecomposed tests using at least 64 GPUs. We present the results of a suite of canonical hydrodynamics tests in one, two, and three dimensions. The stateofthe art hydrodynamics algorithms employed enables Cholla to perform accurately on a wide variety of tests, and the GPU architecture makes the computations fast without sacrificing physical accuracy. Since a single GPU can compute a simulation with nontrivial resolution, using Cholla with a cluster presents the option of running many problems to explore large parameter spaces rapidly. Further, the excellent weak scaling of Cholla suggests that very large problem sizes can be tackled on large GPUenabled supercomputers.
We demonstrate the power of Cholla for astrophysics by addressing the classic numerical problem of a shock hitting a small gas cloud in the interstellar medium. Calculations of the evolution of such clouds require high numerical resolution and excellent shockcapturing abilities (Klein et al., 1994), and Cholla was designed to address such astrophysical problems. Comparing the wellstudied ideal case of spherical overdensities to realistic clouds with lognormal density distributions created by supersonic turbulence, we present new results showing that realistic clouds are destroyed more quickly than spherical clouds of the same mass and mean density. In our simulations, realistic clouds are mixed with the ambient medium on timescales comparable to spherical clouds with half their mean density. We posit that the faster destruction time is a result of the lower median density of the gas in the realistic cloud. The shock propagates more quickly through the realistic clouds, allowing the hydrodynamic instabilities responsible for disrupting the cloud and mixing the gas to develop sooner than in the spherical case. We further show that the Klein et al. (1994) formalism can be used to describe the destruction of our realistic cloud simulations provided that the median density is used in place of the average cloud density. These results demonstrate the successful application of Cholla in the performance of highresolution, physicallyaccurate astrophysical simulations. We plan to use similar calculations in the future to connect smallscale stellar feedback to galaxy evolution on larger scales.
Lastly, we note that in contrast to many MPIenabled codes, we have designed Cholla to perform almost all hydrodynamics calculations on the GPU without transferring information to the CPU during the course of a single timestep, leaving the computational power of the CPU to address other tasks (see e.g. Figure 3). By performing calculations on the CPU and GPU simultaneously, additional physics could be modeled on the CPU during the hydrodynamical computation on the GPU. Logical extensions to Cholla include using Fourier transforms to solve gravity or drive turbulence. The addition of a magnetohydrodynamics module on the GPU is also an attractive possibility, as Cholla uses an unsplit integration algorithm that is optimized for MHD (Gardiner & Stone, 2008).
Appendix A Reconstruction Methods
To calculate the input states for the Riemann solvers used in Cholla, appropriate values of the conserved variables at each interface must be reconstructed using the cellaveraged quantities. Each Riemann problem requires input states at either side of a cell interface, referred to as and in the context of the CTU algorithm presented in Section 2.1. While previously and indicated the left and right of the interface, in this Appendix a cellcentered labeling is used to document the procedure for computing the input states at the left and right boundaries of a single cell.
The input states calculated at the left and right sides of the cell will be labeled and in the conserved variables, or and in the primitive variables. As described in Table 1, the asterisk indicates the timeevolved input state. The boundary values reconstructed before time evolution will be labeled and . For each method described below, only the steps involved in the reconstruction for the interfaces are shown. The  and reconstructions proceed in the same manner but with an appropriate change of stencil. The notation will drop the , , and subscripts unless they are needed for clarification.
a.1. Plmp
The simplest practical reconstruction method implemented in Cholla is PLMP, a piecewise linear reconstruction with slope limiting applied in the primitive variables. The stencil to calculate the boundary values for cell contains cells to the left and to the right. The first step in the method converts the cellaveraged values of the conserved variables into the primitive variables, . The cellaveraged primitive values are then used to reconstruct boundary values in the primitive variables, and at the left and right sides of cell using a local, piecewise linear reconstruction (Toro, 2009):
(A1)  
where is a vector containing the slope of each primitive variable across cell . To compute , we first calculate the left, right, and centered differences in the primitive variables across each of the cell interfaces:
(A2) 
A monotonized centraldifference limiter (van Leer, 1977) is then used to compute :
(A3) 
where is defined as
(A4) 
Each of the primitive variables is treated independently in the limiting process, so the vector of primitive variable slopes (for extrapolations to the left cell face) can be simply written as
(A5) 
The primitive variable slopes for extrapolating to the right cell face can be similarly defined.
The last step in computing input states for the Riemann problem is to evolve the reconstructed boundary values by half a time step . The time evolution is modeled using the conserved form of the Euler equations, and and are therefore converted back into conserved variables, and and used to calculate the associated fluxes via Equation 11. These fluxes are used to evolve the reconstructed boundary values and obtain input states appropriate for the Riemann problem:
(A6) 
(A7) 
a.2. Plmc
The second reconstruction method, PLMC, is also based on a piecewise linear reconstruction but with the slope limiting computed using the characteristic variables. The stencil again contains one cell to the left and right of cell . After converting the cellaveraged quantities from conserved to primitive variables, an eigenvector decomposition of the Euler equations is performed using the characteristic variables as described in Section 2. First, the eigenvalues of the linear system of equations for cell are calculated. For adiabatic hydrodynamics, the eigenvalues correspond to the three wave speeds,
(A8) 
where is the average sound speed in cell . The quantities and are speeds of the acoustic waves and is the speed of the contact wave. The corresponding eigenvalue for any advected scalar quantity (such as the transverse velocities in multidimensional problems) is simply the speed of the fluid in the normal direction .
The left (), right (), and centered () differences in the primitive variables shown in Equation A2 are then calculated. These differences are projected onto the characteristic variables, , using the left eigenvectors given in Appendix A of Stone et al. (2008). Rather than reproduce the the expressions for each eigenvector, equations describing the final projections are shown since they are actually used in the GPU kernel. The projection of the left difference is
(A9) 
where , , , , and are the components of the primitive variable difference vector . The projections for the right and central differences are calculated in the same manner, yielding and .
The characteristic differences are then monotonized using the van Leer (1977) limiter, computed as
(A10) 
We project the monotonized differences in the characteristic variables back onto the primitive variables, providing slopes in each variable that are analogous to the limited slopes described in PLMP: