A fast GPUbased Monte Carlo simulation of proton transport with detailed modeling of nonelastic interactions
Abstract
Purpose: Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic protonnucleus interactions. Our primary goal is to build a GPUbased proton transport MC with detailed modeling of elastic and nonelastic protonnucleus collisions.
Methods: Using the CUDA framework, we implemented GPU kernels for the following tasks: (1) Simulation of beam spots from our possible scanning nozzle configurations, (2) Proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) Modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) Simulation of nuclear evaporation, and (5) Statistical error estimates on the dose. To validate our MC, we performed: (1) Secondary particle yield calculations in proton collisions with therapeuticallyrelevant nuclei, (2) Dose calculations in homogeneous phantoms, (3) Recalculations of complex head and neck treatment plans from a commerciallyavailable treatment planning system, and compared with Geant4.9.6p2/TOPAS.
Results: Yields, energy and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the Geant4.9.6p2 Bertini and Binary cascade models. The 3Dgamma pass rate at 2%2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPUGPU data transfers, is 20 s for proton histories.
Conclusions: Our GPUbased MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with Geant4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencilbeam based treatment plans, and is being used as the dose calculation engine in a clinicallyapplicable MCbased IMPT treatment planning system. The detailed nuclear modeling will allow us to perform very fast linear energy transfer and neutron dose estimates on the GPU.
Keywords: Proton therapy, CUDA, GPU, Monte Carlo, Nonelastic
I Introduction
The ability to perform fast and accurate dose calculations is of great importance to modern radiation therapy, especially in proton therapy, where range uncertainties can be critical. In terms of accuracy, it is widely accepted that Monte Carlo (MC) techniques are the gold standard for dose calculations. MC simulations of proton treatment plans have previously been performed using wellestablished packages such as Geant4, FLUKA and MCNPX Paganetti2008 (); fluka (); MCNP (). The improved accuracy of MC simulations compared to pencil beam calculations has been explicitly demonstrated Paganetti2008 (). Because of the enhanced accuracy, MC dose calculations can be used to verify results from pencil beambased treatment planning systems (TPS), hence mitigating the effects of proton range uncertainties Paganetti2012 (). Unfortunately, computational times associated with MCbased treatment plan simulations using Geant4, FLUKA and MCNPX can be prohibitive. For example, a TOPAS TOPAS () simulation of the dose within 2% statistical error (1 particle histories) in a 150 cm tumor volume typically requires around 1 hour on a 100node CPU cluster. This makes routine clinical QA of treatment plans very difficult. At best, MC evaluations can be reserved for a handful of cases, and even so, only if significant computing resources are available.
Recently, several attempts to implement condensed history MC simulations of proton transport on graphics processing units (GPUs) have been made, with very encouraging results Kohno (); Jia (); Su (); Osiecki (). Using Berger’s classification, these attempts can be grouped as class I Kohno (); Su (); Osiecki () and II Jia (). The typical processing time for particle histories is reported to be around 2–30 s. The impressive speed gain compared to CPUbased calculations is due to both algorithmic simplification and hardware acceleration. By far the most important approximation in refs.Kohno (); Jia (); Su (); Osiecki () concerns protonnucleus collisions. Su et al. Su () and Osiecki et al. Osiecki () have computed proton depthdose curves in water on the GPU, without any nuclear interactions. Kohno et al. Kohno () indirectly consider nuclear processes by using measured proton depthdose curves in water to determine energy losses during particle stepping. Jia et al. Jia () use a simplified model of nonelastic nuclear interactions (based on the work of Fippel and Soukup Fippel ()), which can predict the energy spectrum of secondary protons from p–O collisions. The composition of all materials is assumed to be identical to water, and hence, only nonelastic collisions on O are considered. Their dose calculations in bone and lowdensity tissue demonstrate good agreement with Geant4, despite the lack of an indepth model of nuclear interactions.
However, in certain situations a more accurate model of nuclear processes is critical, e.g. particle propagation in patients with sizeable metallic implants, detailed studies of secondary particle production, neutron dose estimates and doseaveraged linear energy transfer (LET) calculations. A nonelastic protonnucleus interaction can be simulated with the MC method as a twostep process: a fast intranuclear (INC) cascade that leaves the nucleus in an excited state, followed by an evaporation stage Bertini (). Traditional CPUbased simulations are rather timeconsuming: it takes 200s to compute 200 MeV p–O nonelastic interactions
In this paper, we present a GPUbased class II condensed history proton transport MC, which uses our previously reported INC and evaporation kernels
This paper is organized as follows. In §II.1, we describe our physics model and all the major components of our simulation. This includes realistic beam spot modeling, energy loss, multiple scattering, energy straggling, INC, nuclear evaporation, elastic nuclear scattering and transport mechanics. In §II.2, the GPU implementation is described: our parallelization strategy (§II.2.1), the software structure (§II.2.2) and GPU memory management (§II.2.3). We give details of our validation methods and of our treatment plan recalculation workflow in §II.3 and II.4 respectively. In §III, we report on the validation results, comparing our MC simulation results with Geant4.9.6p2/TOPAS: (1) predictions from our model of nonelastic nuclear interactions (§III.1), (2) pencil beam dose distributions in homogeneous phantoms (§III.2), and (3) MC recalculations of complex head and neck treatment plans (§III.3). We then discuss a number of current and future applications of our work in §IV, before summarizing and concluding in §V.
Ii Methods
ii.1 Simulation model
Geometry
We focus only on transport through voxelized geometries. Patient geometry is implemented from CT scans, using the formalism of Schneider et al. Schneider () to convert the Hounsfield Unit (HU) of each voxel to material density and composition. The HU to density conversion factors are the same as in ref.Schneider (). Following ref.Paganetti2008 (), in the HU range between 1000 and 3060 we use 26 materials comprised of a combination of up to 12 elements, with everything above 3060 assumed to be metallic implant (titanium by default).
Stopping powers, multiple scattering, energy straggling
Total stopping power () tables for protons in water, air and titanium are obtained from Geant4.9.6p2. Proton rangeenergy tables are then generated via numerical integration in the continuous slowing down approximation (CSDA). Second derivatives are also precalculated so that the tables can be cubicspline interpolated. The stopping power of a proton of energy in material with is related to , the stopping power in water, by:
(1) 
where is the density of material, is the density of water, and is a scaling factor. is tabulated as a function of HU using Geant4.9.6p2. For air (HU = 1000) and titanium (HU 3060), this scaling factor is not applied, and pregenerated tables are used directly instead.
Multiple scattering in each condensed history step is simulated using the Highland formula RPP (). A multiplicative factor of 1.07 is added to obtain comparable angular spread with Geant4.9.6p2, which employs the Urban multiple scattering model Urban (). Radiation lengths for all defined materials are precalculated with Geant4.9.6p2. We also implemented a full Molière multiple scattering simulation, which takes into account the single scatter tail at the cost of slowing down the simulation by a factor of 5. All results presented in this work were produced using the scaled Highland formula.
Energy straggling is simulated with a VavilovGaussian model. Fast sampling from Vavilov distributions is achieved using Chibani’s method Chibani (). The mean energy loss in each step is found by cubicspline interpolating the rangeenergy table. For a material with , the CSDA proton range is derived from the range in water, by:
(2) 
In this work the kinetic energies of ray electrons are assumed to be deposited locally. Given the typical size of a voxel (1 mm), this approximation holds very well for soft tissue and bone.
Transport mechanics
The step size is initially picked as the minimum of three quantities: (1) the distance to the next elastic or nonelastic nuclear interaction, , which is selected according to the total nuclear crosssection (2) the distance to the next voxel boundary, , and (3) the range of the proton, . We use a ‘random hinge’ algorithm PENELOPE () to take into account the small transverse displacement after each multiple scattering step.
If , a nuclear collision occurs, and the proton is propagated to the collision point. A random number is thrown to determine if the collision is elastic or nonelastic. The proton history is terminated if one of three conditions are met: (1) , which means that the proton is stopping in the current voxel, (2) , where MeV, and (3) the proton is outside the boundaries of the voxelized geometry.
Nonelastic nuclear interactions
Nonelastic protonnucleus crosssections for all relevant nuclei in the energy range 0–250 MeV were tabulated using Geant4.9.6p2. Once a nonelastic collision is deemed to happen, a random number is thrown and the target is picked according to the relative sizes of the nonelastic crosssections of each nuclear component. After picking a random point of entry into the target, the proton is propagated through nuclear matter. This is repeated until an INC involving at least one Pauliallowed nucleonnucleon collision occurs within the nucleus. We very briefly summarize the INC and evaporation simulations below (for further details see ref.INCPaper ()).
The nucleus is assumed to be a twocomponent Fermi gas of neutrons and protons. The nucleon density follows a WoodsSaxon distribution that is approximated by 15 constant density shells. The incident proton and all collision products are tracked in the nuclear matter, until they exit the nucleus or become trapped. Pauli blocking is enforced by requiring the kinetic energies of all collision products to exceed their Fermi energies. Reflection and refraction at the potential shell boundaries are taken into account. In the therapeutic energy range ( MeV), only elastic collisions between nucleons need to be considered. Total and differential elastic nucleonnucleon crosssections are calculated using the parameterization of Cugnon et al. Cugnon (). The result of the INC simulation is a number of outgoing nucleons and a residual nucleus that subsequently deexcites by particle evaporation.
To model the evaporation phase, we use the generalized evaporation model (GEM) GEM (). The angular distribution of evaporated particles is assumed to be isotropic in the rest frame of the nucleus. Other postcascade processes such as photon evaporation, preequilibrium emission of particles, fission and Fermi breakup are currently not considered. Only protons produced in the INC and evaporation stages are stored for subsequent propagation. Neutrons are presently assumed to be dosimetrically irrelevant, and the kinetic energies of all charged particles heavier than protons are locally deposited.
Coherent nuclear elastic interactions
We simulate coherent nuclear elastic collisions with all constituent nuclei in the patient geometry. Collision kinematics are fully relativistic. Total elastic crosssections are obtained from Geant4.9.6p2. Scattering angles in the centerofmass frame are sampled according to the following parameterization of the elastic differential cross section by Ranft Ranft ():
(3) 
where is the mass number of the target and the invariant momentum transfer in (GeV/c) is given by , being the momentum of the scattered proton in GeV/c.
Simulation of proton beam spots from scanning nozzle
A TOPAS simulation of the upcoming Mayo proton center spot scanning nozzles has been developed (fig. 1). Details of the different treatment nozzle geometries and the proton beam phase space parameters at nozzle entrance were supplied by the vendor. Although commissioning data are not yet available to validate our model, preliminary spot size calculations are in good agreement with vendor specifications. The capability of Geant4 to accurately simulate spot scanning nozzles has previously been demonstrated in ref. MDAndersonNozzle ().
We used the TOPAS simulation to generate phasespace lists of protons near nozzle exit at 40 cm from isocenter plane, covering all synchrotron beam energies and treatment nozzle configurations. Five kinematic variables are stored for each proton: the transverse positions , transverse directional cosines and kinetic energy . For any given beam energy, we verified that probability distributions of kinematic variables for all relevant offaxis beamspots were essentially identical to those for centralaxis spots. This significantly reduces the size of input phasespace data, since centralaxis particle lists can be used to model offaxis beam spots. However, if the nozzle includes a range shifter, an energy reduction has to be applied to to account for the additional energy loss of protons in offaxis spots
The phasespace files are preloaded into our MC at initialization time, and protons are generated at the nozzle exit. In practice, the number of primary proton histories in a given calculation has to be scaled down to account for losses due to nuclear interactions in the nozzle; otherwise the dose deposited in the phantom or patient would be slightly overestimated. This scaling factor was calculated for each nozzle configuration and beam energy. The protons are then are propagated through air to the edge of the volume defined by the CT image in a single step, taking into account multiple scattering, energy straggling and nuclear interactions.
ii.2 GPU implementation
We now describe our implementation of the above model on a GPU using the CUDA C framework cuda (). The reader is assumed to be acquainted with GPU terminology; if not, he or she is referred to the abundant resources available on the NVIDIA developer website nvidia ().
Parallelization approach
Many different strategies have been used in the past to parallelize photon, neutron and charged particle transport MCs on the GPU. Although no two simulations are identical, one can identify two broad paradigms: the ‘one step per thread’, and ‘one particle history per thread’ methods.
In the first method (e.g. refs.Jahnke (); rpi ()), the simulation of particle histories is split into basic components (such as track stepping, boundary crossing and discrete physical interactions), which are accumulated and processed in parallel by kernels that contain little or no branching instructions. As shown by Du et al. rpi () this strategy can result in a divergencefree simulation because all threads in a warp are made to execute the same set of instructions.
The second method is the most commonly adopted approach. Here, a GPU thread processes a particle history until end conditions are satisfied. Since particle histories differ, thread divergence is inevitable, especially if more than one particle type are handled by a warp, or if a particle and its secondaries are tracked by the same thread. However, this method potentially results in a much reduced number of global memory transactions. For example, at the end of each step the particle position, direction and energy needs to be updated. These variables can be read from global to register memory, updated during track stepping and written back to global memory when end conditions are met. In contrast, in the ‘one step per thread’ method, processing each track step would require at least one global memory read and write per kinematic variable, since one kernel launch occurs per step.
Comparing the two methods for a neutron transport MC, Du et al. rpi () find the ‘one step per thread’ method to be 10 times slower. In their implementation, the benefits of a divergencefree simulation is overwhelmed by severe global memory access latency. In this work we therefore follow the ‘one history per thread’ approach. We limit the effects of thread divergence to some extent by grouping nonelastic events and processing them in parallel. Our code structure is discussed next.
Software structure
We implement the following five kernels: (1) a phase space kernel to set the initial kinematic variables of the protons in order to simulate realistic beam spots (§II.1.6), (2) a transport kernel to evolve the proton history step by step through the voxelized geometry, taking into account energy loss fluctuations and scattering (§II.1.2), until it undergoes a nonelastic collision or meets one of the end conditions outlined in §II.1.3, (3) an INC simulation kernel, (4) a nuclear evaporation kernel, and (5) a dose statistics kernel to evaluate the total and root mean square (RMS) dose in each voxel. Kernels (3) and (4) are described in further detail in ref.INCPaper (). We also use CURAND kernels curand () for onthefly random number generation, and CUDA Thrust Thrust () libraries for various bookkeeping operations.
For a dose calculation involving proton histories, we process groups of protons to evaluate statistical errors on the dose deposited in each voxel. Due to GPU memory constraints, a group is further split into subgroups of protons, where is an integer multiple of the number of GPU threads in a block.
The sequence of kernel calls is shown in fig. 2. Before looping through the primary proton groups, the voxelized geometry, CURAND states, crosssections, stopping power and range tables, physics constants, dose array and other simulation inputs are initialized. At the start of a proton subgroup loop, the phase space kernel is launched to sample the kinematic variables of all primary protons in that subgroup. Then the transport kernel is launched to simulate all protons until end conditions are met. If a nuclear interaction occurs, it is stored for subsequent batch processing. Thrust utility kernels are then called to collect and sort nonelastic interactions. These are simulated in parallel by the INC kernel, resulting in a list of residual nuclei. This list is passed to the evaporation kernel to simulate the deexcitation mechanism. The INC and evaporation kernels are based on the model in §II.1.4. Secondary protons produced in the transport, INC and evaporation kernels are collected using Thrust functions, and the three kernels are again invoked in the same order. This is repeated until all secondary protons are simulated. The program then moves on to the next primary proton group. After all groups have been processed, the dose statistics kernel calculates the total and RMS dose in each voxel from the dose tallies of each primary proton group.
In our MC, CPU operations are limited to simulation initialization, kernel call control, and file output. The thread workload for each GPU kernel is as follows: the phase space kernel initializes the position , directional cosines and kinetic energy of one proton in each GPU thread. The transport kernel handles one proton history per thread, and the Bertini cascade and evaporation kernels simulate one INC and one nuclear deexcitation per thread, respectively. As for the dose statistics kernel, it processes one voxel per thread.
GPU memory allocation and management
Readonly data such as phase space kinematic variables, interaction crosssections and stopping power and range tables are stored in as 1D textures. Constant memory is used for storing physics and mathematical constants. Shared memory is barely used in this work. CT geometry information and the 3D dose tallies of each primary proton group are kept in GPU global memory. All dose tallies need to be updated ‘atomically’ to prevent race conditions when multiple threads are updating the dose deposited to the same voxel. For any proton group, a single dose tally was found to be sufficient on Kepler cards
As seen in fig. 2, virtually all hostGPU data transfers occur either during the initialization stage or at simulation end, when the computed dose and dose RMS maps are transferred to the host CPU for file output. Particle SoAs are created and cleared at the beginning and end of a subgroup loop, respectively. Data contained in the SoAs do not need to be transferred to the host between kernel calls. In other words, we tried to eliminate CPUGPU data transfers as much as possible.
ii.3 Treatment plan calculations
The workflow that we adopted for recalculating plans from a commercially available TPS is shown in fig. 3. DICOM files (plan, structure and image) outputted by the TPS are processed using a set of Python scripts, to produce inputs for our MC. This step involves trimming the CT image to reduce computational time: using information contained in the structures file, a smaller rectangular volume containing the patient geometry is defined, so that airfilled voxels in the beam path are removed as much as possible. Beam spot weights, isocenter positions, gantry and couch rotation angles, and details of the nozzle configuration are also read and outputted in text format. A ‘bit mask’ file specifying whether or not each voxel in the reduced volume is within a userspecified list of structures is created. This information allows the user to override the HU values of all voxels inside or outside a given structure, and to restrict dose file output to certain structures only. This considerably reduces GPUCPU data transfer and file output times at simulation end.
ii.4 Validation
To validate our nuclear model, we simulated nonelastic proton collisions on a large number of therapeutically relevant nuclei, with incident energies between 70 and 230 MeV. We compared our predictions with Geant4.9.6p2 (using the Bertini and Binary cascade models) and published measurements when available.
Using our MC, we simulated infinitesimally narrow pencil beams of protons with energies between 70 and 230 MeV, in various homogeneous voxelized phantoms including water, soft tissue, dense bone and pure calcium and titanium. The voxel size was 1 mm along all directions. The simulations were also carried out with Geant4.9.6p2, using the QGSP_BIC_EMY qgspbicemy () physics list. All particles were terminated once they exit the phantom, so as to eliminate any inscattering contributions from outside the region of interest. These pencil beam simulations allowed us to verify and refine our implementation of the physics models for proton transport. We also simulated square fields of varying sizes in water at several beam energies, for different treatment nozzle configurations. These square field simulations were repeated with TOPAS, using the nozzle model described in §II.1.6.
To verify our treatment plan calculations, we simulated three complex head and neck cases involving different patients using both TOPAS and our GPUbased MC. The TOPAS simulation again included the nozzle model described in §II.1.6. Both simulations adopted the native CT voxel size ( mm) for particle transport and dose scoring. To quantify differences between the two MC we use the 3D gamma index Low (), which is also computed on the GPU. However, the 3D gamma index might not be sensitive to systematic biases that are smaller than the pass criterion. To investigate these, we examined distributions of the percentage dose difference, defined for a given voxel as:
(4) 
where the dose calculated by TOPAS and our MC in that voxel are and , respectively. We also define the quantities and . The latter is defined as:
(5) 
where is the dose calculated by our MC for the same number of protons, but using a different starting random number seed, and is the average percentage statistical uncertainty on the dose in the voxel. is estimated by dividing the total number of primary protons in 10 groups, as explained in §II.2.2. , which measures the statistical error in the TOPAS simulation, is defined in a similar way.
Ideally, 1D distributions of , and in the targets should be Gaussians peaked at zero, with identical widths. Biases and other model discrepancies result in peak shifts and RMS differences.
Iii Results
iii.1 Nonelastic interactions
Fig. 4 shows predicted differential crosssections for proton and neutron production in: (a) 90 MeV p–C, (b) 113 MeV p–C, (c) 200 MeV p–O and (d) 200 MeV p–Ca collisions at various angles. The values are normalized to total crosssections from Cugnon et al. Cugnon () and have been scaled down for display clarity (except for the topmost curves). In general, reasonable agreement is obtained with both Geant4.9.6p2 models and published data. The Bertini and Binary models are routinely included in the physics lists used in Geant4based dose calculations. Further comparisons of our INC and evaporation kernel predictions with Geant4 are shown in ref.INCPaper ().
Platform  Model  p–O (s)  p–Ca (s) 

(1) CPU  G4 Binary  2382.07  4620.92 
(2) CPU  G4 Bertini  492.42  534.13 
(3) CPU  this work  172.74  326.11 
(4) GPU  this work  8.39  15.56 
(5) GPU  this work  6.61  9.77 
Table 1 shows the computational times for proton–O and proton–Ca interactions at 200 MeV for the following: (1) Geant4 Binary cascade model on the CPU, (2) Geant4 Bertini cascade model on the CPU, (3) The CPU version of our MC, (4) Our GPUbased MC, and (5) Our GPUbased MC, compiled with the use_fast_math flag
iii.2 Calculations in homogeneous phantoms
To illustrate some of the pencil beam simulation results, figs. 5 and 6 show, clockwise from the top left: the depth dose, 2D isodose contours, transverse dose and centralaxis depthdose profiles for 100 and 200 MeV protons in water and titanium, respectively. In these two figures, the average statistical error on the dose in voxels containing at least 10% of the maximum dose was below 0.2%. Very good agreement was obtained with Geant4.9.6p2: using the Geant4 dose maps as the reference, the 3D gamma pass rate at 2%2 mm was 100% for all beam energies and phantoms investigated. In particular, our MC correctly models the transverse dose tails, down to the regime where nuclear elastic scattering dominates. Fig. 6 explicitly demonstrates the capability of our MC to correctly model proton transport in a typical metallic implant. We stress that the level of agreement illustrated fig. 6 would have been difficult to achieve with a nuclear model that only considers p–O interactions.
The square field simulations also demonstrated very good agreement between Geant4 and our MC. These calculations were performed to test our ability to compute large field sizes, and to correctly position and model beam spots. For brevity, we won’t show these results, since the treatment plan recalculation results below imply that these requirements were met.
iii.3 Treatment plan calculations
Fig. 7 shows MC recalculations of one of the three head and neck plans from a commercially available TPS. This plan consisted of two bilateral and one posterior fields. The dose in one representative coronal plane is shown in the bottom plots for our MC (left) and TOPAS (right). This plan contained one lowdose and one highdose target that are delineated in cyan and magenta, respectively. The top plot in fig. 7 shows predicted DVHs for the two targets and four critical structures, for our MC (triangle markers) and TOPAS (square markers). Very good agreement is obtained between the two simulations. Taking the TOPAS dose map as the reference, the 3D gamma pass rate at 2%2 mm in this case was 98.2%. A color wash plot showing the distribution of in the same coronal plane is shown in fig. 8 (left). No apparent pattern in was found; as expected higher values were obtained in voxels with low dose statistics.
Distributions of , and in the highdose target are Gaussian, as shown in the right plot in fig. 8. We observe that the distribution of is slightly offcenter, i.e. is on average 0.2% lower than . This small dose deficit is attributed primarily to neutrons and deexcitation gammas originating from nuclear interactions in the nozzle and in the patient, and which are not currently propagated in our MC simulation. Gaussian fits to the three curves yield RMS values of 2.2, 2.0 and 1.9, respectively. The value of 1.9 is consistent with the estimated value of % (see Eq. 5). The percentage statistical error in the TOPAS simulation is slightly higher (%). This is possibly caused by the difference in energy straggling implementations between TOPAS and our MC. Note that in our simulation the initial number of primary proton histories has been scaled down relative to TOPAS to account for nuclear interactions in the nozzle (see §II.1.6).
, the variance of the distribution of , can be written as:
(6) 
where is the additional variance due to model differences. From the above numbers, %. We hypothesize that the main contribution to is due to the handling of rays: TOPAS propagates these particles, while in our simulation the kinetic energy of electrons is locally deposited.
For the two other head and neck plan recalculations, the 3D gamma pass rates at 2%2mm were 97.8% and 98.9%. Very good agreement was again obtained between the DVHs generated from our MC and TOPAS, and dose difference distributions followed the same trend. Hence, for brevity we will not discuss these two plans in further detail.
The computational times for the three head and neck plans on two types of NVIDIA cards are shown in table 2. We verified that simulations performed with and without the use_fast_math flag were equivalent. It is seen that our MC on a single K20Xm card can be up to 200 times faster than TOPAS on a 100node cluster. Evidently, TOPAS performs a more detailed calculation, but as shown above, dosimetrically the results from the two simulations are very close.
Platform  HN Plan 1  HN Plan 2  HN Plan 3 

Our MC (K20Xm use_fast_math)  119 s  128 s  129 s 
Our MC (GTX680 use_fast_math)  155 s  165 s  176 s 
Our MC (K20Xm)  225 s  239 s  270 s 
Our MC (GTX680)  256 s  270 s  288 s 
TOPAS on CPU cluster  650 CPUhours 
Iv Discussion
iv.1 Current applications of our GPUbased MC
At our clinic, the GPUbased proton transport MC is being used in two applications that can potentially play an important role in mitigating the effects of proton range uncertainties:

Our MC was deployed on a dual K20Xm GPU server to routinely QA treatment plans from a commerciallyavailable TPS. Due to the very fast computational speed, near realtime feedback on the accuracy of most plans is achievable. This would not have been possible with a conventional, CPUbased MC. A friendly user interface for this verification system is being developed to make the software easily accessible to dosimetrists. Another Python script, not shown in fig. 3, allows the MCcalculated dose to be imported into the TPS for convenient evaluation and display.

In addition, we have developed a GPUbased intensity modulated proton therapy (IMPT) optimization code that adopts our MC as the dose computation engine Ma (). For a complex threefield head and neck plan, the full calculation, including the initial dose map for all relevant beam spots, is estimated to take half an hour on a cluster with 24 GPUs. Therefore, using GPU acceleration, MCbased IMPT planning can be clinically applicable on a routine basis. We are currently upgrading this code to a robust MCbased IMPT optimization system.
iv.2 Future applications
We are planning the following two improvements to our physics model: simulation of the production and transport of rays, and propagation of secondary neutrons from nonelastic nuclear interactions of protons in the patient. The ability to evolve neutron histories will allow us to make very fast and accurate neutron dose calculations.
As discussed above, one of the main strengths of this work is the ability to carry out detailed nuclear interaction simulations very rapidly on the GPU. It was shown in ref.Grassberger () that secondary protons have a significant impact on the LET distribution in clinical proton beams. In fact, our MC is capable of computing physical dose and LET simultaneously; this does not considerably extend the net computational time. The LET calculations in water are in good agreement with TOPAS. Fig. 9 shows our prediction of the LET distribution in areas with at least 10% of the maximum dose, in the same coronal plane as in fig. 7. As expected, LET values are higher in regions where the protons are ranging out. At present, our LET values neither include the small contributions of proton recoils from scattering of secondary neutrons, nor heavy ion recoils. The neutron scattering portion will be taken into account after the addition of a neutron transport kernel to our MC. The ability to perform MCbased LET calculations rapidly and accurately on the GPU opens very exciting prospects for biological treatment planning.
V Conclusions
To summarize, we have successfully developed a GPUbased proton transport MC that incorporates Bertini cascade and evaporation kernels for modeling nonelastic interactions of protons with any nucleus in the therapeutic energy range. We have verified our MC extensively using Geant4 and TOPAS. The net computation speed is typically 20 s per proton histories, which is comparable to processing times reported by previous authors using simulations with simpler nuclear interaction models. The very fast calculation speed allowed this MC to be used in our clinic as the central component of a treatment plan verification system, and also as the dose calculation engine for MCbased IMPT optimization. Furthermore, the detailed nuclear modeling not only gives us greater confidence in our physical dose calculations, but will allow us to perform accurate GPUbased MC calculations of LET, as well as fast neutron dose estimates.
Vi Acknowledgements
This work was partially funded by a grant from Varian Medical Systems, Inc. We are grateful to our colleagues at Mayo Clinic for carefully reading the manuscript.
References
Footnotes
 This is roughly the expected number of nonelastic events for proton histories.
 In this paper, ‘kernel’ refers to a piece of code that runs on the GPU but called from the host CPU.
 The ICRU definitions for the terms ‘elastic’, ‘nonelastic’ and ‘inelastic’ are adopted in this paper ICRU63 ().
 For example, if the thickness of the range shifter is 4.34 cm, on average the beam at the extreme edge of the scanning region would ‘see’ a thickness of 4.38 cm.
 On Fermi cards, Jia et al. Jia () report the calculation speed to be considerably extended by atomic operations. They use multiple dose tallies to mitigate this effect.
 Compiling with the use_fast_math flag enables the use of faster, but less accurate functions for standard mathematical operations on NVIDIA GPUs.
References
 H. Paganetti, H. Jiang, K. Parodi, R. Slopsema and M. Engelsman, Clinical implementation of full Monte Carlo dose calculation in proton beam therapy, Phys. Med. Biol. 53 (2008) 4825.
 K. Parodi, A. Ferrari, F. Sommerer and H. Paganetti, Clinical CTbased calculations of dose and positron emitter distributions in proton therapy using the FLUKA Monte Carlo code, Phys. Med. Biol. 52 (2007) 3369.
 U. Titt, B. Bednarz and H. Paganetti, Comparison of MCNPX and Geant4 proton energy deposition predictions for clinical use, Phys. Med. Biol. 57 (2012) 6381.
 H. Paganetti, Range uncertainties in proton therapy and the role of Monte Carlo simulations, Phys. Med. Biol. 57 (2012) R99.
 J. Perl, J. Shin, J. Schümann, B. Faddegon and H. Paganetti, TOPAS: An innovative proton Monte Carlo platform for research and clinical applications, Med. Phys. 39 (2012) 6818.
 R. Kohno, K. Hotta, S. Nishioka, K. Matsubara, R. Tansho and T. Suzuki, Clinical implementation of a GPUbased simplified Monte Carlo method for a treatment planning system of proton beam therapy, Phys. Med. Biol 56 (2011) N287.
 X. Jia, J. Schümann, H. Paganetti and S. Jiang, GPUbased fast Monte Carlo dose calculation for proton therapy, Phys. Med. Biol. 57 (2012) 7783.
 L. Su, T. Liu, A. Ding and X.G. Xu, A GPU/CUDA based Monte Carlo for proton transport: preliminary results of proton depth dose in water, Med. Phys. 39 (2012) 3945.
 T. Osiecki, M. Tsai, A. Gattiker, D. Jamsek, S. Nassif, W. Speight and C. Sze, Hardware acceleration of an efficient and accurate proton therapy Monte Carlo, Procedia Comput. Sci. 18 (2013) 2241.
 M. Fippel and M. Soukup, A Monte Carlo dose calculation algorithm for proton therapy, Med. Phys. 31 (2004) 2263.
 H.W. Bertini, Lowenergy intranuclear cascade calculation, Phys. Rev. 131 (1963) 1801.
 S. Leray, D. Mancusi, P. Kaitaneimi, J.C. David, A. Boudard, B. Braunn and J. Cugnon, Extension of the Liège intranuclear cascade model to light ioninduced collisions for medical and space applications, J. Phys. Conf Ser 420 (2013) 012065.
 H. Wan Chan Tseung, C. Beltran, “A graphics processorbased intranuclear cascade and evaporation simulation,” Comput. Phys. Commun. 185 (2014) 2029.
 The International Commission on Radiation Units and Measurements, Nuclear data for neutron and proton radiotherapy and for radiation protection, ICRU Report No. 63, ICRU Publications, Bethesda, MD, 2000.
 W. Schneider, T. Bortfeld and W. Schlegel, Correlation between CT numbers and tissue parameters needed for the Monte Carlo simulations of clinical dose distributions, Phys. Med. Biol. 45 (2000) 459.
 J. Beringer et al. (Particle Data Group), Review of particle physics 20122013, Phys. Rev. D86 (2012) 010001.
 L. Urban, ”Multiple scattering model in Geant4”, CERN Report No. CERNOPEN2002070 (CERN, Geneva, 2002).
 O. Chibani, New algorithms for the Vavilov distribution calculation and the corresponding energy loss sampling, IEEE Trans. Nucl. Sci. 45 (1998) 2288.
 J. Baro, J. Sempau, J.M. FernandezVaria and F. Salvat, PENELOPE: An algorithm for Monte Carlo simulation of the penetration and energy loss of electrons and positrons in matter, NIM B100 (1995) 31.
 J. Cugnon, D. L’Hôte and J. Vandermeulen, Simple parameterization of crosssections for nuclear transport studies up to the GeV range, NIM B 111 (1996) 215.
 S. Furihata, Statistical analysis of light fragment production from medium energy protoninduced reactions, NIM B 171 (2000) 251.
 J. Ranft, Estimation of radiation problems around high energy accelerators using calculations of the hadronic cascade in matter, Particle Accelerators 3 (1972) 129.
 S.W. Peterson, J. Polf, M. Bues, G. Ciangaru, L. Archambault, S. Beddar and A. Smith, Experimental validation of a Monte Carlo proton therapy nozzle model incorporating magnetically steered protons, Phys. Med. Biol. 54 (2009) 3217.
 J. Nickolls, I. Buck, M. Garland and K. Skadron, Scalable parallel programming with CUDA, ACM Queue 6 (2008) 40.
 URL: https://developer.nvidia.com
 L. Jahnke, J. Fleckenstein, F. Wenz and J. Hesser, GMC: a GPU implementation of a Monte Carlo dose calculation based on Geant4, Phys. Med. Biol. 57 (2012) 1217.
 X. Du, T. Lui, W. Ji and G. Xu, Evalulation of vectorized Monte Carlo algorithm on GPUs for a neutron eigenvalue problem, Proceedings of International Conference on Mathematics and Computational Methods Applied to Nuclear Science & Engineering (M&C 2013), Sun Valley, Idaho, USA, May 59, 2013, on CDROM, pp.25132522, American Nuclear Society, LaGrange Park, IL (2013).
 NVIDIA Corporation, CURAND library, PG05328032_V01 (2010).
 J. Hoberock and N. Bell, Thrust: a parallel template library, Version 1.7.0 (2010), URL: http://thrust.github.io.
 S.V. Förtsch, A.A Cowley, J.V. Pilcher, D.M. Whittal, J.J. Lawrie, J.C. Van Standen and E. Friedland, Continuum yields from C(p,p) at incident proton energies of 90 and 200 MeV, Nucl. Phys. A 485 (1988) 258.
 M.M. Meier, D.A. Clark, C.A. Goulding, J.B. McClelland, G.L. Morgan, C.E. Moss and W.B. Amian, Differential neutron production cross sections and neutron yields from stoppinglength targets for 113 MeV protons, Nucl. Sci. Eng 102 (1989) 310.
 A.A. Cowley, data taken from ICRU Report 63.
 G.A.P. Cirrone et al., Hadrontherapy: a Geant4based tool for proton/ion therapy studies, Progress in Nuclear Science and Technology 2 (2011) 207.
 D.A. Low, W.B. Harms, S. Mutic and J.A. Purdy, A technique for the quantitative evaluation of dose distributions, Med. Phys. 25 (1998) 656.
 J. Ma, H. Wan Chan Tseung and C. Beltran, A GPUaccelerated and Monte Carlobased intensity modulated proton therapy optimization system, AAPM 56 Annual Meeting THA19A12, Med. Phys. 41 (2014) 535.
 C. Grassberger and H. Paganetti, Elevated LET components in clinical proton beams, Phys. Med. Biol. 56 (2011) 6677.