FullyAsynchronous FullyImplicit VariableOrder VariableTimestep Simulation of Neural Networks
\sidecaptionvposfigurec \newwatermark[firstpage,color=gray!60,angle=90,scale=0.32, xpos=4.05in,ypos=0] arXiv:1907.00670 \newwatermark[firstpage,color=gray!60,angle=90,scale=0.32, xpos=3.9in,ypos=0] arXiv:1907.00670 \newwatermark[firstpage,color=gray!90,angle=0,scale=0.28, xpos=0in,ypos=5in] correspondence: bruno.magalhaes@epfl.ch
1 Introduction
Simulation of the electrical activity of networks of biologically detailed neuron models is a major impact scientific problem, as it allows for the better understanding of the brain. Simulation can be performed on different scales, ranging from molecularlevel to point neuron representations, or more detailed neuron models that include neuronal arborization or high number of complex biological mechanisms. Moreover, simulations may focus on different details of neurons activity, from biochemical reactions to conductancebased simulation models, or simpler models such as the integrateandfire [3]. Our problem is the simulation of the electrical activity of large scale morphologically detailed neuron networks. The model of neurons follows from the HodgkinHuxley (HH) formalism [10], modelling the electrical currents passing though connecting sections of neuron morphologies (spatially discretized as a tree of cylindrical leaky capacitors, henceforth referred to as compartments). Neurons are coupled via electrochemical transductors, denominated synapses. When the voltage at a neuron soma reaches a specific action potential threshold, a neuron spikes (or fires), leading to a chain of biological reactions that changes the voltage at their synapticallyconnected counterparts.
Due to the long simulation time required to express biological phenomena such as learning and synaptic plasticity, the acceleration of the simulation of neural networks is a relevant problem. Similarly to simulations in most scientific fields, existing acceleration efforts follow the Bulk Synchronous Parallel (BSP) execution model, computing several neurons simultaneously via synchronized multithredead and distributed execution. Execution time is divided in communication intervals equivalent to the time duration of the shortest synaptic delay across all neuron pairs in the network, given by the spike propagation delay between a pre and a postsynaptic neuron. A synchronous collective communication call performs both synchronization of stepping and synaptic exchange at the onset of each communication interval. Stepping of neurons is performed independently within the boundaries of each intervals, typically with fixed timestep interpolation, as illustrated in layout a) in Figure 1. Such approach is implemented in the NEURON scientific application [9] for the Single Instruction Single Data (SISD) memory layout and in CoreNEURON[16, 15] for the Single Instruction Multiple Data (SIMD, vector) equivalent.
Further acceleration of individual steps can be achieved on the strong scaling axis with finergrained parallelism of individual neuron models. Stateoftheart methods provide distributed multicore SIMD acceleration via graphparallelism of the resolution of the Ordinary Differential Equations (ODEs) [19], and branchparallelism of neuron topology sections [20]. Variable timestep interpolation on the BSP execution model — pictured in layout b) in Figure 1 — is also possible and has been presented by NEURON with SISD computation [18]. Speculative stepping is allowed by a single step that traverses the synchronization instant, with posterior backstepping for missed events. The step size is limited to the instant of the nearest synaptic or discontinuity event. The main advantage of variable step methods is the dynamic adaptation of the step size to the solution volatility. However, distributed variablestep executions struggle with load imbalance across compute nodes at synchronization barriers and are therefore infeasible for most simulation domains.
Complementary acceleration at the cache level has been pioneered in our previous article [21] — henceforth referred to as our previous work — where we pioneered the fullyasynchronous parallel (FAP) execution model applied to fixed timestep interpolations, illustrated in diagram c) in Figure 1. The underlying logic is that, because the BSP communication interval refers to the shortest event delay across the system, the removal of collective synchronization barriers and stepping neurons based on their pairwise connectivities allows for stepping intervals longer than the BSP communication timeframe. This promotes better processor prefetching, lower data volume across memory layers, data representations being kept longer in cache, and ultimately, a reduced time to solution.
This paper introduces a method for the distributed fullyasynchronous variableorder variabletimestep interpolation of detailed neuron models, that benefits from cacheefficient barrierfree synchronization and performs variable timesteps on the FAP execution model. We show that by following an earliest neuron steps next scheduler, we allow for large time interpolation intervals, and maximise the efficiency of the variable step interpolator beyond what was believed to be possible in BSPbased executions. A simple illustration of our methods is presented in layout d) in Figure 1.
The contributions of this paper are as follows. We present the mathematical formalism underlying the simulation of our use case and its resolution with variable step interpolation. We perform a study of numerical precision on the electrical activity of a single neuron and demonstrate higher numerical accuracy than its fixed step counterpart found in NEURON. We analyse and discuss the benefits and bottlenecks of two distinct distributed execution models — a speculative model with backstepping, and a scheduled nonspeculative model — and provide insights on their feasibility on distributed simulations of large networks of neurons. We demonstrate a low sensitivity of our model to stiffness of solution (spiking rate), and high susceptibility to discontinuity events (synaptic activity) with a guaranteed speedup for a discontinuity rate below approx. 1000 Hz. To measure the relevance of our findings on a real use case, we simulate a laboratory experiment based on the spontaneous activity of 219 thousand neurons, demonstrating a mean discontinuity frequency of 94 Hz, and large periods of absence of discontinuities, demonstrating the suitability of our methods to the problem domain.
An implementation of our methods on the core kernel of the NEURON simulator is detailed, and a benchmark is performed on 64 Cray XE6 compute nodes. Distributed asynchrony and multicore executions on a global memory address space is provided by the HPX runtime system [26]. We simulate five spiking regimes that characterize several dynamics of the mammal brain, on an input of 1024 to 65536 neurons, and compare our methods against five stateoftheart numerical solvers. Results demonstrate a speedup of 54465 for a quiet spiking regime of 0.25Hz representing a majority of neurons in regular brain activity, down to 7.71.8 to a moderate regime of 6.5Hz, and 2 to no acceleration for 38Hz, a pattern of unlike occurrence or short duration. An analysis of the overall performance achievable on the previous laboratory experiment demonstrates a speedup of 224.511.9x for an execution with precise delivery of events, increasing to 225.117.1x and 228.524.6x for two optimized alternatives that group events delivery in the next half and full timestep (a numerical precision similar to the stateoftheart fixedstep methods). To finalize, we show that over 95% of neurons fall in three spiking regimes that guarantee the preservation of the speedup in larger networks, demonstrating good scaling properties of our methods to very large networks of neurons.
2 Methods
2.1 Mathematical Model
The RC circuit that models the electrical current passing through the membrane of a compartment is modelled as:
(1) 
where and describe the conductance and reversal potential of the ionic channels. models the opening probability of the transmembrane ion channel currents, typically described by a voltagegated ODE, and omitted for brevity. Synaptic currents or injected current stimuli, if any, are included in . The branching contributions are provided by Ohm’s Law and the neuronal cable theory [23]: the subscript refers to the index of the parent compartment of , and to an iterator over the indices of its children in the compartmental tree. The variable defines the axial resistance as a function of the diameter and the cytoplasmic resistivity. The RC circuit underlying the current passing through a compartment is illustrated in Figure 2, layout c).
The solution of this system is solved numerically. The complexity of the spatiotemporal model of the neuron activity is reduced by performing a spatial discretization of the neuronal morphology, from biologically inspired to HHbased compartmental representation, and assume the spatial discretization to be small enough, so that the state across compartments’ length is constant. Thus, interpolation of solution is performed for consecutive discrete time intervals only. The resolution follows a fixed step defined as small enough to capture the currents with fastest dynamics, set to milliseconds. The fastest synaptic delay across our network model has been measured as or equivalently computation steps, accounting for of the total synapses, as shown in Figure 3.
2.2 Simple and Complex Neuron Models
A problem specific optimization allows for a substantial speedup on the resolution of simple neuron models such as the HodgkinHuxley, where state variables are described by linear ODEs and depend only on the voltage , and viceversa. An implicit resolution based on interleaved timestepping of voltage and states, by solving voltages at a given time and states at time , allows for the resolution of the system of ODEs as a system of linear equations.
Resolution of complex models, including nonlinear and/or correlated state equations cannot be resolved with the aforementioned method, and require a fullyimplicit (nonstaggered) resolution. A use case of high importance is the model of synaptic plasticity presented by Graupner et al. [7] — with cubic ODEs and correlated calcium and synaptic efficacy values — or Chindemi et al. [5] with higher complexity introduced by ODEs describing synaptic weight and calcium activity models. For such scenarios, numerically reliable resolutions rely on fixedstep iterative implicit methods such as Backward Euler. Alternatively, an implicit variable timestep method with variable order is possible. Its implementation to our use case is detailed in the following section.
2.3 Variable Step Implementation
The CVODE (C Variablestep solver for ODEs, [6]) is an implementation of the Backward Differentiation Formula (BDF) for the variablestep multistep implicit method solving the Initial Value Problem (IVP, , where ) for ODEs as:
(2) 
where is a vector representing the state variables of a neuron or a set of neurons following the variable notation in Equation 1, is the order of the current iteration, and and are the dependent BDFmethod coefficients. BDF1 is the Backward Euler. In brief, CVODE returns the and that solve BDF for an userprovided tolerance (atol). The computation is performed iteratively, with a suggested step size for each iteration based on the solution gradient and order . The implicit resolution relies on Newton iterations, with a stop condition based on the test for iteration in step : If error is greater than threshold, a reiteration follows with a smaller ; if error is smaller, proceeds to step with larger . The user provides the function that computes the ODE righthand side for a given value of time and state vector ; and the function that computes the Jacobian or an approximation to it. We utilise the Jacobian function with the CVODEdefined preconditioning function that solves , where is the input and approximates . This is the default Jacobian in NEURON. Two alternative Jacobian implementations were tested and deemed infeasible: (1) a diagonal linear approximation to the Jacobian, given by , displayed faster computation yet highly inaccurate results; and (2) a dense matrix approximation for a parameter variation , accurate yet infeasible due to the high time to solution and memory requirements for the dense matrix of states.
2.4 Asynchronous Timestepping of Neuron Networks
Control of neurons time advancement on synchronized distributed executions is a solved problem, by enforcing a BSPlike synchronization barrier [9], as in layouts a) and b) in Figure 1. Here we discuss alternatives following an asynchronous execution model. We implemented and analysed the feasibility of two distinct fullyasynchronous barrierfree execution models, displayed in Figure 4 and detailed next.
The initial approach is inspired on the speculative interpolator previously designed for NEURON for a single compute node [18]. Neurons are described by individual interpolators, and advance in time under the best assumption that no discontinuity of solution (synaptic current) will arrive with a delivery time earlier than the neuron current time. Discontinuities lead to a reset of the IVP problem and interpolator state history, and consequently to small steps in the following iterations. When a discontinuity is required to be delivered in a past instant in time, a backstepping operation must precede, in order to reset the recent step and interpolate neuron state back to a time instant of confidence (the time of the discontinuity). Simulations of small neuron networks on single compute nodes are possible and have previously shown a substantial runtime acceleration utilising this model [18]. The state of neurons is local to the compute node, thus backstepping of neuron states and synaptic activity are not computationally heavy and do not require communication. However, in our implementation this approach demonstrated two main drawbacks: (1) large spiking networks lead to a high number of IVP resets, and large amount of time spent on speculative stepping with posterior backstepping. This is mainly due to, in practice, one being unable to tell which neuron should be stepped at a time, such that the risk of backstepping would be minimized and the step length maximized. Moreover, (2) on distributed executions, a main problem arises when synaptic activity (exchanged across different compute nodes) needs to be reverted, requiring further communication. On large networks of neurons, the reversal may lead to an extremely complex cascade chain of reversal notifications and backstepping across neurons on different compute nodes, leading to a high communication and computation workload, deeming this methodology infeasible.
An alternative approach was implemented, based on the nonspeculative asynchronous stepping methodology detailed in our previous work[21]. Neurons hold a map storing the time instant of their presynaptic connectivities. The map is updated by stepping notifications received actively at a certain frequency, throughout the stepping of its presynaptic dependencies. The frequency value is set at a neuron pair level, to a minimum value that guarantees no deadlocking. More importantly, this value is at least the BSP communication interval of , and can reach up to several milliseconds (Figure 3), thus minimising overall communication. Neurons step to the maximum time allowed by their synaptic connectivities. This guarantees synapses to be delivered in future time instants, thus removing backstepping and reversion of sent synaptic spikes. The method is improved with an earliest neuron steps first scheduler at each compute node that keeps track of neurons advancement, and picks the earliest neuron in time as the next to interpolate. This guarantees the maximisation of the step length and provides a larger variable step interval. Due to the reduced communication and computation, the removal of solution resets and backstepping, and larger stepping intervals, this method will be the default used in the asynchronous variable step simulations of networks of neurons.
2.5 Implementation details
Our methods were implemented on the core kernel of the NEURON simulator [2]. The mathematical specifications of biological mechanisms were provided by the NEURON modelling language (NMODL). SIMD capabilities were implemented on all fixed and variable timestep methods discussed hereafter. Code changes to support SIMD capabilities were added to the MODLtoC code generator [8]. Communication, synchronisation control objects, memory allocation, threading, distributed execution and parallelism were implemented with HPX [26], the runtime system for the Parallex execution model [12]. HPX provides a multithreaded, cooperative thread scheduler, a global address space (GAS), an activemessage parcel transport, a group of globally addressable local synchronization object classes, and a dynamic workload scheduler. These features makes HPX very adequate for dynamic adaptive execution of high scalability and high concurrency asynchronous problems on distributed networks of nodes. The following features were implemented with HPX on a Global Address Space: a memory linearisation of a neurons static and dynamic data structures; an active pointtopoint asynchronous communication protocol for stepping and synapse notification; a reduce of communication by aggregating neurontoneuron into into neurontocompute node communications; and a neuron scheduler tracking neurons’ time advancement and dynamically allocating computing resources to the earliest neuron in time. Implementation details have been covered in our previous manuscript [21], and will be omitted for brevity.
3 Results
3.1 Numerical Accuracy
Backward Euler is an Astable and Lstable method of order 1. The numerical accuracy of CVODE depends on the order of the Backward Differential Formula iteration and the userprovided tolerance. The underlying BDF is Astable at order 1 or 2. At orders 3 to 5, it is not Astable but stifflystable: the region of instability grows as the order increases from 3 to 5, and stability depends on the step size [6].
We compare the numerical accuracy of both models by measuring the time difference of the main unit of interest in the activity of spiking neuron networks — the spiking time instants. Figure 5 presents the voltage trajectory and number of steps of a current clamp experiment for a single () and several () spikes of a layer 5 pyramidal cell. The default settings utilised in NEURON are plotted in red, with and for Euler and CVODE methods, respectively. The reference solutions are plotted in black, with and , and their numerical difference is considered negligible.
Results on the single spike voltage trajectory () display a reduced step count and better adaptation to trajectory change when comparing CVODE to Euler method. The rationale behind the better performance of adaptive stepping is that it is gradient sensitive and thus it follows the natural behaviour of a neuron: a cell spike requires small timesteps for higher precision, followed by interspike intervals or resting periods with little synaptic input therefore allowing large step sizes. CVODE displays less steps during long periods of low gradient (e.g. ), and greater number of steps for steep trajectories (the uprising trajectory of the spike) and sudden changes in gradient (the trajectory proximal to the peak voltage).
The simulation studies the impact and numerical accuracy of both interpolators to longer executions. Results display a phase shift in solution (measured as the time difference between peak voltage values of the reference and benchmark curves) that increases with the increase of the step size on the Euler methods. In practice, the timestep determines the fastest reaction time of the system. Thus, a large timestep will inevitably cause the system dynamics to be slow. The analysis of the performance of the CVODE tolerance values show that tolerance of approximates the resolution of the default Euler step size of , with a significant reduction of in step count. At longer runs, the variable step demonstrated to be more precise, due to no accumulation of phase shift, with the maximum trajectory shift measured at approximately . On the other hand, a tolerance value of approximates closely the optimal solution with less steps, and with a margin of error similar to its Euler counterpart for the period of , while yielding less interpolations.
On a longer execution, the measured phase shifting of the trajectory for a period of of simulation (omitted for brevity) was of approximately for , and for , considered a large value in the timescale of neurons activity. The execution displayed demonstrated a phase shifting of approximately a tenth of the whole second simulation.
3.2 Performance Dependency on Solution Stiffness
We measured the response of both stepping methods with the NEURON default parameters to different levels of changes in spiking frequency. Changes in trajectory were enforced by injecting a continuous current of a given amplitude on a neuron during . Current intensity is measured as a percentage of the threshold current, the minimum continuous current value that needs to be injected to force the neuron to spike. Performance was measured in terms of steps count and time to solution on an Intel i5 at 1.6 GHz. Results are presented in Figure 6, and demonstrate that high dynamics of the solution degrade the CVODE performance. This is justified by CVODE requiring smaller steps on high trajectory variations in order to respect the absolute tolerance value. For the range of tested scenarios, CVODE runs on less steps and shorter time to solution when compared to Backward Euler. For a neuron without any spikes, or equivalently for a current injection below the spiking threshold, it was measured a reduction of: (1) in step count and in runtime for injected currents below ; (2) step count and runtime for of the threshold current; and (3) step count and runtime for of the threshold current, a worst case scenario of little probability of occurrence.
3.3 Performance Dependency on Number of Discontinuities
We measured the effect of discontinuities on both methods by injecting several current pulses at a fixed frequency on a neuron soma, in the same Intel i5 compute architecture, mimicking synaptic events. The experiment results are displayed in Figure 7 and suggest that the CVODE performance depends on the trajectory change from each discontinuity, i.e. the amplitude of the current injected. This is due to the increase of the voltage trajectory being dependent on the amount of current injected. The larger the voltage increase, the larger the change in trajectory gradient, thus the more interpolation steps are required. Furthermore, and as expected, results demonstrate that the number of discontinuities plays a major role on the CVODE performance. CVODE is shown to deliver a reduction of steps in the order of for a frequency of 10 discontinuities per second for the current values of to . The step count equilibrium between Euler and CVODE method lies in the interval of for the same currents interval.
The runtime demonstrates a similar dependency on the injected current, yielding a speedup of for decreasing linearly up to the speedup equilibrium value at for the strongest current. For the lightest current injected, a speedup of is visible for the discontinuity rate, decreasing to an Euler matching value at circa events per second (). Although this exercise does not represent the stochastic pattern of spikes arrival on real neurons, typically described by a Poisson distribution with a long tail, it provides an estimation of the CVODE speedup allowance. With that in mind, the following section measures the discontinuity rates on a simulation of a laboratory experiment.
3.4 Simulation of a Laboratory Experiment
We tested the suitability of variable step methods to our problem by measuring the spiking activity of a simulation of 7.5 secs of electrical activity mimicking a laboratory experiment. The experimental setup performs a fixed step simulation of the spontaneous activity of 219.247 neurons during tonic depolarization. The network exhibits spontaneous slow oscillatory population bursts, initiated in layer 5 (L5), spreading down to L6, and then up to L4 and L2/3 with secondary bursts spreading back to L6. For further details, refer to section Simulating Spontaneous Activity in [22].
A representative distribution of discontinuity events for three groups of neurons — organized by highest 1%, median 1%, and lowest 1% number of discontinuities — is displayed in Figure 8. The simulation incurred a total of circa 155 million events, with the following distribution: (a) top 1% of neurons, between 3040 and 6146 events in 7.5 secs, or 405820 Hz; (b) median 1%, from 541 to 558 events (7274.4 Hz); and (c) bottom 1%: less than 100 events (10 Hz). The average number of events was of 707 events for the 7.5 secs of simulation, or equivalently, 94 Hz, significantly below the threshold discussed in the previous section. Moreover, the results on the distributions of time interval between discontinuities, plotted in red on the right, display large periods of silence between events arrival in the median and bottom use cases, but not on the top, suggesting the suitability of adaptive stepping to most (but not all) neurons in the population.
4 Benchmark
We simulate one second of the electrical activity of a digitally reconstructed neural network extracted from the model of Markram et al. [22]. Neuron models include 23 distinct biological mechanisms types modelled by 44 ODEs, and highly heterogeneous neuron morphologies. Each neuron requires a storage of 515 MB for its state, times for a order interpolation. Our CVODE solver is defined to utilise the default maximum BDF order value of 5.
On the setup of the simulation test bench, it is relevant to mention that neuronal activity is highly dependent on the mammal specie, brain region and momentary activity, among other factors. Furthermore, the network behaviour simulated must approximate a real use case, as spiking activity affect heavily the performance of variable step methods, as detailed in Section 3.3. An analysis of a single simulation combining several brain dynamics would be of little interest in the context of efficiency analysis, due to all free variables that affect performance. Therefore, our test bench simulates and studies the efficiency of five different brain dynamics described in the literature:

a model of quiet dynamics with a mean spiking rate of 0.25 Hz per neuron, representing neurons almost at rest and/or with little activity. This model provides an upper bound of the runtime of circa 90% of neurons in the brain during regular activity — see Table 1 in Shoham et al. [24] for evidence of silence and highly sparse activity among neurons; supported by Kerr et at. [13] for estimates of rat’s neocortex spiking rate at circa 0.1 Hz due to sparsity of activity; and the estimation of 0.16 Hz by Lennie et al. [17] based on brain energy levels;

a model of slow dynamics at 1.5Hz, representing the lower bound of active neurons, described next;

a model of moderate dynamics with a spiking rate of 6.5 Hz, an approximation of the irregular regime of slow oscillations displayed by the Brunel network [4]; also an upper limit to the 0.0055Hz spiking rate of the rate frontal cortex [27]; and a lose approximation of the to visual cortex of the cats in the 34 Hz interval[1].
The following benchmarks measure the scaling of our methods by simulating quasihomogeneous activity across neurons on densely connected networks. This setup is favourable to fixed step and BSPbased methods, therefore the results presented next are a lower bound of possible acceleration. Neurons activity is triggered by a constant current injection in all neurons throughout the whole duration of the simulation, strong enough to approximate the spiking rate of the network to the regimes described. For complete coverage of the topic, we include the following stateoftheart solvers for simple neuron models (labelled 1a to 1c, and restrained to linear ODEs with uncorrelated states) and complex models (2a2c), both detailed previously in Section 2.2:
 1a)

the cnexp fixed step solver in NEURON, with added SIMD, providing an intercalated resolution of current and states as linear equations, with an analytical resolution of the first order ODE describing state variables;
 1b)

the Euler solver in NEURON, with added SIMD, resolving the currentstates dependency with an explicit Euler method with staggered timestepping, and as a linear equations; a model computationally less expensive than the previous due to no exponential and division operator;
 1c)

the same Euler method on a fullyasynchronous parallel execution model, presented in our previous work, and illustrated in diagram c in Figure 1;
 2a)

the BSP fixed step derivimplicit solver available in NEURON, with added SIMD, with interleavedtimestep resolution of current as a linear equation, and implicit resolution of individual mechanism state ODEs;
 2b)

the BSP variable step method in NEURON with added SIMD and a collective communication barrier, show in diagram b) in Figure 1; and
 2c)

the SIMDenabled fullyasynchronous protocol with variable timestepping introduced in this paper (diagram d in Figure 1).
The following biological constraints were taken into account to guarantee that variablestep runtimes represent a biologically plausible use case: we verified that there were no continuous periods of silence in the network, and no collective synchrony of spiking or voltage trajectory across neurons, that would promote additional efficiency in variable timestep methods. Our input data is retrieved from neurons in layers 4 and 5 of the brain, typically represented by the longest dendritic trees, therefore representing a worst case scenario in terms of number of presynaptic (dependency) neurons. Finally, the number of synapses as AMPA and GABA receptors was measured, and are characterized by a mean of 2289.7 and 4418.8, and a std. dev. of 1284.9 and 3200.4 per neuron, below the counts described in the literature, but inline with the reference digital reconstruction of the rodent brain detailed in Section 3.4.
Execution times were collected on 64 Cray XE6 compute nodes, powered by an AMD Opteron 6380 with 16 cores at 2.5 GHz, 64 GB of RAM and 256bit floating point units. Efficient pointtopoint communication and remote direct access memory is provided with specialized Infiniband network hardware, interfaced via the photon API library [14].
To study the dependency of the algorithm on the input size and synaptic connectivity, we tested our methods in neural networks ranging from 1024 to 65536 neurons, a scale that approximates two columns in the rodent neocortex, and the maximum allowed due to memory requirements of the BDF solver of order 5 we used. Neurons were equally distributed in a round robin fashion across compute nodes. Due to the long simulation time required, runtimes for moderate, fast and burst dynamics were extrapolated from an execution of , and , respectively.
The benchmark results are presented in Figure 9. The FAP variable step method (2c) is presented alongside two variants — labelled 2c and 2c — that group and deliver instantly the discontinuity events within an interval equivalent to the timestep of the interleaved fixed timestep method, and the of the regular fixed step methods, respectively. This approach yields a level of reduced precision in the delivery of events — similar to fixed step methods — however maintaining the same high variableorder variablestep accuracy during periods of activity without discontinuities, with the main advantage of reducing significantly the number in IVP resets. CVODEbased executions are displayed for an absolute tolerance (atol) of , a value equal to the default value in the NEURON simulator. Executions with an absolute tolerance of yielded a reduction in runtime of 5%8%, and were omitted for brevity. The performance analysis for fixedstep simple model solvers (1a1c) was covered in depth in our previous work [21], and is omitted for brevity.
4.1 Fixed vs VariableTimestep Interpolators
Fixed step methods do not yield significantlydifferent execution times across different spiking regimes. This is due to the homogeneous computation of neuron state updates throughout time, and the light computation attached to synaptic events and collective communication not yielding a substantial increase of runtime. The difference in execution times measured across the five regimes was of about , which we consider negligible. On the other hand, as expected, variable step executions are penalized on regimes with high discontinuity rates. It is noticeable that the runtimes of fixed and variablestep solvers approximate as we increase the spiking rate, i.e. the increase of runtimes with the input size is steeper for variable timestep (2b and 2c) compared to fixed timestep methods (2a). This is due to discontinuities in variablestep being delivered throughout a continuous time line, compared to the discrete delivery instants of the fixedstep methods — therefore increase the number of interpolation steps; and the iterative model of the variable timestep reinitializing the state computation with small step sizes on each IVP reset, compared to the constantsized step of fixed step methods. A remarkable performance is visible on the quiet dynamics use case, where our fullyimplicit ODE solver of complex models (with Newton iterations), still runs faster than the simple solver resolving only a system of linear equations. The underlying rationale is that — despite the inherent computation cost of Newton iterations in the variable step methods — the low level of discontinuities allow for very long steps, that surpass the simulation throughput of simple solvers running on fixed step methods. The measured speedup of our reference method (2c) compared to the reference fixed step method (2a) was of 54465 across input sizes for the quiet dynamics, down to 7.71.8 to the moderate dynamics. The fast dynamics presented a speedup of twofold for the dataset of 1024 neurons, and a similar runtime for the 66K neurons. The burst dynamics, although of very unlikely probability of occurrence, demonstrated an acceleration of 1.5 for 1024 neurons and a deceleration of 1.5 for 66K neurons.
4.2 Variable Step Event Grouping
On the analysis of the performance of the CVODE with grouping of events within half fixed timestep (2c), when applied to the largest dataset tested, the previous acceleration was reduced to 47 for quiet, 4.4 for slow, and 1.2 for moderate dynamics, with an inferior performance on the remaining regimes. A further reduction of speedup to 33 for quiet and 1.9 for slow dynamics was noticeable on the CVODE implementation without events grouping (2c), with lower performance for the remaining spike regimes. Although being more precise and solving correlated states implicitly, this method runs slower than the reference implicit fixed step method 2a in the use cases characterized by a high number of neurons and/or strong network activity. This goes in line with the conclusions in Section 3.4, confirming that performance is activity dependent, and the achievable speedup depends on the network connectivity.
The speedup introduced across FAP CVODE variants (2c) increases with the amount of discontinuities in the system — correlated to high network activity or size — as the efficiency of the event grouping method is related to the amount of events in the same grouping interval that are delivered at once. No difference in runtime was measured for the smallest dataset tested due to the high sparsity of synaptic events in small networks.
4.3 FullyAsynchronous vs BulkSynchonous Execution Models
Our next analysis focuses on the performance difference between the BSP and FAP execution models. Results show that the runtimes of both implementations approximate with the increase of the input. This is visible by comparing the fixed step trajectories 1b and 1c , and the variable step trajectories 2b and 2c. For small network sizes, the difference in runtime is noticeably few orders of magnitude higher than for larger network sizes. On large models, the runtimes are similar. This property was demonstrated in our previous work. In brief, an increase of network leads to a higher number of network connectivity, therefore reducing the maximum allowed stepping interval per neuron, and approximating it to the minimum communication delay utilised in the BSP methods. On fixed step methods, it is noticeable a similar runtime on large (66K) networks of neurons, as timesteps are computationally homogeneous. On variable step methods, similar runtimes are only noticeable when significant network activity is present (moderate, fast and burst dynamics), as little network activity leads to few discontinuities, allowing stepping intervals to be modelled in a reduced number of variable steps.
4.4 Runtime Dependency on Input Size and Spike Activity
It is known that, on simulations of small networks, variabletimestep methods yield a significant acceleration in time to solution compared to fixed timestep methods [18]. The rationale is that the small number of neurons in the network leads to a reduced number of synapses per neuron, thus less discontinuities. However, the connectivity in larger networks reaches up to 10 thousand synapses per neuron, with the number of discontinuities being related also to the overall network activity. The question lies now on which conditions are required for similar computation complexity in both interpolators. To that extent, we measured the regions of similar runtime growth for the reference fixed step (2b) and our variable step methods (2c). The region is labelled as in Figure 9.
As expected, fixed step methods yield a quasilinear runtime growth with the increase of the input size, and are independent of the spiking regime, due to the almost ideal scaling of the algorithm in the BSP model. On the other hand, the runtime of variable timestep methods — dependent on the number of discontinuities — demonstrates a rapidly increasing growth with the input size outside the region of similar growth, and almost linearly inside. Moreover, as it depends on the network activity, the lower limit of the region increases with the spiking rate, and is delimited at 16.4K, 32.8K and 32.8K neurons or more for the quiet, slow and moderate dynamics, while not visible in the fast and burst dynamics. In the three spiking regimes where such region exists, the similar growth in both approaches provides a confidence of the scaling capabilities of our methods in larger network models. This is of high importance as it provides an estimation of runtime upper bound in simulations combining neurons with heterogeneous spiking rates, as discussed next.
4.5 Overall Runtime Speedup Estimation
To conclude our analysis, we computed an estimation of the performance acceleration on a simulation combining several spiking regimes. For that purpose, we measured the the distribution of neuron spike rates and neurons per spiking regime, following the laboratory experiment simulation described in Section 3.4. Estimations were collected from the central minicolumn (31346 neurons) of the 219K neurons network, to avoid boundaryeffects that would improve results due to neurons placed on the edges, with reduced connectivity. The counts relate to the 24s simulation time interval, to exclude the initial artificial synaptic burst due to the current injection. The results are presented in Figure 10. The measured percentage of neurons on each regime is 31.43%, 38.44%, 27.02%, 3.10% and 0.01%, relating to 68.9K, 84.3K, 59.2K, 6.8K and 22 neurons. Following the runtimes described in Section 4.1, the speedup range for the interval of 102466K neurons when comparing our methods with the stateoftheart solver for complex models (2a) are estimated as: 224.511.9x for the variable step method with precise event delivery (2c); 225.117.1x for the similar implementation with delivery of events within the next half timestep (2c); and 228.524.6x for the use case with fulltimestep event group delivery (2c). As a side note, the percentage of neurons in the quiet dynamics regime does not agree with the 90% described in the literature. We believe this is due to the reduced size of the network, and that simulations of larger networks and complete brain models include a larger portion of neurons in this regime. Therefore, we assume these results to be a lower bound estimate of possible acceleration.
Moreover, since the quiet, slow and moderate dynamics regimes weight over 95% in the runtime calculation, and as for datasets above 32.8K the reference vs benchmark runtimes have a similar runtime growth in those regimes, we believe the overall runtime for larger circuits do not yield a significant reduction in the speedup values presented, and that the scaling properties are almost fullypreserved on larger networks.
Discussion
This paper presented a strategy for the fullyasynchronous distributed simulation of detailed neuron models with variableorder variabletimestep interpolation. We detailed stateoftheart approaches based on the Bulk Synchronous Parallel execution model (BSP), their limitations on the numerical resolution of complex neuron models, and computation load imbalance at synchronization barriers in variablestep simulations. We discussed the problem of synaptic exchange and synchronization on networks of neurons on speculative variablestep executions, and the inherent issue of a cascade chain of states reset and reversal of false synaptic events. To overcome this issue, we proposed an alternative nonspeculative scheduled execution model on a distributed network of compute nodes. The approach follows a novel FullyAsynchronous Parallel (FAP) execution model (yielding asynchronous computation, communication and synchronisation), that relies on the individual stepping of neurons based on the time instant of their synaptic connectivities, avoiding backstepping and allowing for stepping intervals beyond the BSPbased synchronization interval. Step lengths are maximised by a scheduler that tracks neurons’ time advancement and dynamically allocates compute resources to the earliest neurons in time.
We performed an analysis of numerical accuracy and demonstrated better precision and less interpolation steps of the methods presented compared to the reference fixedstep implicit solver in the NEURON scientific application. An analysis of variablestep performance based on step count and time to solution was performed on three experimental setups. (1) A continuous current injection causing fast gradient changes in state on a single neuron demonstrated a reduced step count and runtime, and low dependency of performance on the stiffness of solution. (2) An injection of a sequence of current pulses simulating network activity demonstrated a high dependency of step count and runtime on incoming synaptic activity, which cause a solution discontinuity and a reset of solution state. We demonstrated a reduced step count for an incoming spike current rates between 1000 and 1600 Hz, depending on the current value. (3) A digital reconstruction of a laboratory experiment on a network of 219 thousand neurons measured the relevance of our result to a real use case. Results displayed a highly heterogeneous distribution of discontinuity rates across neurons, demonstrating the suitability of our methods to the problem domain.
A proof of concept simulator was implemented on top of the core kernel of the NEURON scientific application. Distributed asynchrony, global memory addressing space, remote procedure calls, threading, synchronization and communication methods were replaced and implemented by the HPX runtime system [26]. We analysed and benchmarked five spiking regimes that describe distinct patterns of brain activity in mammals, and six stateoftheart solvers for simple and complex neuron models. Benchmark results on a network of 102465536 neurons demonstrate an overall reduction in runtime in the order of 224.511.9x for the most accurate method with precise delivery of events in a continuous timeline, and 225.117.1x and 228.524.6x for two optimized variants with halfstep and fullstep grouped delivery of events. We demonstrated that performance is activitydependent and that almost ideal scaling is possible, as over 95% of neurons in a biologicallyinspired neuron network — from a simulated laboratory experiment — fall in spiking regimes with guaranteed preservation of the speedup achieved.
The fullyasynchronous variablestep methods presented open the prospectus for the redesign of simulations across a wide range of scientific domains, so far limited to synchronous fixedstep interpolation on the BSP execution model. Furthermore, the scaling, accuracy and performance demonstrated can be further improved with enhanced numerical resolution and asynchronous computation. Thus, as future work, we intend to study the simulation of further laboratory experiments, a lowerorder BDF for reduced memory usage, finetuned tolerance values, a mechanism for the fallback to standard Backward Euler for neurons in fast and burst spiking regimes, and a carefullyspeculative execution model performing speculative stepping while avoiding the initiation of cascades of solution resets throughout neurons. Moreover, we plan to investigate dynamic load balancing of neurons across compute nodes based on recent momentary activity. Combined with the FAP execution model presented, it opens new prospectus in solving the longstanding problem of computational imbalance at synchronization barriers that characterizes BSPbased variablestep simulations.
Acknowledgements
The work was supported by funding from the ETH Domain for the Blue Brain Project (BBP). The supercomputing infrastructures were provided by Indiana University. A portion of Michael Hines efforts was supported by NINDS grant R01NS11613. The authors would like to thank Francesco Cremonesi for scientific discussions and Max Nolte for the provision of the spiking rate information.
Mean spiking frequency rate (Hz) for Linear Current Injections

Input: continuous linear current (mAmp);

Dataset: PCP Layers 4 and 5;
Cell keys
0.25 Hz  quiet dynamics 
1.5 Hz  slow dynamics 
6.5 Hz  moderate dynamics 
38 Hz  fast dynamics 
55.8 Hz  burst dynamics 
Current  Network size  

(mAmp)  32  64  128  256  512  1024  2048  4096  8192  16384  32768 
0.01  0  0  0  0.21  0.22  0.23  0.27  0.25  0.25  0.25  0.25 
0.02  0  0  0.2  1.4  1.4  1.4  1.3  1.2  1.1  0.9  0.8 
0.03  0  1.2  3.4  3.1  2.9  2.7  2.5  2.4  2.0  1.8  1.5 
0.04  0  2.4  6.8  5.8  5.6  4.8  4.2  3.4  2.8  2.4  2.0 
0.05  0.5  1.0  4.2  10.0  7.6  6.9  5.9  4.7  3.7  3.02  2.5 
0.06  1.5  6.21  14.0  10.2  9.9  8.7  7.7  6.1  4.7  3.8  3.1 
0.08  4.1  9.0  17.5  12.7  12.4  11.6  11.0  9.0  6.9  5.3  4.3 
0.10  5.7  10.9  20.2  14.8  14.4  13.8  13.5  11.8  9.3  7.1  5.7 
0.15  7.7  14.2  25.7  19.3  19.0  18.4  18.4  16.8  14.7  11.5  9.6 
0.20  9.7  17.6  31.6  24.0  23.6  23.0  23.0  21.4  19.2  15.5  12.5 
0.25  11.7  21.5  39.1  30.2  29.6  28.7  28.8  26.9  24.5  20.9  16.3 
0.30  13.1  25.5  47.7  36.6  35.9  34.9  35.0  32.6  29.6  25.7  20.8 
0.40  16.7  33.2  64.0  48.9  48.2  47.1  47.7  45.3  41.6  35.8  29.3 
0.50  18.5  40.7  81.0  61.2  60.4  60.9  59.7  57.5  53.5  46.9  38.2 
0.75  23.2  53.5  108.7  78.8  77.1  77.1  76.6  75.7  73.0  67.3  55.6 
1.00  26.2  60.5  123.2  80.3  82.4  —  —  —  —  —  — 
1.50  30.1  33.0  38.4  47.1  48.8  —  —  —  —  —  — 
2.00  33.2  40.0  53.0  49.6  49.7  —  —  —  —  —  — 
3.00  36.8  30.6  22.0  45.9  46.5  —  —  —  —  —  — 
4.00  39.6  33.3  19.4  21.5  48.3  —  —  —  —  —  — 
5.00  44.1  37.2  19.9  14.3  14.2  —  —  —  —  —  — 
6.00  24.3  25.1  13.7  10.2  9.2  —  —  —  —  —  — 
7.00  29.6  17.9  14.9  10.8  9.5  —  —  —  —  —  — 
References
 (1997) Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proceedings of the Royal Society of London. Series B: Biological Sciences 264 (1389), pp. 1775–1783. Cited by: item 3.
 (2015) CoreNeuron  simulator optimized for large scale neural network simulations. GitHub. Note: \urlhttps://github.com/bluebrain/CoreNeuron Cited by: §2.5.
 (2007) Simulation of networks of spiking neurons: a review of tools and strategies. Journal of computational neuroscience 23 (3), pp. 349–398. Cited by: §1.
 (2000) Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of computational neuroscience 8 (3), pp. 183–208. Cited by: item 3, item 4, item 5.
 (2018) Towards a unified understanding of synaptic plasticity: parsimonious modeling and simulation of the glutamatergic synapse lifecycle. EPFL Infoscience scientific publications. Cited by: §2.2.
 (1996) CVODE, a stiff/nonstiff ode solver in c. Computers in physics 10 (2), pp. 138–143. Cited by: §2.3, §3.1.
 (2012) Calciumbased plasticity model explains sensitivity of synaptic changes to spike pattern, rate, and dendritic location. Proceedings of the National Academy of Sciences, pp. 201109359. Cited by: §2.2.
 (2000) Expanding neuron’s repertoire of mechanisms with nmodl. Neural Computation 12 (5), pp. 995–1007. Cited by: §2.5.
 (1997) The neuron simulation environment. Neural computation 9 (6), pp. 1179–1209. Cited by: §1, §2.4.
 (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology 117 (4), pp. 500–544. Cited by: §1.
 (2014) Digital reconstruction of neocortical microcircuitry. EPFL. Note: \urlhttps://bbp.epfl.ch/nmcportal/welcomeAccessed: 20160131 Cited by: Figure 5, Figure 6, Figure 7.
 (2009) Parallex an advanced parallel execution model for scalingimpaired applications. In Parallel Processing Workshops, 2009. ICPPW’09. International Conference on, pp. 394–401. Cited by: §2.5.
 (2005) Imaging input and output of neocortical networks in vivo. Proceedings of the National Academy of Sciences 102 (39), pp. 14063–14068. Cited by: item 1.
 (2016) Photon: remote memory access middleware for highperformance runtime systems. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pp. 1736–1743. Cited by: §4.
 (2019) CoreNEURON : an optimized compute engine for the neuron simulator. External Links: arXiv:1901.10975 Cited by: §1.
 (2016) Leveraging a clusterbooster architecture for brainscale simulations. In International Conference on High Performance Computing, pp. 363–380. Cited by: §1.
 (2003) The cost of cortical computation. Current biology 13 (6), pp. 493–497. Cited by: item 1.
 (2005) Independent variable timestep integration of individual neurons for network simulations. Neural computation 17 (4), pp. 903–921. Cited by: Figure 1, §1, §2.4, §4.4.
 Exploiting implicit flow graph of system of odes to accelerate the simulation of neural networks. Proceedings of IEEE International Parallel & Distributed Processing Symposium (IPDPS). Cited by: §1.
 (2019) Asynchronous simdenabled branchparallelism of morphologicallydetailed neuron models. Frontiers in Neuroinformatics. Cited by: §1.
 (2019) Fullyasynchronous cacheefficient simulation of detailed neural networks. Proceedings of International Conference on Computational Science (ICCS), pp. 421–434. Cited by: Figure 1, §1, §2.4, §2.5, §4.
 (2015) Reconstruction and simulation of neocortical microcircuitry. Cell 163 (2), pp. 456–492. Cited by: Figure 3, §3.4, §4.
 (2008) Neuronal cable theory. Scholarpedia 3 (5), pp. 2674. Note: revision 121893 Cited by: §2.1.
 (2006) How silent is the brain: is there a “dark matter” problem in neuroscience?. Journal of Comparative Physiology A 192 (8), pp. 777–784. Cited by: item 1.
 (1998) Dynamic properties of corticothalamic neurons and local cortical interneurons generating fast rhythmic (30–40 hz) spike bursts. Journal of Neurophysiology 79 (1), pp. 483–490. Cited by: item 4.
 (201404) Towards exascale codesign in a runtime system. In Exascale Applications and Software Conference, Stockholm, Sweden. Cited by: §1, §2.5, Discussion.
 (2016) Network homeostasis and state dynamics of neocortical sleep. Neuron 90 (4), pp. 839–852. Cited by: item 3.