BioSimulator.jl: Stochastic simulation in Julia

BioSimulator.jl: Stochastic simulation in Julia

Alfonso Landeros Department of Biomathematics, David Geffen School of Medicine at UCLA, USA Timothy Stutz Department of Biomathematics, David Geffen School of Medicine at UCLA, USA Kevin L. Keys Department of Medicine, University of California, San Francisco, USA Alexander Alekseyenko Department of Public Health Sciences, Medical University of South Carolina, USA Janet S. Sinsheimer Department of Human Genetics, David Geffen School of Medicine at UCLA, USA Kenneth Lange Department of Biomathematics, David Geffen School of Medicine at UCLA, USA Mary Sehl
\DeclareUnicodeCharacter

3F5 \setminted[julia]fontfamily=tt,fontsize=,breaklines

Corresponding author at:
Division of Hematology-Oncology
Department of Medicine
100 UCLA Medical Plaza, Suite 550, Los Angeles, CA 90095–7059, USA
Tel.: +1/310/825–9203
Fax: +1/310/443–0477

E-mail addresses: alanderos@ucla.edu (A. Landeros), stutztim@ucla.edu (T. Stutz), klkeys@g.ucla.edu (K. L. Keys), alekseye@musc.edu (A. Alekseyenko), JanetS@mednet.ucla.edu (J. S. Sinsheimer), klange@ucla.edu (K. Lange), msehl@mednet.ucla.edu (M. E. Sehl)

Abstract

Background and Objectives: Biological systems with intertwined feedback loops pose a challenge to mathematical modeling efforts. Moreover, rare events, such as mutation and extinction, complicate system dynamics. Stochastic simulation algorithms are useful in generating time-evolution trajectories for these systems because they can adequately capture the influence of random fluctuations and quantify rare events. We present a simple and flexible package, BioSimulator.jl, for implementing the Gillespie algorithm, -leaping, and related stochastic simulation algorithms. The objective of this work is to provide scientists across domains with fast, user-friendly simulation tools.

Methods: We used the high-performance programming language Julia because of its emphasis on scientific computing. Our software package implements a suite of stochastic simulation algorithms based on Markov chain theory. We provide the ability to (a) diagram Petri Nets describing interactions, (b) plot average trajectories and attached standard deviations of each participating species over time, and (c) generate frequency distributions of each species at a specified time.

Results: BioSimulator.jl’s interface allows users to build models programmatically within Julia. A model is then passed to the \mintinlinejuliasimulate routine to generate simulation data. The built-in tools allow one to visualize results and compute summary statistics. Our examples highlight the broad applicability of our software to systems of varying complexity from ecology, systems biology, chemistry, and genetics.

Conclusion: The user-friendly nature of BioSimulator.jl encourages the use of stochastic simulation, minimizes tedious programming efforts, and reduces errors during model specification.

Keywords: stochastic simulation, Gillespie algorithm, -leaping, systems biology, Julia language

1 Introduction

Biological systems with overlapping feedback and feedforward loops are often inherently stochastic. Furthermore, large, complex systems are mathematically intractable, and dynamical predictions based on deterministic models can be grossly misleading [29, 13]. Stochastic simulation algorithms based on continuous-time Markov chains allow researchers to generate accurate time-evolution trajectories, test the sensitivity of models to key parameters, and quantify frequencies of rare events [16, 35, 12, 34]. Stochastic simulation is helpful in cases where (a) rare events, such as extinction or mutation, influence system dynamics, (b) population compartments, such as numbers of biochemical molecules, are present in small numbers, and (c) population cycles arise from demographic stochasticity. Examples of such systems include gene expression networks, tumor suppressor pathways, and demographic and ecological systems. The current paper introduces a simple and flexible package, BioSimulator.jl, for implementing popular stochastic simulation algorithms based on Markov chain theory. BioSimulator.jl builds on previous software. Notable examples include:

  • StochSS, an integrated framework for deploying simulations on high-performance clusters [10]. It features a graphical interface for model editing, tools for deterministic, stochastic, and spatial simulations, a model analysis toolkit, and data visualization. StochKit2, a mature C++ library of stochastic simulation algorithms, serves as the simulation engine.

  • StochPy, an interactive stochastic modeling tool written in Python [27]. It provides a number of simulation algorithms, including delayed and single molecule methods.

  • GillespieSSA, an R implementation of exact and approximate simulation algorithms [30]. Gillespie.jl is a Julia extension of the original R package [14]. Today, Gillespie.jl features Jensen’s uniformization method and the True Jump Method.

  • DifferentialEquations.jl, an extensive Julia ecosystem for solving differential equations [31].

In addition, there are many specialized biological modeling tools:

  • COPASI, a large open-source software application for analyzing and simulating biochemical networks models [24]. It features a user-friendly graphical interface and supports both differential equation modeling and stochastic simulation algorithms.

  • Smoldyn, a particle-based stochastic spatial simulation engine [1]. It emphasizes biophysical and cell environment modeling.

  • BioNetGen, a rules-based model editor and simulation framework that focuses on cell regulatory networks [22].

  • PySB, a Python module for building mathematical descriptions of biological networks [26]. It provides a domain-specific language that is translated into a portable intermediate model representation. PySB connects to other software tools, such as SciPy and StochKit2, for simulations.

These packages have grown more sophisticated over time. Our goal in developing BioSimulator.jl is to provide a fast, open-source, and user-friendly library of stochastic simulation algorithms.

BioSimulator.jl is written in the high-level, high-performance programming language Julia [2]. Our software consists of three main components: an interactive interface for model prototyping, a simulation engine, and a small library of stochastic simulation algorithms. We briefly review the theory underlying stochastic simulation in Section 2 and present the algorithms implemented by BioSimulator.jl in Section 3. Section 4 outlines the model development process and describes the available visualization tools. We summarize BioSimulator.jl’s graphical inputs and outputs, including (a) Petri nets describing the connectivity of the reaction network, (b) time-evolution trajectories of the system, and (c) frequency distributions of events. Three examples from biology, chemistry, and genetics in Section 5 illustrate how to define a model in BioSimulator.jl and simulate it as a continuous-time Markov chain. BioSimulator.jl will be valuable to a broad range of molecular and systems biologists, physicists, chemists, applied mathematicians, statisticians, and computer scientists interested in stochastic simulation modeling.

2 Background

2.1 Markov jump processes

Before we discuss simulation specifics, we describe the time evolution of a Markov jump process [25]. The underlying Markov chain follows a column vector whose -th component is the number of particles of type at time . The components of track species counts and are necessarily non-negative integers. The system starts at time and evolves via a succession of random reactions. Let denote the number of reaction channels and the number of particle types. Channel is characterized by a propensity function depending on the current vector of counts . In a small time interval of length , we expect reactions of type to occur. Reaction changes the count vector by a fixed integer vector . Some components of may be positive, some 0, and some negative.

From the wait and jump perspective of Markov chain theory, the chain waits an exponential length of time until the next reaction. If the chain is currently in state , then the intensity of the waiting time until the next reaction is . Once the decision to jump is made, the chain jumps to the neighboring state with probability . Table 1 lists typical reactions, their propensities , and increment vectors . In the table, denotes a single particle of type . Only the nonzero increments are shown. The reaction propensities invoke the law of mass action and depend on rate constants [23]. Each discipline has its own vocabulary. Chemists use the term propensity instead of the term intensity and call the increment vector a stoichiometric vector. Physicists prefer creation to immigration. Biologists speak of death and mutation rather than of decay and isomerization.

Name Reaction
Immigration
Decay
Dimerization
Isomerization
Dissociation
Budding
Replacement
Complex Reaction
Table 1: Propensities and increment vectors for some typical reactions. Here, denotes a single particle of type , and denotes the reaction rate constant.

Often one is interested in the finite-time transition probabilities of a Markov chain. Let denote the probability that a Markov chain starting at state transitions to state by time . The chemical master equation reads

or equivalently, in differential form

The differential form is a system of possibly infinitely-many coupled differential equations. Solving the master equation is an enormous endeavor except in the simplest of models. Alternatively, simulating multiple realizations of the process yields data that can then be used to estimate transition probabilities and summary statistics. References [16, 17] offer a first-principles derivation of the chemical master equation for biochemical reactions and connect this formalism to simulation methods.

2.2 The Julia language

Julia is a fast, expressive, and flexible programming language for scientific computing [2]. Specifically, the language targets the so-called two-language problem, in which a methods developer builds working prototypes in a slow high-level language only to then move performance-critical subroutines to a fast low-level language. The implicit assumption is that high-level, dynamic languages are expressive and therefore easier to use but at the expense of performance. For example, some language features, such as “for loops”, may incur performance penalties through no fault of the user. Such performance hurdles are overlooked because dynamic programming languages typically avoid tedious compilation steps and provide users with an intuitive syntax. High-level languages allow users to express the tasks they wish to complete by handling the low-level details. On the other hand, low-level languages typically require technical expertise and offer the fastest possible execution.

A consequence of the two-language problem is that many tools, especially scientific software, become fragmented and difficult to manage. This is all the more relevant to scientists who often lack the necessary skills to maintain a software engineering project. Julia tackles this problem by providing an intuitive syntax and language features compatible with high performance underneath the hood, thereby making the user all the more productive. Interested readers are encouraged to peruse [2] for a deeper look into the Julia philosophy. Many online tutorials for learning Julia are available at https://julialang.org/learning/.

3 Simulation methods

BioSimulator.jl supports five different simulation algorithms. In the following subsections, we review each algorithm and briefly describe its strengths and weaknesses. The purpose of this section is to provide the reader with a high-level understanding of each algorithm and develop an intuition as to where methods succeed and fail. There are many references that provide details on these simulation methods, elucidate connections to spatial systems and related stochastic models, and motivate applications in many fields [20, 21].

3.1 Stochastic simulation algorithm

The Stochastic Simulation Algorithm (SSA), also known as the Direct or Gillespie method, implements the wait and jump mechanism for simulating a continuous-time Markov chain [16, 17]. At each step, the algorithm computes the propensities of each reaction channel and generates two random deviates. One of these is an exponential deviate indicating the time to the next reaction based on . The second is a uniform random number determining which reaction fires next based on the ratios . The two main computational steps in the SSA are

  1. Generate a random time to the next event by sampling from an exponential distribution with rate . This provides the update .

  2. Generate a random index denoting the reaction that occurred by sampling a categorical distribution with probabilities . This provides the update .

Since the propensities change after each event, the distributions underlying steps (I) and (II) change over time. Every Gillespie-like simulation algorithm is effectively distinguished by the sampling procedures used to generate the required random numbers, and the algorithms that update the underlying probability distributions.

The main advantage of the SSA is its ability to produce statistically correct trajectories and distributions by simulating every reaction. This strength is also its greatest weakness in models where a small subset of frequently occurring reactions dominate simulation. The detailed computational analysis of Cao et al. identifies the linear search on the propensities as a major obstacle to fast simulation [5]. The algorithm does not scale well with model size in the presence of different time scales. Thus, one must balance the value of accurate results versus speed in selecting SSA for simulation.

3.2 First reaction method

Gillespie proposed the First Reaction Method (FRM) as an alternative to SSA [16]. The main difference is the time to the next reaction

defined by independent exponentially distributed waiting times with intensity . Here again denotes the total number of reaction channels. The premise of the algorithm is to compute the minimum of exponential random variables explicitly. This approach is less computationally efficient than SSA in the number of exponential deviates required to compute the time to the next event. We include the FRM in BioSimulator.jl purely for educational purposes. While the FRM does not offer any advantage over the original SSA, it provides a different way of thinking about simulation. The Next Reaction Method builds upon this idea.

3.3 Next reaction method

The Next Reaction Method (NRM), also known as the Gibson-Bruck method, is another exact algorithm equivalent to SSA [15]. At time , the algorithm seeds each reaction channel a firing time and stores them inside a priority queue. In this context, a priority queue is a data structure that sorts pairs according to the value of in increasing order. That is, if is the minimum time, then the pair appears at the top of the queue. Thus, the next reaction is and its firing time is ; all other reactions fire at some future time. After reaction fires, the NRM updates the state vector . The next firing time is also updated by an appropriate exponential deviate:

where is the new propensity value. The remaining firing times change according to the recipe

based on the lack of memory property of the exponential distribution.

The NRM also minimizes the number of propensities updated by tracking dependencies between reaction channels. Typically, the SSA sweeps through all the propensities to reflect the change in . However, a reaction channel’s propensity only changes when the previous reaction event affected components of that appears as reactants. The NRM uses a reaction dependency graph to describe these relationships between reactions. This data structure reduces the number of propensities that must be updated. Kahan summation can also be used to update the cumulative intensity efficiently [15, 28].

The NRM excels in simulating systems with large numbers of species and lightly coupled reactions. Otherwise, in the extreme case, the algorithm must recalculate every firing time at every step. Systems with heavily coupled reaction channels are problematic for the NRM [5]. In this setting, the NRM becomes identical to the SSA but with the added burden of maintaining its priority queue.

3.4 Optimized direct method

The Optimized Direct Method (ODM) improves upon the original SSA by exploiting multi-scale properties inherent in large models [5]. BioSimulator.jl’s ODM implementation simulates a system once to count the number of times each reaction fires. This allows one to classify each reaction channel as fast (high frequency) or slow (low frequency). Sorting the reactions from fast to slow reduces the search depth in selecting the next reaction. This approach works well with heavily coupled reactions. Some systems exhibit more erratic behavior that prohibits classifying a reaction fast or slow. That is, switching between different time scales thwarts the ODM’s sorting optimization. The auto-regulation genetic network in Example 3 is an example of a system that undermines the optimized sorting of ODM.

3.5 -leaping

BioSimulator.jl implements performance optimizations described by Mauch and Stalzer to improve SSA, FRM, NRM, and ODM techniques [28]. However, algorithms that simulate every reaction ultimately succumb to the high computational expense of large models. The -leaping framework attempts to accelerate simulation by lumping reaction events together within a time leap , selected to be as large as possible [18, 19, 7]. The basic -leaping formula is

where is a Poisson random variable with rate . Thus, -leaping accelerates the SSA by lumping together multiple reaction events over an interval of size . The main challenge in -leaping is selecting the step size as large as possible while satisfying the leap condition

which states that the propensity for each reaction is approximately constant over a leap of size . Here, is a prescribed acceptable change in propensities that controls the accuracy of sample paths generated by a -leaping algorithm. A larger allows for larger leaps, while a smaller restricts leap size. In practice, many -leaping algorithms employ a surrogate condition that satisfies the leap condition with high probability.

In the stochastic simulation setting, a system is said to be stiff if the dynamics force a simulator to take “small” steps. Stiffness arises for a variety of reasons. Large models typically have a number of reactions occurring within a given interval. Reactions occurring on separate time scales split the system between “fast” and “slow” reactions, with the former occurring in a nearly deterministic fashion. In any case, stiffness causes the number of simulated events to increase in exact methods like the SSA. Stiffness poses a second threat to -leaping methods. In addition to decreasing the leap size, stiffness can cause -leaping to generate an excess of events due to the unbounded nature of the Poisson distribution.

There are two precautionary measures to protect against aberrant behavior in selection [6, 33]. For example, a parameter controls whether -leaping will default to SSA when the leap size is less than the expected change under the SSA; that is, if . This precaution is necessary to avoid taking suboptimal steps that introduce error and to mitigate leaps that send the system into negative population counts. In the event of a negative excursion, an acceptance parameter in contracts the leap step, effectively thinning the number of reaction events. Specifically, each event in a bad leap is randomly accepted if a uniform deviate is less than . The leap size is then set to . Leap contraction introduces bias in sample paths, so one must take care in setting . As a rule of thumb, one should first select conservative values for , , and and test performance using short numerical experiments. BioSimulator.jl sets , , and as default values, drawn from the literature, that ought to perform well in many cases.

-leaping discards reaction event times but reduces the burden of random number generation. Each leap in the algorithm requires Poisson random deviates, one for each reaction channel. This accelerates simulation when the leap size is significantly large compared to a single SSA step. Like the SSA, there are many sophisticated variations on the original -leaping algorithm. BioSimulator.jl implements the version found in [19] and is referred to as Ordinary -Leaping (OTL). Future development will implement additional -leaping algorithms from the literature. The next section reviews Step anticipation -leaping, a second -leaping method.

3.6 Step anticipation -leaping

The Step Anticipation -Leaping (SAL) algorithm is a variation on -leaping [33]. In the SAL algorithm, one approximates each propensity by a first-order Taylor polynomial around with starting value . The number of reactions of type is then sampled from a Poisson distribution with mean

The deterministic reaction rate equation

allows one to approximate the derivatives at by applying the chain rule of differentiation:

The required partial derivatives are typically constant or linear for mass-action kinetics. The main advantages of SAL are that it improves accuracy over other -leaping algorithms without compromising speed. This is crucial for complex systems exhibiting rapid fluctuations in reaction propensity and higher order kinetics. The critical step in SAL is selecting the leap size so that the linear approximation to the system holds and avoids negative populations. In our implementation, we select so that the bound

holds for every reaction [19]. Here is the rate constant of reaction and is the tuning parameter. Our implementation of SAL includes the negative population safeguards outlined in the previous section.

4 Software description

This section briefly reviews BioSimulator.jl’s interface and its tools. We refer interested readers to the package documentation for technical details. One may access BioSimulator.jl’s documentation through Julia’s help system based on the convention \mintinlinejulia?¡name¿, where \mintinlinejulia¡name¿ is the name of a function of interest, such as \mintinlinejuliasum.

4.1 Creating a model

First, a user loads BioSimulator.jl’s interface, simulation routines, and other helper functions with the command \mintinlinejuliausing BioSimulator. The \mintinlinejuliaNetwork construct is central to model specification. It represents a system of interacting particles starting from some initial state . A \mintinlinejuliaNetwork object stores the initial population sizes for each \mintinlinejuliaSpecies and the definitions for each \mintinlinejuliaReaction. One constructs a \mintinlinejuliaNetwork by passing a name to the system and successively adding each component with the \mintinlinejulia¡= symbol. For example, the following code {minted}[fontsize=]julia model = Network(”Michaelis-Menten”)

model ¡= Species(”S”, 301) model ¡= Species(”E”, 130) model ¡= Species(”SE”, 0) model ¡= Species(”P”, 0) defines four \mintinlinejuliaSpecies named S (substrate), E (enzyme), SE (substrate-enzyme complex), and P (protein) with initial counts . One defines a \mintinlinejuliaReaction by providing a label, a reaction rate constant, and the reaction equation itself. For example, the code {minted}[fontsize=]julia model ¡= Reaction(”dimerization”, 0.00166, ”S + E –¿ SE”) model ¡= Reaction(”dissociation”, 0.0001, ”SE –¿ S + E”) model ¡= Reaction(”conversion”, 0.1, ”SE –¿ P + E”) defines the dimerization, dissociation, and conversion reactions with rate constants , , and , respectively. BioSimulator.jl assumes mass action kinetics in simulations (see Table 1).

4.2 Running simulations

The \mintinlinejuliasimulate method parses a \mintinlinejuliaNetwork and carries out a simulation run using one of BioSimulator.jl’s algorithms. The command to simulate is written as {minted}[fontsize=]julia simulate(model, algname, output_type = Val(:fixed), time = 1.0, epochs = 1, trials = 1, kwargs…). Here \mintinlinejuliamodel and \mintinlinejuliaalgname are the required inputs. The remaining inputs are keyword arguments invoked by key-value pairs; each of these optional arguments has a default value. We summarize these inputs below along with any default values:

  • \mintinline

    juliamodel: The \mintinlinejuliaNetwork to simulate.

  • \mintinline

    juliaalgname: A simulation algorithm. One may choose between \mintinlinejuliaDirect(), \mintinlinejuliaFirstReaction(), \mintinlinejuliaNextReaction(), \mintinlinejuliaOptimizedDirect(), \mintinlinejuliaTauLeaping(), or \mintinlinejuliaStepAnticipation().

  • \mintinline

    juliaoutput_type = Val(:fixed): One of \mintinlinejuliaVal(:fixed) or \mintinlinejuliaVal(:full) denoting a strategy for saving the state vector. The \mintinlinejuliaVal(:full) option has the simulator sample the state vector after each reaction event and records its value. This option uses more memory and may incur a slight performance penalty due to the fact that BioSimulator.jl cannot determine in advance the size of the output. The \mintinlinejuliaVal(:fixed) option records the state vector at fixed intervals.

  • \mintinline

    juliatime = 1.0: The amount of time to simulate the model. If one specifies a time , then the model will be simulated over the interval .

  • \mintinline

    juliaepochs = 1: The number of save points when using fixed-interval output (\mintinlinejuliaVal(:fixed)). The default option \mintinlinejuliaepochs = 1 records the initial and final values of the state vector.

  • \mintinline

    juliatrials = 1: The number of independent realizations to simulate.

  • \mintinline

    juliatrack_stats = false: A Boolean value that indicates whether the simulation should keep track of algorithm-specific statistics. Using the option with SSA-like methods simply track the number of events. -leaping methods also include the number of times the simulation encounters negative population counts.

  • \mintinline

    juliakwargs: A catch-all for additional options specific to an algorithm. One can check these options using \mintinlinejulia?¡algname¿, where \mintinlinejulia¡algname¿ is to be replaced with an algorithm type. For example, \mintinlinejulia?TauLeaping will print a description of the simulation method.

As an example, the code {minted}[fontsize=]julia simulate(model, StepAnticipation(), Val(:fixed), time=100.0, epochs=50, trials=1000, epsilon = 0.125) will simulate the given model with SAL and return fixed-interval output. The time interval is discretized into epochs for each of the independent realizations of the stochastic process. Lastly, the \mintinlinejuliaepsilon = 0.125 option specifies the value of the parameter used by SAL to control leap size.

4.3 A note on epochs

By default, BioSimulator.jl partitions the simulation time span into epochs of equal length. After each simulation step, BioSimulator.jl checks whether the previous event pushed the simulation into the next epoch. If so, it will record the current value of at each of the previous epochs. We note that the \mintinlinejuliaVal(:fixed) option does not affect how each algorithm steps through a simulation. However, this strategy necessarily discards information, such as waiting times between reactions. Users must also take care to use a sufficiently large number of epochs so that the simulation data accurately captures system dynamics. In particular, one may fail to capture phenomena occurring on a time scale smaller than the one implied by the number of epochs. Despite its drawbacks, BioSimulator.jl favors fixed interval sampling because it assists in computing summary statistics, improves performance for long simulation runs, and facilitates interactive model prototyping. The developers of StochPy provide an excellent discussion on the trade-offs between fixed-interval and full simulation output in [27].

4.4 Running parallel simulations

Simulating large numbers of realizations is naturally amenable to parallelization because trial runs are independent. BioSimulator.jl takes advantage of Julia’s built-in parallelism to speed up large simulation tasks. This is achieved by specifying \mintinlinejuliajulia –procs=N when starting Julia. Here is the number of worker threads. Running Julia in parallel mode allows BioSimulator.jl to simulate a \mintinlinejuliaNetwork by delegating work to separate processes. For example, if one has specified threads then BioSimulator.jl will simulate realizations by delegating trials to each thread. In practice, simulations of large networks benefit more from parallelization than small networks because generating a single trajectory is typically more expensive in the former scenario.

4.5 Petri nets

BioSimulator.jl allows users to visualize the structure of their models as Petri nets using the \mintinlinejuliavisualize function (see Figure 1). A Petri net is a directed graph whose nodes represent either a species (oval) or reaction (rectangle). An arrow from a species to a reaction indicates that the species acts as a reactant (black arrows), while the opposite direction indicates a product (red arrows).

(a) Kendall’s process
(b) Michaelis-Menten enzyme kinetics
(c) Prokaryotic autoregulation
Figure 1: Petri net representations of (a) Kendall’s process, (b) the Michaelis-Menten model, and (c) a self-regulating gene network generated by BioSimulator.jl via the TikzGraphs package. Arrows connecting species to reaction denote how species enter into a reaction, and arrows from a reaction to a species denote how species are produced or deleted by the reaction. When the reaction produces more than 1 particle of a given species, its coefficient appears along the arrow connecting reaction to product species.

4.6 Simulation output

Output from stochastic simulation at a series of time points can be plotted as mean trajectories over time or as full distributions. BioSimulator.jl generates time series data for each species and stores it in a \mintinlinejuliaSimulationSummary object, which also tracks the model used in the simulation, simulation parameters, and key algorithm statistics. BioSimulator.jl provides a few convenient functions for visualization and summary statistics through the \mintinlinejuliaSimulationSummary construct. For example, one may access simulation data as a \mintinlinejuliaDataFrame provided by the DataFrames.jl package in Julia. The DataFrames.jl documentation provides examples for carrying out common operations, including data manipulation, computing summary statistics, and saving data to a file. \mintinlinejuliaDataFrame conversion is achieved by calling \mintinlinejuliaDataFrame(result), where \mintinlinejuliaresult is a \mintinlinejuliaSimulationSummary. The resulting table has three types of columns. The \mintinlinejuliatime and \mintinlinejuliatrial columns indicate the time point and trial number of a record. The remaining columns are labelled according to the species or compartment name. One must load the DataFrames.jl package before converting simulation output to a \mintinlinejuliaDataFrame: {minted}[fontsize=]julia ¿ using DataFrames ¿ result = simulate(model, time = 50.0, epochs = 3, trials = 3) ¿ DataFrame(result) — Row — time — SE — S — P — E — trial — ——–—————-——–——–——–———-— — 1 — 0.0 — 0 — 301 — 0 — 130 — 1 — — 2 — 25.0 — 66 — 46 — 189 — 64 — 1 — — 3 — 50.0 — 12 — 0 — 289 — 118 — 1 — — 4 — 0.0 — 0 — 301 — 0 — 130 — 2 — — 5 — 25.0 — 63 — 38 — 200 — 67 — 2 — — 6 — 50.0 — 9 — 0 — 292 — 121 — 2 — — 7 — 0.0 — 0 — 301 — 0 — 130 — 3 — — 8 — 25.0 — 64 — 29 — 208 — 66 — 3 — — 9 — 50.0 — 17 — 0 — 284 — 113 — 3 —

4.7 Plotting

BioSimulator.jl provides convenient methods for visualizing simulation results through the Plots.jl package. Installing the Plots.jl package provides one with default recipes for plotting individual realizations, mean trajectories, and frequency histograms. The \mintinlinejuliaplot function acts on a \mintinlinejuliaSimulationSummary to produce a figure depending on the value of \mintinlinejuliaplot_type: the possible choices are \mintinlinejulia:trajectory, \mintinlinejulia:meantrajectory, and \mintinlinejulia:histogram. Consider the following examples:

  • \mintinline

    juliaplot(result, plot_type=:trajectory, trial=1) will plot the sample paths for each species based on the results of the first trial.

  • \mintinline

    juliaplot(result, plot_type=:meantrajectory, species=[”S”, ”E”], epochs = 100) will plot the mean trajectories for the species and based on epochs. Error bars represent one standard deviation from the mean.

  • \mintinline

    juliaplot(result, plot_type=:histogram) will plot the distribution of each species at the end of a simulation, based on the number of trials.

Plotting options can be mixed and matches based on the interface provided by Plots.jl. Users may consult the documentation of Plots.jl for help in further customizing figures.

5 Results

To illustrate the workflow of BioSimulator.jl, we provide three simple numerical examples using the SAL method unless otherwise specified. In each case, we set , , and .

Example 1

Kendall’s process

Kendall’s birth, death, and immigration process is a continuous-time Markov chain governed by a birth rate per particle, a death rate per particle, and an immigration rate . Let denote the count of a species at time . The events (reactions)

occur with propensities , , and , respectively. The Petri net in Figure 1 (a) shows the relationships between the particles and reactions. It allows the modeler to visualize the flow of particles through the network and check whether the set of reactions specified accurately captures the biological pathways being studied.

The following code simulates this model in BioSimulator.jl: {minted}[fontsize=,samepage]julia using BioSimulator

model = Network(”Kendall’s Process”)

model ¡= Species(”S”, 5)

model ¡= Reaction(”Birth”, 2.0, ”S –¿ S + S”) model ¡= Reaction(”Death”, 1.0, ”S –¿ 0”) model ¡= Reaction(”Immigration”, 0.5, ”0 –¿ S”)

result = simulate(model, StepAnticipation(), time=4.0, epochs=40, trials=100_000)

In addition to tracking the mean and variance, BioSimulator.jl enables the display of the full distribution of species counts. In this way, one can quantify the frequency of rare extinction events. Figure 2 summarizes results for the mean trajectory and distribution of species in Kendall’s process. At , the average population is , and in approximately percent of the simulations, the species has gone extinct by this time point. Figure 3 illustrates the trade-off in selecting large values in -leaping methods, namely OTL and SAL. As decreases towards , the distribution of at approaches the exact statistical results from the SSA. Note that the smaller values may force these -leaping algorithms to perform slower than SSA because the proposed leap sizes become smaller than an SSA update. This means that each update wastes time computing a leap that is often rejected. On the other hand, large values tend to increase the number of bad leaps. In this case, -leaping wastes time contracting the leap size until a suitable update emerges.

(a)
(b)
Figure 2: Kendall’s process with , , and starting from particles. (a) Mean trajectory and distribution of species computed from SAL realizations. The light, colored region represents one standard deviation from the mean at each recorded point. (b) The histogram suggests extinction is possible at even though the mean value is approximately .
(a)
(b)
Figure 3: The distribution of in Kendall’s process at computed from realizations using (a) OTL and (b) SAL. SSA is used to derive an exact estimate of the distribution for comparison and is shown in black. The colored lines represent the resulting distribution under different values of . Conservative values are generally safe, but may increase the clock time performance of a specific algorithm on a non-trivial model. In this example, the time to generate all realizations for SSA is approximately seconds. The running times for OTL are (in order of increasing value): , , , , and seconds. Similarly, the running times for SAL are: , , , , and seconds.
Example 2

Michaelis-Menten enzyme kinetics

Our next example, Michaelis-Menten enzyme kinetics, involves a combination of first- and second-order reactions. The system consists of a substrate S, an enzyme E, the substrate-enzyme complex SE, and a product P. The three reactions connecting them

represent binding of the substrate to an enzyme, dissociation of the substrate-enzyme complex, and conversion of the substrate into a product. These reactions have rates , , and , respectively. The following code simulates this system under the initial conditions , , and rate constants , , and . {minted}[fontsize=]julia using BioSimulator

model = Network(”Michaelis-Menten”)

model ¡= Species(”S”, 301) model ¡= Species(”E”, 130) model ¡= Species(”SE”, 0) model ¡= Species(”P”, 0)

model ¡= Reaction(”dimerization”, 0.00166, ”S + E –¿ SE”) model ¡= Reaction(”dissociation”, 0.0001, ”SE –¿ S + E”) model ¡= Reaction(”conversion”, 0.1, ”SE –¿ P + E”)

result = simulate(model, StepAnticipation(), time = 50.0, epochs = 10_000, trials = 1_000) Figure 1 (b) provides a Petri net representation of the model. Figure 4 (a) demonstrates mean trajectories and standard deviations for each of the reactant species over time. These mean trajectories match the dynamics predicted by a deterministic model. Figure 4 (b) shows the full distribution of species counts after time steps. The substrate is typically exhausted at .

(a)
(b)
Figure 4: (a) Mean trajectory of each species in the Michaelis-Menten model using realizations. The region representing one standard deviation from the mean is small and suggests this network is not dominated by noise. (b) Population distributions for each species at generated from realizations.
Example 3

Auto-regulatory genetic network

The influence of noise at the cellular level is difficult to capture in deterministic models. Stochastic simulation is appropriate for the study of regulatory mechanisms in genetics, where key species may be present in low numbers. Figure 1 (c) is an example of a simplified negative auto-regulation network for a single gene, in the sense that the protein represses its own transcription.

There are eight possible reactions: (1) gene transcription into RNA, (2) translation of the protein, (3) dimerization of the protein with itself, (4) dissociation of the protein dimer, (5) binding to the gene, (6) unbinding from the gene, (7) RNA degradation, and (8) protein degradation. There are five species to track — the free copies of the gene, transcribed RNA, protein molecules, dimer molecules, and blocked copies of the gene. The model is easily implemented in BioSimulator.jl: {minted}[fontsize=,samepage]julia using BioSimulator

model ¡= Network(”negative auto-regulation”)

model ¡= Species(”gene”, 10) # assume 10 copies of the gene are present model ¡= Species(”RNA”, 0) # transcribed from the underlying gene model ¡= Species(”P”, 0) # protein model ¡= Species(”P2”, 0) # protein dimer model ¡= Species(”P2_gene”) # gene repression

model ¡= Reaction(”transcription”, 0.01, ”gene –¿ gene + RNA”) model ¡= Reaction(”translation”, 10.0, ”RNA –¿ RNA + P”) model ¡= Reaction(”dimerization”, 1.0, ”P + P –¿ P2”) model ¡= Reaction(”dissociation”, 1.0, ”P2 –¿ P + P”) model ¡= Reaction(”repression binding”, 1.0, ”gene + P2 –¿ P2_gene”) model ¡= Reaction(”reverse repression binding”, 10.0, ”P2_gene –¿ gene + P2”) model ¡= Reaction(”RNA degradation”, 0.1, ”RNA –¿ 0”) model ¡= Reaction(”protein degradation”, 0.01, ”P –¿ 0”)

result = simulate(model, StepAnticipation(), Val(:full), time = 500.0, trials = 100) RNA typically has a limited lifetime. Thus, the per particle reaction rates governing protein production are balanced to favor translation events following transcription. Moreover, the reaction rates for dimerization and dissociation reflect an assumption that the protein favors neither the monomer nor the dimer configuration.

(a)
(b)
Figure 5: (a) Mean trajectories for the protein and dimer in a simple auto-regulatory gene network. The bars represent standard deviation away from the mean. (b) Full sample paths for the protein (top) and RNA (bottom). Note that the RNA sample path is sensitive to the number of epochs if one were using the fixed-interval option. Using a large window may fail to capture peaks in the output. This subtlety is important for qualitative model assessment.

Figure 5 (a) compares the mean behavior of the protein and the dimer over time with results from a deterministic model. Plotting individual trajectories for the protein level reveals strong stochastic fluctuations driven by the relative rarity of RNA. Figure 5 (b) highlights the influence of noise in the system’s dynamics.

5.1 Algorithm comparison

Here we compare the performance of BioSimulator.jl’s algorithms across the three examples from the previous sections. Each model is simulated using the same model parameters outlined in the previous examples. Each simulation task involves generating realizations. In the case of fixed-interval output, epochs are used. The -leaping algorithms use the default values , , and .

Table 2 records the averaged clock times reported by the \mintinlinejulia@benchmark macro from Julia’s BenchmarkTools.jl package. Each simulation task is allotted minutes to generate samples of the running time. To be explicit, this means that the following command is executed at most times: {minted}[fontsize=,samepage]julia simulate(model, algname, output_option, time = tfinal, epochs = 1000, trials = 100) Thus, these benchmarks reflect both the cost of generating a single realization, the efficiency of completing a typical simulation task, and the economy of choosing fixed-interval versus full simulation output. Results are based on a MacBook Pro with 2 GHz Intel Core i7 (4 cores) and 16 GB of RAM running macOS High Sierra 10.13.3.

Algorithm Kendall’s process Michaelis-Menten Auto-regulation
SSA
FRM
NRM
ODM
OTL
SAL
Table 2: Median runtimes and interquantile ranges (in milliseconds) for Example 1, Example 2, and Example 3 based on samples of the \mintinlinejuliasimulate command. Each simulation task generates realizations of the underlying stochastic model. For example, the median time to generate realizations of Kendall’s process using SSA is milliseconds. The first row in a cell is the benchmark result using the \mintinlinejuliaVal(:fixed) option for fixed-interval output. The second row indicates the timing result using the \mintinlinejuliaVal(:full) option for full simulation output.

The NRM performs the worst across the selected model. However, we note that BioSimulator.jl uses the a priority queue implementation from the DataStructures.jl package which may not be optimal for the method’s specific demands. Moreover, we take a naïve approach to building dependency graphs. The overhead from BioSimulator.jl’s dependency graphs is sufficiently large that in some cases most of the simulation is spent on this step. This warrants further review of our NRM implementation, although we expect an improved implementation to perform similarly to the FRM. Only the auto-regulation model may show improved performance for the NRM over SSA because the reaction channels for that model are not tightly coupled. Similarly, our ODM implementation may be inefficient based on its performance on the birth-death-immigration process.

Both -leaping algorithms perform poorly on the auto-regulation model. This is to be expected because neither OTL nor SAL handle separate time scales. Figure 5b shows that RNA transcription becomes a rare event as the protein dimer population grows. Similarly, the free and blocked gene copies are always present in relatively small quantities compared to the protein and protein dimer populations. Thus, protein dimerization and dissociation become the dominant reaction channels further along the time axis. These two reactions drive the dynamics of the system and therefore heavily influence the -leap selection procedure. A natural consequence is that -leaping methods have an increased likelihood of violating the leap condition in this regime and therefore bias the simulation toward infeasible states. This assertion is easily verified using the \mintinlinejuliatrack_stats = true option and checking the number of negative excursions reported by BioSimulator.jl. In fact, the leap condition violations are egregious to the point that the recovery mechanism requires thinning the leap several times per negative excursion. This suggests that reducing the leap size via the control parameter is likely to reduce the algorithm to SSA because leaps eventually become smaller than the expected Gillespie updates.

5.2 Software comparison

We compare BioSimulator.jl’s features and performance against three software packages: StochPy, StochKit2, and Gillespie.jl. The tools in the space of stochastic simulation are manifold and vary in their features and applications. Naturally, there are many omissions. Each software package is selected for its similarity to BioSimulator.jl. Specifically, these tools are domain-independent, general purpose Gillespie-like simulators. Table 3 summarizes the differences and similarities between these software tools.

StochPy StochKit2 Gillespie.jl BioSimulator.jl
Software characteristics:
Language Python C++ Julia Julia
Open-source Yes Yes Yes Yes
GUI No via StochSS No No
Jupyter integration Yes No Yes Yes
Performance Fast Faster Faster Fastest
Interactivity:
Model editor No via StochSS No Yes
Simulation interface Yes via StochSS Yes Yes
Plotting Yes Limited No Yes
Simulation features:
Fixed-interval output via StochKit2 Yes No Yes
Full output Yes Yes Yes Yes
SBML support Yes Yes No No
Readable input Yes No No Yes
Parallelism No Yes No Yes
Table 3: A summary of features across StochPy, StochKit2, Gillespie.jl, and BioSimulator.jl. Jupyter notebooks are human-readable documents that combine code, text, and figures into a single interactive report.

In addition, we compare each software package on simulation speed across five different models:

  1. Kendall’s process permits an analytic mean trajectory. This model serves as an easy speed test and a correctness check.

  2. The Michaelis-Menten enzyme kinetics model provides a second simple speed test with multiple species.

  3. The auto-regulation model serves a benchmark against an interesting biological application of stochastic simulation.

  4. The dimer-decay model is an example of a small stiff system. It is featured in the literature as a -leaping benchmark [7, 4].

  5. A yeast model from biology is another interesting stiff system used here as a benchmark. [8, 9, 32].

Initial conditions and other model parameters are deferred to supplementary information (Section A.1, Tables S1 - S5). Table 4 reports the benchmark results and includes the simulation parameters.

Software StochPy StochKit2 Gillespie.jl BioSimulator.jl
Method SSA SSA (dep. graph) SSA SSA
Mode Serial Parallel Serial Serial Parallel
Kendall’s process
Michaelis-Menten
Auto-regulation
Dimer-decay
Yeast
Table 4: Median runtimes and interquantile ranges for StochPy, StochKit2, Gillespie.jl, and BioSimulator across selected models based on samples, reported in milliseconds (ms). Each sample measured the time to generate realizations of a given stochastic process. Results using fixed-interval and fixed output options are recorded in the first and second rows of each cell, respectively. Each simulation tool is used with its default settings. For example, StochKit2 automatically parallelizes simulation tasks involving multiple realizations and uses a dependency graph by default. We note that both StochKit2 and BioSimulator.jl used threads for the parallel simulation benchmarks. Direct comparisons based on these results are not possible, in the sense that slower performance does not necessarily indicate a particular tool is poorly implemented. Rather, our results reflect natural trade-offs in optimizing software for particular goals. Note: The StochPy benchmark on the auto-regulation model is based on only samples.

5.3 Package availability

BioSimulator.jl is available on GitHub (https://github.com/alanderos91/BioSimulator.jl). All source code is readily available to view, download, and distribute under the MIT license. We also maintain a documentation manual via GitHub Pages (https://alanderos91.github.io/BioSimulator.jl/stable/).

6 Discussion

BioSimulator.jl simplifies interactive stochastic modeling by virtue of being contained within a single programming language. Every aspect of the modeling workflow — model specification, simulation, and analysis — is handled by Julia and its type system. Here we discuss these three aspects of our simulation software and compare its features and timing benchmarks with other packages.

StochPy and BioSimulator.jl are mainly interactive simulation tools meant to be used in Read-Eval-Print Loop (REPL) environments and Jupyter notebooks, but they stand out in that they can also be used as libraries like StochKit2. BioSimulator.jl’s model specification interface is flexible enough to allow a user to build up a model using for loops and other language features. StochPy and StochKit2 mainly rely on input files whose main advantage is model sharing. Model editing interfaces can be built around standardized formats and indeed such tools exist; StochSS includes a graphical user interface that feeds into StochKit2. The strength of BioSimulator.jl’s model interface is the flexibility afforded by Julia’s features as it facilitates rapid model prototyping.

An additional benefit of our implementation is that models can be packaged into functions by defining a wrapper. This allows one to share models through \mintinlinejulia.jl files that provide the model through a function call that specifies parameters and initial conditions. A model author has full control over what parameters ought to be exposed to a user by designing the function signature appropriately. As an example, recall the model definition for Kendall’s process: {minted}[fontsize=,samepage]julia function birth_death_process(S; birth_rate = 2.0, death_rate = 1.0, immigration_rate = 0.5) model = Network(”Kendall’s Process”)

model ¡= Species(”S”, S)

model ¡= Reaction(”Birth”, birth_rate, ”S –¿ S + S”) model ¡= Reaction(”Death”, death_rate, ”S –¿ 0”) model ¡= Reaction(”Immigration”, immigration_rate, ”0 –¿ S”)

return model end Here \mintinlinejuliabirth_death_process is a function that requires an argument \mintinlinejuliaS to specify an initial condition. The variables \mintinlinejuliabirth_rate, \mintinlinejuliadeath_rate, and \mintinlinejuliaimmigration_rate are optional and have default values. Calling \mintinlinejuliabirth_death_process(5) builds up the model with and the default reaction rate constants.

Many models are implemented using data standards independent of software and programming languages. One notable standard is the Systems Biology Markup Language (SBML) [3]. SBML addresses crucial details such as parameter units, compartment sizes, and kinetic rate laws in a standardized format. An interface to SBML is required in order to connect BioSimulator.jl to existing software, enable thorough comparisons, and facilitate model sharing in a standardized way. StochPy and StochKit2 stand out in this regard because these tools provide an interface to SBML. We anticipate adopting the SBML standard in future versions of our software.

StochKit2 is primarily a command-line tool and a software development package. This makes StochKit2 highly reusable since other programs, such as model editors or simulation packages, can interface with it. In fact, one of StochPy’s features is its ability to call StochKit2 for tau-leaping. The main commands are \mintinlinejuliassa and \mintinlinejuliatau_leaping. At the start of a simulation job, StochKit2 performs an analysis to select a suitable algorithm based on the model structure. For example, using the \mintinlinejuliassa command will run some variation of the exact Gillespie algorithm based on model structure. The main drawback is the command-line interface, which may be off-putting to inexperienced users.

Overall, Gillespie.jl is a fast stochastic simulation tool for visualizing results and computing quantities of interest, but it requires a modest programming effort. Without a model editor, it requires the user to specify the net change increments and propensity functions for each reaction . This approach places a burden on users, especially those unfamiliar with Julia or any similar programming language. A benefit of this interface is that non-mass action propensities are automatically supported since a user must hard-code these for the software. Unlike the other tools, Gillespie.jl does not yet have an interface for running multiple simulations, and so collecting the results from independent simulations requires some effort from a user. While the main simulation functions are similar, Gillespie.jl and BioSimulator.jl have the advantage of Julia’s type system and multiple-dispatch. When used correctly, these two language features allow one to write powerful abstractions around scientific computing problems that Julia leverages to generate highly optimized instructions. The JuMP.jl and DifferentialEquations.jl packages are exemplars in this regard and serve as a testament to Julia’s advantages [11, 31].

The benchmarking results from Table 4 show that BioSimulator.jl is competitive with StochKit2. While we are pleased with BioSimulator.jl’s performance, one can only speculate on the precise meaning of these benchmarks without an intimate understanding of implementation details. Julia provides a theoretical competitive performance edge for Gillespie.jl and BioSimulator.jl over the Python software. StochPy’s timings may be slower due to the fact that it records a greater amount of information. In addition to recording the state vector after each step, StochPy offers the option to store reaction channel information such as propensity values and event waiting times. While fixed-interval output provides StochKit2 and BioSimulator.jl with a slight performance boost, the results between StochPy and BioSimulator.jl using the full output option suggest that the performance gap is in fact wide.

StochKit2 and BioSimulator.jl each support running parallel simulations. StochKit2 automates this feature; the software defaults to parallelism if a user’s machine supports it. However, BioSimulator.jl’s implementation is nearly automatic thanks to Julia’s abstractions for parallelism. Table 4 shows that BioSimulator.jl is an order of magnitude faster on each example in the serial case, except for the auto-regulation model. This warrants further investigation.

All four packages support varying degrees of simulation output analysis. Gillespie.jl and BioSimulator.jl offer automatic time series visualization. We improve upon Gillespie.jl by providing helper functions for additional visualization tools, such as histograms and mean trajectories, as well as integration with the Plots.jl ecosystem. Our goal is to emulate StochKit2 / StochSS and StochPy in their support for more sophisticated analysis tools.

7 Conclusion

In a biological system, interacting feedback loops can make mathematical analysis intractable and create challenges in choosing an optimal set of experiments to probe system behavior. Combining experiments and stochastic simulation of complex biological systems promotes model validation and the design of promising experiments. The user-friendly nature of BioSimulator.jl encourages the use of stochastic simulation, eliminates effort spent on modifying simulation code, reduces errors during model specification, and allows visualization of system interactions via Petri Nets. Tracking trajectories and distributions of interacting species over time helps modelers decide between deterministic and stochastic models. Future developments of BioSimulator.jl include

  • support for non-mass action kinetics,

  • adopting SBML as an input format,

  • extending the SAL algorithm implementation using higher order Taylor expansions,

  • implementing additional exact and approximate simulation algorithms from the literature,

  • incorporating spatial effects, and

  • implementing hybrid methods that integrate stochastic simulation with deterministic modeling.

The Julia language provides an ideal environment for this purpose. Its syntax allows one to write mathematical code in a natural way and facilitates fast prototyping. In particular, the integration with the IJulia package encourages code sharing and reproducible work. Furthermore, its interactive environment is well-suited for novice programmers. The package ecosystem in Julia provides software for visualization, statistical analysis, optimization, differential equations, and probability distributions. Emerging computational tools in Julia can only increase BioSimulator.jl’s strengths over time.

Acknowledgments

We thank Kevin Tieu for testing an early version of our software. This work was funded by NCATS Grant KL2TR000122. A.L. and T.S. were funded by the NIH Training Grant in Genomic Analysis and Interpretation T32HG002536. K.L.K. was supported by grant 5R01HL135156-02S1, the UCSF Bakar Computational Health Sciences Institute, and the UC Berkeley Institute for Data Sciences as part of the Moore-Sloan Data Sciences Environment initiative.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
322140
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description