RNA folding kinetics using Monte Carlo and Gillespie algorithmsResearch supported by National Science Foundation grant DBI-1262439.

RNA folding kinetics using Monte Carlo and Gillespie algorithms††thanks: Research supported by National Science Foundation grant DBI-1262439.

Peter Clote    Amir H. Bayegan
Department of Biology, Boston College, Chestnut Hill, MA 02467, USA. clote@bc.edu.
Abstract

RNA secondary structure folding kinetics is known to be important for the biological function of certain processes, such as the hok/sok system in E. coli. Although linear algebra provides an exact computational solution of secondary structure folding kinetics with respect to the Turner energy model for tiny ( nt) RNA sequences, the folding kinetics for larger sequences can only be approximated by binning structures into macrostates in a coarse-grained model, or by repeatedly simulating secondary structure folding with either the Monte Carlo algorithm or the Gillespie algorithm.

Here we investigate the relation between the Monte Carlo algorithm and the Gillespie algorithm. We prove that asymptotically, the expected time for a -step trajectory of the Monte Carlo algorithm is equal to times that of the Gillespie algorithm, where denotes the Boltzmann expected network degree. If the network is regular (i.e. every node has the same degree), then the mean first passage time (MFPT) computed by the Monte Carlo algorithm is equal to MFPT computed by the Gillespie algorithm multiplied by ; however, this is not true for non-regular networks. In particular, RNA secondary structure folding kinetics, as computed by the Monte Carlo algorithm, is not equal to the folding kinetics, as computed by the Gillespie algorithm, although the mean first passage times are roughly correlated.

Simulation software for RNA secondary structure folding according to the Monte Carlo and Gillespie algorithms is publicly available, as is our software to compute the expected degree of the network of secondary structures of a given RNA sequence – see http://bioinformatics.bc.edu/clote/RNAexpNumNbors.

1 Introduction

Ever since Anfinsen’s pioneering experiment involving the denaturation and renaturation of bovine ribonuclease [3], it has long been held that the native state of a protein is that conformation which has minimum free energy (MFE) among all possible conformations. For that reason, both the Monte Carlo algorithm [34] and the Gillespie algorithm [22] have been used in biopolymer folding studies to estimate kinetic folding time, even when detailed balance does not hold.111Definitions of Monte Carlo algorithm, Gillespie algorithm, detailed balance, mean first passage time, master equation, etc. are given later in the paper. Since a number of different measures of folding time have been described for RNA secondary structure formation [42, 17, 53, 45], the goal of the current paper is to clarify the relations between these measures, to point out the possibly low correlation between mean first passage time (MFPT) when computed by the Monte Carlo algorithm versus Gillespie algorithm, and to prove an asymptotic result that precisely relates Monte Carlo trajectory time with Gillespie trajectory time. We hope that this clarification will be useful to those users who perform computational experiments to estimate RNA folding kinetics.

Protein folding kinetics using Monte Carlo and/or Gillespie

For protein to be biologically useful, its amino acid sequence must satisfy two requirements: (1) a thermodynamic requirement that the sequence have a unique, thermodynamically stable native structure, and (2) a kinetic requirement that the denatured polypeptide chain be capable of reaching the native state within reasonable time (milliseconds to seconds) under appropriate solution conditions. In [40, 41] a simple Monte Carlo folding experiment was designed for a 27 bead hetero-polymer on a cubic lattice using hydrophobic contact potentials. Defining a sequence to fold rapidly provided its MFPT is sufficiently small, the authors concluded that a necessary and sufficient condition for a sequence to fold rapidly in their model is the existence of a pronounced energy minimum with substantial gap between the energy of the native state and that of the most stable misfolded structure. A succinct summary of the theoretical work in [40, 41] is that the thermodynamic requirement entails the kinetic requirement. In [1, 2] Monte Carlo hetero-polymer folding experiments were performed to show how prebiotic proteins might have evolved to fold rapidly in a primordial soup in order to survive hydrolysis. Hetero-polymer sequences were evolved by applying a Metropolis criterion to the acceptance of pointwise mutations, depending on whether the mutation increased MFPT. One of the conclusions from this work was that the kinetic requirement entails the thermodynamic requirement.

More recently, the notion of Markov state model (MSM) has been introduced to analyze protein folding kinetics by statistical analysis of a number of molecular dynamics (MD) trajectories – see for instance [47, 6, 25, 51, 5, 23]. To construct a Markov state model, microstates are first formed by applying a clustering algorithm to conformations taken at fixed time intervals (shapshots) from MD trajectories by using a measure of structural similarity such as RMSD. States of the Markov state model are then defined to be kinetically reachable macrostates constructed from microstates, where transition probabilities are defined by counting the relative number of transitions between macrostates. The kinetics of folding are then analyzed by computing MFPT to reach the macrostate of the MSM which contains the native state. Provided that the number of states is reasonably small, the MFPT can be computed analytically from the fundamental matrix as done in [51]. Analytic computations of MFPT are much more efficient and accurate than repeated simulations of the Monte Carlo algorithm – however, both methods are theoretically equivalent. In contrast, in some papers such as [58], instead of computing the MFPT of the Markov state model, transition rates are computed from which the master equation is solved to determine the equilibrium time, i.e. time necessary for molecular equilibrium. Provided that the number of states is reasonably small, it is possible to solve the master equation analytically by spectral decomposition; otherwise, the Gillespie algorithm can be used to stochastically estimate the equilibrium time. In [32], a formula (involving the integral of an appropriate time correlation function) is given to estimate the equilibrium time among unfolded states of a protein having two state kinetics.

RNA folding kinetics using Monte Carlo and/or Gillespie

In [42], the folding time for an RNA sequence is defined to be the sum of reciprocals of the rate constants in each step of a folding trajectory, where is a rate calibration constant and is the activation energy of adding or removing a base pair to the current structure, using the Turner free energy model [50]. The resultant folding time differs from MFPT by additionally accounting for the transition rates between trajectory steps. A similar notion of folding time is adopted in [21], where additionally basins of attraction are estimated by assigning a collection of sampled secondary structures to a locally optimal structure by a greedy procedure.

In [45], mean first passage time was estimated by the average number of Monte Carlo steps, taken over 50 runs, to fold the empty secondary structure into the MFE structure. The authors determined that for the selenocysteine (SECIS-1) family RF00031 from the Rfam database [37], the Pearson correlation coefficient is (p-value ) between the logarithm of MFPT for a given RNA sequence and the standard deviation of the number of base pairs taken over the ensemble of all secondary structures of that sequence. After publication of [45], computational experiments performed with the Gillespie algorithm in place of the Monte Carlo algorithm by using Kinfold [17] suggested that there is no such correlation between log MFPT and standard deviation of the number of base pairs of secondary structures in the ensemble (data not shown). This observation raises the question of which method to use for RNA folding kinetics.

In [17] the program Kinfold is described, now part of Vienna RNA Package [33], which implements the Gillespie algorithm for RNA secondary structure formation for the Turner energy model [50]. In that paper, folding time is defined to be the first passage time from the empty structure to a given target structure, so that the MFPT is expected folding time. In [4], an increase in speed of Kinfold is reported, by modifying the source code to incorporate memoization. In [14], the program KFOLD is described, which also implements the Gillespie algorithm, uses the Turner energy model, and performs the same elementary step base pair addition/removal transitions as does Kinfold; however, KFOLD achieves a remarkable speed-up over Kinfold by exploiting the fact that many of the base-pair addition/deletion moves and their corresponding rates do not change between each step in the simulation. As in [17], folding time is defined in [14] to be the first passage time from the empty structure to a given target structure.

Instead of determining the mean first passage time for RNA secondary structure formation by using the Monte Carlo algorithm or the Gillespie algorithm, an alternative approach is to compute the equilibrium time by solving the master equation or by simulating the Gillespie algorithm until equilibrium has been achieved. In [53] the method Treekin is described, where macrostates are defined by basins of attraction, and population occupancy curves are computed over time for each macrostate. A similar approach is described in [44], where the authors use a different method of constructing macrostates.

Relation between MFPT for Markov chain versus process

In this paper, we consider mean first passage time (MFPT) of random walks in a finite, discrete-time Markov chain , with transition probabilities given by equation (2) in the next section. We also consider MFPT of trajectories in the finite, continuous-time Markov process , with transition rates given by equation (4) in the next section. Mean first passage time for the Markov chain is computed by repeated simulations of the Metropolis Monte Carlo algorithm, while that for the Markov process is computed by repeated simulations of the Gillespie algorithm. Although distinct, both methods to compute MFPT have been described in the literature – see summary in the previous two sections. There appears to be a tacit and sometimes explicit assumption that the kinetics results are essentially equivalent [55, 49].

Indeed, mean first passage times for protein and/or RNA folding have been computed by matrix inversion and/or Monte Carlo simulations over a Markov chain [30, 49, 7, 25, 51, 32], while the master equation and/or Gillespie simulations have been used in [17, 53, 4, 14, 56]. Of particular interest is the construction of Markov state models from molecular dynamics folding trajectories for a protein or RNA molecule, followed by either mean first passage time computed by matrix inversion [51] versus equilibrium time computed by solving the master equation [56]. This leads to the question: what is the relation between mean first passage time computed by the Monte Carlo algorithm versus the Gillespie algorithm?

The answer to this question is trivial in the case that the Markov chain is -regular, i.e. that each state has degree ( neighbors). In this case, we show the MFPT obtained by the time-driven Monte Carlo algorithm is equal to times that of the Gillespie algorithm. In the context of RNA secondary structures of a given sequence RNA, this suggests that MFPT determined by the time-driven Monte Carlo algorithm might be approximately equal to times the MFPT determined by the Gillespie algorithm, where denotes the Boltzmann expected node degree (expected number of structural neighbors), formally defined by

 ⟨N⟩ = ∑sN(s)⋅P(s)=∑sN(s)⋅exp(−E(s)/RT)Z (1)

where the sum is taken over all secondary structures of the RNA sequence, is the number of neighbors of , i.e. structures that can be obtained from by the addition or removal of a single base pair, is the free energy of structure using the Turner energy model [50], the universal gas constant, absolute temperature, and denotes the partition function. However, we show that this is not the case, and can only conclude that the Monte Carlo and Gillespie algorithms return loosely correlated mean first passage times for RNA secondary structure folding. On the other hand, we prove that asymptotically, the expected time for a -transition Monte Carlo trajectory is equal to that for a -transition Gillespie trajectory multiplied by .

A rapidly emerging area of synthetic biology concerns the computational design of synthetic RNA [9, 13] by using inverse folding software [48, 57, 19, 15]. It seems clear that the next step in synthetic RNA design will be to control the kinetics of RNA folding. Due to the differences between MFPT computations with the Monte Carlo and Gillespie algorithms shown in this paper, we suggest that the Gillespie algorithm and related population occupancy curves determined by solution of the master equation constitute a more appropriate approach to macromolecular folding kinetics [56], rather than the Monte Carlo algorithm and related computation of mean first passage time by matrix inversion [7]. In the context of synthetic RNA design, we advocate the computation of MFPT using matrix inversion for a coarse-grained model [43], as an efficient, initial screen of potential candidate, followed by the slower but more accurate coarse-grain method Treekin [17, 52], followed by repeated simulations using KFOLD [14].

Plan of the paper

In Section 2, we provide some background on RNA secondary structure, Markov chains and Markov processes. Section 3 describes three versions of the Monte Carlo algorithm (discrete-time time-driven, discrete-time event-driven and continuous-time event-driven) and the Gillespie algorithm, and illustates these algorithms with a simple example. Readers familiar with Markov chains, Markov processes, Monte Carlo algorithm and the Gillespie algorithm should read the pseudocode of Algorithms 1-4 (given in Figures 1-4) and otherwise skip Sections 2,3 and proceed directly to Section 4. That section presents a proof of a new theorem that Monte Carlo trajectory time asymptotically equals Gillespie trajectory time multiplied by the expected degree. RNA secondary structure folding simulations using the Monte Carlo algorithm and the Gillespie algorithm are given, which provide a computational illustration of the theorem. Despite the close asymptotic relation between Monte Carlo and Gillespie trajectory time, Section 4 shows that there is no such relation between Monte Carlo and Gillespie mean first passage time for RNA secondary structure folding, but only a loose correlation. Section 5 presents some discussion and concluding remarks. The Appendix presents computations that show that detailed balance does not hold for the Markov chain of secondary structures of an RNA sequence.

2 Background

In this section, we describe some definitions concerning RNA secondary structures, expected degree of a secondary structure network, Markov chains and Markov processes. We use the notation for the set of real numbers, for the set of non-negative real numbers, and for the set of natureal numbers .

2.1 Background on RNA secondary structure

A secondary structure for an RNA nucleotide sequence is a set of Watson-Crick or wobble base pairs , containing neither base triples nor pseudoknots. More formally we have the following definition.

Definition 1

A secondary structure for a given RNA nucleotide sequence is a set of base pairs , where , such that:

1. if then form either a Watson-Crick (AU,UA,CG,GC) or wobble (GU,UG) base pair,

2. if then (a steric constraint requiring that there be at least unpaired bases between any two positions that are paired),

3. if then for all and , and (nonexistence of base triples),

4. if and , then it is not the case that (nonexistence of pseudoknots).

The Turner energy model [50] is an additive energy model, where enthalpies, entropies and free energies for stacked base pairs, hairpins, bulges, internal loops, etc. are derived by least-squares fitting of experimental data from UV-absorbance (so-called optical melting) experiments, following the pioneering work of Tinoco [27]. In this model, there is no energy contribution for a base pair; in contrast, stacked base pairs contribute negative, stabilizing free energy, while loop regions contribute positive, destabilizing free energy due to entropic loss. For instance, from melting experiments at C, Turner’s rules assign stacking free energy of and enthalpy of to the stacked pair The minimum free energy (MFE) secondary structure can be computed in time that is cubic with respect to the length of an input RNA sequence; this is done by dynamic programming following the Zuker algorithm [60], as implemented in publicly available software, such as the Vienna RNA Package [33].

Secondary structures can be depicted in several equivalent manners. For instance, the sequence and dot-bracket representation for a type III hammerhead ribozyme from Peach Latent Mosaic Viroid (PLMVd) AJ005312.1/282-335 is given by

GAUGAGUCUGUGCUAAGCACACUGAUGAGUCUAUGAAAUGAGACGAAACUCAUA
.((((((.(((((...))))).......((((........))))...)))))).


The left panel of Figure 5 displays the equivalent usual presentation of the same structure, computed by RNAfold from the Vienna RNA Package [33] using the Turner 1999 energy parameters. This minimum free energy (MFE) structure agrees with the consensus structure from the Rfam database [20]. The right panel of Figure 5 depicts the network of all secondary structures for the 11-nt RNA sequence GCGCGCGCGCG, containing 27 nodes and 42 edges, where nodes are secondary structures in dot-bracket notation, and edges are indicated between each two structures in which is obtained by adding or removing a single base pair to/from . The move set [resp. ] consists of adding or removing [resp. adding, removing or shifting [17, 12]] a base pair (in this paper, we mostly concentrate on the move set). Mean first passage time (MFPT) for [resp. ] is the average first passage time from a designated initial state to a designated final state , which can be approximated by averaging over a number of runs of the Monte Carlo algorithm [resp. Gillespie algorithm]. In the context of RNA secondary structure folding kinetics, the initial structure is often taken to be the empty structure and the target structure is often taken to be the MFE structure – see Figure 5) where the empty structure is indicated by a green circle, and the MFE structure by an orange circle. The contribution of this paper is to show that asymptotic trajectory times for Monte Carlo and Gillespie algorithms are related, but that mean first passage times appear to be only loosely correlated.

2.2 Background on Markov chains and processes

Suppose that is a finite set of states, and for each state there is an associated energy . Suppose that is a graph with vertex set and an arbitrary, but fixed edge set. could be a complete graph with edges between any two distinct states, or could be the network of secondary structures for a given RNA sequence, whose vertices are the exponentially [59] many secondary structures of the sequence, and whose edges are between any two structures where is obtained from by removing or adding a single base pair – i.e. is obtained by a move from from , or equivalently, have base pair distance [36] of . For state , let ambiguously denote either the number of immediate neighbors of , or the set of immediate neighbors of in graph ; i.e. those such that there is an edge from to belonging to . Let be a discrete Markov chain with set of states, having transition probability matrix given by

 px,y=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1Nx⋅min(1,exp(−E(y)−E(x)RT))if y∈Nx1−∑u∈Nxpx,uif x=y0if y∉Nx, y≠x. (2)

In the context of biomolecular folding (especially in Markov state models [47, 25, 39, 7, 32]), the transition matrix is used to describe protein and RNA folding kinetics. Let be the time-dependent population occupancy row vector, where is the probability that the molecule is in state at (discrete) time – here, state represents a particular molecular conformation in a discretized space of all possible conformations. Then

 p(t) = p(0)⋅Pt (3)

Let be a continuous Markov process with set of states, having rate matrix given by

 qx,y=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩min(1,exp(−E(y)−E(x)RT))if y∈Nx−∑u∈Nxqx,uif x=y0if y∉Nx, y≠x. (4)

In the context of biomolecular folding, the rate matrix is usually used in the context of the master equation, also called the Kolmogorov forward equation, defined as follows. As before, if denotes the time-dependent population occupancy row vector, where is the probability that the molecule is in state at (continuous) time , then the master equation is the following system of ordinary differential equations

 dp1(t)dt = ∑j≠1(qj,1pj(t)−q1,jp1(t)) dp2(t)dt = ∑j≠2(qj,2pj(t)−q2,jp2(t)) ⋯ dpn(t)dt = ∑j≠n(qj,npj(t)−qn,jpn(t))

Since , this set of equations is equivalent to the matrix differential equation

 dp(t)dt = p(t)⋅Q (5)

Let denote the embedded Markov chain, also called jump chain, corresponding to the Markov process , having state space and transition probability matrix given by

 sx,y=⎧⎨⎩qx,y∑z≠xqx,zif y∈Nx0if x=y or y∉Nx. (6)

The stationary probability distribution [resp. ] for a Markov chain [resp. Markov process] with transition probability matrix [resp. rate matrix ] must satisfy [resp. ]. It is well-known that the stationary distribution exists and is unique for any finite, irreducible, aperiodic Markov chain [11].Detailed balance is said to hold, if for all states , [resp. ].

Define the Boltzmann probabilities

 πx=exp(−E(x)/RT)Z (7)

where the partition function is defined by . It is trivial to verify that detailed balance holds for the rate matrix defined in equation (4). From detailed balance, it is easy to verify that the Boltzmann distribution is the stationary distribution for the Markov process defined by equation (4); indeed, if are neighbors and , then and . However, as shown in the Appendix, the Markov chain does not generally satisfy detailed balance for the network of RNA secondary structures.

Using Hasting’s trick [24], where in equation (2) is redefined by for , we obtain a Markov chain satisfying detailed balance. Moreover the stationary distribution for the Hastings Markov chain is easily shown to be the Boltzmann distribution from equation (7). Nevertheless, this modification comes at a computational cost, since must be determined for each neighbor of . It is perhaps for this reason that Hasting’s trick is not used in some kinetics simulations from the literature, as in the Monte Carlo time-driven algorithm for protein folding [40] and in Markov state models defined from long molecular dynamics (MD) trajectories [7].

3 Monte Carlo and Gillespie algorithms

In this section, we explore the relation between MFPT for a discrete time Markov chain with transition probability matrix (2) and that of the continuous time Markov process with rate matrix (4). Consider Algorithms 1, 2, 3 and 4. The MFPT of the Markov chain can be approximated by taking the average over many runs of Algorithm 1, as done to show that proteins fold quickly exactly when there is a large energy gap between the energy of the native state and that of the next lowest energy misfolded state [40]. Time-driven Monte Carlo Algorithm 1 is equivalent to event-driven Monte Carlo Algorithm 2, since the probability of leaving state at any given time is geometrically distributed with success probability , the probability of leaving state . By replacing the geometric distribution by the continuous exponential distribution with the same mean, Monte Carlo Algorithms 2 and 3 are equivalent (see Figure 6 for a comparison of relative histograms of samples from geometric and exponential distributions having the same mean). For a given state , the probability of leaving state is equal to the reciprocal of the number of neighbors of times the flux of leaving . Nevertheless the jump probability is identical in Algorithms 2,3,4 since is equal to . It follows that Algorithms 1,2,3,4 have identical probabilities of visiting exactly the same states in a trajectory, the only difference being that the expected waiting time in any state differs by the factor of the number of neighbors of .

The previous discussion shows that the only difference between the MFPT of a discrete Markov chain and that of the related continuous time Markov process lies in the fact that the time increment in the former is sampled from a geometric distribution with mean while the time increment in the latter is sampled from a exponential distribution with mean . Hence, if the value equals a fixed constant for each state , then the Markov chain MFPT equals times the Markov process MFPT, as shown in the following example.

Example 1

Let be the set of states, where the energy of state is given by

 E(k)=⎧⎪⎨⎪⎩−3if k=0−2if k>0 and k is even−1if k is odd. (8)

The left panel of Figure 7 depicts the -node network , where . Let be the indicator function for whether are adjacent in a circle of size ; i.e.

Since the underlying graph is -regular, for each state , and the transition probability matrix of is defined by

while the rate matrix of is defined by

Relation between M1 and M2

Given that the definition of the transition probability matrix of the Markov chain , as given in equation (2), is similar to that of the rate matrix of the Markov process , as given in equation (4), it is natural to ask what the relation is between and . The answer is that there is no relation. For the Markov chain , equation (3) states that the population occupancy vector at time satisfies . In contrast, for the Markov process , equation (5) states that the population occupancy vector at time satisfies , since the solution of the master equation (5) is

 p(t) = p(0)⋅exp(Q⋅t) (12)

For instance, if we set in Example 1, then the transition probability matrix is

 P =⎛⎜ ⎜ ⎜⎝0.86470.06770.00000.06770.50000.00000.50000.00000.00000.18390.63210.18390.50000.00000.50000.0000⎞⎟ ⎟ ⎟⎠

while the rate matrix is

 Q =⎛⎜ ⎜ ⎜⎝−0.27070.13530.00000.13531.0000−2.00001.00000.00000.00000.3679−0.73580.36791.00000.00001.0000−2.0000⎞⎟ ⎟ ⎟⎠

and the exponential of the rate matrix is

 exp(Q) =⎛⎜ ⎜ ⎜⎝0.82990.05660.05690.05660.41830.19830.32050.06290.15470.11790.60950.11790.41830.06290.32050.1983⎞⎟ ⎟ ⎟⎠

Clearly, the value obtained by equation (3) is unequal to the value obtained by equation (5).

Mean first passage time for Example 1

Using the 20 state Markov chain obtained in Example 1 by setting , the time to reach the minimum energy state was computed for the (1) time-driven Monte Carlo Algorithm 1, (2) event-driven Monte Carlo Algorithm 2 with geometrically distributed waiting times, (3) event-driven Monte Carlo Algorithm 3 with exponentially distributed waiting times, (4) Gillespie’s Algorithm 4. Times for Algorithms 1, 2 and 3 were divided by , since the number of neighbors of each state is , while times for Algorithm 4 were reported as computed. The average time was taken over 10,000 separate runs of each algorithm, and then histograms were produced for 1000 repetitions of each of the 10,000 runs. It follows by the central limit theorem that each histogram is approximately normal; moreover, the mean and standard deviation for each histogram is reported as follows: (1) , , (2) , , (3) , , (4) , . The right panel of Figure 7 displays superimposed histograms of for the expected number of moves obtained on average for Algorithms 1,2,3,4 for a state set with states .

4 Results

4.1 Expected trajectory time

Throughout this section, we assume that [resp. ] is a finite, irreducible, aperiodic Markov chain [resp. Markov process] whose transition probabilities [resp. transition rates ] satisfy equation (2) [resp. equation (4)]. Moreover, we assume that and have the same underlying set of states, and that each state is labeled by the same energy in both and .

The main result of this section is that asymptotically, for large values of , the expected time taken by a -transition trajectory222Each step of Algorithms 2, 3 and 4 involves a transition from a current state to a distinct state; however, each step of the time-driven Monte Carlo Algorithm 1 does not necessarily involve a transition to a new state, especially if the current state has low energy, so that may be large. For this reason, we use the term -transition trajectory. for each of the Monte Carlo Algorithms 1, 2, 3 is equal to multiplied by the expected time for a -transition trajectory of the Gillespie Algorithm 4, where denotes the expected degree as defined in equation (1); i.e. the expected number of neighbors, given by

 ∑x∈Xexp(−E(x)/RT)Z⋅Nx

Let Algorithm 1, 2, 3, 4 respectively denote Algorithm 1,2,3,4 where line 3 of each algorithm is replaced by the line “while true”; i.e. there is no condition on the while loop, so the algorithms do not terminate. In Theorem 1, we establish that for any given RNA sequence, as the number of trajectory steps approaches infinity, the trajectory time of Algorithm 3 equals that of Algorithm 4 multiplied by the expected network degree . In the previous section, we showed that Algorithms 1,2,3 are equivalent; it follows that Algorithms 1, 2, 3 are also equivalent. Thus Theorem 1 directly relates (asymptotic) trajectory time of Monte Carlo time-driven and event-driven algorithms with that of Gillespie’s algorithm. In order to give a formal statement of this result, we need to provide some definitions.

Definition 2 (K-step trajectory and expected trajectory time)

A -step trajectory (sometimes called -transition trajectory) of Monte Carlo Algorithm 3 [resp. Gillespie Algorithm 4] is a sequence of states in , where for , although can occur for . Let [resp. ] denote the random variable whose value is the sum of the time increments in line 11 of Algorithm 3 [resp. Algorithm 4]. Denote the expected time for steps of Algorithm 3 [resp. Algorithm 4] by or [resp. or ], where the expectation is taken over all stochastically sampled -step trajectories (lines 13-18) and sampled time increments (line 11) of Algorithm 3 [resp. Algorithm 4].

Theorem 1

Let denote the Boltzmann expected network degree, where denotes the degree of state . Then .

Proof: For each state , define the Boltzmann probability is defined by , where the partition function is defined by . For state , define the probability of leaving [resp. flux out of ], denoted [resp. ] by

 Φ1(x)=∑y≠xpx,y=Φ2(x)Nx (13) Φ2(x)=∑y≠xqx,y=Nx⋅Φ1(x) (14)

Recall that the transition probability matrix of the embedded chain, also called jump chain, is defined in equation (6), and hence satisfies

 sx,y={qx,yΦ2(x)=px,yΦ1(x)if y∈Nx0if x=y or y∉Nx. (15)

A -step trajectory of either Algorithm 3 or 4 corresponds to a random walk on the jump chain with states and transition matrix as defined in equation (15). Since the rate matrix , defined in equation (4), for the Markov process corresponding to Algorithm 4 trivially satisfies detailed balance, so does the transition matrix for the jump chain, defined in equation (15).

Claim: Define the row vector by

 s∗x = q∗xΦ2(x)∑y∈SS(a)q∗yΦ2(y). (16)

Then , hence is the (unique) stationary distribution for the jump chain .

Proof: For fixed state , the th coordinate of the row vector satisfies

 (s∗S)y = n∑x=1s∗x⋅sx,y=∑x≠ys∗x⋅sx,y = = ∑x≠yq∗xqx,y∑z∈SS(a)q∗zΦ2(z)=∑x≠yq∗yqy,x∑z∈SS(a)q∗zΦ2(z) = q∗y⋅⎛⎝∑x≠yqy,x⎞⎠1∑z∈SS(a)q∗zΦ2(z) = q∗yΦ2(y)∑z∈SS(a)q∗zΦ2(z)=s∗y

Note that line 2 follows by definition of and ; line 3 follows by detailed balance of ; line 4 follows by factoring out terms that do not depend on , and line 5 follows by the definition of .

An equivalent, more intuitive statement of equation (16) is that equals the the Boltzmann probability of state , times the flux out of , divided by the expected flux with respect to the Boltzmann distribution; i.e. is the ratio of the Boltzmann weighted flux out of with respect to the expected flux out of any state, where expectation is taken over all states. This proves the claim.

If the number of steps in the trajectory is large, then the expected number of occurrences of each state is . The expected time to leave state in Algorithm 3 [resp. Algorithm 4] is [resp. ], since waiting times are exponentially distributed. It follows that the ratio of the expected time for a -step trajectory of Algorithm 3 divided by the expected time for a -step trajectory of Algorithm 4 equals:

 E[τK1]E[τK2] = ∑nx=1s∗xKΦ1(x)∑nx=1s∗xKΦ2(x)=∑nx=1s∗xNxΦ2(x)∑nx=1s∗xΦ2(x) = ∑nx=1q∗xΦ2(x)∑zq∗zΦ2(z)⋅NxΦ2(x)∑nx=1q∗xΦ2(x)∑zq∗zΦ2(z)⋅1Φ2(x) = ∑nx=1q∗xNx∑nx=1q∗x=∑nx=1πxNx1 = n∑x=1exp(−E(x)/RT)⋅NxZ=μ(a).

This completes the proof of Theorem 1.

Corollary 1

The same proof shows that mean recurrence time for Markov chain is equal to times the mean recurrence time for Markov process . (See [16] for the definition of mean recurrence time.)

4.2 Illustrative examples from RNA

For a fixed RNA sequence , denote the set of all secondary structures for by . Define the Markov chain [resp. Markov process ] for the set of states, where transitions occur between secondary structures that differ by base pair distance of . [resp. ] has probability matrix [resp. rate matrix ] defined by equation (2) [resp. equation (4)], where denotes the free energy of secondary structure with respect to the Turner energy model [50]. Theorem 1 then implies that for trajectories of length , for large , the expected trajectory time for each of Algorithm 1, Algorithm 2, and Algorithm 3 is approximately multiplied by the trajectory time for Algorithm 4, where is the expected degree for the network of secondary structures for RNA sequence .

Although the set is generally of size exponential in (see [46, 18]), we recently described a cubic time dynamic programming algorithm [10] to compute the expected network degree of

 μ(a)=∑x∈SS(a)exp(−E(x)/RT)Z⋅Nx

where is the set of secondary structures having base pair distance to .

We illustrate Theorem 1 in the following computational experiments. For a given RNA sequence , let denote the empty secondary structure containing no base pairs. If algorithms C and D begin from the same initial state and use the same pseudo-random number generator seed, then the algorithms are said to be synchronized, or to generate synchronized trajectories; if no random seed is set, then the algorithms are said to be unsynchronized. If Algorithms 3 and 4 are synchronized, then clearly each algorithm visits exactly the same states in the same order.

The following computational experiment is described. Given an RNA sequence, , we determine the minimum free energy (MFE) structure by Zuker’s algorithm [60] as implemented in Vienna RNA Package [33]. For the 32 nt fruA SECIS element CCUCGAGGGG AACCCGAAAG GGACCCGAGA GG, for which the target minimum free energy structure is reached generally in less than one thousand moves, we analyzed a trajectory of one million moves. For synchronized runs of Algorithm 3 [resp. Algorithm 4] the trajectory time was approximately [resp. ] expected number of neighbors , and the ratio of Algorithm 3 trajectory time divided by is – close to the Algorithm 4 trajectory time. For nonsynchronized runs of Algorithm 3 [resp. Algorithm 4] the trajectory time was approximately [reps. ], and the ratio of Algorthm C trajectory time divided by is – very close to Algorithm 4 trajectory time. Similar validation of Theorem 1 was shown in second computational experiment, where a 500 thousand step trajectory was generated for the 76 nt ala-tRNA from Mycoplasma mycoides with Sprinzl ID RA1180 (tRNAdb ID tdbR00000006) [29] using Algorithms C and D. A final illustration of Theorem 1 is given in the right panel of Figure 7, which displays a scatter plot, for both synchronized and unsynchronized runs of one million steps for the Monte Carlo Algorithm 3 and the Gillespie Algorithm 4 for forty 20 nt randomly generated RNA sequences, each of whose MFE secondary structure has free energy kcal/mol, each having at most 2500 secondary structures, and each with expected compositional frequency of 0.25 for each of A,C,G,U.

Figure 8 shows the distribution for the number of neighbors the 76 nt transfer RNA (RA1180 from tRNAdb 2009 [29]), with mean and standard deviation , as well as that for the 56 nt spliced leader RNA from Leptomonas collosoma (a known conformational switch), with mean and standard deviation of . Although we can exactly compute the expected network degree in seconds [10], the only current method to obtain an approximation of the distribution is by sampling using RNAsubopt [54] from the Vienna RNA Package [33], a computation requiring many hours. It follows that the RNA examples illustrating Theorem 1 could not have been given without use of the dynamic programming algorithm to compute expected network degree [10].

Remark 1

In our examples of RNA secondary structure trajectory time, we considered elementary step transitions between secondary structures , in which is obtained from by addition or removal of a single base pair. However, Theorem 1 is equally applicable for more general move sets between RNA secondary structures, including shift moves [17] and even the addition and removal of entire stems [28]. In each of these cases, it is easily established that the underlying Markov chain [resp. Markov process] is finite, irreducible and aperiodic, so that Theorem 1 applies.

Remark 2

The program Kinfold [17] is an implementation of the Gillespie algorithm, which supports both move set , consisting of adding or removing a single base pair, and move set , consisting of adding, removing, or shifting a base pair. Since we found it difficult to modify the source code of Kinfold to implement the Monte Carlo algorithm, we implemented the Monte Carlo and Gillespie algorithms for RNA secondary structure networks in a C-program mc.c, available at http://bioinformatics.bc.edu/clotelab/RNAexpNumNbors/. In [12], we generalized the algorithm from [10] to efficiently compute the expected network degree for RNA secondary structure networks with respect to move set . At the present time, however, our program mc.c does not support shift moves.

4.3 Monte Carlo and Gillespie MFPT

Here, we show that unless the network is regular (each node has the same degree), there is no simple analogue of Theorem 1 to relate the mean first passage time (MFPT) for the Monte Carlo Algorithm 3 and the Gillespie Algorithm 4 for RNA secondary structure folding kinetics. Algorithms 3 and 4 differ from Algorithms 3 and 4, in that the former algorithms do not terminate computation upon visiting the minimum free energy structure . This modification allowed us to establish Theorem 1.

In Example 1 from Section 3, we described the computation of mean first passage time (MFPT) using each of Algorithms 1,2,3,4 for a Markov chain [resp. Markov process ], having states states , where energy of state is defined in equation (8), and transition probability of moving from state to state is defined in equation (10). Since the underlying graph is -regular, Figure 7 indeed shows that MFPT for each of Monte Carlo Algorithms 1,2,3 is twice the MFPT of Gillespie Algorithm 4.

By slightly modifying the topology of Example 1, we obtain a non-regular network with states , for which all states have degree , with the exception of the (terminal) minimum energy state and the last state . For this example, described in the following, there appears to be no simple relation between the MFPT of Monte Carlo Algorithms 1,2,3 with that of Gillespie Algorithm 4.

Example 2

Define Markov chain [resp. Markov process ] as in Example 1, with the sole exception that state is no longer connected to state ; i.e. the indicator function for whether are adjacent is redefined by

 Adj(i,j) = ⎧⎪⎨⎪⎩1if 0≤i<2n−1 and j=i+11if 0≤j<2n−1 and i=j+10otherwise (17)

Transition probabilities [resp. rates] are defined by equation (2) [resp. equation (4)] where we set , hence

 pi,j=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩0.5⋅min(1,exp(−(E(j)−E(i))))if Adj(i,j),1≤i,j<2n−1min(1,exp(−(E(j)−E(i))))if i=0,j=1 or i=2n−1,j=2n−21−pi,(i+1mod2n)−pi,(i−1mod2n)if i=j0otherwise (18)

while the rate matrix of remains unmodified, defined by

Let , so that there are states , and let the initial state , as in Example 1. The time to reach the minimum energy state was computed for the time-driven Monte Carlo Algorithm 1, event-driven Monte Carlo Algorithm 2 with geometrically distributed waiting times, event-driven Monte Carlo Algorithm 3 with exponentially distributed waiting times and the Gillespie Algorithm 4. The average time was taken over 10,000 separate runs of each algorithm, and histograms were produced for 1000 repetitions of each of the 10,000 runs. The mean and standard deviation for each histogram is reported as follows: (1) , , (2) , , (3) , , (4) , . Figure 9 displays the relative histograms for this data, where for comparison purposes, first passage times for Gillespie Algorithm 4 are multiplied by . The uniform expected number of neighbors, or network degree, computed over the the collection of 19 non-terminal states , is , while the Boltzmann expected number of neighbors is . In contrast, the ratio of the average of the mean first passage times computed by Monte Carlo Algorithms 1,2,3 divided by the mean first passage time for the Gillespie Algorithm 4 is equal to . The p-value for the 2-tailed T-test for equality of Monte Carlo MFTP with Gillespie MFPT times uniform network degree is . It follows that Monte Carlo MFPT is statistically different from Gillespie MFPT times the expected degree of the network in the case of Example 2 (for either uniform or Boltzmann probability).

4.4 Folding trajectories for the 10-mer GGGGGCCCCC

Theorem 1 implies that IF there is no termination condition for Algorithms 3 and 4 (obtained by replacing line 3 by while true), so that every secondary structure of the 10-mer GGGGGCCCCC may be visited multiple times (including the MFE structure), THEN for a sufficiently large number of algorithm steps , the total time for the trajectory of Algorithm 3 is (asymptotically) equal to the total time for the trajectory of Algorithm 4 times the expected network degree. We now show that this is false for mean first passage times; i.e. if Algorithms 3 and 4 terminate upon reaching the MFE structure, then there is no such relation between their trajectory times, which correspond to the first passage time from the empty initial structure to the MFE structure. Before proceeding, we need some notation.

Definition 3 (Expected convergence time)

Given an RNA sequence , let [resp. ] denote the random variable whose value is the sum of the time increments in line 11 of Algorithm 3 [resp. Algorithm 4] until the algorithm converges, where initial state is the empty secondary structure of a and final state is the MFE structure of . Let or [resp. or ] denote the expected convergence time of Algorithm 3 [resp. Algorithn 4], or in other words, the mean first passage time to fold the RNA sequence a using Monte Carlo Algorithm 3 [resp. Gillespie Algorithm 4].

For the 10 nt RNA sequence GGGGGCCCCC, we ran Algorithms 3 and 4 to compute the trajectory time for 100,000 distinct, synchronized folding trajectories, each trajectory starting from the empty initial structure and terminating upon reaching the minimum free energy (MFE) structure 333Recall that secondary structures may be represented as a set of base pairs. with dot-bracket notation . Figure 10a [resp. Figure 10b] depicts the relative histogram for the 100,000 trajectory times of Algorithm 3 [resp. 4] – note that the first passage times appear to be exponentially distributed, although the fit is not good. Since the folding experiments were synchronized, for each , the number of steps taken by each of Algorithms 3 and 4 were identical, as were the secondary structures visited in each step, so that the only difference between Algorithms 3 and 4 consisted in the incremental times determined in line 11 of each algorithm, as well as in the total trajectory time in line 20 of each algorithm.

Expected degree with respect to the Boltzmann distribution

Using the software from [10], we find that Boltzmann expected number of neighbors , as defined in equation (1), for the set of all 62 secondary structures of the 10-mer GGGGGCCCCC is . Figure 10c depicts the relative histogram for the pairwise differences . Let the null hypothesis assert that trajectory time for Algorithm 3 to reach the MFE structure is equal to multiplied by the trajectory time for Algorithm 4 to reach the MFE structure; i.e. is the assertion that .

The p-value for paired values is , since the mean of paired differences satisfies , the standard deviation of the paired differences satisfies , the test statistic , and the number of degrees of freedom .

For , to compute the confidence interval for , we determine the margin of error , yielding a confidence interval of . Under the null hypothesis, is asserted to be , which does not belong to the confidence interval. Even repeating the computation with , we still observe that does not belong to the corresponding confidence interval . Since Algorithm 3 first passage times are greater than expected network degree time Algorithm 4 first passage times, it must be that a disproportionately large set of the secondary structures visited in folding trajectories of the 10-mer GGGGGCCCCC have larger degree (a larger number of neighboring structures that can be reached by the addition or removal of a single base pair) than the network average. This suggests that the (identical) folding trajectories of Algorithms 3 and 4 do not visit mainly low energy structures, although our detailed analysis of one 10 nt RNA sequence can hardly be representative.

Note as well that the paired times and are poorly correlated. Since first passage times are not normally distributed, we compute the Spearman correlation of ; although Pearson correlation should not be used for non-normal distributions, the Pearson correlation is . Analysis of the same data after removal of outliers (e.g. computing the 10%-trimmed mean, etc.) even strengthen the conclusions just reached (data not shown). It follows that the analogue of Theorem 1 for trajectories that terminate when reaching the MFE does not hold – moreover, the correlation is low for the paired trajectory times for synchronized Algorithms 3 and 4 to reach the MFE structure. Since tremendous computational resources would be required to perform a similar analysis for longer RNA sequences, we restrict our attention to the correlation between the synchronized and non-synchronized first passage times for

Algorithms 3 (continuous-time event-driven Monte Carlo) and 4 (Gillespie) – see Tables 1 and 2.

Expected degree with respect to the uniform distribution

Our software [10] also computes that the uniform expected number of neighbors for the set of all 62 secondary structures of the 10-mer GGGGGCCCCC, defined by equation (1) where the free energy of every secondary structure is defined to be zero, or equivalently by setting probability , where now denotes the the total number of secondary structures of GGGGGCCCCC. We now re-analyze the 100,000 first passage times of Algorithms 3 and 4.

Recall that the null hypothesis asserts that trajectory time for Algorithm 3 to reach the MFE structure is equal to multiplied by the trajectory time for Algorithm 4 to reach the MFE structure, except that we now use uniform expected number of neighbors; i.e. is the assertion that .

The p-value for paired values is . For , to compute the confidence interval for , we determine the 95% confidence interval of . The 95% confidence interval for an independent run of Algorithms 3 and 4 to generate another set of 100,000 first passage times was . It follows that the uniform probability analogue of Theorem 1 for trajectories that terminate when reaching the MFE does not hold.

Since Algorithm 3 first passage times are now less than the uniform expected number of neighbors multiplied by Algorithm 4 first passage times, it must be that a a disproportionately large set of the secondary structures which are visited in folding trajectories of the 10-mer GGGGGCCCCC have lower degree (a lower number of neighboring structures that can be reached by the addition or removal of a single base pair) than the network average. This suggests that the (identical) folding trajectories of Algorithms 3 and 4 do not visit mainly high energy structures (having few base pairs, hence potentially a larger number of neighbors), although our detailed analysis of one 10 nt RNA sequence can hardly be representative.

Finally, we note that the ratio of the mean first passage time [resp. ], taken over runs of Algorithm 3 [resp. Algorithm 4] is , a value that appears somewhat closer to uniform expected network degree of than the Boltzmann expected network degree of .

Correlation analysis for 20-mers

In the case of RNA secondary structure folding kinetics, it appears that Gillespie MFPT times expected network degree is even less correlated with Monte Carlo MFPT than unaltered Gillespie MFPT. Table 1 shows the Pearson correlation between the averages, taken over 1000 runs, of the MFPT for each of 1000 20 nt RNAs, when computed by the Monte Carlo (MC) method (Algorithm 3), synchronized (syn) [resp. not synchronized (ns)] with Gillespie’s (G) method (Algorithm 4). Additionally, we performed two repetitions of the unsyncronized runs, designated experiment A and B in the table. For these data, the MFPT computed by Monte Carlo is highly correlated with that computed by Gillespie: G (syn) has correlation of 0.85702 with MC (syn) and a correlation of 0.81030 with MC (ns). Surprisingly, G (syn) has a somewhat lower orrelation of 0.85381, for MC* (syn), obtained by dividing the MC (syn) time for each sequence by its expected number of neighbors; similarly, G (syn) has a somewhat lower correlation of 0.80559 for MC* (ns), obtained by dividing the MC (ns) time for each sequence by its expected number of neighbors. In other words, despite the fact that the ratio of Monte Carlo trajectory time over Gillespie trajectory time equals the expected number of neighbors for sufficiently long trajectories as proved in Theorem 1, there is no such relation between Monte Carlo MFPT and Gillespie MFPT, presumably since the mean first passage time is generally reached within steps, where is too small for the asymptotic effects of Theorem 1 to apply.

5 Discussion

In this paper, we have compared Markov chains and related Markov processes by considering four closely related Algorithms 1,2,3,4. If the underlying graph for the Markov chain is -regular, so that each state of has exactly neighbors, then it follows that on average, mean first passage time (MFPT) computed by each of the Monte Carlo Algorithms 1,2 and 3 equals multiplied by the MFTP computed by Gillespie Algorithm 4. Although the Markov chain of RNA secondary structures of a given RNA sequence is generally not -regular for any , the total time along a Monte Carlo trajectory is asymptotically equal to the Boltzmann expected number of neighbors multiplied by the total time along a Gillespie trajectory, as proved in Theorem 1. Computational experiments on several RNAs confirm this result, provided that trajectories are sufficiently long to exhibit asymptotic properties of Markov chains. Our code mc.c for Algorithms 3 and 4, is written in C and makes calls to the function energy_of_structure() from libRNA.a of Vienna RNA Package [33]. This program, along with C programs to compute the expected network degree for RNA secondary structures with move set (base pair addition or deletion) [10] or with move set (base pair addition, deletion or shift) [12], is publicly available at http://bioinformatics.bc.edu/clotelab/RNAexpNumNbors.

Since Anfinsen’s pioneering experimental result on the folding of bovine ribonuclease [3], it is widely accepted that the native state of a biomolecule is a free energy minimum. In the literature on biomolecular folding, there is sometimes a tacit assumption that kinetics simulations using the Monte Carlo and Gillespie algorithms yield comparable results. Indeed, in [55] Tang et al. assert that “We demonstrate with two different RNA that the different analysis methods (ME, MC, MMC) produce comparable results and can be used interchangeably.444ME stands for the Master Equation, meaning the computation of the time-dependent population occupancy vector by solution of the master equation , where is the rate matrix. MC stands for Monte Carlo simulation, and MMC stands for Map-based Monte Carlo simulation, a method inspired by probabilistic road map methods from robotics, where a constant many closest neighboring secondary structures are selected for each RNA secondary structure from a sampled collection of modest size. Subsequently Monte Carlo simulation is applied to the resulting network, which is much smaller than the network of all secondary structures. The correctness of the authors’ assertion is due uniquely to the fact that the roadmap sampling network is -regular.” The results of this paper suggest that one should not make tacit assumptions concerning folding kinetics simulations using Monte Carlo and Gillespie algorithms, when different nodes in the network have different numbers of neighbors, as in the case for RNA secondary structures.

If computing the mean first passage time requires sufficiently long trajectories, then we would expect Monte Carlo MFPT to approximately equal Gillespie MFPT times the expected number of neighbors. However, for a slight modification of Example 1 described in Example 2, as well as for a benchmarking set of 1000 20 nt RNAs, each of which has at most 2,500 secondary structures, no such relation was found. Even worse, for the set of 1000 RNAs, the correlation between Monte Carlo MFPT and Gillespie MFTP times the expected number of neighbors was found to be lower than the correlation without its consideration. This result suggests that number of trajectory steps necessary to reach the minimum free energy structure may be too small to see the asymptotic relation expected by Theorem 1. We conclude that RNA secondary structure folding may occur faster than the time scale required for Theorem 1, a type of Levinthal paradox [31] which suggests that folding pathways could be encoded in RNA sequences, or that RNA secondary structure formation could follow either a kinetic funnel model [8] or a kinetic hub model [7].

Next, we argue that macromolecular folding kinetics are better captured by Gillespie’s Al