# Efficient maximum likelihood parameterization of continuous-time Markov processes

###### Abstract

Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence intervals in all model parameters, and can easily enforce important physical constraints on the models such as detailed balance. We demonstrate and discuss the advantages of these models over existing discrete-time Markov models for the analysis of molecular dynamics simulations.

## I Introduction

Estimating the parameters of a continuous-time Markov jump process model based on discrete-time observations of the state of a dynamical system is a problem which arises in many fields of science, including physics, biology, sociology, meteorology, and finance.Singer and Spilerman (1976); Madsen, Spliid, and Thyregod (1985); Turner, Startz, and Nelson (1989); Minin and Suchard (2008) Diverse applications include the progression of credit risk spreads,Jarrow, Lando, and Turnbull (1997) social mobility,Spilerman (1972) and the evolution of DNA sequences in a phylogenetic tree.Kimura (1980) In chemical physics, these models, also called master equations, describe first-order chemical kinetics, and are the principle workhorse for modeling chemical reactions.Anderson and Kurtz (2011)

For complex physical systems, the derivation of kinetic models from first principles is often intractable. In these circumstances, the parameterization of models from data is often a superior approach. As an example, consider the dynamical behavior of solvated biomolecules, such as proteins and nucleic acids. Despite the microscopic complexity of their equations of motion, relatively simple multi-state kinetics often arise, as exemplified by the ubiquity of two- and few-state Markov process models for protein folding. Ikai and Tanford (1971); Zwanzig (1997); Sánchez and Kiefhaber (2003); Chan and Dill (1998); Pirchi et al. (2011); Beauchamp et al. (2012)

Due in part to the unavailability of computationally efficient and numerically robust estimators for continuous-time Markov models, in the field computational biophysics, discrete-time Markov models have been widely used to fit and interpret the output of molecular dynamics (MD) simulations. Also called Markov state models (MSMs), these methods describe the molecular kinetics observed in an MD simulation as a jump process with a discrete-time interval generally on the order of 10 – 100 ns.Beauchamp et al. (2011); Senne et al. (2012) These models provide convenient estimators for key quantities of interest for molecular systems, such as the free energies of various metastable conformational states, the timescales of their interconversion, and the dominant transition pathways.Banerjee and Cukier (2014); Shukla et al. (2015); Noé and Fischer (2008); Chodera and Noé (2014)

In this work, we introduce an efficient maximum likelihood estimator for continuous-time Markov models on a finite state space from discrete-time data. The source of data used here is identical to that employed in fitting discrete-time Markov chain models – namely, the number of observed transitions between each pair of states within a specified time interval. We demonstrate the properties of these models on simple systems, and apply them to the analysis of the folding of the FiP35 WW protein domain.

## Ii Background

Consider a time-homogenous continuous-time Markov process over a finite state space, . The process is determined completely by an matrix , variously called its rate matrix, infinitesimal generator,Metzner et al. (2007) substitution matrix,Holmes and Rubin (2002) or intensity matrix.Bladt and Sørensen (2005)

For an interval , begin with the matrix, , of probabilities that the process jumps from one state, , to another state, ,

(1) |

which, by time-homogeneity is assumed to be independent of . The process’s rate matrix, , is defined as

(2) |

Given and any time interval, , the transition probability matrix, , can be expressed a matrix exponential

(3) |

A particular rate matrix corresponds to a valid continuous-time Markov process if and only if its off-diagonal elements are nonnegative and its row sums equal zero. These constraints are necessary to ensure that the probabilities propagated by the dynamics remain positive and sum to one. We denote by this set of admissible rate matrices,

(4) | ||||

Furthermore, we denote by the set of all embeddable transition probability matrices, that is, those which could originate as the transition probability matrix, , induced by some continuous-time Markov process,

(5) |

It is well-known that set is a strict subset of the set of all stochastic matrices; not all stochastic matrices are embeddable.Kingman (1962); Davies (2010) A complete description of the topological structure of as well as the necessary and sufficient conditions for a stochastic matrix to be embeddable are open problems in the theory of Markov processes.

Although Eq. 2 serves as the definition of the rate matrix of a continuous-time Markov process, it is generally not directly suitable as a method for parameterizing Markov models, particularly for applications in chemical kinetics. The attempt to numerically approximate the limit in Eq. 2 from empirically measured transition probabilities would be valid if the generating process were exactly Markovian. However, in chemical kinetics, a Markov process model – the chemical master equation – is an approximation valid only for timescales longer than the molecular relaxation time.Adams and Doll (1981); Voter and Doll (1985) A suitable Markov model which is predictive over long timescales must capture both the instantaneous kinetics as well as, to use the vocabulary of Mori-Zwanzig formalism, the effective contribution of the integrated memory kernel.Mori (1965); Zwanzig (1961)

Our goal is to address this parameterization problem. The primary contribution of this work is an efficient algorithm for estimating from observed discrete-time observations. We adopt a direct maximum likelihood approach, with work per iteration. Many constraints on the solution, such as detailed balance or specific sparsity patterns on can be introduced in a straightforward manner without additional cost.

Prior work on this subject is numerous. Crommelin and Vanden-Eijnden proposed a method for estimating in which a discrete-time transition probability matrix is first fit to the observed data, followed by the determination of the rate matrix, such that is nearest to the target empirical transition probability matrix.Crommelin and Vanden-Eijnden (2006, 2009) The nature of this calculation depends on the norm used to define the concept of “nearest”: under a Frobenius norm, this problem has a closed form solution, while the norm of Crommelin and Vanden-Eijnden leads to a quadratic program. A similar approach was advocated by Israel et al.Israel, Rosenthal, and Wei (2001)

Kalbfleisch and Lawless proposed a maximum likelihood estimator for .Kalbfleisch and Lawless (1985) Without constraints on the rate matrix, their proposed optimization involves the construction and inversion of an Hessian matrix at each iteration of the optimization, rendering it prohibitively costly ( scaling per iteration) for moderate to large state spaces.

A series of expectation maximization (EM) algorithms are described by Asmussen, Nerman and Olsson, Holmes and Rubin, Bladt and Sørensen, and Hobolth and Jensen Asmussen, Nerman, and Olsson (1996); Holmes and Rubin (2002); Bladt and Sørensen (2005); Hobolth and Jensen (2005) These algorithms treat the state of the system between observation intervals as an unobserved latent variable, which when interpolated via EM leads to more efficient estimators. A review of these algorithms is presented by Metzner et al.Metzner et al. (2007) At best, each iteration of the proposed methods scales as .

## Iii Maximum likelihood estimation

### iii.1 Log-likelihood and gradient

We take our source of data to be one or more observed discrete-time trajectories from a Markov process, , in a finite state space, observed at a regular time interval.

The likelihood of the data given the model and the initial state is given in terms of the transition probability matrix as the product of the transition probabilities assigned to each of the observed jumps in the trajectory.

(6) |

When more than one independent trajectory is observed, the data likelihood is a product over trajectory with individual terms given by Eq. 6.

Because many transitions are potentially observed multiple times, Eq. 6 generally contains many repeated terms. Define the observed transition count matrix ,

(7) |

Collecting repeated terms, the likelihood can be rewritten more compactly as

(8) |

Suppose that the rate matrix, is parameterized by a vector, of independent variables, . In the most general case, every element of the rate matrix may individually be taken as an independent variable, with . As discussed in Section III.2, other parameterizations may be used to enforce certain properties on . The logarithm of data likelihood is

(9) | ||||

(10) | ||||

(11) |

where is the element-wise natural logarithm, matrix exponential, and is the Hadamard (element-wise) matrix product. Note that the element-wise logarithm and matrix exponential are not inverses of one another.

The most straightforward parameter estimator – the maximum likelihood estimator (MLE) – selects parameters which maximize the likelihood of the data,

(12) |

To maximize Eq. 12, we focus our attention on quasi-Newton optimizers that utilize the first derivatives of with respect to . This requires an efficient algorithm for computing . We achieve this by starting from the eigendecomposition of ,

(13) |

where the columns of and contain the left and right eigenvectors of respectively, jointly normalized such that , and are the corresponding eigenvalues. Assuming that has no repeated eigenvalues, the directional derivatives of induced transition probability matrix, are given by Kalbfleisch and Lawless (1985); Jennrich and Bright (1976)

(14) |

where is an matrix with entries

(15) |

The elements of the gradient of the log-likelihood can then be constructed as

(16) |

where .

A direct implementation of Eq. 16 requires at least 4 matrix multiplies for each element of , indexed by . If the parameter vector, , contains parameters, then computing the full gradient will require floating point operations (FLOPs). However, two properties of the Hadamard product and matrix trace can be exploited to dramatically reduce the computational complexity of constructing the gradient vector to FLOPs.

(17) | ||||

(18) |

Using these identities, the gradient of the log-likelihood can be rewritten as

(19) |

Note that because is independent of , it can be constructed once at the beginning of a gradient calculation at a cost of FLOPs, and reused for each index, . The remainder of the work involves constructing the derivative matrix , which is generally quite sparse, and a single inexpensive sum of a Hadamard product. Overall, this rearrangement reduces the complexity of constructing the full gradient vector from to FLOPs.

### iii.2 Reversible parameterization

In the application of these models to domain-specific problems, additional constraints on the Markov process may be known, and enforcing these constraints during parameterization can enhance the interpretability of solutions as well as provide a form of regularization.

For many molecular system, it is known that the underlying dynamics are reversible, and this property can be enforced in Markov models as well. A Markov process is reversible when the rate matrix, , satisfies the detailed balance condition with respect to a stationary distribution, , towards which the process relaxes over time.

(20) | ||||

(21) |

This constraint can be enforced on solutions through the design of the parameterization function, . If is reversible, Eq. 21 implies that a real symmetric matrix, , can be formed, which we refer to as the symmetric rate matrix, such that

(22) |

Because of this symmetry and the constraint on the row sums of , only the upper triangular (exclusive of the main diagonal) elements of , and the stationary vector, , need be directly encoded by the parameter vector, , to fully specify . Furthermore, since the elements of are constrained to be positive, working with the element-wise logarithm of can enhance numerical stability. For the elements of , which are only constrained to be nonnegative, the same logarithm transformation is inapplicable, as it is incompatible with sparse solutions that set one or more rate constants equal to zero. For these reasons, we use a parameter vector, of length , with . The first elements, notated , encode the off-diagonal elements of . The remaining elements are notated , and are used to construct the stationary distribution, . From and , the off-diagonal and diagonal elements of are then constructed from Eq. 22. In explicit notation, the construction is

(23) | ||||

(24) | ||||

(25) |

where is the row-major vectorization of the elements of a symmetric matrix above the main diagonal,

(26) |

The necessary gradients of Eq. 25, are sparse. For fixed , the matrix over all , contains only four nonzero entries, whereas for , the same matrix contains nonzero entries. The sum of its Hadamard product with in Eq. 19 can thus be computed in or time. For the remainder of this work, we focus exclusively on this reversible parameterization for .

### iii.3 Optimization

Equipped with the log-likelihood and an efficient algorithm for the gradient, we now consider the construction of maximum likelihood estimates, Eq. 12. Among the first-order quasi-Newton methods tested, we find Limited-memory Broyden-Fletcher-Goldfarb-Shanno optimizer with bound constraints (L-BFGS-B) to be the most successful and robust.Byrd et al. (1995); Zhu et al. (1997)

To begin the optimization, we choose the initial guess for according to the following procedure. First, we fit the maximum likelihood reversible transition probability matrix computed using Algorithm 1 of Prinz et al. (2011). Next, we compute its principle matrix logarithm, , using an inverse scaling and squaring algorithm, and scaling by .Al-Mohy and Higham (2012) Generally, the MLE reversible transition matrix is not embeddable, and thus the principle logarithm is complex or has negative off-diagonal entries, and does not correspond to any valid continuous-time Markov process. We take the initial guess from directly from the stationary eigenvector of the MLE transition matrix, and from the nearest (by Frobenius norm) valid rate matrix to , given by . Davies (2010)

The optimization problem is non-convex in the general case and may have multiple local minima. Varying the optimizer’s initialization procedure can thus mitigate the risk of convergence to a low quality local minimum. One alternative initialization is the pseudo-generator, , which arises from a first-order Taylor approximation to the matrix exponential. After the optimization has terminated, a useful check is to compare the maximum likelihood transition matrix estimated during initialization with the exponential of the recovered rate matrix, . Large differences between the two matricies, or their eigenspectra / relaxation timescales, may be symptomatic of non-embedability of the data or a convergence failure of the optimizer. If the data are available at a lag time shorter than , convergence failures can often also be circumvented by using a converged rate matrix obtained from a model at a shorter lag time as an initial guess for a model at a longer lag time.

### iii.4 Implementation notes

Because is symmetric, it can be diagonalized efficiently at cost of FLOPs. The eigenvectors can then be rotated by to give the eigenvectors of . Compared to diagonalizing the non-symmetric matrix directly, this can yield a speedup of 2-10 in the critical diagonalization step required to compute the gradient vector.

For each pair of states with an observed transition count, , the gradient expressions Eq. 16, and Eq. 19 are only defined when . A sufficient condition to ensure this property is that be irreducible,Bakry, Gentil, and Ledoux (2014) but this cannot be straightforwardly ensured throughout every iteration of the L-BFGS-B optimization without heavy-handed measures such as complete positivity of . In practice, we find that replacing any zeros values in with a small constant, such as , when computing the matrix in Eq. 19 is sufficient to avoid this instability.

Furthermore, note that calculation of by direct implementation of Eq. 15 can suffer from a substantial loss of accuracy for close-lying eigenvalues. The matrix can instead be computed in a more precise manner using the or routines, which are designed to be accurate for small and are available in numerical libraries such as SLATEC, GSL, and the upcoming release of SciPy.Vandevender and Haskell (1982); Gough (2009); Jones et al. (01)

## Iv Quantifying Uncertainty

Since all data sets are finite, statistical uncertainty in any estimate of a probabilistic model is unavoidable. Therefore, key quantities of interest beyond the maximum likelihood rate matrix itself, , are estimates of the sampling uncertainty in , and estimates of the sampling uncertainty in quantities derived from , such as its stationary eigenvector, , its eigenvalues, , and relaxation timescales.

In the large sample size limit, the central limit theorem guarantees that the distribution of converges to a multivariate normal distribution with a covariance matrix which can be estimated by the inverse of the Hessian of the log-likelihood function evaluated at , assuming that the MLE does not lie on a constraint boundary.Rao (1973) This can be thought of as a second order Taylor expansion for the log-likelihood surface at the MLE; the log-likelihood is approximated as a paraboloid with negative curvature whose peak is at the MLE and whose width is determined by the Hessian matrix at the peak. The exponential of the log-likelihood, the likelihood surface, is then Gaussian, and the multivariate delta theorem can be used to derive expressions for the asymptotic variance in scalar functions of .Rao (1973) Computationally, the critical component is the computation of the Hessian matrix,

(27) | |||

(28) |

and its inverse.

### iv.1 Approximate Analytic Hessian

Direct calculation of the Hessian requires both the evaluation of the first derivatives of as well as the more costly second derivatives. A more efficient alternative, as pointed out by Kalbfleisch and Lawless, is to approximate the second derivatives by estimates of their expectations.Kalbfleisch and Lawless (1985)

Let . Taking the expected value of conditional on , we approximate . This makes it possible to factor out of the summation over in Eq. 28, and exploit the property that , simplifying Eq. 28 to

(29) |

Equipped with the approximator Eq. 29, the asymptotic variance-covariance matrix of is calculated as the matrix inverse of the Hessian, , and the asymptotic variance in each derived quantity is estimated using the multivariate delta method.Rao (1973)

(30) |

For example, the asymptotic variance in the stationary distribution can be calculated as

(31) |

where represent the lower block of the asymptotic variance covariance matrix and

(32) |

Other key quantities of interest for biophysical applications include the exponential relaxation timescales of the Markov model

(33) |

The asymptotic variance in the relaxation timescales, , is

(34) |

where follows from standard expressions for derivatives of eigensystems,Murthy and Haftka (1988)

(35) |

The sampling uncertainty in other derived properties which depend continuously on can be calculated similarly.

When the MLE solution lies at the boundary of the feasible region, with one or more elements of equal to zero, we adopt an active set approach to approximate . We refer to the elements of which do not lie on a constraint boundary as free parameters. The Hessian block for the free parameters is constructed and inverted, and the variance and covariance of the constrained elements as well as their covariance with the free parameters is taken to be zero.

## V Numerical Experiments

We performed numerical experiments on three datasets, which demonstrate different aspects of our estimator for continuous-time Markov processes. Where appropriate, we compare these models to reversible discrete-time Markov models which directly estimate , parameterized via Algorithm 1 of Prinz et al. (2011)

### v.1 Recovering a Known Rate Matrix

First, we constructed a simple synthetic eight state Markov process with known rates. The network is shown in Fig. 1. The largest non-zero eigenvalue of is , which corresponds to a slowest exponential relaxation timescale, (arbitrary time units).

From this model, we simulated discrete-time data with a collection interval of 1 time unit by calculating the matrix exponential of and propagating the discrete-time Markov chain. In Fig. 2, we show the convergence of the models estimated from this simulation data to the true model, as the length of the simulated trajectories grows. As expected, the fit parameters get more accurate as the size of the data set grows. We observe approximately power law convergence as measured by the 2-norm and Frobenius norm over the range of trajectory lengths studied.

The true rate matrix for this continuous time Markov process is sparse – only 7 of the 28 possible pairs of distinct states are directly connected in Fig. 2. Can this graph structure be recovered by our estimator? This task is challenging because of the nature of the discrete-time data. The observation that the system transitioned from state (at time ) to state (at time ) does not imply that is non-zero. Instead, the observed transition may have been mediated by one or more other states – the process may have jumped from to , and then again from to , all within the observation interval.

When the rate matrix, is irreducible, the corresponding transition probability matrix is strictly positive for every positive lag time, .Bakry, Gentil, and Ledoux (2014) This implies that in the limit that the trajectory length, , approaches infinity, at least 1 transition count will almost surely be observed between any pair of states, regardless of the sparsity of .

In Fig. 3, we attempt to resolve the underlying graph structure using the model estimated with a trajectory of length . The plot compares the estimated rate matrix elements with the true values. We find that all of the true connections are well-estimated, and that many of the zero rates are also correctly identified. However, the maximum likelihood estimator also identifies very low, but non-zero rates between many of the states which are in fact disconnected.

We computed 95% () confidence intervals for each of the estimated rate matrix elements, . For each of the spuriously non-zero elements, these confidence intervals overlapped with zero. None of the confidence intervals for the properly non-zero rates overlapped with zero. These uncertainty estimates can therefore be used, in combination with the MLE, to identify the underlying graph structure.

This example demonstrates that some degree of sparsity-inducing regularization or variable selection may be required to robustly identify the underlying graph structure in Markov process.

### v.2 Accuracy of Uncertainty Estimates

How accurate are the approximate asymptotic uncertainty expressions derived in Section IV? To answer this question, we performed a numerical experiment with twenty independent and identically distributed collections of trajectories of Brownian dynamics on a two-dimensional potential. One of those trajectories is shown superimposed on the potential in Fig. 4, along with the grid used to discretize the process. The Brownian dynamics simulations were performed following the same procedure described in McGibbon, Schwantes and Pande.McGibbon, Schwantes, and Pande (2014)

To assess the accuracy of the asymptotic approximations, we compare the empirical distribution of the estimated parameters over the separate data sets with the theoretical distribution which would be expected based on the Gaussian approximation. Consider a scalar model parameter , such as one of the relaxation timescales or equilibrium populations. Fitting a model separately on each of the twenty data sets yields estimates, . If these estimates are accurate, then is normally distributed, . Our goal is to examine the accuracy of the estimated variances, . Note that the true value of is unknown, but subtracts out when examining standardized differences between the estimates, which, assuming normality, should follow a standard normal distribution,

(36) |

In Fig. 5, we compare the empirical and theoretical distributions of , , for estimates of the first two relaxation timescales using a quantile-quantile (Q-Q) plot, a powerful method of comparing distributions. The observation that Q-Q plot runs close to the line is encouraging, and shows that the observed deviates are close to normally distributed, and that the approximator’s variance estimates are of the appropriate magnitude. This suggests that the asymptotic error expressions can be of practical utility for practitioners.

### v.3 Comparison with discrete-time MSMs

In a data-limited regime, are continuous-time Markov models more capable than discrete-time MSMs? We extended the analysis in Section V.1 to a larger class of generating processes in order to address this question. We began by sampling random 100-state Markov process rate matrices from scale free random graphs.Barabási and Albert (1999). Details of the random rate matrix generation are described in Appendix A.

From each random rate matrix, , we sampled three discrete-time trajectories of different lengths. Each trajectory was used individually to fit both a continuous-time and discrete-time Markov model, and the parameterized models were then compared to the underlying system from which the trajectories were simulated to assess the convergence properties of the approaches.

In Fig. 6, we consider two notions of error. The first norm measures error in the elements of the estimated transition matrix, . Unlike the experiment in Fig. 2, we used the as the basis of the measure so that the continuous-time and discrete-time models could be compared on an equal footing. The second error norm we consider is the max-norm error in the estimated relaxation timescales, , which measures a critical spectral property of the models. In both panels of Fig. 6, the distribution of the difference in error between the continuous-time and discrete-time models is plotted; values below zero indicate that the continuous-time model performed better for a particular class of trajectories, whereas values above zero indicate the reverse. For each condition, we performed replicates.

Our results show that as measured by the transition matrix error, the continuous-time Markov process model is more accurate in the regimes considered. A binomial sign test rejects the hypothesis that the two estimators give the same error for all three conditions (two-sided values of for trajectories of length steps, respectively). The relative advantage of the continuous-time Markov model decreases as the trajectory length increases – its advantage is in the sparse data regime when no transition counts have been observed between a significant number of pairs of states.

In contrast, as measured by the relaxation timescale estimation error, we observe no significant difference between the continuous-time and discrete-time estimators. A binomial sign test does not definitively reject the hypothesis that the two estimators give the same error for any of the three conditions (two-sided values of for trajectories of length steps, respectively). Neither estimator is consistently more accurate in recovery of the dominant spectral properties of the dynamics.

### v.4 Application to Protein Folding and Lag Time Selection

How can these models be applied to the analysis of molecular dynamics (MD) simulations of protein folding? We obtained two independent ultra-long 100 MD simulations of the FiP35 WW protein,Liu et al. (2008) a small 35 residue -sheet protein (Fig. 7), performed by D.E. Shaw Research on the ANTON supercomputer.Shaw et al. (2010)

In order to focus on the construction of discrete-state Markov models, we initially projected every snapshot of the MD trajectories, which were available at a 200 time interval, into a discrete state space with 100 states in a way consistent with prior work.McGibbon, Schwantes, and Pande (2014) Briefly, this involved the extraction of the distance between the closest non-hydrogen atoms in each pair of amino acids in each simulation snapshot,McGibbon et al. (2014) followed by the application of time-structure independent components analysis (tICA) to extract the four most slowly decorrelating degrees of freedom,Schwantes and Pande (2013); Pérez-Hernández et al. (2013) which were then clustered into 100 states using the -means algorithm.Lloyd (1982); Arthur and Vassilvitskii (2007)

Although the equations of motion for a protein’s dynamics in an MD simulation are Markovian, the generating process of the data analyzed by our model is not. The pre-processing procedure which projects the original dynamics from a high-dimensional continuous state space (the position and momenta of the constituent atoms) into a lower dimensional continuous space or discrete state space is not information preserving, and destroys the Markov property.Mori (1965); Zwanzig (1961) For chemical dynamics, qualitative features of the non-Markovianity are well-understood. Consider, for example, a metastable system with two states, and , the system in state may stochastically oscillate across the boundary surface many times without committing to state . Whereas for a Markov process, the probability distribution of the waiting time that the system spends in any states before exiting is exponential, chemical dynamics are expected to show a higher propensity for short waiting times, corresponding to so-called recrossing events.Voter and Doll (1985) This effect is more pronounced when viewing the process at short lag times – the bias induced by approximating the process as Markov decreases with lag time.Sarich, Noé, and Schütte (2010)

For the FiP35 WW domain, we observe that the change in the relaxation timescales of the continuous-time and discrete-time Markov models with respect to lag time are essentially identical, as shown in Fig. 8. For both model classes, the estimated relaxation timescales increase and converge with respect to lag time. This is consistent with our results in Fig. 6 (b), which suggest that the estimated timescales are the same for both models, especially as the length of the trajectories grow. While fitting the models in Fig. 8, we observed a small number (2-4) of convergence failures at long lag times, which were notable due to a dramatic discontinuinity in the relaxation timescales curve. This problem was solved by reinitializing the optimization at these lag times from the converged solutions at adjacent lag times.

Because of the essentially unchanged nature of the relaxation timescale spectrum, we suggest that when choosing a particular lag time, the same approach be used for discrete-time and continuous-time Markov models. Ideally, this entails the selection of a lag time large enough that the relaxation timescales are independent of lag time.Swope, Pitera, and Suits (2004); Bowman, Huang, and Pande (2009) For the continuous-time Markov model, other techniques may be appropriate as well. For example, in Fig. 10 we show the convergence of selected diagonal entries of the rate matrix as a function of lag time. As described in the context of transition state theory, these rate constants should plateau with increasing , which provides another related basis on which select the parameter.Chandler (1978, 1998)

The most significant difference between the continuous-time and discrete-time estimators in this case is the sparsity of the parameterized models. In Fig. 8(c), we compare the number of non-zero independent parameters for both models as a function of . Of the independent parameters for both the continuous-time and discrete-time models, only are nonzero for the continuous-time model, regardless of lag time. In contrast, the number of nonzero parameters for the discrete-time model model continues to increase with lag time.

We anticipate that the sparsity of may aid in the analysis and interpretation of Markov models. In Fig. 9, we show the MLE rate matrix computed at . The state indices were sorted such that states grouped together via Peron cluster cluster analysis (PCCA) were given adjacent indices.Deuflhard et al. (2000) The evident block structure of the matrix visually indicates that the protein’s conformational space consists of a small number of regions with generally high within-region rate constants, but weak between-region coupling. Although a detailed analysis of the biophysics of these conformations is beyond the scope of this work, visual analysis of these structures indicate that the model resolves folded and unfolded, as well as partially folded intermediate states.

In interpreting the error bars on the relaxation timescales in Fig. 8(a), cautionary note is warranted. Our error analysis considers the number of observed transitions between states but does not take into account any notion of uncertainty in the proper definitions of the states themselves, or the error inherent in approximating a non-Markovian process with a Markov process. The observation in Fig. 8(a) that the magnitude of the systematic shift in the timescales with respect to lag time is much larger than the error bars suggests that the Markov approximation (a model misspecification) is a larger source of error, for this dataset, than the statistical uncertainty in model parameters. For these reasons, we caution that these error bars should be interpreted as lower bounds rather than upper bounds.

### v.5 Performance

In order to assess the performance of our maximum likelihood estimator, we compared it with an algorithm by Holmes and Rubin, which solves the same Markov process parameterization problem using an expectation-maximization approach.Holmes and Rubin (2002) Because the original code was unavailable, we reimplemented the algorithm following the description by Metzner et al., where it is denoted “Algorithm 4: Enhanced MLE-method for the reversible case”.Metzner et al. (2007) The algorithm scales as , where is the number of states. Its rate limiting step involves an FLOP contraction of five matrices into a rank-4 tensor of dimension on each axis. ^{1}^{1}1Both our algorithm and Holmes-Rubin estimator were implemented the Cython language and compiled to C++. Our implementation of the Holmes-Rubin estimator is available at https://github.com/rmcgibbo/holmes_rubin For benchmarking, we constructed a variant of the the FiP35 WW protein dataset from Section V.4, in which we varied the number of states between 10 and 100 during clustering. All models were fit on an Intel Xeon E5-2650 using a single CPU core.

As shown in Fig. 11, and expected on the basis of the vs. scaling, the performance difference between the algorithms is substantial. For , our algorithm is roughly four orders of magnitude faster per iteration; our algorithm takes on the order of 1 ms per iteration, while the Holmes-Rubin estimator’s iteration takes over 10 seconds. Using the L-BFGS-B optimizer’s default convergence criteria, roughly three quarters (68/91) of the runs of our algorithm converge in fewer than 100 iterations; a solution is often achieved long before the EM estimator has performed a single iteration.

## Vi Conclusions

In this work, we have introduced a maximum likelihood estimator for continuous-time Markov processes on discrete state spaces. This model can be used to estimate transition rates between various substates in a dynamical system based on observations of the system at a discrete time interval. Various constraints on the solution, such as detailed balance, can be easily incorporated into the model, and asymptotic error analysis can give confidence intervals in model parameters and derived quantities.

With the efficient parameterization problem solved, these continuous-time Markov models offer several advantages over existing MSM methodologies. As compared to discrete-time MSMs, these models are more interpretable for chemists and biologists because they do not arbitrarily discretize time. Although a lag time is used internally during parameterization, the final estimated quantities are familiar rate constants from chemical kinetics, as opposed to the somewhat unintuitive transition probabilities in a discrete-time MSM. Furthermore, these models are more parsimonious, and unlike the discrete-time MSM are able to detect that many pairs of states are not immediately kinetically adjacent to one another. This makes it possible to more clearly recover the underlying graph structure of the kinetics. For applications such as the determination of transition pathways in protein dynamics, we anticipate that this property will be valuable.

Many extensions of this model are possible in future work. The simple nature of the constraints on make Bayesian approaches, especially Hamiltonian Monte Carlo, particularly attractive.Neal (2011) In particular, because of the separation of and in the parameterization, strong informative priors on may be added to extend the work of Trendelkamp-Schroer and Noé.Trendelkamp-Schroer and Noé (2013) The appropriate sparsity inducing priors on may be a topic of future work.

An implementation of this estimator is available in the MSMBuilder software package at http://msmbuilder.org/ under the GNU Lesser General Public License.

## Acknowledgements

This work was supported by the National Science Foundation and National Institutes of Health under Grant Nos. NIH R01-GM62868, NIH S10 SIG 1S10RR02664701, and NSF MCB-0954714. We thank Mohammad M. Sultan, Matthew P. Harrigan, John D. Chodera, Kyle A. Beauchamp, Ariana Peck, and the reviewers for helpful feedback during the preparation of this work.

## Appendix A Random rate matrices

Scale-free random graphs with 100 states were generated using the Barabási–Albert preferential attachment model with .Barabási and Albert (1999) From the graph’s adjacency matrix, we generated a symmetric rate matrix by sampling a log-normally distributed random variable (, ) for each connected edge. The stationary distribution, , was sampled from . The matrix was then scaled by , which tuned the relaxation timescales in the range between and time steps, and used with in Eq. 25 to construct .

## References

## References

- Singer and Spilerman (1976) B. Singer and S. Spilerman, Am. J. Sociol. 82, 1 (1976).
- Madsen, Spliid, and Thyregod (1985) H. Madsen, H. Spliid, and P. Thyregod, J. Appl. Meteorol. 24, 629 (1985).
- Turner, Startz, and Nelson (1989) C. M. Turner, R. Startz, and C. R. Nelson, J. Financ. Econ. 25, 3 (1989).
- Minin and Suchard (2008) V. N. Minin and M. A. Suchard, J. Math. Biol. 56, 391 (2008).
- Jarrow, Lando, and Turnbull (1997) R. A. Jarrow, D. Lando, and S. M. Turnbull, Rev. Financ. Stud. 10, 481 (1997).
- Spilerman (1972) S. Spilerman, American Journal of Sociology , 599 (1972).
- Kimura (1980) M. Kimura, J. Mol Evol. 16, 111 (1980).
- Anderson and Kurtz (2011) D. F. Anderson and T. G. Kurtz, in Design and analysis of biomolecular circuits (Springer, 2011) pp. 3–42.
- Ikai and Tanford (1971) A. Ikai and C. Tanford, Nature 230, 100 (1971).
- Zwanzig (1997) R. Zwanzig, Proc. Natl. Acad. Sci. U.S.A. 94, 148 (1997).
- Sánchez and Kiefhaber (2003) I. E. Sánchez and T. Kiefhaber, J. Mol. Biol. 325, 367 (2003).
- Chan and Dill (1998) H. S. Chan and K. A. Dill, Proteins: Struct., Funct., Bioinf. 30, 2 (1998).
- Pirchi et al. (2011) M. Pirchi, G. Ziv, I. Riven, S. S. Cohen, N. Zohar, Y. Barak, and G. Haran, Nat. Commun. 2, 493 (2011).
- Beauchamp et al. (2012) K. A. Beauchamp, R. T. McGibbon, Y.-S. Lin, and V. S. Pande, Proc. Natl. Acad. Sci. U.S.A. 109, 17807 (2012).
- Beauchamp et al. (2011) K. A. Beauchamp, G. R. Bowman, T. J. Lane, L. Maibaum, I. S. Haque, and V. S. Pande, J. Chem. Theory Comput. 7, 3412 (2011).
- Senne et al. (2012) M. Senne, B. Trendelkamp-Schroer, A. S. Mey, C. Schütte, and F. Noé, J. Chem. Theory Comput. 8, 2223 (2012).
- Banerjee and Cukier (2014) R. Banerjee and R. I. Cukier, J. Phys. Chem. B 118, 2883 (2014).
- Shukla et al. (2015) D. Shukla, C. X. Hernández, J. K. Weber, and V. S. Pande, Acc. Chem. Res. 48, 414 (2015).
- Noé and Fischer (2008) F. Noé and S. Fischer, Curr. Opin. Struct. Biol. 18, 154 (2008).
- Chodera and Noé (2014) J. D. Chodera and F. F. Noé, Curr. Opin. Stuct. Biol 25, 135 (2014).
- Metzner et al. (2007) P. Metzner, E. Dittmer, T. Jahnke, and C. Schütte, J. Comput. Phys. 227, 353 (2007).
- Holmes and Rubin (2002) I. Holmes and G. Rubin, J. Mol. Biol. 317, 753 (2002).
- Bladt and Sørensen (2005) M. Bladt and M. Sørensen, J. R. Statistic. Soc. B 67, 395 (2005).
- Kingman (1962) J. F. C. Kingman, Probab. Theory Related Fields 1, 14 (1962).
- Davies (2010) E. Davies, Electron. J. Probab. 15, 1474 (2010).
- Adams and Doll (1981) J. Adams and J. Doll, Surface Science 103, 472 (1981).
- Voter and Doll (1985) A. F. Voter and J. D. Doll, J. Chem. Phys. 82, 80 (1985).
- Mori (1965) H. Mori, Prog. Theor. Phys. 33, 423 (1965).
- Zwanzig (1961) R. Zwanzig, Phys. Rev. 124, 983 (1961).
- Crommelin and Vanden-Eijnden (2006) D. Crommelin and E. Vanden-Eijnden, J. Comput. Phys. 217, 782 (2006).
- Crommelin and Vanden-Eijnden (2009) D. Crommelin and E. Vanden-Eijnden, Multiscale Modeling & Simulation 7, 1751 (2009).
- Israel, Rosenthal, and Wei (2001) R. B. Israel, J. S. Rosenthal, and J. Z. Wei, Math. Financ. 11, 245 (2001).
- Kalbfleisch and Lawless (1985) J. D. Kalbfleisch and J. F. Lawless, J. Am. Stat. Assoc. 80, 863 (1985).
- Asmussen, Nerman, and Olsson (1996) S. Asmussen, O. Nerman, and M. Olsson, Scand. J. Stat. 23, 419 (1996).
- Hobolth and Jensen (2005) A. Hobolth and J. L. Jensen, Stat. Appl. Genet. Mol. Biol. 4, 1 (2005).
- Jennrich and Bright (1976) R. I. Jennrich and P. B. Bright, Technometrics 18, 385 (1976).
- Byrd et al. (1995) R. Byrd, P. Lu, J. Nocedal, and C. Zhu, SIAM J. Sci. Comput. 16, 1190 (1995).
- Zhu et al. (1997) C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, ACM Trans. Math. Softw. 23, 550 (1997).
- Prinz et al. (2011) J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé, J. Chem. Phys. 134, 174105 (2011).
- Al-Mohy and Higham (2012) A. Al-Mohy and N. Higham, SIAM J. Sci. Comput. 34, C153 (2012).
- Bakry, Gentil, and Ledoux (2014) D. Bakry, I. Gentil, and M. Ledoux, Analysis and Geometry of Markov Diffusion Operators, Grundlehren der mathematischen Wissenschaften, Vol. 348 (Springer International Publishing, 2014).
- Vandevender and Haskell (1982) W. H. Vandevender and K. H. Haskell, SIGNUM Newsl. 17, 16 (1982).
- Gough (2009) B. Gough, GNU scientific library reference manual (Network Theory Ltd., 2009).
- Jones et al. (01 ) E. Jones, T. Oliphant, P. Peterson, et al., “SciPy: Open source scientific tools for Python,” (2001–).
- Rao (1973) C. R. Rao, ed., Linear Statistical Inference and its Applications (John Wiley & Sons, Inc., 1973).
- Murthy and Haftka (1988) D. V. Murthy and R. T. Haftka, Int. J. Numer. Methods Eng. 26, 293 (1988).
- McGibbon, Schwantes, and Pande (2014) R. T. McGibbon, C. R. Schwantes, and V. S. Pande, J. Phys. Chem. B 118, 6475 (2014).
- Barabási and Albert (1999) A.-L. Barabási and R. Albert, Science 286, 509 (1999).
- Shaw et al. (2010) D. E. Shaw, P. Maragakis, K. Lindorff-Larsen, S. Piana, R. O. Dror, M. P. Eastwood, J. A. Bank, J. M. Jumper, J. K. Salmon, Y. Shan, and W. Wriggers, Science 330, 341 (2010).
- Liu et al. (2008) F. Liu, D. Du, A. A. Fuller, J. E. Davoren, P. Wipf, J. W. Kelly, and M. Gruebele, Proc. Natl. Acad. Sci. U.S.A. 105, 2369 (2008).
- McGibbon et al. (2014) R. T. McGibbon, K. A. Beauchamp, C. R. Schwantes, L.-P. Wang, C. X. Hernández, M. P. Harrigan, T. J. Lane, J. M. Swails, and V. S. Pande, bioRxiv (2014), 10.1101/008896.
- Schwantes and Pande (2013) C. R. Schwantes and V. S. Pande, J. Chem. Theory Comput. 9, 2000 (2013).
- Pérez-Hernández et al. (2013) G. Pérez-Hernández, F. Paul, T. Giorgino, G. De Fabritiis, and F. Noé, J. Chem. Phys. 139, 015102 (2013).
- Lloyd (1982) S. Lloyd, IEEE Trans. Inform. Theory 28, 129 (1982).
- Arthur and Vassilvitskii (2007) D. Arthur and S. Vassilvitskii, in Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SIAM, 2007) pp. 1027–1035.
- Deuflhard et al. (2000) P. Deuflhard, W. Huisinga, A. Fischer, and C. Schütte, Linear Algebra Appl. 315, 39 (2000).
- Sarich, Noé, and Schütte (2010) M. Sarich, F. Noé, and C. Schütte, Multiscale Model. Simul. 8, 1154 (2010).
- Swope, Pitera, and Suits (2004) W. C. Swope, J. W. Pitera, and F. Suits, J. Phys. Chem. B 108, 6571 (2004).
- Bowman, Huang, and Pande (2009) G. R. Bowman, X. Huang, and V. S. Pande, Methods 49, 197 (2009).
- Chandler (1978) D. Chandler, J. Chem. Phys. 68, 2959 (1978).
- Chandler (1998) D. Chandler, in Classical and Quantum Dynamics in Condensed Phase Simulations, edited by B. J. Berne, G. Ciccotti, and D. J. Coker (World Scientific, Singapore, 1998) pp. 3–23.
- (62) Both our algorithm and Holmes-Rubin estimator were implemented the Cython language and compiled to C++. Our implementation of the Holmes-Rubin estimator is available at https://github.com/rmcgibbo/holmes_rubin.
- Neal (2011) R. M. Neal, in Handbook of Markov Chain Monte Carlo, edited by S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng (Chapman & Hall / CRC, 2011) pp. 113–162.
- Trendelkamp-Schroer and Noé (2013) B. Trendelkamp-Schroer and F. Noé, J. Chem. Phys. 138, 164113 (2013).