A MCMC strategy for identifying a sparse continuous time dynamics matrix from time series data
A MCMC strategy for identifying a sparse continuous time dynamics matrix from time series data
Identifying a sparse continuous time dynamics matrix from time series data by a MCMC approach
Variable selection in linear dynamical systems by Markov Chain Monte Carlo approach
Variable selection in linear dynamical systems by an MCMC approach
Bayesian variable selection in linear dynamical systems
Abstract.
We develop a method for reconstructing regulatory interconnection networks between variables evolving according to a linear dynamical system. The work is motivated by the problem of gene regulatory network inference, that is, finding causal effects between genes from gene expression time series data. In biological applications, the typical problem is that the sampling frequency is low, and consequentially the system identification problem is illposed. The low sampling frequency also makes it impossible to estimate derivatives directly from the data. We take a Bayesian approach to the problem, as it offers a natural way to incorporate prior information to deal with the illposedness, through the introduction of sparsity promoting prior for the underlying dynamics matrix. It also provides a framework for modelling both the process and measurement noises. We develop Markov Chain Monte Carlo samplers for the discretevalued zerostructure of the dynamics matrix, and for the continuoustime trajectory of the system.
Keywords: Variable selection, Bayesian inference, Markov Chain Monte Carlo, Network inference, Linear dynamical system
1. Introduction
We consider the problem of retrieving the sparsity pattern of the dynamics matrix in the system
(1.1) 
from time series data . Here is an unknown noise process modelled as a Brownian motion with incremental covariance , and ’s are measurement noise terms. An additional, deterministic input can be treated by superposition. Our motivation for this problem arises from the field of systems biology, where a topical problem is finding the interconnection network structure between different species. More specifically, we are interested in reconstructing gene regulatory networks from gene expression time series data [17]. In this application, data collection is expensive and laborious, and therefore the temporal resolution tends to be relatively poor and the overall length of the time series short. Consequently, the problem is illposed, and additional information needs to be incorporated in order to obtain reasonable solutions. A typical resolution to the identifiability issues is to look for sparse matrices .
Let us discuss first the related problem of variable selection in linear regression, that is, finding the zerostructure of the matrix from inputoutput data connected by
(1.2) 
A sparse solution could be obtained by solving the cardinalitypenalised least squares problem, that is, minimising , where gives the number of nonzero entries in . However, the cardinality penalty is nonconvex and moreover, the problem becomes combinatorial in nature, as each variable combination must be tested separately. A typical remedy is to resort to convex relaxation, that is, penalising instead for the 1norm of the matrix , defined by . This approach is generally known as Lasso [23]. The Lasso approach can also be interpreted as a maximum likelihood estimate assuming Gaussian noise, and Laplace priors for the parameters. This property has been exploited in the socalled Bayesian Lasso approach [21],[5]. However, in the Laplace distribution, the value zero is the maximum likelihood estimate, but even there the probability that the coefficient would be zero in a single realization is zero. Therefore the Lasso only produces sparse solutions when it is used in the maximum likelihood estimates. Sparse realizations can be obtained by introducing a prior distribution for the parameters that have a point mass at zero, leading to a probabilistic counterpart of the cardinalitypenalty setup. Clearly the combinatorial nature of the problem remains, but to some extent this can be overcome by using an MCMC strategy. Such strategies tend to spend more time in the “neighborhood” of solutions with high probability. Variable selection methods based on such probabilistic consideration are considered for example in [18] introducing socalled “spike and slab” priors consisting of a mixture of a point mass at zero, and a uniform distribution around zero. Indicator variables are introduced in [15] and [9]. The article [15] also discusses different types of global priors for the indicator variable. This means that the probability of a certain regression coefficient being zero depends on how many of the other coefficients are zero. Different methods are reviewed and compared in [20] and [8].
Comparing the dynamical system (1.1) and the linear regression case (1.2), the main difference is that in (1.1), the “input” on which matrix operates is the trajectory , and the “output” is its derivative . This means that also the input is unknown. In addition, the derivative cannot be estimated directly from the samples due to the low sampling frequency. One approach to tackle this problem is taken in [14] and [4] where the Lasso approach is combined with a Kalman smoother estimating the latent trajectory. The result is an EM type algorithm alternately updating the latent trajectory and the matrix . Another approach is presented in [25] which is based on a discrete time approach studying and imposing sparsity on the matrix logarithm. The approach taken here is to impose a sparsity promoting prior probability distribution for the matrix , which — together with the process noise model — gives rise to a probability measure for the continuous trajectory . A MCMC sampler is then constructed for both the matrix (or, more precisely, its zerostructure) and the trajectory . The problem of low sampling rate is addressed by sampling the full trajectory , as opposed to sampling only . The prior for is defined through an indicator variable as in [15], and separate priors are employed for the indicator variable, and the magnitudes of the nonzero values. A discretevalued Markov chain is defined that is moving between different indicator variables, that are controlling the zerostructure of . This approach bears some resemblance to the Reversible Jump MCMC which is designed in [12] for sampling from distributions with varying dimension. However, assuming a normal distribution for the nonzero elements of , it is possible to integrate out the magnitude parameters, thus avoiding the need to deal with varying dimension of the parameter space.
The proposed approach gives rise to some challenges related to the MCMC sampling. The continuoustime trajectory of the system is an infinitedimensional random variable. We will employ a CrankNicolson sampling scheme for the trajectory in order to achieve high acceptance rates in the sampler. A mixture of discrete and continuous variables is prone to multimodality problems. This problem is addressed by employing a parallel tempering scheme. The outline of the paper is as follows: in order to best convey the main idea, the sampling scheme for the zerostructure of the matrix is first introduced in the simpler context of variable selection in linear regression. This is the topic of Section 2. The case of dynamical systems is treated in Section 3, where we also introduce slight improvements and generalisations of the method involving higher order dynamics. Finally, in Section 4, we present a numerical example where the introduced method is compared to the Expectation Maximization (EM) method incorporating a Laplace prior for the elements of matrix , corresponding to the popular Lasso algorithm.
2. Variable selection in linear regression
In this section we introduce our sampling scheme for sampling the zerostructure of the matrix in connection of a linear regression problem. Say we have data of inputoutput pairs for of the form
where and if . The task is to identify the matrix for which we have prior information that it should be sparse.
Let us introduce some notation:
that is, is a variable indicating whether the element of the matrix is nonzero, and is the magnitude variable. Denote the indicator matrix by , that is , and . For a vector , the notation stands for the vector in that consists of those elements for which . For a matrix , the notation stands for the submatrix of that consists of the elements for which and . For a matrix (or ) the notation stands for the (or ) matrix that consists of those rows (columns) for which .
We wish to sample from the posterior distribution for which it holds that
where the first line is a marginalization integral, the second line is the Bayes’ rule, and the third line follows from the probability chain rule. For given and , the output is Gaussian, that is, . The topology is independent of the input data, so which is just the prior probability for the topology. At this point let us assume that is a diagonal matrix, .
For the matrix we assume that its rows are independent, and . Then the function is an exponential function where the exponent is a quadratic function of :
This marginalization integral can be computed analytically. Firstly, integrating over the variables for which corresponds to the usual Gaussian marginalization integral
For the remaining part of the exponent it holds that
where is the minimal value of the quadratic exponent and is the vector attaining this minimum. The minimal value is obtained by straightforward differentiation and it is
(2.1) 
where is the row of , that is, the vector containing the components of for .
2.1. The proposal Markov chain and the Metropolis–Hastings algorithm
There is some freedom in how to perform a jump from one connectivity matrix to another, that is, designing the proposal distribution . We will employ a simple scheme where we randomly pick an element from , and flip it. That is, draw from the uniform distribution on and set
This proposal is symmetric so that .
Another possibility is presented in [5]. Their strategy is to decide whether to add or remove (or neither) a variable from the active regressor set. Say that the probability for an addition is and probability for a removal is , so that holds. With probability , the active regressor set is not changed, that is, . The ratio of the probabilities of a jump and its reverse is a bit complicated, since one has to take into account the extreme cases, when a removal step removes the last remaining regressor, or when an addition step results in a full matrix . In the end, the ratios are for an addition move :
and for a removal :
Note that an addition move is not possible if and a removal is not possible if .
For a given connectivity matrix we define the Metropolis–Hastings number
where is given in (2.1), and is the userdefined prior probability for this particular zerostructure. It can be defined, for example, using the full number of nonzero elements, , or the numbers of nonzero elements on each row, , etc.
In the Metropolis–Hastings MCMC algorithm we use the above procedure to sample a new network topology from the old topology matrix . The acceptance probability of the new topology is then . This algorithm is summarised below:
Algorithm 2.1.

Set and .

Pick an initial topology and compute .

For

Form from using the procedure described above.

Compute .

With probability , set . Otherwise set .

Compute .


Compute .
As the number of samples grows, the elements of the matrix tend to the matrix , whose elements are the probabilities with which the corresponding elements of are nonzero. This algorithm is slightly simplified since a burnin period or any thinning are not explicitly included.
3. Linear dynamical systems
In this section, we will encounter a number of different indices. To improve readability, we shall use index exclusively to refer to the time discretisation, for the output dimension of the state and measurement , for the input dimension, and for numbering the samples in the MCMC scheme — used as a parenthesised superscript, e.g., the trajectory sample is .
In this section, we formulate the approach for estimating the zerostructure of a sparse matrix from time series data , that is, is the dimension of one measurement, and is the number of samples in the time series. This data is assumed to arise from discrete measurements of a continuous time trajectory,
The trajectory is the solution of
where is a Brownian motion with incremental covariance , which is assumed to be diagonal, with . Again the goal is to obtain the posterior probabilities for different structure matrices . The main difference to the previous section is that now we have an additional unknown variable, namely the trajectory . This trajectory will be treated as a latent variable, which will be sampled as well. What makes things slightly tricky is that is an infinitedimensional variable. In particular, is a probability measure on the augmented variable consisting of the discretevalued graph topology, the continuousvalued parameters , and the infinitedimensional trajectories :
For a background on infinitedimensional integrals, we refer to [13] and for background on stochastic processes, see [22].
Given , that is, and , the trajectory is a Gaussian process. The measure of the process is continuous with respect to the Wiener measure corresponding to the incremental covariance . By the Cameron–Martin theorem [22, Theorem 8.2.9], it holds that
The exponent is a quadratic function of . Therefore, we impose a normal prior to the rows of , that is, , and, as before in the linear regression case, the integral with respect to can be computed analytically like in the basic case in Section 2. Denote by the matrix defined elementwise
The integral then yields
The bracket notation is defined for a vectorvalued function as a vector in defined elementwise as the Ito integral
The functional is defined by
(3.1) 
3.1. Sampling strategy
Common MCMC strategies’ acceptance probabilities decrease to zero as the dimension of the distribution increases. In particular, this becomes a problem when sampling some discretised infinitedimensional object, and one wishes to refine the discretisation. We introduce a Crank–Nicolson sampling scheme [2, 6] to speed up the sampling. Crank–Nicolson sampling is based on implementing the “Gaussian part” of the posterior distribution already in the sampling scheme, and then it does not affect the acceptance probability. That is, assume we wish to sample from a distribution that has the form . A CrankNicolson sampler draws samples from by
where , and is the current sample. The acceptance probability of the sample is computed using only the nonGaussian part of the distribution, that is, .
In our case, the Gaussian part is . However, sampling from this distribution leads to poor performance, since — loosely speaking — the measure is concentrated on very different area in the infinitedimensional space of trajectories as the full posterior measure. We wish to design a sampling measure that is proportional to , and that is concentrated on the same area as the full posterior.
Let us introduce the used sampling scheme in the infinitedimensional context. The practical implementation in discretised form will be presented later. We propose a twophase sampling scheme where we first sample from the measurement distribution, that is, . Then define as the continuous, piecewise linear (on intervals ) function, for which it holds . The trajectory sample is then where is a collection of independent Brownian bridges that satisfy . That is, each component is a Gaussian process with covariance function
In the following lemma, it is shown that the proposed sampling scheme equipped with a suitable acceptancerejection mechanism, is indeed equivalent to sampling from the conditioned measure . Only the onedimensional case is considered for simplicity of notation. The higher dimensional trajectory samples are just collections of onedimensional trajectories.
Lemma 3.1.
Say the current state trajectory sample is . The sampling scheme described above, combined with Metropolis–Hastings acceptance ratio
is equivalent with sampling from the conditioned Wiener measure .
Note that the result does not depend on how and are sampled. Later, we will construct Crank–Nicolson samplers for both and .
Proof.
An equivalent way to sample as described above is to sample first and then a Wiener process with incremental covariance . Then denote , and set
and so . The process belongs to the Cameron–Martin space of the Wiener measure, that is, , and so we can use the Cameron–Martin theorem to obtain a measure for the process with respect to the Wiener measure :
Now is exactly the described measure for the process . ∎
With this sampling scheme, the proposal distribution already takes into account the data fit and the trajectory smoothness between data points. Therefore the acceptance probability does not tend to zero as the discretisation is refined. In the full posterior sampling, the acceptance probability given in the lemma is combined with the part arising from the term in the dynamics equations.
For a practical implementation of the scheme, we need a finitedimensional subspace of that contains the piecewise linear functions . Piecewise linear hat functions with a finer discretisation are a natural choice for the basis of the finitedimensional subspace. Assume now that the output data is sampled with constant sampling frequency and all dimensions of the state are measured at the same times , , and denote . This assumption is made mostly for clarity of presentation. Divide each interval [ to pieces, where is a design parameter, and denote . The hat functions are defined as
Define the matrices and elementwise
With the chosen functions , these matrices are
Define also the embedding matrix such that for , the product gives in the basis . For example, with , this matrix is
To sample the Brownian bridge term , we use the Karhunen–Loève expansion using sinusoidal basis functions. To this end, define the matrix whose columns consist of the discretised basis functions:
The resulting sampling scheme is presented in the form of an algorithm.
Algorithm 3.1.
Initialisation:

Choose the discretisation level and the proposal step length parameter .

Form , , , and .

Choose initial trajectory , and initial topology .
Sampling (for ):

Sample from as described in Section 2.

Sample where is an matrix whose each element is an independent, normally distributed random variable with zero mean and variance one.

Sample where consists of the Brownian bridges between measurements.

Compute , and . The term arises from the Ito integral formula. Denote the row of by .

Compute the Metropolis–Hastings number for the new candidate sample
(3.2) where
(3.3) Note that is not explicitly a variable of because it can be obtained from by . The acceptance probability of the new sample is
that is, with this probability, set , , and . Otherwise, set , , and .
Forming the Brownian bridge term is difficult to present using standard notation, but it is efficiently done using the MATLAB code line
B=[zeros(n,1),C*reshape([Pb*randn(n1,n2);zeros(1,n2)],[],n)’]
where n1 and n2 and C.
Note that in principle there is no reason why the same step size should be used for both and . Also, if some other method is used for sampling the indicator matrix , then the proposal ratio must be included in the Metropolis–Hastings number , see Section 2.
3.2. Alternative Gibbs sampler
Under the fairly natural assumption that the topology prior can be factorised with respect to the rows of , that is, , then the algorithm can be made more efficient by introducing a Gibbs sampler that is updating first each row of separately, and then the trajectory . Note that the posterior decomposes also with respect to the rows of , and subsequently the Metropolis–Hastings number in (3.2) can be factorised to
where , and contains simply the term of the sum in given in (3.3).
The key steps of Algorithm 3.1 are modified as follows:

For

Sample the new row from the current sample .

Accept with probability .


Sample and as in Algorithm 3.1.

Accept with probability
Note that in this modification, each factor is stored separately.
3.3. Hyperparameter sampling
Typically even the hyperparameters , and are not known, and they can be sampled as well. Again, sampling the process noise covariance poses an additional technical problem, because the Wiener measures corresponding to different covariances are not equivalent. This means that if nothing else is done, then in the infinitesimal discretisation limit, a step where increases is always accepted, whereas a step where decreases is never accepted. To prevent this, The Brownian bridge term in the trajectory has to be scaled by . That is, if the current trajectory is decomposed into , then when hyperparameter is sampled, then also the trajectory is scaled to obtain a candidate , which is accepted if is accepted.
We assume that the matrices are assumed diagonal, and they are assumed to be of the form where is diagonal with
The purpose of this choice is to scale all potential regulators to same magnitude so that the scales would not matter in the variable selection.
When the hyperparameters and are sampled, their acceptance is based on computing the Metropolis–Hastings numbers using these new variables. That is, using for example random walk sampling, , and , then these samples are accepted with probability
where and are the user defined hyperpriors. The productterm arises from the Wiener measure factorization in Lemma 3.1. It is not necessary to sample and simultaneously, and can even be sampled one component at a time without increasing computational complexity.
The measurement noise variance can be sampled using the random walk sampling and the acceptance probability is given by