Variable selection in dynamical systems

A MCMC strategy for identifying a sparse continuous time dynamics matrix from time series data

A MCMC strategy for identifying a sparse continuous time dynamics matrix from time series data

Atte Aalto and Jorge Gonçalves
Luxembourg Centre for Systems Biomedicine
University of Luxembourg

Identifying a sparse continuous time dynamics matrix from time series data by a MCMC approach

Atte Aalto and Jorge Gonçalves
Luxembourg Centre for Systems Biomedicine
University of Luxembourg

Variable selection in linear dynamical systems by Markov Chain Monte Carlo approach

Atte Aalto and Jorge Gonçalves
Luxembourg Centre for Systems Biomedicine
University of Luxembourg

Variable selection in linear dynamical systems by an MCMC approach

Atte Aalto and Jorge Gonçalves
Luxembourg Centre for Systems Biomedicine
University of Luxembourg

Bayesian variable selection in linear dynamical systems

Atte Aalto and Jorge Gonçalves
Luxembourg Centre for Systems Biomedicine
University of Luxembourg

Abstract.

We develop a method for reconstructing regulatory interconnection networks between variables evolving according to a linear dynamical system. The work is motivated by the problem of gene regulatory network inference, that is, finding causal effects between genes from gene expression time series data. In biological applications, the typical problem is that the sampling frequency is low, and consequentially the system identification problem is ill-posed. The low sampling frequency also makes it impossible to estimate derivatives directly from the data. We take a Bayesian approach to the problem, as it offers a natural way to incorporate prior information to deal with the ill-posedness, through the introduction of sparsity promoting prior for the underlying dynamics matrix. It also provides a framework for modelling both the process and measurement noises. We develop Markov Chain Monte Carlo samplers for the discrete-valued zero-structure of the dynamics matrix, and for the continuous-time trajectory of the system.

Keywords: Variable selection, Bayesian inference, Markov Chain Monte Carlo, Network inference, Linear dynamical system

1. Introduction

We consider the problem of retrieving the sparsity pattern of the dynamics matrix in the system

(1.1)

from time series data . Here is an unknown noise process modelled as a Brownian motion with incremental covariance , and ’s are measurement noise terms. An additional, deterministic input can be treated by superposition. Our motivation for this problem arises from the field of systems biology, where a topical problem is finding the interconnection network structure between different species. More specifically, we are interested in reconstructing gene regulatory networks from gene expression time series data [17]. In this application, data collection is expensive and laborious, and therefore the temporal resolution tends to be relatively poor and the overall length of the time series short. Consequently, the problem is ill-posed, and additional information needs to be incorporated in order to obtain reasonable solutions. A typical resolution to the identifiability issues is to look for sparse matrices .

Let us discuss first the related problem of variable selection in linear regression, that is, finding the zero-structure of the matrix from input-output data connected by

(1.2)

A sparse solution could be obtained by solving the cardinality-penalised least squares problem, that is, minimising , where gives the number of non-zero entries in . However, the cardinality penalty is non-convex and moreover, the problem becomes combinatorial in nature, as each variable combination must be tested separately. A typical remedy is to resort to convex relaxation, that is, penalising instead for the 1-norm of the matrix , defined by . This approach is generally known as Lasso [23]. The Lasso approach can also be interpreted as a maximum likelihood estimate assuming Gaussian noise, and Laplace priors for the parameters. This property has been exploited in the so-called Bayesian Lasso approach [21],[5]. However, in the Laplace distribution, the value zero is the maximum likelihood estimate, but even there the probability that the coefficient would be zero in a single realization is zero. Therefore the Lasso only produces sparse solutions when it is used in the maximum likelihood estimates. Sparse realizations can be obtained by introducing a prior distribution for the parameters that have a point mass at zero, leading to a probabilistic counterpart of the cardinality-penalty setup. Clearly the combinatorial nature of the problem remains, but to some extent this can be overcome by using an MCMC strategy. Such strategies tend to spend more time in the “neighborhood” of solutions with high probability. Variable selection methods based on such probabilistic consideration are considered for example in [18] introducing so-called “spike and slab” priors consisting of a mixture of a point mass at zero, and a uniform distribution around zero. Indicator variables are introduced in [15] and [9]. The article [15] also discusses different types of global priors for the indicator variable. This means that the probability of a certain regression coefficient being zero depends on how many of the other coefficients are zero. Different methods are reviewed and compared in [20] and [8].

Comparing the dynamical system (1.1) and the linear regression case (1.2), the main difference is that in (1.1), the “input” on which matrix operates is the trajectory , and the “output” is its derivative . This means that also the input is unknown. In addition, the derivative cannot be estimated directly from the samples due to the low sampling frequency. One approach to tackle this problem is taken in [14] and [4] where the Lasso approach is combined with a Kalman smoother estimating the latent trajectory. The result is an EM type algorithm alternately updating the latent trajectory and the matrix . Another approach is presented in [25] which is based on a discrete time approach studying and imposing sparsity on the matrix logarithm. The approach taken here is to impose a sparsity promoting prior probability distribution for the matrix , which — together with the process noise model — gives rise to a probability measure for the continuous trajectory . A MCMC sampler is then constructed for both the matrix (or, more precisely, its zero-structure) and the trajectory . The problem of low sampling rate is addressed by sampling the full trajectory , as opposed to sampling only . The prior for is defined through an indicator variable as in [15], and separate priors are employed for the indicator variable, and the magnitudes of the non-zero values. A discrete-valued Markov chain is defined that is moving between different indicator variables, that are controlling the zero-structure of . This approach bears some resemblance to the Reversible Jump MCMC which is designed in [12] for sampling from distributions with varying dimension. However, assuming a normal distribution for the non-zero elements of , it is possible to integrate out the magnitude parameters, thus avoiding the need to deal with varying dimension of the parameter space.

The proposed approach gives rise to some challenges related to the MCMC sampling. The continuous-time trajectory of the system is an infinite-dimensional random variable. We will employ a Crank-Nicolson sampling scheme for the trajectory in order to achieve high acceptance rates in the sampler. A mixture of discrete and continuous variables is prone to multimodality problems. This problem is addressed by employing a parallel tempering scheme. The outline of the paper is as follows: in order to best convey the main idea, the sampling scheme for the zero-structure of the matrix is first introduced in the simpler context of variable selection in linear regression. This is the topic of Section 2. The case of dynamical systems is treated in Section 3, where we also introduce slight improvements and generalisations of the method involving higher order dynamics. Finally, in Section 4, we present a numerical example where the introduced method is compared to the Expectation Maximization (EM) method incorporating a Laplace prior for the elements of matrix , corresponding to the popular Lasso algorithm.

It should be noted that the proposed method is straightforwardly generalizable to nonlinear dynamics where is a library of selected nonlinear functions, and their weights are to be determined. A similar approach has been presented in [3] and [16].

2. Variable selection in linear regression

In this section we introduce our sampling scheme for sampling the zero-structure of the matrix in connection of a linear regression problem. Say we have data of input-output pairs for of the form

where and if . The task is to identify the matrix for which we have prior information that it should be sparse.

Let us introduce some notation:

that is, is a variable indicating whether the element of the matrix is non-zero, and is the magnitude variable. Denote the indicator matrix by , that is , and . For a vector , the notation stands for the vector in that consists of those elements for which . For a matrix , the notation stands for the submatrix of that consists of the elements for which and . For a matrix (or ) the notation stands for the (or ) matrix that consists of those rows (columns) for which .

We wish to sample from the posterior distribution for which it holds that

where the first line is a marginalization integral, the second line is the Bayes’ rule, and the third line follows from the probability chain rule. For given and , the output is Gaussian, that is, . The topology is independent of the input data, so which is just the prior probability for the topology. At this point let us assume that is a diagonal matrix, .

For the matrix we assume that its rows are independent, and . Then the function is an exponential function where the exponent is a quadratic function of :

This marginalization integral can be computed analytically. Firstly, integrating over the variables for which corresponds to the usual Gaussian marginalization integral

For the remaining part of the exponent it holds that

where is the minimal value of the quadratic exponent and is the vector attaining this minimum. The minimal value is obtained by straightforward differentiation and it is

(2.1)

where is the row of , that is, the vector containing the components of for .

Finally, the full marginalisation integral is

where is given in (2.1).

2.1. The proposal Markov chain and the Metropolis–Hastings algorithm

There is some freedom in how to perform a jump from one connectivity matrix to another, that is, designing the proposal distribution . We will employ a simple scheme where we randomly pick an element from , and flip it. That is, draw from the uniform distribution on and set

This proposal is symmetric so that .

Another possibility is presented in [5]. Their strategy is to decide whether to add or remove (or neither) a variable from the active regressor set. Say that the probability for an addition is and probability for a removal is , so that holds. With probability , the active regressor set is not changed, that is, . The ratio of the probabilities of a jump and its reverse is a bit complicated, since one has to take into account the extreme cases, when a removal step removes the last remaining regressor, or when an addition step results in a full matrix . In the end, the ratios are for an addition move :

and for a removal :

Note that an addition move is not possible if and a removal is not possible if .

For a given connectivity matrix we define the Metropolis–Hastings number

where is given in (2.1), and is the user-defined prior probability for this particular zero-structure. It can be defined, for example, using the full number of non-zero elements, , or the numbers of non-zero elements on each row, , etc.

In the Metropolis–Hastings MCMC algorithm we use the above procedure to sample a new network topology from the old topology matrix . The acceptance probability of the new topology is then . This algorithm is summarised below:

Algorithm 2.1.

  • Set and .

  • Pick an initial topology and compute .

  • For

    • Form from using the procedure described above.

    • Compute .

    • With probability , set . Otherwise set .

    • Compute .

  • Compute .

As the number of samples grows, the elements of the matrix tend to the matrix , whose elements are the probabilities with which the corresponding elements of are non-zero. This algorithm is slightly simplified since a burn-in period or any thinning are not explicitly included.

3. Linear dynamical systems

In this section, we will encounter a number of different indices. To improve readability, we shall use index exclusively to refer to the time discretisation, for the output dimension of the state and measurement , for the input dimension, and for numbering the samples in the MCMC scheme — used as a parenthesised superscript, e.g., the trajectory sample is .

In this section, we formulate the approach for estimating the zero-structure of a sparse matrix from time series data , that is, is the dimension of one measurement, and is the number of samples in the time series. This data is assumed to arise from discrete measurements of a continuous time trajectory,

The trajectory is the solution of

where is a Brownian motion with incremental covariance , which is assumed to be diagonal, with . Again the goal is to obtain the posterior probabilities for different structure matrices . The main difference to the previous section is that now we have an additional unknown variable, namely the trajectory . This trajectory will be treated as a latent variable, which will be sampled as well. What makes things slightly tricky is that is an infinite-dimensional variable. In particular, is a probability measure on the augmented variable consisting of the discrete-valued graph topology, the continuous-valued parameters , and the infinite-dimensional trajectories :

For a background on infinite-dimensional integrals, we refer to [13] and for background on stochastic processes, see [22].

Given , that is, and , the trajectory is a Gaussian process. The measure of the process is continuous with respect to the Wiener measure corresponding to the incremental covariance . By the Cameron–Martin theorem [22, Theorem 8.2.9], it holds that

The exponent is a quadratic function of . Therefore, we impose a normal prior to the rows of , that is, , and, as before in the linear regression case, the integral with respect to can be computed analytically like in the basic case in Section 2. Denote by the matrix defined elementwise

The integral then yields

The bracket notation is defined for a vector-valued function as a vector in defined elementwise as the Ito integral

The functional is defined by

(3.1)

3.1. Sampling strategy

Common MCMC strategies’ acceptance probabilities decrease to zero as the dimension of the distribution increases. In particular, this becomes a problem when sampling some discretised infinite-dimensional object, and one wishes to refine the discretisation. We introduce a Crank–Nicolson sampling scheme [2, 6] to speed up the sampling. Crank–Nicolson sampling is based on implementing the “Gaussian part” of the posterior distribution already in the sampling scheme, and then it does not affect the acceptance probability. That is, assume we wish to sample from a distribution that has the form . A Crank-Nicolson sampler draws samples from by

where , and is the current sample. The acceptance probability of the sample is computed using only the non-Gaussian part of the distribution, that is, .

In our case, the Gaussian part is . However, sampling from this distribution leads to poor performance, since — loosely speaking — the measure is concentrated on very different area in the infinite-dimensional space of trajectories as the full posterior measure. We wish to design a sampling measure that is proportional to , and that is concentrated on the same area as the full posterior.

Let us introduce the used sampling scheme in the infinite-dimensional context. The practical implementation in discretised form will be presented later. We propose a two-phase sampling scheme where we first sample from the measurement distribution, that is, . Then define as the continuous, piecewise linear (on intervals ) function, for which it holds . The trajectory sample is then where is a collection of independent Brownian bridges that satisfy . That is, each component is a Gaussian process with covariance function

In the following lemma, it is shown that the proposed sampling scheme equipped with a suitable acceptance-rejection mechanism, is indeed equivalent to sampling from the conditioned measure . Only the one-dimensional case is considered for simplicity of notation. The higher dimensional trajectory samples are just collections of one-dimensional trajectories.

Lemma 3.1.

Say the current state trajectory sample is . The sampling scheme described above, combined with Metropolis–Hastings acceptance ratio

is equivalent with sampling from the conditioned Wiener measure .

Note that the result does not depend on how and are sampled. Later, we will construct Crank–Nicolson samplers for both and .

Proof.

An equivalent way to sample as described above is to sample first and then a Wiener process with incremental covariance . Then denote , and set

and so . The process belongs to the Cameron–Martin space of the Wiener measure, that is, , and so we can use the Cameron–Martin theorem to obtain a measure for the process with respect to the Wiener measure :

Now is exactly the described measure for the process . ∎

With this sampling scheme, the proposal distribution already takes into account the data fit and the trajectory smoothness between data points. Therefore the acceptance probability does not tend to zero as the discretisation is refined. In the full posterior sampling, the acceptance probability given in the lemma is combined with the part arising from the term in the dynamics equations.

For a practical implementation of the scheme, we need a finite-dimensional subspace of that contains the piecewise linear functions . Piecewise linear hat functions with a finer discretisation are a natural choice for the basis of the finite-dimensional subspace. Assume now that the output data is sampled with constant sampling frequency and all dimensions of the state are measured at the same times , , and denote . This assumption is made mostly for clarity of presentation. Divide each interval [ to pieces, where is a design parameter, and denote . The hat functions are defined as

Define the matrices and elementwise

With the chosen functions , these matrices are

Define also the embedding matrix such that for , the product gives in the basis . For example, with , this matrix is

To sample the Brownian bridge term , we use the Karhunen–Loève expansion using sinusoidal basis functions. To this end, define the matrix whose columns consist of the discretised basis functions:

The resulting sampling scheme is presented in the form of an algorithm.

Algorithm 3.1.

Initialisation:

  • Choose the discretisation level and the proposal step length parameter .

  • Form , , , and .

  • Choose initial trajectory , and initial topology .

Sampling (for ):

  • Sample from as described in Section 2.

  • Sample where is an matrix whose each element is an independent, normally distributed random variable with zero mean and variance one.

  • Sample where consists of the Brownian bridges between measurements.

  • Compute , and . The term arises from the Ito integral formula. Denote the row of by .

  • Compute the Metropolis–Hastings number for the new candidate sample

    (3.2)

    where

    (3.3)

    Note that is not explicitly a variable of because it can be obtained from by . The acceptance probability of the new sample is

    that is, with this probability, set , , and . Otherwise, set , , and .

Forming the Brownian bridge term is difficult to present using standard notation, but it is efficiently done using the MATLAB code line

B=[zeros(n,1),C*reshape([Pb*randn(n1,n2);zeros(1,n2)],[],n)']

where n1 and n2 and C.

Note that in principle there is no reason why the same step size should be used for both and . Also, if some other method is used for sampling the indicator matrix , then the proposal ratio must be included in the Metropolis–Hastings number , see Section 2.

3.2. Alternative Gibbs sampler

Under the fairly natural assumption that the topology prior can be factorised with respect to the rows of , that is, , then the algorithm can be made more efficient by introducing a Gibbs sampler that is updating first each row of separately, and then the trajectory . Note that the posterior decomposes also with respect to the rows of , and subsequently the Metropolis–Hastings number in (3.2) can be factorised to

where , and contains simply the term of the sum in given in (3.3).

The key steps of Algorithm 3.1 are modified as follows:

  • For

    • Sample the new row from the current sample .

    • Accept with probability .

  • Sample and as in Algorithm 3.1.

  • Accept with probability

Note that in this modification, each factor is stored separately.

3.3. Hyperparameter sampling

Typically even the hyperparameters , and are not known, and they can be sampled as well. Again, sampling the process noise covariance poses an additional technical problem, because the Wiener measures corresponding to different covariances are not equivalent. This means that if nothing else is done, then in the infinitesimal discretisation limit, a step where increases is always accepted, whereas a step where decreases is never accepted. To prevent this, The Brownian bridge term in the trajectory has to be scaled by . That is, if the current trajectory is decomposed into , then when hyperparameter is sampled, then also the trajectory is scaled to obtain a candidate , which is accepted if is accepted.

We assume that the matrices are assumed diagonal, and they are assumed to be of the form where is diagonal with

The purpose of this choice is to scale all potential regulators to same magnitude so that the scales would not matter in the variable selection.

When the hyperparameters and are sampled, their acceptance is based on computing the Metropolis–Hastings numbers using these new variables. That is, using for example random walk sampling, , and , then these samples are accepted with probability