Bayesian Inference and Learning in Gaussian Process
StateSpace Models with Particle MCMC
Abstract
Statespace models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric statespace models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity.
Bayesian Inference and Learning in Gaussian Process
StateSpace Models with Particle MCMC
Roger Frigola, Fredrik Lindsten, Thomas B. Schön and Carl E. Rasmussen
1. Dept. of Engineering, University of Cambridge, UK, {rf342,cer54}@cam.ac.uk
2. Div. of Automatic Control, Linköping University, Sweden, lindsten@isy.liu.se
3. Dept. of Information Technology, Uppsala University, Sweden, thomas.schon@it.uu.se
1 Introduction
Statespace models (SSMs) constitute a popular and general class of models in the context of time series and dynamical systems. Their main feature is the presence of a latent variable, the state , which condenses all aspects of the system that can have an impact on its future. A discretetime SSM with nonlinear dynamics can be represented as
(1a)  
(1b) 
where denotes a known external input, denotes the measurements, and denote i.i.d. noises acting on the dynamics and the measurements, respectively. The function encodes the dynamics and describes the relationship between the observation and the unobserved states.
We are primarily concerned with the problem of learning general nonlinear SSMs. The aim is to find a model that can adaptively increase its complexity when more data is available. To this effect, we employ a Bayesian nonparametric model for the dynamics (1a). This provides a flexible model that is not constrained by any limiting assumptions caused by postulating a particular functional form. More specifically, we place a Gaussian process (GP) prior [1] over the unknown function . The resulting model is a generalization of the standard parametric SSM. The functional form of the observation model is assumed to be known, possibly parameterized by a finite dimensional parameter. This is often a natural assumption, for instance in engineering applications where corresponds to a sensor model – we typically know what the sensors are measuring, at least up to some unknown parameters. Furthermore, using too flexible models for both and can result in problems of nonidentifiability.
We adopt a fully Bayesian approach whereby we find a posterior distribution over all the latent entities of interest, namely the state transition function , the hidden state trajectory and any hyperparameter of the model. This is in contrast with existing approaches for using GPs to model SSMs, which tend to model the GP using a finite set of target points, in effect making the model parametric [2]. Inferring the distribution over the state trajectory is an important problem in itself known as smoothing. We use a tailored particle Markov Chain Monte Carlo (PMCMC) algorithm [3] to efficiently sample from the smoothing distribution whilst marginalizing over the state transition function. This contrasts with conventional approaches to smoothing which require a fixed model of the transition dynamics. Once we have obtained an approximation of the smoothing distribution, with the dynamics of the model marginalized out, learning the function is straightforward since its posterior is available in closed form given the state trajectory. Our only approximation is that of the sampling algorithm. We report very good mixing enabled by the use of recently developed PMCMC samplers [4] and the exact marginalization of the transition dynamics.
There is by now a rich literature on GPbased SSMs. For instance, Deisenroth et al. [5, 6] presented refined approximation methods for filtering and smoothing for already learned GP dynamics and measurement functions. In fact, the method proposed in the present paper provides a vital component needed for these inference methods, namely that of learning the GP model in the first place. Turner et al. [2] applied the EM algorithm to obtain a maximum likelihood estimate of parametric models which had the form of GPs where both inputs and outputs were parameters to be optimized. This type of approach can be traced back to [7] where Ghahramani and Roweis applied EM to learn models based on radial basis functions. Wang et al. [8] learn a SSM with GPs by finding a MAP estimate of the latent variables and hyperparameters. They apply the learning in cases where the dimension of the observation vector is much higher than that of the latent state in what becomes a form of dynamic dimensionality reduction. This procedure would have the risk of overfitting in the common situation where the statespace is highdimensional and there is significant uncertainty in the smoothing distribution.
2 Gaussian Process StateSpace Model
We describe the generative probabilistic model of the Gaussian process SSM (GPSSM) represented in Figure 1 by
(2a)  
(2b)  
(2c) 
and , where we avoid notational clutter by omitting the conditioning on the known inputs . In addition, we put a prior over the various hyperparameters . Also, note that the measurement model (2c) and the prior on can take any form since we do not rely on their properties for efficient inference.
The GP is fully described by its mean function and its covariance function. An interesting property of the GPSSM is that any a priori insight into the dynamics of the system can be readily encoded in the mean function. This is useful, since it is often possible to capture the main properties of the dynamics, e.g. by using a simple parametric model or a model based on first principles. Such simple models may be insufficient on their own, but useful together with the GPSSM, as the GP is flexible enough to model complex departures from the mean function. If no specific prior model is available, the linear mean function is a good generic choice. Interestingly, the prior information encoded in this model will normally be more vague than the prior information encoded in parametric models. The measurement model (2c) implicitly contains the observation function and the distribution of the i.i.d. measurement noise .
3 Inference over States and Hyperparameters
Direct learning of the function in (2a) from input/output data is challenging since the states are not observed. Most (if not all) previous approaches attack this problem by reverting to a parametric representation of which is learned alongside the states. We address this problem in a fundamentally different way by marginalizing out , allowing us to respect the nonparametric nature of the model. A challenge with this approach is that marginalization of will introduce dependencies across time for the state variables that lead to the loss of the Markovian structure of the stateprocess. However, recently developed inference methods, combining sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) allow us to tackle this problem. We discuss marginalization of in Section 3.1 and present the inference algorithms in Sections 3.2 and 3.3.
3.1 Marginalizing out the State Transition Function
Targeting the joint posterior distribution of the hyperparameters, the latent states and the latent function is problematic due to the strong dependencies between and . We therefore marginalize the dynamical function from the model, and instead target the distribution (recall that conditioning on is implicit). In the MCMC literature, this is referred to as collapsing [9]. Hence, we first need to find an expression for the marginal prior . Focusing on we note that, although this distribution is not Gaussian, it can be represented as a product of Gaussians. Omitting the dependence on in the notation, we obtain
(3a)  
with  
(3b)  
(3c) 
for and , . Equation (3) follows from the fact that, once conditioned on , a onestep prediction for the state variable is a standard GP prediction. Here, we have defined the mean vector and the positive definite matrix with block entries . We use two sets of indices, as in , to refer to the offdiagonal blocks of . We also define . We can also express (3a) more succinctly as,
(4) 
This expression looks very much like a multivariate Gaussian density function. However, we emphasize that this is not the case since both and depend (nonlinearly) on the argument . In fact, (4) will typically be very far from Gaussian.
3.2 Sequential Monte Carlo
With the prior (4) in place, we now turn to posterior inference and we start by considering the joint smoothing distribution . The sequential nature of the proposed model suggests the use of SMC. Though most well known for filtering in Markovian SSMs – see [10, 11] for an introduction – SMC is applicable also for nonMarkovian latent variable models. We seek to approximate the sequence of distributions , for . Let be a collection of weighted particles approximating by the empirical distribution, . Here, is a pointmass located at . To propagate this sample to time , we introduce the auxiliary variables , referred to as ancestor indices. The variable is the index of the ancestor particle at time , of particle . Hence, is generated by first sampling with . Then, is generated as,
(5) 
for . The particle trajectories are then augmented according to . Sampling from the onestep predictive density is a simple (and sensible) choice, but we may also consider other proposal distributions. In the above formulation the resampling step is implicit and corresponds to sampling the ancestor indices (cf. the auxiliary particle filter, [12]). Finally, the particles are weighted according to the measurement model, for , where the weights are normalized to sum to 1.
3.3 Particle Markov Chain Monte Carlo
There are two shortcomings of SMC: (i) it does not handle inference over hyperparameters; (ii) despite the fact that the sampler targets the joint smoothing distribution, it does in general not provide an accurate approximation of the full joint distribution due to path degeneracy. That is, the successive resampling steps cause the particle diversity to be very low for time points far from the final time instant .
To address these issues, we propose to use a particle Markov chain Monte Carlo (PMCMC, [3, 13]) sampler. PMCMC relies on SMC to generate samples of the highly correlated state trajectory within an MCMC sampler. We employ a specific PMCMC sampler referred to as particle Gibbs with ancestor sampling (PGAS, [4]), given in Algorithm 1. PGAS uses Gibbslike steps for the state trajectory and the hyperparameters , respectively. That is, we sample first given , then given , etc. However, the full conditionals are not explicitly available. Instead, we draw samples from specially tailored Markov kernels, leaving these conditionals invariant. We address these steps in the subsequent sections.
3.3.1 Sampling the State Trajectories
To sample the state trajectory, PGAS makes use of an SMClike procedure referred to as a conditional particle filter with ancestor sampling (CPFAS). This approach is particularly suitable for nonMarkovian latent variable models, as it relies only on a forward recursion (see [4]). The difference between a standard particle filter (PF) and the CPFAS is that, for the latter, one particle at each time step is specified a priori. Let these particles be denoted . We then sample according to (5) only for . The th particle is set deterministically: . To be able to construct the th particle trajectory, has to be associated with an ancestor particle at time . This is done by sampling a value for the corresponding ancestor index . Following [4], the ancestor sampling probabilities are computed as
(6) 
where the ratio is between the unnormalized target densities up to time and up to time , respectively. The second proportionality follows from the mutual conditional independence of the observations, given the states. Here, refers to a path in formed by concatenating the two partial trajectories. The above expression can be computed by using the prior over state trajectories given by (4). The ancestor sampling weights are then normalized to sum to 1 and the ancestor index is sampled with .
The conditioning on a prespecified collection of particles implies an invariance property in CPFAS, which is key to our development. More precisely, given let be generated as follows:

Run CPFAS from time to time , conditionally on .

Set to one of the resulting particle trajectories according to .
For any , this procedure defines an ergodic Markov kernel on , leaving the exact smoothing distribution invariant [4]. Note that this invariance holds for any , i.e. the number of particles that are used only affect the mixing rate of the kernel . However, it has been experienced in practice that the autocorrelation drops sharply as increases [4, 14], and for many models a moderate is enough to obtain a rapidly mixing kernel.
3.3.2 Sampling the Hyperparameters
Next, we consider sampling the hyperparameters given a state trajectory and sequence of observations, i.e. from . In the following, we consider the common situation where there are distinct hyperparameters for the likelihood and for the prior over trajectories . If the prior over the hyperparameters factorizes between those two groups we obtain . We can thus proceed to sample the two groups of hyperparameters independently. Sampling will be straightforward in most cases, in particular if conjugate priors for the likelihood are used. Sampling will, nevertheless, be harder since the covariance function hyperparameters enter the expression in a nontrivial way. However, we note that once the state trajectory is fixed, we are left with a problem analogous to Gaussian process regression where are the inputs, are the outputs and is the likelihood covariance matrix. Given that the latent dynamics can be marginalized out analytically, sampling the hyperparameters with slice sampling is straightforward [15].
4 A Sparse GPSSM Construction and Implementation Details
A naive implementation of the CPFAS algorithm will give rise to computational complexity, since at each time step , a matrix of size needs to be factorized. However, it is possible to update and reuse the factors from the previous time step, bringing the total computational complexity down to the familiar . Furthermore, by introducing a sparse GP model, we can reduce the complexity to where . In Section 4.1 we introduce the sparse GP model and in Section 4.2 we provide insight into the efficient implementation of both the vanilla GP and the sparse GP.
4.1 FIC Prior over the State Trajectory
An important alternative to GPSSM is given by exchanging the vanilla GP prior over for a sparse counterpart. We do not consider the resulting model to be an approximation to GPSSM, it is still a GPSSM, but with a different prior over functions. As a result we expect it to sometimes outperform its nonsparse version in the same way as it happens with their regression siblings [16].
Most sparse GP methods can be formulated in terms of a set of so called inducing variables [17]. These variables live in the space of the latent function and have a set of corresponding inducing inputs. The assumption is that, conditionally on the inducing variables, the latent function values are mutually independent. Although the inducing variables are marginalized analytically – this is key for the model to remain nonparametric – the inducing inputs have to be chosen in such a way that they, informally speaking, cover the same region of the input space covered by the data. Crucially, in order to achieve computational gains, the number of inducing variables is selected to be smaller than the original number of data points. In the following, we will use the fully independent conditional (FIC) sparse GP prior as defined in [17] due to its very good empirical performance [16].
As shown in [17], the FIC prior can be obtained by replacing the covariance function by,
(7) 
where , is Kronecker’s delta and we use the convention whereby when takes a set as one of its arguments it generates a matrix of covariances. Using the Woodbury matrix identity, we can express the onestep predictive density as in (3), with
(8a)  
(8b) 
where , and . Despite its apparent cumbersomeness, the computational complexity involved in computing the above mean and covariance is , as opposed to for (3). The same idea can be used to express (4) in a form which allows for efficient computation. Here refers to a block diagonalization if is not diagonal.
We do not address the problem of choosing the inducing inputs, but note that one option is to use greedy methods (e.g. [18]). The fast forward selection algorithm is appealing due to its very low computational complexity [18]. Moreover, its potential drawback of interference between hyperparameter learning and active set selection is not an issue in our case since hyperparameters will be fixed for a given run of the particle filter.
4.2 Implementation Details
As pointed out above, it is crucial to reuse computations across time to attain the or computational complexity for the vanilla GP and the FIC prior, respectively. We start by discussing the vanilla GP and then briefly comment on the implementation aspects of FIC.
There are two costly operations of the CPFAS algorithm: (i) sampling from the prior (5), requiring the computation of (3b) and (3c) and (ii) evaluating the ancestor sampling probabilities according to (6). Both of these operations can be carried out efficiently by keeping track of a Cholesky factorization of the matrix , for each particle . Here, is a matrix defined analogously to , but where the covariance function is evaluated for the concatenated state trajectory . From , it is possible to identify submatrices corresponding to the Cholesky factors for the covariance matrix as well as for the matrices needed to efficiently evaluate the ancestor sampling probabilities (6).
It remains to find an efficient update of the Cholesky factor to obtain . As we move from time to in the algorithm, will be replaced by in the concatenated trajectory. Hence, the matrix can be obtained from by replacing rows and columns, corresponding to a rank update. It follows that we can compute by making successive rank one updates and downdates on . In summary, all the operations at a specific time step can be done in computations, leading to a total computational complexity of .
For the FIC prior, a naive implementation will give rise to computational complexity. This can be reduced to by keeping track of a factorization for the matrix . However, to reach the cost all intermediate operations scaling with has to be avoided, requiring us to reuse not only the matrix factorizations, but also intermediate matrixvector multiplications.
5 Learning the Dynamics
Algorithm 1 gives us a tool to compute . We now discuss how this can be used to find an explicit model for . The goal of learning the state transition dynamics is equivalent to that of obtaining a predictive distribution over , evaluated at an arbitrary test point ,
(9) 
Using a samplebased approximation of , this integral can be approximated by
(10) 
where is the number of samples and and follow the expressions for the predictive distribution in standard GP regression if are treated as inputs, are treated as outputs and is the likelihood covariance matrix. This mixture of Gaussians is an expressive representation of the predictive density which can, for instance, correctly take into account multimodality arising from ambiguity in the measurements. Although factorized covariance matrices can be precomputed, the overall computational cost will increase linearly with .The computational cost can be reduced by thinning the Markov chain using e.g. random subsampling or kernel herding [19].
In some situations it could be useful to obtain an approximation from the mixture of Gaussians consisting in a single GP representation. This is the case in applications such as control or real time filtering where the cost of evaluating the mixture of Gaussians can be prohibitive. In those cases one could opt for a pragmatic approach and learn the mapping from a cloud of points using sparse GP regression. The latent function values can be easily sampled from the normally distributed .
6 Experiments
6.1 Learning a Nonlinear System Benchmark
Consider a system with dynamics given by and observations given by , with parameters and a known input . One of the difficulties of this system is that the smoothing density is multimodal since no information about the sign of is available in the observations. The system is simulated for time steps, using lognormal priors for the hyperparameters, and the PGAS sampler is then run for iterations using particles. To illustrate the capability of the GPSSM to make use of a parametric model as baseline, we use a mean function with the same parametric form as the true system, but parameters . This function, denoted model B, is manifestly different to the actual state transition (green vs. black surfaces in Figure 2), also demonstrating the flexibility of the GPSSM.
Figure 2 (left) shows the samples of (red). It is apparent that the distribution covers two alternative state trajectories at particular times (e.g. ). In fact, it is always the case that this bimodal distribution covers the two states of opposite signs that could have led to the same observation (cyan). In Figure 2 (right) we plot samples from the smoothing distribution, where each circle corresponds to . Although the parametric model used in the mean function of the GP (green) is clearly not representative of the true dynamics (black), the samples from the smoothing distribution accurately portray the underlying system. The smoothness prior embodied by the GP allows for accurate sampling from the smoothing distribution even when the parametric model of the dynamics fails to capture important features.
To measure the predictive capability of the learned transition dynamics, we generate a new dataset consisting of time steps and present the RMSE between the predicted value of and the actual one. We compare the results from GPSSM with the predictions obtained from two parametric models (one with the true model structure and one linear model) and two known models (the ground truth model and model B). We also report results for the sparse GPSSM using an FIC prior with 40 inducing points. Table 1 summarizes the results, averaged over 10 independent training and test datasets. We also report the RMSE from the joint smoothing sample to the ground truth trajectory.
RMSE  prediction of  smoothing 

Ground truth model (known parameters)  –  
GPSSM (proposed, model B mean function)  
Sparse GPSSM (proposed, model B mean function)  
Model B (fixed parameters)  
Ground truth model, learned parameters  
Linear model, learned parameters 
6.2 Learning a Cart and Pole System
We apply our approach to learn a model of a cart and pole system used in reinforcement learning. The system consists of a cart, with a freespinning pendulum, rolling on a horizontal track. An external force is applied to the cart. The system’s dynamics can be described by four states and a set of nonlinear ordinary differential equations [20]. We learn a GPSSM based on 100 observations of the state corrupted with Gaussian noise. Although the training set only explores a small region of the 4dimensional state space, we can learn a model of the dynamics which can produce one step ahead predictions such the ones in Figure 3. We obtain a predictive distribution in the form of a mixture of Gaussians from which we display the first and second moments. Crucially, the learned model reports different amounts of uncertainty in different regions of the statespace. For instance, note the narrower errorbars on some states between and . This is due to the model being more confident in its predictions in areas that are closer to the training data.
7 Conclusions
We have shown an efficient way to perform fully Bayesian inference and learning in the GPSSM. A key contribution is that our approach retains the full nonparametric expressivity of the model. This is made possible by marginalizing out the state transition function, which results in a nontrivial inference problem that we solve using a tailored PGAS sampler.
A particular characteristic of our approach is that the latent states can be sampled from the smoothing distribution even when the state transition function is unknown. Assumptions about smoothness and parsimony of this function embodied by the GP prior suffice to obtain highquality smoothing distributions. Once samples from the smoothing distribution are available, they can be used to describe a posterior over the state transition function. This contrasts with the conventional approach to inference in dynamical systems where smoothing is performed conditioned on a model of the state transition dynamics.
References
References
 [1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006.
 [2] R. Turner, M. P. Deisenroth, and C. E. Rasmussen, “Statespace inference and learning with Gaussian processes,” in 13th International Conference on Artificial Intelligence and Statistics, ser. W&CP, Y. W. Teh and M. Titterington, Eds., vol. 9, Chia Laguna, Sardinia, Italy, May 13–15 2010, pp. 868–875.
 [3] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269–342, 2010.
 [4] F. Lindsten, M. Jordan, and T. B. Schön, “Ancestor sampling for particle Gibbs,” in Advances in Neural Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., 2012, pp. 2600–2608.
 [5] M. Deisenroth, R. Turner, M. Huber, U. Hanebeck, and C. Rasmussen, “Robust filtering and smoothing with Gaussian processes,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1865 –1871, july 2012.
 [6] M. Deisenroth and S. Mohamed, “Expectation Propagation in Gaussian process dynamical systems,” in Advances in Neural Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., 2012, pp. 2618–2626.
 [7] Z. Ghahramani and S. Roweis, “Learning nonlinear dynamical systems using an EM algorithm,” in Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn, Eds. MIT Press, 1999.
 [8] J. Wang, D. Fleet, and A. Hertzmann, “Gaussian process dynamical models,” in Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. Platt, Eds. Cambridge, MA: MIT Press, 2006, pp. 1441–1448.
 [9] J. S. Liu, Monte Carlo Strategies in Scientific Computing. Springer, 2001.
 [10] A. Doucet and A. Johansen, “A tutorial on particle filtering and smoothing: Fifteen years later,” in The Oxford Handbook of Nonlinear Filtering, D. Crisan and B. Rozovsky, Eds. Oxford University Press, 2011.
 [11] F. Gustafsson, “Particle filter theory and practice with positioning applications,” IEEE Aerospace and Electronic Systems Magazine, vol. 25, no. 7, pp. 53–82, 2010.
 [12] M. K. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle filters,” Journal of the American Statistical Association, vol. 94, no. 446, pp. 590–599, 1999.
 [13] F. Lindsten and T. B. Schön, “Backward simulation methods for Monte Carlo statistical inference,” Foundations and Trends in Machine Learning, vol. 6, no. 1, pp. 1–143, 2013.
 [14] F. Lindsten and T. B. Schön, “On the use of backward simulation in the particle Gibbs sampler,” in Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, Mar. 2012.
 [15] D. K. Agarwal and A. E. Gelfand, “Slice sampling for simulation based fitting of spatial data models,” Statistics and Computing, vol. 15, no. 1, pp. 61–69, 2005.
 [16] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudoinputs,” in Advances in Neural Information Processing Systems (NIPS), Y. Weiss, B. Schölkopf, and J. Platt, Eds., Cambridge, MA, 2006, pp. 1257–1264.
 [17] J. QuiñoneroCandela and C. E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research, vol. 6, pp. 1939–1959, 2005.
 [18] M. Seeger, C. Williams, and N. Lawrence, “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression,” in Artificial Intelligence and Statistics 9, 2003.
 [19] Y. Chen, M. Welling, and A. Smola, “Supersamples from kernel herding,” in Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), P. Grünwald and P. Spirtes, Eds. AUAI Press, 2010.
 [20] M. Deisenroth, “Efficient reinforcement learning using Gaussian processes,” Ph.D. dissertation, Karlsruher Institut für Technologie, 2010.