Streaming Variational Monte Carlo

# Streaming Variational Monte Carlo

## Abstract

Nonlinear state-space models are powerful tools to describe dynamical structures in complex time series. In a streaming setting where data are processed one sample at a time, simultaneous inference of the state and its nonlinear dynamics has posed significant challenges in practice. We develop a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering. Our method provides an approximation of the filtering posterior which can be made arbitrarily close to the true filtering distribution for a wide class of dynamics models and observation models. Specifically, the proposed framework can efficiently approximate a posterior over the dynamics using sparse Gaussian processes, allowing for an interpretable model of the latent dynamics. Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications.

Nonlinear state-space modeling, online filtering, Bayesian machine learning
\forLoop

126ct \forLoop126ct

## 1 Introduction

Nonlinear state-space models are generative models for complex time series with underlying nonlinear dynamical structure [Haykin1998, Ko2009, Mattos2016]. Specifically, they represent nonlinear dynamics in the latent state-space, , that capture the spatiotemporal structure of noisy observations, :

 xt =fθ(xt−1,ut)+ϵt (state dynamics model) (1a) yt ∼P(yt∣gψ(xt)) (observation model) (1b)

where and are continuous vector functions, denotes a probability distribution, and is intended to capture unobserved perturbations of the state . Such state-space models have many applications (e.g., object tracking) where the flow of the latent states is governed by known physical laws and constraints, or where learning an interpretable model of the laws is of great interest, especially in neuroscience [Roweis2001, Sussillo2013, Frigola2014, Daniels2015, Zhao2016, Nassar2019]. If the parametric form of the model and the parameters are known a priori, then the latent states can be inferred online through the filtering distribution, , or offline through the smoothing distribution,  [Ho1964, Sarkka2013]. Otherwise the challenge is in learning the parameters of the state-space model, , which is known in the literature as the system identification problem.

In a streaming setting where data is processed one sample at a time, joint inference of the state and its nonlinear dynamics has posed significant challenges in practice. In this study, we are interested in online algorithms that can recursively solve the dual estimation problem of learning both the latent trajectory, , in the state-space and the parameters of the model, , from streaming observations [Haykin2001].

Popular solutions such, as the extended Kalman filter (EKF) or the unscented Kalman filter (UKF) [wan2000unscented], build an online dual estimator using nonlinear Kalman filtering by augmenting the state-space with its parameters [wan2000unscented, Wan2001]. While powerful, they usually provide coarse approximations to the filtering distribution and involve many hyperparameters to be tuned which hinder their practical performance. Moreover, they do not take advantage of modern stochastic gradient optimization techniques commonly used throughout machine learning and are not easily applicable to arbitrary observation likelihoods.

Recursive stochastic variational inference has been proposed for streaming data assuming either independent [Broderick2013] or temporally-dependent samples [Frigola2014, Zhao2017]. However the proposed variational distributions are not guaranteed to be good approximations to the true posterior. As opposed to variational inference, sequential Monte Carlo (SMC) leverages importance sampling to build an approximation to the target distribution in a data streaming setting [doucet2009tutorial, Doucet2013]. However, its success heavily depends on the choice of proposal distribution and the (locally) optimal proposal distribution usually is only available in the simplest cases [doucet2009tutorial]. While work has been done on learning good proposals for SMC [cornebise2014adaptive, gu2015neural, guarniero2017iterated, Naesseth2018] most are designed only for offline scenarios targeting the smoothing distributions instead of the filtering distributions. In [cornebise2014adaptive], the proposal is learned online but the class of dynamics for which this is applicable to is extremely limited.

In this paper, we propose a novel sequential Monte Carlo method for inferring a state-space model for the streaming time series scenario that adapts the proposal distribution on-the-fly by optimizing a surrogate lower bound to the log normalizer of the filtering distribution. Moreover, we choose the sparse Gaussian process (GP) [Titsias2009] for modeling the unknown dynamics that allows for recursive Bayesian inference. Specifically our contributions are:

1. We prove that our approximation to the filtering distribution converges to the true filtering distribution.

2. Our objective function allows for unbiased gradients which lead to improved performance.

3. To the best of our knowledge, we are the first to use particles to represent the posterior of inducing variables of the sparse Gaussian processes, which allows for accurate Bayesian inference on the inducing variables rather than the typical variational approximation and closed-form weight updates.

4. Unlike many efficient filtering methods that usually assume Gaussian or continuous observations, our method allows arbitrary observational distributions.

## 2 Streaming Variational Monte Carlo

Given the state-space model defined in (1), the goal is to obtain the latent state given a new observation, , where is a measurable space (typically or ). Under the Bayesian framework, this corresponds to computing the filtering posterior distribution at time

 p(xt|\vy1:t)=p(yt|xt)p(yt|\vy1:t−1)p(xt|\vy1:t−1) (2)

which recursively uses the previous filtering posterior distribution, .

However the above posterior is generally intractable except for limited cases [Haykin2001] and thus we turn to approximate methods. Two popular approaches for approximating (2) are sequential Monte Carlo (SMC) [doucet2009tutorial] and variational inference (VI) [blei2017variational, zhang2018advances, wainwright2008graphical]. In this work, we propose to combine sequential Monte Carlo and variational inference, which allows us to utilize modern stochastic optimization while leveraging the flexibility and theoretical guarantees of SMC. We refer to our approach as streaming variational Monte Carlo (SVMC). For clarity, we review SMC and VI in the follow sections.

### 2.1 Sequential Monte Carlo

SMC is a sampling based approach to approximate Bayesian inference that is designed to recursively approximate a sequence of distributions for , using samples from a proposal distribution, where are the parameters of the proposal [doucet2009tutorial]. Due to the Markovian nature of the state-space model in (1), the smoothing distribution, , can be expressed as

 p(\vx0:t|\vy1:t)∝p(x0)t∏j=1p(xt|xt−1)p(yt|xt). (3)

We enforce the same factorization for the proposal, .

A naive approach to approximating (3) is to use standard importance sampling (IS) [mcbook]. samples are sampled from the proposal distribution, , and are given weights according to

 wi0:t=p(xi0)∏tj=1p(xij|xij−1)p(yj|xij)r0(xi0;λ0)∏tj=1rj(xij|xij−1,yj;λj). (4)

The importance weights can also be computed recursively

 wi0:t=t∏s=0wis, (5)

where

 wis=p(ys|xis)p(xis|xis−1)rs(xis|xis−1,ys;λs). (6)

The samples and their corresponding weights, , are used to build an approximation to the target distribution

 p(\vx0:t|\vy1:t)≈^p(\vx0:t|\vy1:t)=N∑i=1wi0:t∑ℓwℓ0:tδ\vxi0:t (7)

where is the Dirac-delta function centered at . While straightforward, naive IS suffers from the weight degeneracy issue; as the length of the time series, , increases all but one of the importance weights will go to 0  [doucet2009tutorial].

To alleviate this issue, SMC leverages sampling-importance-resampling (SIS). Suppose at time , we have the following approximation to the smoothing distribution

 ^p(\vx0:t−1|\vy1:t−1)=1NN∑i=1wit−1∑ℓwℓt−1δ\vxi0:t−1, (8)

where is computed according to (6). Given a new observation, , SMC starts by resampling ancestor variables, with probability proportional to the importance weights . samples are then drawn from the proposal, and their importance weights are computed, , according to (6). The introduction of resampling allows for a (greedy) solution to the weight degeneracy problem. Particles with high weights are deemed good candidates and are propagated forward while the ones with low weights are discarded.

The updated approximation to is now

 ^p(\vx0:t|\vy1:t)=1NN∑i=1wit∑ℓwℓtδ\vxi0:t, (9)

where . Marginalizing out in (9) gives an approximation to the filtering distribution:

 p(xt|\vy1:t)=∫p(\vx0:t|\vy1:t)d\vx0:t−1≈∫N∑i=1wit∑ℓwℓtδ\vxi0:t=N∑i=1wit∑ℓwℓtδxit. (10)

As a byproduct, the weights produced in an SMC run yield an unbiased estimate of the marginal likelihood of the smoothing distribution [Doucet2013]:

 E[^p(\vy1:t)]=E[t∏s=11NN∑i=1wis]=p(\vy1:t), (11)

and a biased but consistent estimate of the marginal likelihood of the filtering distribution [Doucet2013, van2000asymptotic]

 E[^p(yt|\vy1:t−1)]=E[1NN∑i=1wit]. (12)

For completeness, we reproduce the consistency proof of (12) in section LABEL:proof:consistent of the appendix. The recursive nature of SMC makes it constant complexity per time step and constant memory because only the samples and weights generated at time are needed, , making them a perfect candidate to be used in an online setting [adali2010adaptive]. These attractive properties have allowed SMC to enjoy much success in fields such as robotics [thrun2002particle], control engineering [greenfield2003adaptive] and target tracking [gustafsson2002particle]. The success of an SMC sampler crucially depends on the design of the proposal distribution, . A common choice for the proposal distribution is the transition distribution, , which is known as the bootstrap particle filter (BPF) [gordon1993novel]. While simple, it is well known that BPF needs a large number of particles to perform well and suffers in high-dimensions [bickel2008sharp].

Designing a proposal is even more difficult in an online setting because a proposal distribution that was optimized for the system at time may not be the best proposal steps ahead. For example, if the dynamics were to change abruptly, a phenomenon known as concept drift [vzliobaite2016overview], the previous proposal may fail for the current time step. Thus, we propose to adapt the proposal distribution online using variational inference. This allows us to utilize modern stochastic optimization to adapt the proposal on-the-fly while still leveraging the theoretical guarantees of SMC.

### 2.2 Variational Inference

In contrast to SMC, VI takes an optimization approach to approximate Bayesian inference. In VI, we approximate the target posterior, , by a class of simpler distributions, , where are the parameters of the distribution. We then minimize a divergence (which is usually the Kullback-Leibler divergence (KLD)) between the posterior and the approximate distribution in the hopes of making closer to . If the divergence used is KLD, then minimizing the KLD between these distributions is equivalent to maximizing the so-called evidence lower bound (ELBO) [wainwright2008graphical, blei2017variational]:

 L (ϑt)=Eq[logp(xt,\vy1:t)−logq(xt;ϑt)], (13) =Eq[logEp(xt−1|\vy1:t−1)[p(xt,xt−1,\vy1:t)]−logq(xt;ϑt)].

For filtering inference, the intractability introduced by marginalizing over in (13) makes the problem much harder to optimize, rendering variational inference impractical in a streaming setting where incoming data are temporally dependent.

### 2.3 A Tight Lower Bound

Due to the intractability of the filtering distribution, the standard ELBO is difficult to optimize forcing us to define a different objective function. As stated above, we know that the sum of importance weights is an unbiased estimator of . Jensen’s inequality applied to (11[Naesseth2018, anh2018autoencoding] gives,

 logp(\vy1:t)=logE[^p(\vy1:t)]≥E[log^p(\vy1:t)]. (14)

Expanding (14), we obtain

 logp(yt∣\vy1:t−1)+logp(\vy1:t−1)≥E[log^p(yt∣\vy1:t−1)]+E[log^p(\vy1:t−1)], (15) logp(yt∣\vy1:t−1)≥E[log^p(yt∣\vy1:t−1)]−Rt(N) (16)

where is the variational gap. Leveraging this we propose to optimize

 ˜Lt(Θt)=E[log^p(yt∣\vy1:t−1)]−Rt(N),=E[N∑i=1logwit]−Rt(N). (17)

We call the filtering ELBO; it is a lower bound to the log normalization constant (log partition function) of the filtering distribution where accounts for the bias of the estimator (12).

There exists an implicit posterior distribution that arises from performing SMC given by [Cremer2017],

 ~q(xt ∣\vy1:t)=p(xt,\vy1:t)E[1^p(\vy1:t)] (18) =p(xt,yt∣\vy1:t−1)E[^p(yt∣\vy1:t−1)−1p(\vy1:t−1)^p(\vy1:t−1)]

As the number of samples goes to infinity (17) can be made arbitrarily tight; as a result, the implicit approximation to the filtering distribution (18) will become arbitrarily close to the true posterior, almost everywhere which allows for a trade-off between accuracy and computational complexity. We note that this result is not applicable in most cases of VI due to the simplicity of variational families used. As an example, we showcase the accuracy of the implicit distribution in Fig. 1. We summarize this result in the following theorem (proof in section LABEL:prood:filtering_elbo of the appendix).

###### Theorem 2.1 (Filtering ELBO).

The filtering ELBO (17), , is a lower bound to the logarithm of the normalization constant of the filtering distribution, . As the number of samples, , goes to infinity, will become arbitrarily close to .

Theorem 2.1 leads to the following corollary [del1996non] (proof in section LABEL:proof:corollary of the appendix).

###### Corollary 2.1.1.

Theorem 2.1 implies that the implicit filtering distribution, , converges to the true posterior as .

### 2.4 Stochastic Optimization

As in variational inference, we fit the parameters of the proposal, dynamics and observation model, , by maximizing the (filtering) ELBO (Alg. 1). While the expectations in (17) are not in closed form, we can obtain unbiased estimates of and its gradients with Monte Carlo. Note that, when obtaining gradients with respect to , we only need to take gradients of . We also assume that the proposal distribution, is reparameterizable, i.e. we can sample from by setting for some function where and is a distribution independent of . Thus we can express the gradient of (17) using the reparameterization trick [Kingma2014] as

 ∇Θt˜Lt =∇ΘtEs(ϵ1:L)[log^p(yt|\vy1:t−1)], (19) =Es(ϵ1:L)[∇Θtlog^p(yt|\vy1:t−1)], =Es(ϵ1:L)[∇ΘtL∑i=1logwit].

In Algorithm 1, we do times stochastic gradient descent (SGD) update for each step.

While using more samples, N, will reduce the variational gap between the filtering ELBO, , and , using more samples for estimating (19), L, may be detrimental for optimizing the parameters, as it has been shown to decrease the signal-to-noise ratio (SNR) of the gradient estimator for importance-sampling-based ELBOs [rainforth2018tighter]. The intuition is as follows: as the number of samples used to compute the gradient increases, the bound gets tighter and tighter which in turn causes the magnitude of the gradient to become smaller. The rate at which the magnitude decreases is much faster than the variance of the estimator, thus driving the SNR to . In practice, we found that using a small number of samples to estimate  (19), , is enough to obtain good performance.

### 2.5 Learning Dynamics with Sparse Gaussian Processes

State-space models allow for various time series models to represent the evolution of state and ultimately predict the future [Shumway2010]. While in some scenarios, there exists prior knowledge on the functional form of the latent dynamics , this is usually never the case in practice; thus, , must be learned online as well. While one can assume a parametric form for , i.e. a recurrent neural network, and learn through SGD this does not allow uncertainty over the dynamics to be expressed which is key for many real-time, safety-critical tasks. An attractive alternative over parametric models are Gaussian processes (GPs) [Rasmussen2006]. Gaussian processes do not assume a functional form for the latent dynamics; rather general assumptions, such as continuity or smoothness, are imposed. Gaussian processes also allow for a principled notion of uncertainty, which is key when predicting the future.

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. It is completely specified by its mean and covariance functions. A GP allows one to specify a prior distribution over functions

 f(x)∼GP(m(x),k(x,x′)) (20)

where is the mean function and is the covariance function; in this study, we assume that . With the GP prior imposed on the dynamics, one can do Bayesian inference with data.

With the current formulation, a GP can be incorporated by augmenting the state-space to , where . The importance weights are now computed according to

 wt=p(yt|xt)p(xt|ft)p(ft|\vf1:t−1,\vx0:t−1)r(xt,ft|ft−1,xt−1,yt;λt). (21)

Examining (21), it is evident that naively using a GP is impractical for online learning because its space and time complexity are proportional to the number of data points, which grows with time , i.e., and respectively. In other words, the space and time costs increase as more and more observations are processed.

To overcome this limitation, we employ the sparse GP method [Titsias2009, Snelson2006]. We introduce inducing points, , where and are pseudo-inputs and impose that the prior distribution over the inducing points is . In the experiments, we spread the inducing points uniformly over a finite volume in the latent space. Under the sparse GP framework, we assume that is a sufficient statistic for , i.e.

 p(ft|\vx0:t−1,\vf1:t−1,\vz)=p(ft|xt−1,\vz)=N(ft|mt+KtzK−1zz\vz,Ktt−KtzK−1zzKzt), (22)

where . Note that the inclusion of the inducing points in the model reduces the computational complexity to be constant with respect to time. Marginalizing out in (22)

 p(xt|xt−1,\vz)=∫p(xt|ft)p(ft|xt−1,\vz)dft=N(xt|mt+KtzK−1zz\vz,Ktt−KtzK−1zzKzt+Q). (23)

Equipped with equation (23), we can express the smoothing distribution as

 p(\vx0:t,\vz|\vy1:t)∝p(x0)p(\vz)∏p(yt|xt)p(xt|xt−1,\vz), (24)

and the importance weights can be computed according to

 wt=p(yt|xt)p(xt|xt−1,\vz)p(\vz|\vx0:t−1)r(xt,\vz|xt−1,yt;λt). (25)

Due to the conjugacy of the model, can be recursively updated efficiently. Let . Given and by Bayes rule

 p(\vz|\vx0:t)∝p(xt|xt−1,\vz)p(\vz|\vx0:t−1). (26)

we obtain the recursive updating rule:

 Γt =(Γ−1t−1+A⊤tC−1tAt)−1 (27) μt =Γt[Γ−1t−1μt−1+A⊤tC−1t(xt−mt)]

where and .

To facilitate learning in non-stationary environments, we impose a diffusion process over the inducing variables. Letting , we impose the following relationship between and

 \vzt=\vzt−1+ηt, (28)

where . We can rewrite (25)

 wt=p(yt|xt)p(xt|xt−1,\vzt)p(\vzt|\vx0:t−1)r(xt,\vzt|xt−1,\vzt−1,yt;λt), (29)

where

 p(\vzt|\vx0:t−1)=∫p(\vzt|\vzt−1)p(\vzt−1|\vx0:t−1)d\vzt−1=N(μt−1,Γt−1+σ2zI). (30)

To lower the computation we marginalize out the inducing points from the model, simplifying (29)

 wt=p(yt|xt)p(xt|\vx0:t−1)r(xt|xt−1,yt;λt), (31)

where

 p(xt|\vx0:t−1)=∫p(xt|xt−1,\vzt)p(\vzt|\vx0:t−1)d\vzt=N(vt,Σt) (32)

where and .

For each stream of particles, we store and . Due to the recursive updates (27), maintaining and updating and is of constant time complexity, making it amenable for online learning. The use of particles also allows for easy sampling of the predictive distribution (details are in section LABEL:sec:supp:prediction of the appendix). We call this approach SVMC-GP; the algorithm is summarized in Algorithm 2.

### 2.6 Design of Proposals

As stated previously, the accuracy of SVMC depends crucially on the functional form of the proposal. The (locally) optimal proposal is

 p(xt|xt−1,yt)∝p(yt|xt)p(xt|xt−1) (33)

which is a function of and  [doucet2000sequential]. In general (33) is intractable; to emulate (33) we parameterize the proposal as

 r(xt|xt−1,yt)=N(μλt(ft,yt),σ2λt(ft,yt)I) (34)

where and are neural networks with parameter .

## 3 Related Works

Much work has been done on learning good proposals for SMC. The method proposed in [guarniero2017iterated] iteratively adapts its proposal for an auxiliary particle filter. In [cornebise2014adaptive], the proposal is learned online using expectation-maximization but the class of dynamics for which the approach is applicable for is extremely limited. The method proposed in [gu2015neural] learns the proposal by minimizing the KLD between the smoothing distribution and the proposal, ; while this approach can be used to learn the proposal online, biased importance-weighted estimates of the gradient are used which can suffer from high variance if the initial proposal is bad. Conversely, [Naesseth2018] maximizes , which can be shown to minimize the KLD between the proposal and the implicit smoothing distribution, ; biased gradients were used to lower the variance. In contrast, SVMC allows for unbiased and low variance gradients that target the filtering distribution as opposed to the smoothing distribution. In [xu2019learning], the proposal is parameterized by a Riemannian Hamiltonian Monte Carlo and the algorithm updates the mass matrix by maximizing . At each time step (and for every stochastic gradient), the Hamiltonian must be computed forward in time using numerical integration, making the method impractical for an online setting.

Many works have focused on learning latent dynamics online using (sparse) GPs [Frigola2014, kocijan2016modelling, bijl2017system, huber2014recursive, deisenroth2011robust]. However those proposed methods in the literature either make an approximation to the posterior of the GP and/or only work for certain emission models i.e. linear and Gaussian. The use of particles allows SVMC-GP to work for arbitrary emission models without making such simplified assumptions.

## 4 Experiments

To showcase the power of SVMC, we employ it on a number of simulated and real experiments.

### 4.1 Synthetic Data

#### Linear Dynamical System

As a baseline, we use the proposed method on a linear dynamical system (LDS)

 xt =Axt−1+ϵt, ϵt∼N(0,Q), (35) yt =Cxt+ξt, ξt∼N(0,R).

LDS is the de facto dynamical system for many fields of science and engineering due to its simplicity and efficient exact filtering (i.e., Kalman filtering). The use of an LDS also endows a closed form expression of the log marginal likelihood for the filtering and smoothing distribution. Thus, as a baseline we compare the estimates of the log marginal likelihood, , produced by SVMC, variational sequential Monte Carlo (VSMC) [Naesseth2018] (run in offline mode) and BPF [gordon1993novel] in an experiment similar to the one used in [Naesseth2018]. We generated data from (35) with , , , with and where the state and emission parameters are fixed; the true negative log marginal likelihood is . For both VSMC and SVMC, 100 particles were used for both computing gradients and computing the lower bound. The proposal for VSMC and SVMC was

 r(xt|xt−1;λt)=N(μt+diag(βt)Axt−1,diag(σ2t)) (36)

where . 25,000 gradient steps were used for VSMC and 25,000/50 = 5,000 gradient steps were done at each time step for SVCMC to equate the total number of gradient steps between both methods. To equate the computational complexity between SVMC and BPF, we ran the BPF with 125,000 particles. We fixed the data generated from (35) and ran each method for 100 realizations. The average negative ELBO and its standard error of each the methods are shown in Table 4.1.1.

From Table (4.1.1), we see that the use of unbiased gradients in SVMC allows for faster convergence than the biased gradients used in VSMC [Naesseth2018]. Interestingly, SVMC performs worse than a BPF with 125,000 particles. We note that this may be due to the choice of the proposal function; the optimal proposal function for particle filtering is a function of the observation, , and the dynamics, , while (36) is only a function of the dynamics. To support this claim, we ran SVMC with 10,000 particles with the proposal defined by (34). It is evident that the use of this improved proposal allows SVMC to outperform BPF with a much lower computational budget.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters