Variational joint filtering

Variational joint filtering

Yuan Zhao
Department of Neurobiology and Behavior
Center for Neural Circuit Dynamics
Institute for Advanced Computing Science
Institute of AI-driven Discovery and Innovation
Stony Brook University
Stony Brook, NY 11794
&Il Memming Park
Department of Neurobiology and Behavior
Center for Neural Circuit Dynamics
Institute for Advanced Computing Science
Institute of AI-driven Discovery and Innovation
Stony Brook University
Stony Brook, NY 11794

New technologies for recording the activity of large neural populations during complex behavior provide exciting opportunities for investigating the neural computations that underlie perception, cognition, and decision-making. Nonlinear state-space models provide an interpretable signal processing framework by combining an intuitive dynamical system with a probabilistic observation model, which can provide insights into neural dynamics, neural computation, and development of neural prosthetics and treatment through feedback control. It yet brings the challenge of learning both latent neural state and the underlying dynamical system because neither is known for neural systems a priori. We developed a flexible online learning framework for latent nonlinear state dynamics and filtered latent states. Using the stochastic gradient variational Bayes approach, our method jointly optimizes the parameters of the nonlinear dynamical system, the observation model, and the black-box recognition model. Unlike previous approaches, our framework can incorporate non-trivial distributions of observation noise and has constant time and space complexity. These features make our approach amenable to real-time applications and the potential to automate analysis and experimental design in ways that testably track and modify behavior using stimuli designed to influence learning.


126ct \forLoop126ct

1 Introduction

Discovering interpretable structure online from a streaming high-dimensional time series has many applications in science and engineering. State space models have been successful in providing a succinct description, explaining observed time series as trajectories in a low-dimensional state-space. Taking a step further, state-space models equipped with nonlinear dynamics provide an opportunity to describe the latent laws of the system that is generating the seemingly entangled time series Haykin1998 ; Ko2009 ; Mattos2016 . Specifically, we would like to identify a continuous nonlinear dynamics in the state-space that captures the spatiotemporal structure of a noisy observation :

(state dynamics) (1a)
(observation model) (1b)

where and are continuous functions, is control input, and denotes a probability distribution.

In practice, the continuous-time state dynamics is more conveniently formulated in discrete time as,

(state dynamics) (2)

where is intended to capture unobserved perturbations of the state . Such (spatially) continuous state-space models are natural in many applications where the changes are slow and the underlying system follows physical laws and constraints (e.g., object tracking), or where learning the laws are of great interest (e.g. in neuroscience and robotics) Roweis2001 ; Mante2013 ; Sussillo2013 ; Frigola2014 .

If the nonlinear state-space model is fully specified, Bayesian inference methods can be employed to estimate the current state (either the filtering distribution or the smoothing distribution ), and predict future states and observations for  Ho1964 ; Sarkka2013 . However, in many applications, the challenge is in learning the parameters of the state-space model (a.k.a. the system identification problem). Learning both the latent trajectory in state-space and the latent (nonlinear) dynamical system is known as the joint estimation problem Haykin2001 . Expectation maximization (EM) based methods have been widely used in practice Ghahramani1999 ; Valpola2002 ; Turner2010 ; Golub2013 , and more recently variational autoencoder methods Archer2015a ; Krishnan2015 ; Johnson2016 ; Krishnan2017 ; Karl2017 ; Watter2015 have been proposed, all of which are designed for offline analysis, and not appropriate for real-time applications. Recursive stochastic variational inference has been successful in streaming data assuming independent samples Broderick2013 , however, in the presence of temporal dependence, proposed variational algorithms (e.g. Frigola2014 ) remain theoretical and lack testing.

In this study, we are interested in real-time signal processing and state-space control setting Golub2013 where online algorithms are needed that can recursively solve the joint estimation problem on streaming observations. A popular solution to this problem exploits the fact that online state estimators for nonlinear state-space models such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) can be used for nonlinear regression formulated as a state-space model:

(parameter diffusion) (3a)
(least-squares regression) (3b)

where is a function approximator parameterized by that maps to . Therefore, by augmenting the state-space with the parameters, one can build an online dual estimator using nonlinear Kalman filtering Wan2000 ; Wan2001 . However, they involve coarse approximation of Bayesian filtering, involve many hyperparameters, do not take advantage of modern stochastic gradient optimization, and are not easily applicable to arbitrary observation likelihoods. There are also closely related online version of EM-type algorithms Roweis2001 that share similar concerns. In hopes of lifting these concerns, we derive an online black-box variational inference framework, referred to as variational joint filtering (VJF), applicable to a wide range of nonlinear state-space models and observation models, that is, the computational demand of the algorithm is constant per time step.

2 Variational Principle for Online Joint Estimation

The crux of recursive Bayesian filtering is updating the posterior over the latent state one step at a time.


where the input and parameters are omitted for brevity. Unfortunately, the exact calculations of (4) are not tractable in general, especially for nonlinear dynamics models and/or non-conjugate distributions. We developed a recursive variational Bayesian filter by deriving an evidence lower bound for the marginal likelihood as the objective function. Let denote an arbitrary probability measure which will eventually approximate the filtering density . From (4a), we have

where denotes Shannon’s entropy and denotes the Kullback-Leibler divergence Cover1991 . Maximizing for this lower bound would result in a variational posterior . Naturally we plug in the previous step’s solution to the next time step, obtaining a loss function suitable for recursive estimation:


This also results in consistent for all time steps as they are in the same family of distribution.

Online inference is achieved by maximizing this objective w.r.t. the parameters (omitted for brevity) of the observation model , state dynamics and the variational posterior distribution provided that takes some parameterized form and is estimated from the previous time step. Maximizing is equivalent to minimizing the two variational gaps: (1) the variational filtering posterior has to be close to the true filtering posterior, and (2) the filtering posterior from the previous step needs to be close to . Note that this second gap is invariant to if , that is, the one-step backward smoothing distribution is identical to the filtering distribution.

On the flip side, intuitively, there are three components in that are jointly optimized: (1) reconstruction log-likelihood which is maximized if concentrates around the maximum likelihood estimate given only , (2) the dynamics log-likelihood which is maximized if concentrates at around the maximum of , and (3) the entropy that expands and prevents it from collapsing to a point mass.

In order for this recursive estimation to be real-time, we choose to be a multivariate normal with diagonal covariance . Moreover, to amortize the computational cost of optimization to obtain the best on each time step, we employ the variational autoencoder architecture Hinton1995 to parameterize with a recognition model. Intuitively, the recognition model embodies the optimization process of finding , that is, it performs an approximate Bayesian filtering computation (in constant time) of (4) according to the objective function . We use a recursive recognition network model that maps and to . In particular, the recognition model is a deterministic recurrent neural network (RNN; with no extra hidden state):


Specifically we use a simple multi-layer perceptron as in this study. Note that the Markovian architecture of the recognition model reflects the Markovian structure of filtering computation (c.f., smoothing networks often use bidirectional RNN Sussillo2016 or graphical models Archer2015a ; Johnson2016 ).

The expectations appearing in the reconstruction log-likelihood and dynamics log-likelihood are not always tractable in general. For those intractable cases, one can use the reparameterization trick and stochastic variational Bayes Rezende2014 ; Kingma2014 : rewriting the expectations over as expectation over a standard normal random variable, and using a single sample for each time step. Hence, in practice, we optimize the following objective function (the other variables and parameters are omitted for brevity),


where and represent random samples from and respectively. Note that the remaining expectation over has closed form solution for Gaussian state noise. Thus, our method can handle arbitrary observation and dynamics model unlike dual form nonlinear Kalman filtering methods.

The schematics of the proposed model is summarized by two passes in Figure 1. In the forward pass, the previous latent state generates the new state through the state dynamics, and the new state transforms into the observation through the observation model. In the backward pass, the recognition model recovers the current latent state from the observation, and the three components, observation model, recognition model and dynamical system, are updated by backpropagation.

Figure 1: Schematics of variational joint filtering. Blue arrows indicates forward pass in which the previous latent state generates the new state through the state dynamics, and the new state transforms into the observation through the observation model. Red arrows indicate backward pass in which the recognition model recovers the current latent state from the observation, and the three components, observation model, recognition model and dynamical system, are updated by backpropagation.

Denote the set of all parameters by from the generative, recognition and dynamics models. The objective function in Eq. (7) is differentiable w.r.t. , and thus we employ empirical Bayes and optimize it through stochastic gradient ascent (using Adam Kingma2014a ). Algorithm 1 is an overview of the recursive estimation algorithm. We outline the algorithm for a single vector time series, but we can filter multiple sequences with a common state-space model simultaneously, in which case the gradients are averaged across the instantiations.

procedure VJF()
      Draw random samples
      Recognition model
     Update with Stochastic gradient step
     return and
end procedure
Algorithm 1 Variational Joint Filtering (single step)

Note that this algorithm has constant time and space complexity per time step.

3 Application to Latent Neural Dynamics

Our primary applied aim is real-time neural interfaces where a population of neurons are recorded while a low-dimensional stimulation is delivered Newman2015 ; ElHady2016 ; Hocker2019a . State-space modeling of such neural time series have been successful in describing population dynamics Macke2011 ; Zhao2017 . Moreover, models of neural computation are often described as dynamical systems Hopfield1982 ; Dayan2001 ; Barak2013 . For example, attractor dynamics where the convergence to one of the attractors represents the result of computation Wang2002 ; Nassar2018b . Here we propose a parameterization and tools for visualization of the model suitable for studying neural dynamics and building neural interfaces Zhao2016 .

3.1 Parameterization

Having in mind high-temporal resolution neural spike trains where each time bin has at most one action potential, we describe the case for a point process observation. However, note that our method generalizes to arbitrary observation likelihood appropriate for other modalities, including calcium imaging or local field potentials. Our observed time series is a stream of sparse binary vectors. All analysis are done with time bins containing one event at most each for Poisson observation in this study.

Our observation model assumes that the spike train observation is sampled from a probability distribution determined by the latent state though a linear-nonlinear map possibly together with extra parameters at each time ,


where is a point-wise nonlinearity making rates non-negative. We use the canonical link for Poisson likelihood in this study. Note that this model is not identifiable since where is an arbitrary invertible matrix. Also, the mean of can be traded off with the bias term in . It is straightforward to include more additive exogenous variables, history-filter for refractory period, coupling between processes, and stimulation artifacts Pillow2008 ; Truccolo2005 .

For state dynamics we propose to use a specific parameterization with additive state transition function and input interaction as a special case of Eq. (2),


where is a vector of continuous basis functions, i.e. and can be linear or locally linear interaction parameterized as a linear combination of ’s as . In this study, we use squared exponential radial basis functions Roweis2001 ; Frigola2014 ; Sussillo2013 ; Zhao2016 ,


with centers and corresponding inverse squared kernel width .

The time complexity of our algorithm is where denote the dimensions of observation, latent space, input, the numbers of hidden units and radial basis functions for this specific parameterization. Note that the time complexity does not grow with time that enable efficient online inference. If we compare this to an efficient offline algorithm such as PLDS Macke2011 run repeatedly for every new observation (“online mode”), its time complexity is per time step at time which grows as time passes, making it impractical for real-time application.

We also employ the sparse Gaussian process (GP) to model the dynamics. The sparse GP enables Bayesian inference on the dynamics. We derived variational inference on the sparse GP. The details can be found in S.3. The RBF parameterization can be considered as a special case of sparse GP.

3.2 Phase portrait analysis

The function directly represents the velocity field of an underlying smooth dynamics (1a) in the absence of input Roweis2001 ; Zhao2016 . We visualize the estimated dynamics as phase portrait which consists of the vector field, example trajectories, and estimated dynamical features (namely fixed points) Strogatz2000 . We numerically identify candidate fixed points that satisfy . For the simulation studies, we do an affine transformation to orient the phase portrait to match the canonical equations in the main text when the simulation is done through the proposed observation model.

3.3 Prediction

The prediction contains both latent trajectory and observation which last for steps by sampling from:


given estimated parameters without seeing the data during these steps.

4 Experiments

4.1 Simulated data

Many theoretical models have been proposed in neuroscience to represent different schemes of computation. For the purpose of visualization, we choose to simulate from two or three dimensional dynamical systems. We apply the proposed method to four such low-dimensional models: a ring attractor model as a model of internal head direction representation, an nonlinear oscillator as a model of rhythmic population-wide activity, a biophysically realistic cortical network model for a visual discrimination experiment and a chaotic attractor.

The observations are either Gaussian or point-process (200 dimensions unless otherwise mentioned) via state dynamics driven processes. The average firing rate of the simulated spike trains are kept below 40 Hz. We refer to their conventional formulations under different coordinate systems, but our simulations and inferences are all done in Cartesian coordinates.

The approximate posterior distribution is defined recursively in Eq. (6) as diagonal Gaussian with mean and variance determined by corresponding observation, input and previous step via a recurrent neural network. We use one hidden layer MLP in this study. Typically the state noise variance is unknown and has to be estimated from data. To be consistent with Eq. (9c), we set the starting value of to be , and hence . We initialize the loading matrix by factor analysis, and column-wisely normalize it by norm every iteration to keep the system identifiable.

4.1.1 Ring attractor

First, we study the following two-variable ring attractor system:


where represents the direction driven by input , and is the radial component representing an internal circular variable, such as head direction. We simulate 100 trajectories of 1000 steps with step size , , , plus Gaussian noise (). We use strong input (tangent drift) to keep the trajectories flowing around the ring clockwise or counter-clockwise. We use 20 radial basis functions for dynamics model and 100 hidden units for the recognition model.

Figure 2: Ring attractor model. (\subreffig:ring:traj) One latent trajectory (black) in the training set and the corresponding filtered mean (blue). (\subreffig:ring:phase) Velocity field reconstructed from the trained proposed model. The colored streamlines indicates the speed and the directions. The black crosses are candidate fixed points obtained from inferred dynamics. Note the collection of fixed points around the ring shape. The central fixed point is unstable. (\subreffig:ring:density) Density of the posterior means. The density of inferred means of all trajectories in the training set. The higher it is, the more confidence we have on the inferred dynamics where we have more data. (\subreffig:ring:converge) Convergence on the ring attractor. We display the three components of the objective lower bound: reconstruction log-likelihood, dynamics log-likelihood, entropy, and the lower bound itself from Eq. (5).

Figure 1(a) illustrates one latent trajectory (black) and its variational posterior mean (blue). These two trajectories start at green circle and diamond respectively and end at the red markers. The inference starts near the center (origin) that is relatively far from the true location because the initial posterior mean is set at zero. The final states are very close which implies that the recognition model works well. Figure 1(b) shows the reconstructed velocity field by the model. We visualize the velocity as colored directional streamlines. We can see the velocity toward the ring attractor and the speed is smaller closer to the ring. The model also identifies a number of fixed points arranged around the ring attractor via numerical roots finding. Figure 1(c) shows the distribution of posterior means of all data points in the state-space. We have more confidence of the inferred dynamical system in the denser area.

Figure 1(d) shows the three components of (5) and the objective lower bound clearly, demonstrating the convergence of the algorithm. We can see each component reaches a plateau within sec. As the reconstruction and dynamics log-likelihoods increase, the recognition model and dynamical model are getting more accurate while the decreasing entropy indicates the increasing confidence (inverse posterior variance) on the inferred latent states. The average computation time of a joint estimation step is  ms (hardware specification: Intel Xeon E5-2680 2.50GHz, 128GB RAM, no GPU).

4.1.2 Nonlinear oscillator

We use a 2-dimensional relaxation oscillator with the following nonlinear state dynamics:


where is the membrane potential, is a recovery variable and is the magnitude of stimulus current in modeling single neuron biophysics Izhikevich2007 . This model was used to model global brain state that fluctuates between two levels of excitability in anesthetized cortex Curto2009 . We use the following parameter values , , and to simulate 100 trajectories of 1000 steps with step size and Gaussian noise (std=). At this regime, unlike the ring attractor, the spontaneous dynamics is a periodic oscillation, and the trajectory follows a limit cycle.

We use 20 radial basis functions for dynamics model and 100 hidden units for recognition model. While training the model, the input was clamped to zero, and expect the model to learn the spontaneous oscillator.

We also reconstruct the phase portrait (Fig. 2(b)) comparing to the truth (Fig. 2(c)). The two dashed lines are the theoretical nullclines of the true model on which the velocity of corresponding dimension is zero. The reconstructed field shows a low speed valley overlapping with the nullcline especially on the right half figure. The intersection of two nullclines is a unstable fixed point. We can see the identified fixed point is close to the intersection. As most of the trajectories lie on the oscillation path with merely few data points elsewhere, the inferred system shows the oscillation dynamics similar to the true system around the data region. The difference mostly happens in the region far from the trajectories because of the lack of data.

We run a long-term prediction using the proposed model without seeing the future data during these steps () beginning at the final state of training data. We show the truth and prediction in Figure 3(a). The upper row is the true latent trajectory and corresponding observations. The lower row is the filtered trajectory and prediction by our proposed method. The light-colored parts are the 500 steps of inference before prediction and the solid-colored parts are 1000 steps truth and prediction. We also show the sample observations from the trained observation model during the prediction period.

One of the popular latent process modeling tools for point process observation that can make prediction is Poisson Linear Dynamical System (PLDS) Macke2011 which assumes latent linear dynamics. We compare PLDS fit with EM on its long-term prediction on both the states and spike trains (Fig. 3(a)).

Figure 3: Nonlinear oscillator (FitzHugh-Nagumo) model. (\subreffig:fhn:density) One inferred latent trajectory and the density of posterior means of all trajectories. Most of the inferred trajectory lie on the oscillation path. (\subreffig:fhn:phase) Velocity field reconstructed by the inferred dynamical system. (\subreffig:fhn:phase_true) Velocity field of the true dynamical system. The dashed lines are two nullclines of the true model on which the gradients are zero so as the velocity.
Figure 4: Long-term prediction. (\subreffig:fhn:density) 1000-step prediction continuing the trajectory and sampled spike trains compared to ground truth from Figure 2(a). (\subreffig:fhn:rmse) Mean (solid line) and standard error (shade) of root mean square error of prediction of 2000 trials. The prediction started at the same states for the true system and models. Note that PLDS fails to predict long term due to its linear dynamics assumption. A linear dynamical system without noise can only produce damped oscillations.

We run LFADS Sussillo2016 using the same data. LFADS implements its dynamical model with GRU that requires high dimensions. For this 2D system, we tried different GRU dimensionality. We made minimal changes to its recommended setting including only the generator dimensionality, batch and no controller. The result shows that LFADS requires much higher dimension than the true system to capture the oscillation. (The figure of its inferred trajectories is shown in the supplement.) We report the fitted log-likelihood per time bin , and for 2D, 20D and 50D GRU respectively. In comparison, the log-likelihood of the proposed approach is with a 2D dynamical model.

4.1.3 Fixed point attractor for decision-making

Perceptual decision-making paradigm is a well-established cognitive task where typically a low-dimensional decision variable needs to be integrated over time, and subjects are close to optimal in their performance. To understand how the brain implements such neural computation, many competing theories have been proposed Barak2013 ; Mante2013 ; Ganguli2008 ; Wang2002 ; Wong2006 . We test our method on a simulated biophysically realistic cortical network model for a visual discrimination experiment Wang2002 . In the model, there are two excitatory subpopulations that are wired with slow recurrent excitation and feedback inhibition to produce attractor dynamics with two stable fixed points. Each fixed point represents the final perceptual decision, and the network dynamics amplify the difference between conflicting inputs and eventually generates a binary choice.

We subsample 480 selective neurons out of 1600 excitatory neurons from the simulation to be observed by our algorithm. The simulated data is organized into decision-making trials where each trial lasts for 2 sec and with different strength of visual evidence, controlled by “coherence”. Our method with 20 basis functions learned the dynamics from 140 training trials (20 per coherence where max is ). We visualize the velocity field at zero stimulus as colored streamlines in Figure 5. The left panel is the estimated velocity field and the right panel is the theoretical velocity field of a theoretically reduced model Wong2006 . Note that the spiking neuron model where the simulation comes from is different from the reduced model. Here we want to show that our method can provide a qualitatively similar result to the theoretical work which reduces the dimensionality and complexity of the original network.

Figure 5: Inferred (\subreffig:wang_velocity) and theoretical phase portrait (\subreffig:wong_field) for a recurrent spiking neural network performing decision making (zero stimulus). Although there are 2000 neurons in the simulated network, the collective population dynamics can be reduced to 2 dimensions Wong2006 . The red dots in (\subreffig:wang_velocity) are the inferred final states of zero-coherent trials. In (\subreffig:wong_field), the black dots are fixed points (the solid are stable and the gray are unstable) Although the absolute arrangement is dissimilar, the relative topology and relation of the three identified fixed points show correspondence.
(\subreffig:rmse:lds:prediction) Mean RMSEs of one-step-ahead prediction of nonstationary system during online learning (50 trials). The linear system was changed and perturbed at the 2000th step. The lines are average RMSEs. Both the online algorithms learned the change after a few steps.

4.1.4 Nonstationary system

We compare dual EKF and the proposed approach on non-stationary linear dynamical system. A spiral-in linear system was changed from clockwise to counter-clockwise at the 2000th step and perturbed. The observations were Gaussian. To focus on the dynamics, we fixed all the parameters but transition matrix of both methods except that our approach still have to learn the recognition model. Figure 4(c) shows that our approach achieved the same online performance as dual EKF in this experiment.

4.2 Neurophysiological Data

We apply the proposed method to a large scale recording to validate that it picks up meaningful dynamics. The dataset Graf2011 consists of 148 simultaneously recorded single units from the primary cortex (V1). In the experiment, directional drifting gratings were presented to an anesthetized monkey for around 1.3s, each direction of which was repeated 50 trials.

We finally fit a 2D dynamics model to the 50 trials of 63 well-tuned Graf2011 units of one same direction, and the spike counts of 1ms bin size. Due to low signal-to-noise ratio (SNR), we initialize the observation model with dimensionality reduction methods such as variational latent Gaussian processes (vLGP), and then train the model in online mode, each components alternatively. We also feed the data multiple times during training.

There is one continuous circular variables in the stimuli space: temporal phase of oscillation. In hopes that the inferred dynamical system would implement the oscillation, we did not included the stimulus drive as the input in the model. As expected, Figure 6 shows the inferred dynamical system is able to implement the oscillation.

Figure 6: (\subreffig:v1_velocity) Phase portrait of the inferred dynamical system (arrows: direction, blue: low speed, red: high speed) (\subreffig:v1_sim) Trajectories simulated from the inferred dynamical system (1000 steps, black lines: trajectories, green circles: initial states, red diamonds: final states).

5 Discussion

New and future technologies for recording the activity of large neural populations during meaningful behavior will provide exciting opportunities for investigating the neural computations that underlie perception, cognition, and decision-making. However, the datasets provided by these technologies currently require sophisticated offline analyses that slow down the scientific cycle of experiment, data analysis, hypothesis generation, and further experiment. Moreover, in closed-loop neurophysiological setting, real-time adaptive algorithms are extremely valuable. We propose an online algorithm for recursive variational Bayesian inference for joint system identification and filtering that can greatly impact neuroscience research and biomedical engineering. Our algorithm is flexible—allows wide range of likelihood and dynamics, computationally tractable, and produces interpretable visualizations of complex collective network dynamics.

This work opens many avenues for future work. One direction is to apply this model to large-scale neural recording from a behaving animal. We hope that further development would enable on-the-fly analysis of high-dimensional neural spike train during electrophysiological experiments. Clinically, a nonlinear state-space model provides a basis for nonlinear feedback control as a potential treatment for neurological diseases that arise from diseased dynamical states.


  • (1) S. Haykin and J. Principe, “Making sense of a complex world [chaotic events modeling],” IEEE Signal Processing Magazine, vol. 15, no. 3, pp. 66–81, May 1998.
  • (2) J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, may 2009.
  • (3) C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence, “Recurrent gaussian processes,” International Conference on Learning Representations (ICLR), 2016.
  • (4) S. Roweis and Z. Ghahramani, Learning nonlinear dynamical systems using the expectation-maximization algorithm.    John Wiley & Sons, Inc, 2001, pp. 175–220.
  • (5) V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome, “Context-dependent computation by recurrent dynamics in prefrontal cortex,” Nature, vol. 503, no. 7474, pp. 78–84, Nov. 2013.
  • (6) D. Sussillo and O. Barak, “Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks,” Neural Computation, vol. 25, no. 3, pp. 626–649, Mar. 2013.
  • (7) R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussian process state-space models,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, Montreal, Canada, 2014, pp. 3680–3688.
  • (8) Y. Ho and R. Lee, “A Bayesian approach to problems in stochastic estimation and control,” IEEE Transactions on Automatic Control, vol. 9, no. 4, pp. 333–339, Oct. 1964.
  • (9) S. Särkkä, Bayesian filtering and smoothing.    Cambridge University Press, 2013.
  • (10) S. S. Haykin, Kalman filtering and neural networks.    Wiley, 2001.
  • (11) Z. Ghahramani and S. T. Roweis, “Learning nonlinear dynamical systems using an EM algorithm,” in Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn, Eds.    MIT Press, 1999, pp. 431–437.
  • (12) H. Valpola and J. Karhunen, “An unsupervised ensemble learning method for nonlinear dynamic State-Space models,” Neural Computation, vol. 14, no. 11, pp. 2647–2692, Nov. 2002.
  • (13) R. Turner, M. Deisenroth, and C. Rasmussen, “State-Space Inference and Learning with Gaussian Processes,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 868–875.
  • (14) M. D. Golub, S. M. Chase, and B. M. Yu, “Learning an internal dynamics model from control demonstration.” JMLR workshop and conference proceedings, pp. 606–614, 2013.
  • (15) E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski, “Black box variational inference for state space models,” ArXiv e-prints, Nov. 2015.
  • (16) R. G. Krishnan, U. Shalit, and D. Sontag, “Deep Kalman filters,” arXiv, vol. abs/1511.05121, Nov. 2015.
  • (17) M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta, “Composing graphical models with neural networks for structured representations and fast inference,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.    Curran Associates, Inc., 2016, pp. 2946–2954.
  • (18) R. G. Krishnan, U. Shalit, and D. Sontag, “Structured inference networks for nonlinear state space models,” arXiv, vol. abs/1511.05121, 2016.
  • (19) M. Karl, M. Soelch, J. Bayer, and P. van der Smagt, “Deep variational Bayes filters: Unsupervised learning of state space models from raw data,” in 5th International Conference on Learning Representations, 2017.
  • (20) M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.    Curran Associates, Inc., 2015, pp. 2746–2754.
  • (21) T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan, “Streaming variational bayes,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds.    Curran Associates, Inc., 2013, pp. 1727–1735.
  • (22) E. A. Wan and R. Van Der Merwe, “The unscented Kalman filter for nonlinear estimation,” in Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373).    Lake Louise, Alta., Canada: IEEE, Aug. 2000, pp. 153–158.
  • (23) E. A. Wan and A. T. Nelson, Dual extended Kalman filter methods.    John Wiley & Sons, Inc, 2001, pp. 123–173.
  • (24) T. M. Cover and J. A. Thomas, Elements of Information Theory.    Wiley-Interscience, Aug. 1991.
  • (25) G. Hinton, P. Dayan, B. Frey, and R. Neal, “The "wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, May 1995.
  • (26) D. Sussillo, R. Jozefowicz, L. F. Abbott, and C. Pandarinath, “LFADS - latent factor analysis via dynamical systems,” arXiv, vol. abs/1608.06315, Aug. 2016.
  • (27) D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in International Conference on Machine Learning, May 2014.
  • (28) D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv:1312.6114 [cs, stat], May 2014, arXiv: 1312.6114.
  • (29) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  • (30) J. P. Newman, M.-f. Fong, D. C. Millard, C. J. Whitmire, G. B. Stanley, and S. M. Potter, “Optogenetic feedback control of neural activity,” eLife, 2015.
  • (31) A. El Hady, Closed Loop Neuroscience, A. El Hady, Ed.    Academic Press, 2016.
  • (32) D. Hocker and I. M. Park, “Myopic control of neural dynamics,” PLOS Computational Biology, 2019. [Online]. Available:
  • (33) J. H. Macke, L. Buesing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani, “Empirical models of spiking in neural populations,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds.    Curran Associates, Inc., 2011, pp. 1350–1358.
  • (34) Y. Zhao and I. M. Park, “Variational latent Gaussian process for recovering single-trial dynamics from population spike trains,” Neural Computation, vol. 29, no. 5, May 2017.
  • (35) J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, Apr. 1982.
  • (36) P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems.    Massachusetts Institute of Technology Press, 2001.
  • (37) O. Barak, D. Sussillo, R. Romo, M. Tsodyks, and L. F. Abbott, “From fixed points to chaos: three models of delayed discrimination.” Progress in neurobiology, vol. 103, pp. 214–222, Apr. 2013.
  • (38) X.-J. Wang, “Probabilistic decision making by slow reverberation in cortical circuits,” Neuron, vol. 36, no. 5, pp. 955–968, Dec. 2002.
  • (39) J. Nassar, S. W. Linderman, M. Bugallo, and I. M. Park, “Tree-structured recurrent switching linear dynamical systems for multi-scale modeling,” in International Conference on Learning Representations (ICLR), Nov. 2019. [Online]. Available:
  • (40) Y. Zhao and I. M. Park, “Interpretable nonlinear dynamic modeling of neural trajectories,” in Advances in Neural Information Processing Systems (NIPS), 2016.
  • (41) J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, and E. P. Simoncelli, “Spatio-temporal correlations and visual signalling in a complete neuronal population,” Nature, vol. 454, no. 7207, pp. 995–999, Jul. 2008.
  • (42) W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. Brown, “A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects,” Journal of Neurophysiology, vol. 93, no. 2, pp. 1074–1089, Feb. 2005.
  • (43) S. H. Strogatz, Nonlinear Dynamics and Chaos, 1st ed., ser. Studies in nonlinearity.    The Perseus Books Group, Jan. 2000.
  • (44) E. M. Izhikevich, Dynamical systems in neuroscience : the geometry of excitability and bursting, ser. Computational neuroscience.    MIT Press, 2007.
  • (45) C. Curto, S. Sakata, S. Marguet, V. Itskov, and K. D. Harris, “A simple model of cortical dynamics explains variability and state dependence of sensory responses in Urethane-Anesthetized auditory cortex,” The Journal of Neuroscience, vol. 29, no. 34, pp. 10 600–10 612, Aug. 2009.
  • (46) S. Ganguli, J. W. Bisley, J. D. Roitman, M. N. Shadlen, M. E. Goldberg, and K. D. Miller, “One-dimensional dynamics of attention and decision making in LIP,” Neuron, vol. 58, no. 1, pp. 15–25, Apr. 2008.
  • (47) K.-F. Wong and X.-J. Wang, “A recurrent network mechanism of time integration in perceptual decisions,” The Journal of Neuroscience, vol. 26, no. 4, pp. 1314–1328, Jan. 2006.
  • (48) A. B. A. Graf, A. Kohn, M. Jazayeri, and J. A. Movshon, “Decoding the activity of neuronal populations in macaque primary visual cortex,” Nature Neuroscience, no. 2, pp. 239–245, Jan. 2011.
  • (49) W. Maass, T. Natschläger, and H. Markram, “Real-time computing without stable states: A new framework for neural computation based on perturbations,” Neural Computation, vol. 14, pp. 2531–2560, 2002.
  • (50) R. Laje and D. V. Buonomano, “Robust timing and motor patterns by taming chaos in recurrent neural networks,” Nature Neuroscience, vol. 16, no. 7, pp. 925–933, May 2013.


s.1 Nonlinear oscillator

(a) 2D,
(b) 20D,
(c) 50D,
Figure 7: The first 2 PCs of inferred latent trajectories of FHN system by LFADS. We fit LFADS with 2D, 20D and 50D latent state. LFADS inherently requires much higher-dimensional latent space to recover the oscillation. We report the fitted log-likelihoods per time bin. On the contrary, the proposed approach gives a log-likelihood -0.1142.

s.2 Chaotic dynamics

Chaotic dynamics (or edge-of-chaos) has been postulated to support asynchronous states in the cortex, and neural computation over time by generating rich temporal patterns Maass2002 ; Laje2013 . We consider the 3-dimensional standard Lorenz attractor as an example chaotic system. We simulate 216 latent trajectories from:


The initial states of the trajectories are evenly-spaced. We discard the first 500 transient steps of each trajectory and then use the following 1000 steps. We generate 200-dimensional Gaussian observations driven by the trajectories. Figure 8 shows the inference and prediction. We continue simulating a trajectory for 2000 more steps from the final state where the training data end, and predict the latent state through the proposed model. The black lines show the three dimensions of true state respectively and the red lines are the model-prediction. We reset the prediction back to the true state every 500 steps.

Figure 8: Lorenz attractor. (7(a)) The true latent trajectories and their inferences. (7(b)) We simulate the system for 2000 more steps continuing at the final state where the training data end and predict the latent state through the trained model. The black lines show the three dimensions of true state respectively and the red lines are the model-prediction. We reset the prediction back to the true state every 500 steps.

s.3 Sparse GP dynamics

s.3.1 Model specification


s.3.2 Variational sparse GP

Each component of the vector-valued function is defined by a Gaussian Process


Then for any finite set of points has a covariance matrix given the covariance matrix between the components. The state noise covariance becomes as well. We introduce the inducing variable . Let , and omit the control input. Then we have



Introduce inducing variable

where denotes the variational approximation to .

Assumed that is a sufficient statistic of , we have


As is conditional on ,


and hence


The marginal distribution of is




As the marginal and conditional distributions in the integrand are Gaussian, so is the joint distribution, and the marginal distributions are




If we assume further that factorizes as


Usually sparse GP regression methods assume that

given .

and then

Jensen’s inequality

where can be seen as the parameters with prior.

Let . We have


Consider the derivative w.r.t. ,


Letting (34) equal gives


where the inner density can be sampled by (21). With the quadratic form, the optimal distribution is Gaussian if is so. Then we have