Variational joint filtering
Abstract
New technologies for recording the activity of large neural populations during complex behavior provide exciting opportunities for investigating the neural computations that underlie perception, cognition, and decision-making. Nonlinear state-space models provide an interpretable signal processing framework by combining an intuitive dynamical system with a probabilistic observation model, which can provide insights into neural dynamics, neural computation, and development of neural prosthetics and treatment through feedback control. It yet brings the challenge of learning both latent neural state and the underlying dynamical system because neither is known for neural systems a priori. We developed a flexible online learning framework for latent nonlinear state dynamics and filtered latent states. Using the stochastic gradient variational Bayes approach, our method jointly optimizes the parameters of the nonlinear dynamical system, the observation model, and the black-box recognition model. Unlike previous approaches, our framework can incorporate non-trivial distributions of observation noise and has constant time and space complexity. These features make our approach amenable to real-time applications and the potential to automate analysis and experimental design in ways that testably track and modify behavior using stimuli designed to influence learning.
126ct \forLoop126ct
1 Introduction
Discovering interpretable structure online from a streaming high-dimensional time series has many applications in science and engineering. State space models have been successful in providing a succinct description, explaining observed time series as trajectories in a low-dimensional state-space. Taking a step further, state-space models equipped with nonlinear dynamics provide an opportunity to describe the latent laws of the system that is generating the seemingly entangled time series Haykin1998 ; Ko2009 ; Mattos2016 . Specifically, we would like to identify a continuous nonlinear dynamics in the state-space that captures the spatiotemporal structure of a noisy observation :
(state dynamics) | (1a) | ||||
(observation model) | (1b) |
where and are continuous functions, is control input, and denotes a probability distribution.
In practice, the continuous-time state dynamics is more conveniently formulated in discrete time as,
(state dynamics) | (2) |
where is intended to capture unobserved perturbations of the state . Such (spatially) continuous state-space models are natural in many applications where the changes are slow and the underlying system follows physical laws and constraints (e.g., object tracking), or where learning the laws are of great interest (e.g. in neuroscience and robotics) Roweis2001 ; Mante2013 ; Sussillo2013 ; Frigola2014 .
If the nonlinear state-space model is fully specified, Bayesian inference methods can be employed to estimate the current state (either the filtering distribution or the smoothing distribution ), and predict future states and observations for Ho1964 ; Sarkka2013 . However, in many applications, the challenge is in learning the parameters of the state-space model (a.k.a. the system identification problem). Learning both the latent trajectory in state-space and the latent (nonlinear) dynamical system is known as the joint estimation problem Haykin2001 . Expectation maximization (EM) based methods have been widely used in practice Ghahramani1999 ; Valpola2002 ; Turner2010 ; Golub2013 , and more recently variational autoencoder methods Archer2015a ; Krishnan2015 ; Johnson2016 ; Krishnan2017 ; Karl2017 ; Watter2015 have been proposed, all of which are designed for offline analysis, and not appropriate for real-time applications. Recursive stochastic variational inference has been successful in streaming data assuming independent samples Broderick2013 , however, in the presence of temporal dependence, proposed variational algorithms (e.g. Frigola2014 ) remain theoretical and lack testing.
In this study, we are interested in real-time signal processing and state-space control setting Golub2013 where online algorithms are needed that can recursively solve the joint estimation problem on streaming observations. A popular solution to this problem exploits the fact that online state estimators for nonlinear state-space models such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) can be used for nonlinear regression formulated as a state-space model:
(parameter diffusion) | (3a) | ||||
(least-squares regression) | (3b) |
where is a function approximator parameterized by that maps to . Therefore, by augmenting the state-space with the parameters, one can build an online dual estimator using nonlinear Kalman filtering Wan2000 ; Wan2001 . However, they involve coarse approximation of Bayesian filtering, involve many hyperparameters, do not take advantage of modern stochastic gradient optimization, and are not easily applicable to arbitrary observation likelihoods. There are also closely related online version of EM-type algorithms Roweis2001 that share similar concerns. In hopes of lifting these concerns, we derive an online black-box variational inference framework, referred to as variational joint filtering (VJF), applicable to a wide range of nonlinear state-space models and observation models, that is, the computational demand of the algorithm is constant per time step.
2 Variational Principle for Online Joint Estimation
The crux of recursive Bayesian filtering is updating the posterior over the latent state one step at a time.
(4a) | ||||
(4b) |
where the input and parameters are omitted for brevity. Unfortunately, the exact calculations of (4) are not tractable in general, especially for nonlinear dynamics models and/or non-conjugate distributions. We developed a recursive variational Bayesian filter by deriving an evidence lower bound for the marginal likelihood as the objective function. Let denote an arbitrary probability measure which will eventually approximate the filtering density . From (4a), we have
where denotes Shannon’s entropy and denotes the Kullback-Leibler divergence Cover1991 . Maximizing for this lower bound would result in a variational posterior . Naturally we plug in the previous step’s solution to the next time step, obtaining a loss function suitable for recursive estimation:
(5) |
This also results in consistent for all time steps as they are in the same family of distribution.
Online inference is achieved by maximizing this objective w.r.t. the parameters (omitted for brevity) of the observation model , state dynamics and the variational posterior distribution provided that takes some parameterized form and is estimated from the previous time step. Maximizing is equivalent to minimizing the two variational gaps: (1) the variational filtering posterior has to be close to the true filtering posterior, and (2) the filtering posterior from the previous step needs to be close to . Note that this second gap is invariant to if , that is, the one-step backward smoothing distribution is identical to the filtering distribution.
On the flip side, intuitively, there are three components in that are jointly optimized: (1) reconstruction log-likelihood which is maximized if concentrates around the maximum likelihood estimate given only , (2) the dynamics log-likelihood which is maximized if concentrates at around the maximum of , and (3) the entropy that expands and prevents it from collapsing to a point mass.
In order for this recursive estimation to be real-time, we choose to be a multivariate normal with diagonal covariance . Moreover, to amortize the computational cost of optimization to obtain the best on each time step, we employ the variational autoencoder architecture Hinton1995 to parameterize with a recognition model. Intuitively, the recognition model embodies the optimization process of finding , that is, it performs an approximate Bayesian filtering computation (in constant time) of (4) according to the objective function . We use a recursive recognition network model that maps and to . In particular, the recognition model is a deterministic recurrent neural network (RNN; with no extra hidden state):
(6) |
Specifically we use a simple multi-layer perceptron as in this study. Note that the Markovian architecture of the recognition model reflects the Markovian structure of filtering computation (c.f., smoothing networks often use bidirectional RNN Sussillo2016 or graphical models Archer2015a ; Johnson2016 ).
The expectations appearing in the reconstruction log-likelihood and dynamics log-likelihood are not always tractable in general. For those intractable cases, one can use the reparameterization trick and stochastic variational Bayes Rezende2014 ; Kingma2014 : rewriting the expectations over as expectation over a standard normal random variable, and using a single sample for each time step. Hence, in practice, we optimize the following objective function (the other variables and parameters are omitted for brevity),
(7) |
where and represent random samples from and respectively. Note that the remaining expectation over has closed form solution for Gaussian state noise. Thus, our method can handle arbitrary observation and dynamics model unlike dual form nonlinear Kalman filtering methods.
The schematics of the proposed model is summarized by two passes in Figure 1. In the forward pass, the previous latent state generates the new state through the state dynamics, and the new state transforms into the observation through the observation model. In the backward pass, the recognition model recovers the current latent state from the observation, and the three components, observation model, recognition model and dynamical system, are updated by backpropagation.
Denote the set of all parameters by from the generative, recognition and dynamics models. The objective function in Eq. (7) is differentiable w.r.t. , and thus we employ empirical Bayes and optimize it through stochastic gradient ascent (using Adam Kingma2014a ). Algorithm 1 is an overview of the recursive estimation algorithm. We outline the algorithm for a single vector time series, but we can filter multiple sequences with a common state-space model simultaneously, in which case the gradients are averaged across the instantiations.
Note that this algorithm has constant time and space complexity per time step.
3 Application to Latent Neural Dynamics
Our primary applied aim is real-time neural interfaces where a population of neurons are recorded while a low-dimensional stimulation is delivered Newman2015 ; ElHady2016 ; Hocker2019a . State-space modeling of such neural time series have been successful in describing population dynamics Macke2011 ; Zhao2017 . Moreover, models of neural computation are often described as dynamical systems Hopfield1982 ; Dayan2001 ; Barak2013 . For example, attractor dynamics where the convergence to one of the attractors represents the result of computation Wang2002 ; Nassar2018b . Here we propose a parameterization and tools for visualization of the model suitable for studying neural dynamics and building neural interfaces Zhao2016 .
3.1 Parameterization
Having in mind high-temporal resolution neural spike trains where each time bin has at most one action potential, we describe the case for a point process observation. However, note that our method generalizes to arbitrary observation likelihood appropriate for other modalities, including calcium imaging or local field potentials. Our observed time series is a stream of sparse binary vectors. All analysis are done with time bins containing one event at most each for Poisson observation in this study.
Our observation model assumes that the spike train observation is sampled from a probability distribution determined by the latent state though a linear-nonlinear map possibly together with extra parameters at each time ,
(8) |
where is a point-wise nonlinearity making rates non-negative. We use the canonical link for Poisson likelihood in this study. Note that this model is not identifiable since where is an arbitrary invertible matrix. Also, the mean of can be traded off with the bias term in . It is straightforward to include more additive exogenous variables, history-filter for refractory period, coupling between processes, and stimulation artifacts Pillow2008 ; Truccolo2005 .
For state dynamics we propose to use a specific parameterization with additive state transition function and input interaction as a special case of Eq. (2),
(9a) | ||||
(9b) | ||||
(9c) |
where is a vector of continuous basis functions, i.e. and can be linear or locally linear interaction parameterized as a linear combination of ’s as . In this study, we use squared exponential radial basis functions Roweis2001 ; Frigola2014 ; Sussillo2013 ; Zhao2016 ,
(10) |
with centers and corresponding inverse squared kernel width .
The time complexity of our algorithm is where denote the dimensions of observation, latent space, input, the numbers of hidden units and radial basis functions for this specific parameterization. Note that the time complexity does not grow with time that enable efficient online inference. If we compare this to an efficient offline algorithm such as PLDS Macke2011 run repeatedly for every new observation (“online mode”), its time complexity is per time step at time which grows as time passes, making it impractical for real-time application.
We also employ the sparse Gaussian process (GP) to model the dynamics. The sparse GP enables Bayesian inference on the dynamics. We derived variational inference on the sparse GP. The details can be found in S.3. The RBF parameterization can be considered as a special case of sparse GP.
3.2 Phase portrait analysis
The function directly represents the velocity field of an underlying smooth dynamics (1a) in the absence of input Roweis2001 ; Zhao2016 . We visualize the estimated dynamics as phase portrait which consists of the vector field, example trajectories, and estimated dynamical features (namely fixed points) Strogatz2000 . We numerically identify candidate fixed points that satisfy . For the simulation studies, we do an affine transformation to orient the phase portrait to match the canonical equations in the main text when the simulation is done through the proposed observation model.
3.3 Prediction
The prediction contains both latent trajectory and observation which last for steps by sampling from:
(11a) | ||||
(11b) |
given estimated parameters without seeing the data during these steps.
4 Experiments
4.1 Simulated data
Many theoretical models have been proposed in neuroscience to represent different schemes of computation. For the purpose of visualization, we choose to simulate from two or three dimensional dynamical systems. We apply the proposed method to four such low-dimensional models: a ring attractor model as a model of internal head direction representation, an nonlinear oscillator as a model of rhythmic population-wide activity, a biophysically realistic cortical network model for a visual discrimination experiment and a chaotic attractor.
The observations are either Gaussian or point-process (200 dimensions unless otherwise mentioned) via state dynamics driven processes. The average firing rate of the simulated spike trains are kept below 40 Hz. We refer to their conventional formulations under different coordinate systems, but our simulations and inferences are all done in Cartesian coordinates.
The approximate posterior distribution is defined recursively in Eq. (6) as diagonal Gaussian with mean and variance determined by corresponding observation, input and previous step via a recurrent neural network. We use one hidden layer MLP in this study. Typically the state noise variance is unknown and has to be estimated from data. To be consistent with Eq. (9c), we set the starting value of to be , and hence . We initialize the loading matrix by factor analysis, and column-wisely normalize it by norm every iteration to keep the system identifiable.
4.1.1 Ring attractor
First, we study the following two-variable ring attractor system:
(12) | ||||
where represents the direction driven by input , and is the radial component representing an internal circular variable, such as head direction. We simulate 100 trajectories of 1000 steps with step size , , , plus Gaussian noise (). We use strong input (tangent drift) to keep the trajectories flowing around the ring clockwise or counter-clockwise. We use 20 radial basis functions for dynamics model and 100 hidden units for the recognition model.
Figure 1(a) illustrates one latent trajectory (black) and its variational posterior mean (blue). These two trajectories start at green circle and diamond respectively and end at the red markers. The inference starts near the center (origin) that is relatively far from the true location because the initial posterior mean is set at zero. The final states are very close which implies that the recognition model works well. Figure 1(b) shows the reconstructed velocity field by the model. We visualize the velocity as colored directional streamlines. We can see the velocity toward the ring attractor and the speed is smaller closer to the ring. The model also identifies a number of fixed points arranged around the ring attractor via numerical roots finding. Figure 1(c) shows the distribution of posterior means of all data points in the state-space. We have more confidence of the inferred dynamical system in the denser area.
Figure 1(d) shows the three components of (5) and the objective lower bound clearly, demonstrating the convergence of the algorithm. We can see each component reaches a plateau within sec. As the reconstruction and dynamics log-likelihoods increase, the recognition model and dynamical model are getting more accurate while the decreasing entropy indicates the increasing confidence (inverse posterior variance) on the inferred latent states. The average computation time of a joint estimation step is ms (hardware specification: Intel Xeon E5-2680 2.50GHz, 128GB RAM, no GPU).
4.1.2 Nonlinear oscillator
We use a 2-dimensional relaxation oscillator with the following nonlinear state dynamics:
(13) | ||||
where is the membrane potential, is a recovery variable and is the magnitude of stimulus current in modeling single neuron biophysics Izhikevich2007 . This model was used to model global brain state that fluctuates between two levels of excitability in anesthetized cortex Curto2009 . We use the following parameter values , , and to simulate 100 trajectories of 1000 steps with step size and Gaussian noise (std=). At this regime, unlike the ring attractor, the spontaneous dynamics is a periodic oscillation, and the trajectory follows a limit cycle.
We use 20 radial basis functions for dynamics model and 100 hidden units for recognition model. While training the model, the input was clamped to zero, and expect the model to learn the spontaneous oscillator.
We also reconstruct the phase portrait (Fig. 2(b)) comparing to the truth (Fig. 2(c)). The two dashed lines are the theoretical nullclines of the true model on which the velocity of corresponding dimension is zero. The reconstructed field shows a low speed valley overlapping with the nullcline especially on the right half figure. The intersection of two nullclines is a unstable fixed point. We can see the identified fixed point is close to the intersection. As most of the trajectories lie on the oscillation path with merely few data points elsewhere, the inferred system shows the oscillation dynamics similar to the true system around the data region. The difference mostly happens in the region far from the trajectories because of the lack of data.
We run a long-term prediction using the proposed model without seeing the future data during these steps () beginning at the final state of training data. We show the truth and prediction in Figure 3(a). The upper row is the true latent trajectory and corresponding observations. The lower row is the filtered trajectory and prediction by our proposed method. The light-colored parts are the 500 steps of inference before prediction and the solid-colored parts are 1000 steps truth and prediction. We also show the sample observations from the trained observation model during the prediction period.
One of the popular latent process modeling tools for point process observation that can make prediction is Poisson Linear Dynamical System (PLDS) Macke2011 which assumes latent linear dynamics. We compare PLDS fit with EM on its long-term prediction on both the states and spike trains (Fig. 3(a)).
We run LFADS Sussillo2016 using the same data. LFADS implements its dynamical model with GRU that requires high dimensions. For this 2D system, we tried different GRU dimensionality. We made minimal changes to its recommended setting including only the generator dimensionality, batch and no controller. The result shows that LFADS requires much higher dimension than the true system to capture the oscillation. (The figure of its inferred trajectories is shown in the supplement.) We report the fitted log-likelihood per time bin , and for 2D, 20D and 50D GRU respectively. In comparison, the log-likelihood of the proposed approach is with a 2D dynamical model.
4.1.3 Fixed point attractor for decision-making
Perceptual decision-making paradigm is a well-established cognitive task where typically a low-dimensional decision variable needs to be integrated over time, and subjects are close to optimal in their performance. To understand how the brain implements such neural computation, many competing theories have been proposed Barak2013 ; Mante2013 ; Ganguli2008 ; Wang2002 ; Wong2006 . We test our method on a simulated biophysically realistic cortical network model for a visual discrimination experiment Wang2002 . In the model, there are two excitatory subpopulations that are wired with slow recurrent excitation and feedback inhibition to produce attractor dynamics with two stable fixed points. Each fixed point represents the final perceptual decision, and the network dynamics amplify the difference between conflicting inputs and eventually generates a binary choice.
We subsample 480 selective neurons out of 1600 excitatory neurons from the simulation to be observed by our algorithm. The simulated data is organized into decision-making trials where each trial lasts for 2 sec and with different strength of visual evidence, controlled by “coherence”. Our method with 20 basis functions learned the dynamics from 140 training trials (20 per coherence where max is ). We visualize the velocity field at zero stimulus as colored streamlines in Figure 5. The left panel is the estimated velocity field and the right panel is the theoretical velocity field of a theoretically reduced model Wong2006 . Note that the spiking neuron model where the simulation comes from is different from the reduced model. Here we want to show that our method can provide a qualitatively similar result to the theoretical work which reduces the dimensionality and complexity of the original network.
(\subreffig:rmse:lds:prediction) Mean RMSEs of one-step-ahead prediction of nonstationary system during online learning (50 trials). The linear system was changed and perturbed at the 2000th step. The lines are average RMSEs. Both the online algorithms learned the change after a few steps.
4.1.4 Nonstationary system
We compare dual EKF and the proposed approach on non-stationary linear dynamical system. A spiral-in linear system was changed from clockwise to counter-clockwise at the 2000th step and perturbed. The observations were Gaussian. To focus on the dynamics, we fixed all the parameters but transition matrix of both methods except that our approach still have to learn the recognition model. Figure 4(c) shows that our approach achieved the same online performance as dual EKF in this experiment.
4.2 Neurophysiological Data
We apply the proposed method to a large scale recording to validate that it picks up meaningful dynamics. The dataset Graf2011 consists of 148 simultaneously recorded single units from the primary cortex (V1). In the experiment, directional drifting gratings were presented to an anesthetized monkey for around 1.3s, each direction of which was repeated 50 trials.
We finally fit a 2D dynamics model to the 50 trials of 63 well-tuned Graf2011 units of one same direction, and the spike counts of 1ms bin size. Due to low signal-to-noise ratio (SNR), we initialize the observation model with dimensionality reduction methods such as variational latent Gaussian processes (vLGP), and then train the model in online mode, each components alternatively. We also feed the data multiple times during training.
There is one continuous circular variables in the stimuli space: temporal phase of oscillation. In hopes that the inferred dynamical system would implement the oscillation, we did not included the stimulus drive as the input in the model. As expected, Figure 6 shows the inferred dynamical system is able to implement the oscillation.
5 Discussion
New and future technologies for recording the activity of large neural populations during meaningful behavior will provide exciting opportunities for investigating the neural computations that underlie perception, cognition, and decision-making. However, the datasets provided by these technologies currently require sophisticated offline analyses that slow down the scientific cycle of experiment, data analysis, hypothesis generation, and further experiment. Moreover, in closed-loop neurophysiological setting, real-time adaptive algorithms are extremely valuable. We propose an online algorithm for recursive variational Bayesian inference for joint system identification and filtering that can greatly impact neuroscience research and biomedical engineering. Our algorithm is flexible—allows wide range of likelihood and dynamics, computationally tractable, and produces interpretable visualizations of complex collective network dynamics.
This work opens many avenues for future work. One direction is to apply this model to large-scale neural recording from a behaving animal. We hope that further development would enable on-the-fly analysis of high-dimensional neural spike train during electrophysiological experiments. Clinically, a nonlinear state-space model provides a basis for nonlinear feedback control as a potential treatment for neurological diseases that arise from diseased dynamical states.
References
- (1) S. Haykin and J. Principe, “Making sense of a complex world [chaotic events modeling],” IEEE Signal Processing Magazine, vol. 15, no. 3, pp. 66–81, May 1998.
- (2) J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, may 2009.
- (3) C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence, “Recurrent gaussian processes,” International Conference on Learning Representations (ICLR), 2016.
- (4) S. Roweis and Z. Ghahramani, Learning nonlinear dynamical systems using the expectation-maximization algorithm. John Wiley & Sons, Inc, 2001, pp. 175–220.
- (5) V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome, “Context-dependent computation by recurrent dynamics in prefrontal cortex,” Nature, vol. 503, no. 7474, pp. 78–84, Nov. 2013.
- (6) D. Sussillo and O. Barak, “Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks,” Neural Computation, vol. 25, no. 3, pp. 626–649, Mar. 2013.
- (7) R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussian process state-space models,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, Montreal, Canada, 2014, pp. 3680–3688.
- (8) Y. Ho and R. Lee, “A Bayesian approach to problems in stochastic estimation and control,” IEEE Transactions on Automatic Control, vol. 9, no. 4, pp. 333–339, Oct. 1964.
- (9) S. Särkkä, Bayesian filtering and smoothing. Cambridge University Press, 2013.
- (10) S. S. Haykin, Kalman filtering and neural networks. Wiley, 2001.
- (11) Z. Ghahramani and S. T. Roweis, “Learning nonlinear dynamical systems using an EM algorithm,” in Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn, Eds. MIT Press, 1999, pp. 431–437.
- (12) H. Valpola and J. Karhunen, “An unsupervised ensemble learning method for nonlinear dynamic State-Space models,” Neural Computation, vol. 14, no. 11, pp. 2647–2692, Nov. 2002.
- (13) R. Turner, M. Deisenroth, and C. Rasmussen, “State-Space Inference and Learning with Gaussian Processes,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 868–875.
- (14) M. D. Golub, S. M. Chase, and B. M. Yu, “Learning an internal dynamics model from control demonstration.” JMLR workshop and conference proceedings, pp. 606–614, 2013.
- (15) E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski, “Black box variational inference for state space models,” ArXiv e-prints, Nov. 2015.
- (16) R. G. Krishnan, U. Shalit, and D. Sontag, “Deep Kalman filters,” arXiv, vol. abs/1511.05121, Nov. 2015.
- (17) M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta, “Composing graphical models with neural networks for structured representations and fast inference,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 2946–2954.
- (18) R. G. Krishnan, U. Shalit, and D. Sontag, “Structured inference networks for nonlinear state space models,” arXiv, vol. abs/1511.05121, 2016.
- (19) M. Karl, M. Soelch, J. Bayer, and P. van der Smagt, “Deep variational Bayes filters: Unsupervised learning of state space models from raw data,” in 5th International Conference on Learning Representations, 2017.
- (20) M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2746–2754.
- (21) T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan, “Streaming variational bayes,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 1727–1735.
- (22) E. A. Wan and R. Van Der Merwe, “The unscented Kalman filter for nonlinear estimation,” in Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373). Lake Louise, Alta., Canada: IEEE, Aug. 2000, pp. 153–158.
- (23) E. A. Wan and A. T. Nelson, Dual extended Kalman filter methods. John Wiley & Sons, Inc, 2001, pp. 123–173.
- (24) T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley-Interscience, Aug. 1991.
- (25) G. Hinton, P. Dayan, B. Frey, and R. Neal, “The "wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, May 1995.
- (26) D. Sussillo, R. Jozefowicz, L. F. Abbott, and C. Pandarinath, “LFADS - latent factor analysis via dynamical systems,” arXiv, vol. abs/1608.06315, Aug. 2016.
- (27) D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in International Conference on Machine Learning, May 2014.
- (28) D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv:1312.6114 [cs, stat], May 2014, arXiv: 1312.6114.
- (29) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
- (30) J. P. Newman, M.-f. Fong, D. C. Millard, C. J. Whitmire, G. B. Stanley, and S. M. Potter, “Optogenetic feedback control of neural activity,” eLife, 2015.
- (31) A. El Hady, Closed Loop Neuroscience, A. El Hady, Ed. Academic Press, 2016.
- (32) D. Hocker and I. M. Park, “Myopic control of neural dynamics,” PLOS Computational Biology, 2019. [Online]. Available: https://www.biorxiv.org/content/early/2017/12/30/241299
- (33) J. H. Macke, L. Buesing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani, “Empirical models of spiking in neural populations,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 1350–1358.
- (34) Y. Zhao and I. M. Park, “Variational latent Gaussian process for recovering single-trial dynamics from population spike trains,” Neural Computation, vol. 29, no. 5, May 2017.
- (35) J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, Apr. 1982.
- (36) P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems. Massachusetts Institute of Technology Press, 2001.
- (37) O. Barak, D. Sussillo, R. Romo, M. Tsodyks, and L. F. Abbott, “From fixed points to chaos: three models of delayed discrimination.” Progress in neurobiology, vol. 103, pp. 214–222, Apr. 2013.
- (38) X.-J. Wang, “Probabilistic decision making by slow reverberation in cortical circuits,” Neuron, vol. 36, no. 5, pp. 955–968, Dec. 2002.
- (39) J. Nassar, S. W. Linderman, M. Bugallo, and I. M. Park, “Tree-structured recurrent switching linear dynamical systems for multi-scale modeling,” in International Conference on Learning Representations (ICLR), Nov. 2019. [Online]. Available: https://openreview.net/forum?id=HkzRQhR9YX
- (40) Y. Zhao and I. M. Park, “Interpretable nonlinear dynamic modeling of neural trajectories,” in Advances in Neural Information Processing Systems (NIPS), 2016.
- (41) J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, and E. P. Simoncelli, “Spatio-temporal correlations and visual signalling in a complete neuronal population,” Nature, vol. 454, no. 7207, pp. 995–999, Jul. 2008.
- (42) W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. Brown, “A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects,” Journal of Neurophysiology, vol. 93, no. 2, pp. 1074–1089, Feb. 2005.
- (43) S. H. Strogatz, Nonlinear Dynamics and Chaos, 1st ed., ser. Studies in nonlinearity. The Perseus Books Group, Jan. 2000.
- (44) E. M. Izhikevich, Dynamical systems in neuroscience : the geometry of excitability and bursting, ser. Computational neuroscience. MIT Press, 2007.
- (45) C. Curto, S. Sakata, S. Marguet, V. Itskov, and K. D. Harris, “A simple model of cortical dynamics explains variability and state dependence of sensory responses in Urethane-Anesthetized auditory cortex,” The Journal of Neuroscience, vol. 29, no. 34, pp. 10 600–10 612, Aug. 2009.
- (46) S. Ganguli, J. W. Bisley, J. D. Roitman, M. N. Shadlen, M. E. Goldberg, and K. D. Miller, “One-dimensional dynamics of attention and decision making in LIP,” Neuron, vol. 58, no. 1, pp. 15–25, Apr. 2008.
- (47) K.-F. Wong and X.-J. Wang, “A recurrent network mechanism of time integration in perceptual decisions,” The Journal of Neuroscience, vol. 26, no. 4, pp. 1314–1328, Jan. 2006.
- (48) A. B. A. Graf, A. Kohn, M. Jazayeri, and J. A. Movshon, “Decoding the activity of neuronal populations in macaque primary visual cortex,” Nature Neuroscience, no. 2, pp. 239–245, Jan. 2011.
- (49) W. Maass, T. Natschläger, and H. Markram, “Real-time computing without stable states: A new framework for neural computation based on perturbations,” Neural Computation, vol. 14, pp. 2531–2560, 2002.
- (50) R. Laje and D. V. Buonomano, “Robust timing and motor patterns by taming chaos in recurrent neural networks,” Nature Neuroscience, vol. 16, no. 7, pp. 925–933, May 2013.
Supplement
s.1 Nonlinear oscillator
s.2 Chaotic dynamics
Chaotic dynamics (or edge-of-chaos) has been postulated to support asynchronous states in the cortex, and neural computation over time by generating rich temporal patterns Maass2002 ; Laje2013 . We consider the 3-dimensional standard Lorenz attractor as an example chaotic system. We simulate 216 latent trajectories from:
(14) |
The initial states of the trajectories are evenly-spaced. We discard the first 500 transient steps of each trajectory and then use the following 1000 steps. We generate 200-dimensional Gaussian observations driven by the trajectories. Figure 8 shows the inference and prediction. We continue simulating a trajectory for 2000 more steps from the final state where the training data end, and predict the latent state through the proposed model. The black lines show the three dimensions of true state respectively and the red lines are the model-prediction. We reset the prediction back to the true state every 500 steps.
s.3 Sparse GP dynamics
s.3.1 Model specification
(15) | ||||
(16) | ||||
(17) |
s.3.2 Variational sparse GP
Each component of the vector-valued function is defined by a Gaussian Process
(18) |
Then for any finite set of points has a covariance matrix given the covariance matrix between the components. The state noise covariance becomes as well. We introduce the inducing variable . Let , and omit the control input. Then we have
(19) |
and
(20) | ||||
Introduce inducing variable | ||||
where denotes the variational approximation to .
Assumed that is a sufficient statistic of , we have
(21) | ||||
As is conditional on ,
(22) |
and hence
(23) |
The marginal distribution of is
(24) |
Let
(25) | ||||
(26) | ||||
(27) |
As the marginal and conditional distributions in the integrand are Gaussian, so is the joint distribution, and the marginal distributions are
(28) |
and
(29) |
If we assume further that factorizes as
(30) |
Usually sparse GP regression methods assume that
and then
(32) | ||||
Jensen’s inequality | ||||
where can be seen as the parameters with prior.