TreeStructured Recurrent Switching Linear Dynamical Systems for MultiScale Modeling
Abstract
Many realworld systems studied are governed by complex, nonlinear dynamics. By modeling these dynamics, we can gain insight into how these systems work, make predictions about how they will behave, and develop strategies for controlling them. While there are many methods for modeling nonlinear dynamical systems, existing techniques face a trade off between offering interpretable descriptions and making accurate predictions. Here, we develop a class of models that aims to achieve both simultaneously, smoothly interpolating between simple descriptions and more complex, yet also more accurate models. Our probabilistic model achieves this multiscale property through a hierarchy of locally linear dynamics that jointly approximate global nonlinear dynamics. We call it the treestructured recurrent switching linear dynamical system. To fit this model, we present a fullyBayesian sampling procedure using PólyaGamma data augmentation to allow for fast and conjugate Gibbs sampling. Through a variety of synthetic and real examples, we show how these models outperform existing methods in both interpretability and predictive capability.
TreeStructured Recurrent Switching Linear Dynamical Systems for MultiScale Modeling
Josue Nassar 

Department of Electrical & Computer Engineering 
Stony Brook University 
Stony Brook, NY 11794 USA 
josue.nassar@stonybrook.edu 
Scott W. Linderman 

Department of Statistics 
Columbia University 
116th St & Broadway, New York, NY 10027 
scott.linderman@gmail.com 
Mónica F. Bugallo 

Department of Electrical & Computer Engineering 
Stony Brook University 
Stony Brook, NY, 11794 USA 
monica.bugallo@stonybrook.edu 
Il Memming Park 

Department of Neurobiology and Behavior 
Stony Brook University 
Stony Brook, NY, 11794 USA 
memming.park@stonybrook.edu 
1 Introduction
Complex systems can often be described at multiple levels of abstraction. A computer program can be characterized by the list of functions it calls, the sequence of statements it executes, or the assembly instructions it sends to the microprocessor. As we zoom in, we gain an increasingly nuanced view of the system and its dynamics. The same is true of many natural systems. For example, brain activity can be described in terms of highlevel psychological states or via detailed ion channel activations; different tasks demand different levels of granularity. One of our principal aims as scientists is to identify appropriate levels of abstraction for complex natural phenomena and to discover the dynamics that govern how these systems behave at each level of resolution.
Modern machine learning offers a powerful toolkit to aid in modeling the dynamics of complex systems. Bayesian state space models and inference algorithms enable posterior inference of the latent states of a system and the parameters that govern their dynamics (Särkkä, 2013; Barber et al., 2011; Doucet et al., 2001). In recent years, this toolkit has been expanded to incorporate increasingly flexible components like Gaussian processes (Frigola et al., 2014) and neural networks (Chung et al., 2015; Johnson et al., 2016; Gao et al., 2016; Krishnan et al., 2017) into probabilistic time series models. In neuroscience, sequential autoencoders offer highly accurate models of brain activity (Pandarinath et al., 2018). However, while these methods offer state of the art predictive models, their dynamics are specified at only the most granular resolution, leaving the practitioner to tease out higher level structure post hoc.
Here we propose a probabilistic generative model that provides a multiscale view of the dynamics through a hierarchical architecture. We call it the treestructured recurrent switching linear dynamical system, or TrSLDS. The model builds on the recurrent SLDS (Linderman et al., 2017) to approximate latent nonlinear dynamics through a hierarchy of locally linear dynamics. Once fit, the TrSLDS can be queried at different levels of the hierarchy to obtain dynamical descriptions at multiple levels of resolution. As we proceed down the tree, we obtain higher fidelity, yet increasingly complex, descriptions. Thus, depth offers a simple knob for trading off interpretability and flexibility. The key contributions are twofold: first, we introduce a new form of treestructured stick breaking for multinomial models that strictly generalizes the sequential stick breaking of the original rSLDS, while still permitting Pólyagamma data augmentation (Polson et al., 2013) for efficient posterior inference; second, we develop a hierarchical prior that links dynamics parameters across levels of the tree, thereby providing descriptions that vary smoothly with depth.
The paper is organized as follows. Section 2 provides background material on switching linear dynamical systems and their recurrent variants. Section 3 presents our treestructured model and Section 4 derives an efficient fullyBayesian inference algorithm for the latent states and dynamics parameters. Finally, in Section 5 we show how our model yields multiscale dynamics descriptions for synthetic data from two standard nonlinear dynamical systems—the Lorenz attractor and the FitzHughNagumo model of nonlinear oscillation—as well as for a real dataset of neural responses to visual stimuli in a macaque monkey.
2 Background
Let and denote the latent state and the observation of the system at time respectively. The system can be described using a statespace model:
(state dynamics)  (1)  
(observation)  (2) 
where denotes the dynamics parameters, denotes the emission (observation) parameters, and and are the state and observation noises respectively. For simplicity, we restrict ourselves to systems of the form:
(3) 
If the state space model is completely specified then recursive Bayesian inference can be applied to obtain an estimate of the latent states using the posterior (Doucet et al., 2001). However in many applications, the parametric form of the state space model is unknown. While there exist methods that perform smoothing to obtain an estimate of (Barber, 2006; Fox et al., 2009; Djuric & Bugallo, 2006), we are often interested in not only obtaining an estimate of the continuous latent states but also in learning the dynamics that govern the dynamics of the system.
In the simplest case, we can take a parametric approach to solving this joint stateparameter estimation problem. When and are assumed to be linear functions, the posterior distribution over latent states is available in closedform and the parameters can be learned via expectationmaximization. On the other hand, we have nonparametric methods that use Gaussian processes and neural networks to learn highly nonlinear dynamics and observations where the joint estimation is untractable and approximations are necessarily imployed (Zhao & Park, 2016; 2018; Frigola et al., 2014; Sussillo et al., 2016). Switching linear dynamical systems (SLDS) (Ackerson & Fu, 1970; Chang & Athans, 1978; Hamilton, 1990; Ghahramani & Hinton, 1996; Murphy, 1998) balance between these two extremes, approximating the dynamics by stochastically transitioning between a small number of linear regimes.
2.1 Switching Linear Dynamical Systems
SLDS approximate nonlinear dynamics by switching between a discrete set of linear regimes. An additional discrete latent state determines the linear dynamics at time ,
(4) 
where and for . Typically, is endowed with Markovian dynamics, . The conditionally linear dynamics allow for fast and efficient learning of the model and can utilize the learning tools developed for linear systems (Haykin, 2001). While SLDS can estimate the continuous latent states , the assumption of Markovian dynamics for the discrete latent states severely limits their generative capacity.
2.2 Recurrent Switching Linear Dynamical Systems
Recurrent switching linear dynamical systems (rSLDS) (Linderman et al., 2017), also known as augmented SLDS (Barber, 2006), are an extension of SLDS where the transition density of the discrete latent state depends on the previous location in the continuous latent space
(5)  
(6) 
where and represents hyperplanes. maps from the reals to the probability simplex via stickbreaking:
(7) 
for and where is the th component of of and is the logistic function (Fig. 1). By including this recurrence in the transition density of , the rSLDS partitions the latent space into sections, where each section follows its own linear dynamics. It is through this combination of locally linear dynamical systems that the rSLDS approximates eq. (3); the partitioning of the space allows for a more interpretable visualization of the underlying dynamics.
Recurrent SLDS can be learned efficiently and in a fully Bayesian manner, and experiments empirically show that they are adept in modeling the underlying generative process in many cases. However, the stick breaking process used to partition the space poses problems for inference due to its dependence on the permutation of the discrete states (Linderman et al., 2017).
3 TreeStrucutred Recurrent Switching Linear Dynamical Systems
Building upon the rSLDS, we propose the treestructured recurrent switching linear dynamical system (TrSLDS). Rather than sequentially partitioning the latent space using stick breaking, we use a treestructured stick breaking procedure (Adams et al., 2010) to partition the space.
Let denote a tree structure with a finite set of nodes . Each node has a parent node denoted by par with the exception of the root node, , which has no parent. For simplicity, we initially restrict our scope to balanced binary trees where every internal node is the parent of two children, and . Let denote the set of children for internal node . Let denote the set of leaf nodes, which have no children. Let denote the depth of a node in the tree, with .
At time instant , the discrete latent state is chosen by starting at the root node and traversing down the tree until one of the leaf nodes are reached. The traversal is done through a sequence of left/right choices by the internal nodes. Unlike in standard regression trees where the choices are deterministic (Lakshminarayanan, 2016), we model the choices as random variables. The traversal through the tree can be described as a stick breaking process. We start at the root node with a unitlength stick , which we divide between its two children. The left child receives a fraction and the right child receives the remainder such that specifies the left/right balance. This process is repeated recursively, subdividing into two pieces at each internal node until we reach the leaves of the tree (Fig. 1). The stick assigned to each node is thus,
(8) 
We incorporate this into the TrSLDS by allowing to be a function of the continuous latent state
(9) 
where the parameters and specify a linear hyperplane in the continuous latent state space. As the continuous latent state evolves, the left/right choices become more or less probable. This in turn changes the probability distribution over the leaf nodes, where . In the TrSLDS, these leaf nodes correspond to the discrete latent states of the model, such that for each leaf node ,
(10) 
In general, the treestructured stickbreaking is not restricted to balanced binary trees. We can allow more than two children through an ordered sequential stickbreaking at each level. In this sense, treestructured stickbreaking is a strict generalization of stickbreaking. We also note that similar to rSLDS, the model can be made more flexible by introducing a dependence on the previous discrete latent in eq. (9) but for the rest of the paper, we stick to eq. (8).
3.1 A Hierarchical Dynamics Prior that Respects the Tree Structure
Similar to standard rSLDS, the dynamics are conditionally linear given a leaf node . A priori, it is natural to expect that locally linear dynamics of nearby regions in the latent space are similar. Thus, in the context of treestructured stick breaking, we impose that partitions that share a common parent should have similar dynamics. We explicitly model this by enforcing a hierarchical prior on the dynamics that respects the tree structure.
Let be the dynamics parameters associated with node . Although the locally linear dynamics of a discrete state are specified by the leaf nodes, we introduce dynamics at the internal nodes as well. These internal dynamics serve as a link between the leaf node dynamics via a hierarchical prior,
(11) 
where is the vectorization operator. The prior on the root node is
(12) 
We impose the following constraint on the covariance matrix of the prior
(13) 
where is a hyper parameter that dictates how "close" a parent and child are to one another. The prior over the parameters can be written as, where the affine term and the operator are dropped for compactness,
(14) 
It is through this hierarchical treestructured prior that TrSLDS obtains a multiscale view of the system. Parents are given the task of learning a higher level description of the dynamics over a larger region while children are tasked with learning the nuances of the dynamics. The use of hierarchical priors also allows for neighboring sections of latent space to share common underlying dynamics inherited from their parent. TrSLDS can be queried at different levels, where levels deeper in the tree provide more resolution.
TrSLDS shares some features with regression trees (Lakshminarayanan, 2016), even though regression trees are primarily used for standard, static regression problems. The biggest differences are that our treestructured model has stochastic choices and the internal nodes contribute to smoothing across partitions through the corresponding hierarchical prior.
There are other hierarchical extensions of SLDS that have been proposed in the literature. In Stanculescu et al. (2014), they propose adding a layer to factorized SLDS where the toplevel discrete latent variables determine the conditional distribution of , with no dependence on . While the treestructured stickbreaking used in TrSLDS is also a hierarchy of discrete latent variables, the model proposed in Stanculescu et al. (2014) has no hierarchy of dynamics, preventing it from obtaining a multiscale view of the dynamics. In Zoeter & Heskes (2003), the authors construct a tree of SLDSs where an SLDS with possible discrete states is first fit. An SLDS with discrete states is then fit to each of the clusters of points. This process continues iteratively, building a hierarchical collection of SLDSs that allow for a multiscale, lowdimensional representation of the observed data. While similar in spirit to TrSLDS, there are key differences between the two models. First, it is through the treestructured prior that TrSLDS obtains a multiscale view of the dynamics, thus we only need to fit one instantiation of TrSLDS; in contrast, they fit a separate SLDS for each node in the tree, which is computationally expensive. There is also no explicit probabilistic connection between the dynamics of a parent and child in Zoeter & Heskes (2003). We also note that TrSLDS aims to learn a multiscale view of the dynamics while Zoeter & Heskes (2003) focuses on smoothing, that is, they aim to learn a multiscale view of the latent states corresponding to data but not suitable for forecasting.
In the next section we show an alternate view of TrSLDS which we will refer to as the residual model in which internal nodes do contribute to the dynamics. Nevertheless, this residual model will turn out to be equivalent to the TrSLDS.
3.2 Residual Model
Let be the linear dynamics of node and let be the sequence of nodes visited to arrive at node . In contrast to TrSLDS, the dynamics for a leaf node are now determined by all the nodes in the tree:
(15)  
(16) 
We model the dynamics to be independent a priori, where once again the operator and the affine term aren’t shown for compactness,
(17) 
where and .
The residual model offers a different perspective of TrSLDS. The covariance matrix can be seen as representing how much of the dynamics a node is tasked with learning. The root node is given the broadest prior because it is present in eq. (16) for all leaf nodes; thus it is given the task of learning the global dynamics. The children then have to learn to explain the residuals of the root node. Nodes deeper in the tree become more associated with certain regions of the space, so they are tasked with learning more localized dynamics which is represented by the prior being more sharply centered on 0. The model ultimately learns a multiscale view of the dynamics where the root node captures a coarse estimate of the system while lower nodes learn a much finer grained picture. We show that TrSLDS and residual model yield the same joint distribution (See A for the proof).
Theorem 1.
TrSLDS and the residual model are equivalent if the following conditions are true: , , , and
4 Bayesian Inference
The linear dynamic matrices , the hyperplanes , the emission parameters , the continuous latent states and the discrete latent states must be inferred from the data. Under the Bayesian framework, this implies computing the posterior,
(18) 
We perform fully Bayesian inference via Gibbs sampling (Brooks et al., 2011) to obtain samples from the posterior distribution described in eq. (18). To allow for fast and closed form conditional posteriors, we augment the model with Pólyagamma auxiliary variables Polson et al. (2013).
4.1 PólyaGamma Augmentation
Consider a logistic regression from regressor to categorical distribution ; the likelihood is
(19) 
If a Gaussian prior is placed on then the model is nonconjugate and the posterior can’t be obtained in closed form. To circumvent this problem Polson et al. (2013) introduced a PólyaGamma (PG) augmentation scheme. This augmentation scheme is based on the following integral identity
(20) 
where and . Setting , it is evident that the integrand is a kernel for a Gaussian. Augmenting the model with PG axillary r.v.s , eq. (19) can be expressed as
(21) 
Conditioning on , the posterior of is
(22) 
where and . It can be shown that the conditional posterior of is also PG where (Polson et al., 2013).
4.2 Conditional Posteriors
The structure of the model allows for closed form conditional posterior distributions that are easy to sample from. For clarity, the conditional posterior distributions for the TrSLDS are given below:

The linear dynamic parameters and state variance of a leaf node are conjugate with a Matrix Normal Inverse Wishart (MNIW) prior

The linear dynamic parameters of an internal node are conditionally Gaussian given a Gaussian prior on

If we assume the observation model is linear and with additive white Gaussian noise then the emission parameters are also conjugate with a MNIW prior
We can also handle binomial observations through the use of Pólyagamma augmentation. In the interest of space, the details are explained in Section B.1 in the Appendix.

The choice parameters are logistic regressions which follow from the conditional posterior
where . The likelihood is of the same form as the left hand side of eq. (20), thus it is amenable to the PG augmentation. Let be the auxiliary Pólyagamma random variable introduced at time for an internal node . We can express the posterior over the hyperplane of an internal node as:
(23) where , . Augmenting the model with Pólyagamma random variables allows for the posterior to be conditionally Gaussian under a Gaussian prior.

Conditioned on the discrete latent states, the continuous latent states are Gaussian. However, the presence of the treestructured recurrence potentials introduced by eq. (10) destroys the Gaussinity of the conditional. When the model is augmented with PG random variables , the augmented recurrence potential, , becomes effectively Gaussian, allowing for the use of message passing for efficient sampling. Linderman et al. (2017) shows how to perform messagepassing using the Pólyagamma augmented recurrence potentials . In the interest of space, the details are explained in Section B.2 in the Appendix.

The discrete latent variables are conditionally independent given thus

The conditional posterior of the PólyaGamma random variables are also PólyaGamma: .
Due to the complexity of the model, good initialization is critical for the Gibbs sampler to converge to a mode in a reasonable number of iterations. Details of the initialization procedure are contained in Section C in the Appendix.
5 Experiments
We demonstrate the potential of the proposed model by testing it on a number of nonlinear dynamical systems. The first, FitzHughNagumo, is a common nonlinear system utilized throughout neuroscience to describe an action potential. We show that the proposed method can offer different angles of the system. We also compare our model with other approaches and show that we can achieve state of the art performance. We then move on to the Lorenz attractor, a chaotic nonlinear dynamical system, and show that the proposed model can once again break down the dynamics and offer an interesting perspective. Finally, we apply the proposed method on the data from Graf et al. (2011).
5.1 FitzHughNagumo
The FitzHughNagumo (FHN) model is a 2dimensional reduction of the HodgkinHuxley model which is completely described by the following system of differential equations (Izhikevich, 2007):
(24) 
We set the parameters to , , , and . We trained our model with 100 trajectories where the starting points were sampled uniformly from . Each of the trajectories consisted of 430 time points, where the last 30 time points of the trajectories were used for testing. The observation model is linear and Gaussian where , and where is an identity matrix of dimension n. We set the number of leaf nodes to be 4 and ran Gibbs for 1,000 samples; the last 50 samples were kept and we choose the sample that produced the highest log likelihood to produce Fig. 2 where the vector fields were produced using the mode of the conditional posteriors of the dynamics.
To quantitatively measure the predictive power of TrSLDS, we compute the step predictive mean squared error, , and its normalized version, , on a test set where and are defined as
(25) 
where is the average of a trial and is the prediction at time which is obtained by (i) using the the samples produced by the sampler to obtain an estimate of given , (ii) propagate for time steps forward to obtain and then (iii) obtain . We compare the model to LDS, SLDS and rSLDS for over the last 30 time steps for all 100 trajectories (Fig. 2I).
5.2 Lorenz Attractor
Lorenz attractors are chaotic systems whose nonlinear dynamics are defined by,
The parameters were set to , and . The data consisted of 50 trajectories, each of length of 230 where the first 200 time points are used for training and the last 30 are used for testing. The observation model was a projection onto 10 dimensional space with Gaussian noise.We set the number of leaf nodes to be 4 and ran Gibbs for 1,000 samples; the last 50 samples were kept and we choose the sample that produced the highest loglikelihood to produce Fig. 3.
The butterfly shape of the Lorenz attractor lends itself to being roughly approximated by two 2dimensional ellipsoids; this is exactly what TrSLDS learns in the second level of the tree. As is evident from Fig. 5B, the two ellipsoids don’t capture the nuances of the dynamics. Thus, the model partitions each of the ellipsoids to obtain a finer description. We can see that embedding the system with a hierarchical treestructured prior allows for the children to build off its parent’s approximations.
5.3 Neural Data
To validate the model and inference procedure, we used the neural spike train data recorded from the primary visual cortex of an anesthetized macaque monkey collected by Graf et al. (2011). The dataset is composed of short trials where the monkey viewed periodic temporal pattern of motions of 72 orientations, each repeated 50 times. Dimensionality reduction of the dataset showed that for each orientation of the drifting grating stimulus, the neural response oscillates over time, but in a stimulus dependent geometry captured in 3dimensions (Zhao & Park, 2017). We used 50 trials each from a subset of 4 stimulus orientations grouped in two (140 and 150 degrees vs. 230 and 240 degrees) where each trial contained 140 neurons. Out of the 140 neurons, we selected 63 welltuned neurons. The spike trains were binarized with a 10 ms window for Bernoulli observation model and we truncated the onset and offset neural responses, resulting in 111 time bins per trial.
We fit TrSLDS with leaf nodes and 3dimensional continuous latent space; the sampler was run for 500 samples where the last sample was used to produce the results shown in Fig. 4. To obtain an initial estimate for , we smoothed the spike trains using a Gaussian kernel and performed PPCA on the smoothed spike trains.
From Fig. 4, it is evident that TrSLDS has learned a multiscale view as expected. It is able to correctly distinguish between the two groups of orientations by assigning them to two different subtrees (greenyellow vs. redorange). The leaf nodes of each subtree refines the periodic orbit further. From Fig. 4, we can see that TrSLDS also learns two limit cycles that are separated.
6 Conclusion
In this paper, we propose treestructured recurrent switching linear dynamical systems (TrSLDS) which is an extension of rSLDS (Linderman et al., 2017). The system relies on the use of treestructured stickbreaking to partition the space. The treestructured stickbreaking paradigm naturally lends itself to imposing a hierarchical prior on the dynamics that respects the tree structure. This treestructured prior allows for a multiscale view of the system where one can query at different levels of the tree to see different scales of the resolution. We also developed a fully Bayesian sampler, which leverages the PólyaGamma augmentation, to learn the parameters of the model and infer latent states. The two synthetic experiments show that TrSLDS can recover a multiscale view of the system, where the resolution of the system increase as we delve deeper into the tree. The analysis on the real neural data verifies that TrSLDS can find a multiscale structure.
References
 Ackerson & Fu (1970) Guy A Ackerson and KingSun Fu. On state estimation in switching environments. IEEE Transactions on Automatic Control, 15(1):10–17, 1970.
 Adams et al. (2010) Ryan P Adams, Zoubin Ghahramani, and Michael I Jordan. TreeStructured Stick Breaking for Hierarchical Data. In J D Lafferty, C K I Williams, J ShaweTaylor, R S Zemel, and A Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 19–27. Curran Associates, Inc., 2010.
 Barber (2006) David Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Technical report, 2006.
 Barber et al. (2011) David Barber, A Taylan Cemgil, and Silvia Chiappa. Bayesian time series models. Cambridge University Press, 2011.
 Brooks et al. (2011) Steve Brooks, Andrew Gelman, Galin Jones, and XiaoLi Meng. Handbook of Markov Chain Monte Carlo. CRC press, 2011.
 Chang & Athans (1978) ChawBing Chang and Michael Athans. State estimation for discrete systems with switching parameters. IEEE Transactions on Aerospace and Electronic Systems, (3):418–425, 1978.
 Chung et al. (2015) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
 Djuric & Bugallo (2006) Petar Djuric and Mónica Bugallo. CostReference Particle Filtering for Dynamic Systems with Nonlinear and Conditionally Linear States, 9 2006.
 Doucet et al. (2001) Arnaud Doucet, Nando Freitas, and Neil Gordon. An Introduction to Sequential Monte Carlo Methods. In Sequential Monte Carlo Methods in Practice, pp. 3–14. Springer New York, New York, NY, 2001. doi: 10.1007/9781475734379{_}1.
 Erosheva & Curtis (2017) Elena A. Erosheva and S. McKay Curtis. Dealing with Reflection Invariance in Bayesian Factor Analysis. Psychometrika, 82(2):295–307, 6 2017. ISSN 00333123. doi: 10.1007/s113360179564y.
 Fox et al. (2009) Emily Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. Nonparametric Bayesian Learning of Switching Linear Dynamical Systems. In D Koller, D Schuurmans, Y Bengio, and L Bottou (eds.), Advances in Neural Information Processing Systems 21, pp. 457–464. Curran Associates, Inc., 2009.
 Frigola et al. (2014) Roger Frigola, Yutian Chen, and Carl Edward Rasmussen. Variational Gaussian Process StateSpace Models. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 3680–3688. Curran Associates, Inc., 2014.
 Gao et al. (2016) Yuanjun Gao, Evan W Archer, Liam Paninski, and John P Cunningham. Linear dynamical neural population models through nonlinear embeddings. In Advances in neural information processing systems, pp. 163–171, 2016.
 Geweke & Zhou (1996) John Geweke and Guofu Zhou. Measuring the Pricing Error of the Arbitrage Pricing Theory. Review of Financial Studies, 9(2):557–587, 4 1996. ISSN 08939454. doi: 10.1093/rfs/9.2.557.
 Ghahramani & Hinton (1996) Zoubin Ghahramani and Geoffrey E Hinton. Switching statespace models. Technical report, University of Toronto, 1996.
 Graf et al. (2011) Arnulf B. Graf, Adam Kohn, Mehrdad Jazayeri, and J. Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–245, February 2011. ISSN 15461726. doi: 10.1038/nn.2733.
 Hamilton (1990) James D Hamilton. Analysis of time series subject to changes in regime. Journal of econometrics, 45(1):39–70, 1990.
 Haykin (2001) Simon S Haykin. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., New York, NY, USA, 2001. ISBN 0471369985.
 Izhikevich (2007) Eugene M Izhikevich. Dynamical systems in neuroscience. MIT press, 2007.
 Johnson et al. (2016) Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
 Krishnan et al. (2017) Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. 2017.
 Lakshminarayanan (2016) Balaji Lakshminarayanan. Decision Trees and Forests: A Probabilistic Perspective. Technical report, UCL (University College London), 2016.
 Linderman et al. (2015) Scott Linderman, Matthew Johnson, and Ryan P Adams. Dependent Multinomial Models Made Easy: StickBreaking with the Polyagamma Augmentation. In C Cortes, N D Lawrence, D D Lee, M Sugiyama, and R Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 3456–3464. Curran Associates, Inc., 2015.
 Linderman et al. (2017) Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam Paninski. Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In Aarti Singh and Jerry Zhu (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 914–922, Fort Lauderdale, FL, USA, 9 2017. PMLR.
 Murphy (1998) Kevin P Murphy. Switching Kalman filters. Technical report, Compaq Cambridge Research, 1998.
 Pandarinath et al. (2018) Chethan Pandarinath, Daniel J O’Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D Stavisky, Jonathan C Kao, Eric M Trautmann, Matthew T Kaufman, Stephen I Ryu, Leigh R Hochberg, et al. Inferring singletrial neural population dynamics using sequential autoencoders. Nature methods, pp. 1, 2018.
 Polson et al. (2013) Nicholas G Polson, James G Scott, and Jesse Windle. Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables. Journal of the American Statistical Association, 108(504):1339–1349, 2013. doi: 10.1080/01621459.2013.829001.
 Särkkä (2013) Simo Särkkä. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013.
 Stanculescu et al. (2014) Ioan Stanculescu, Christopher KI Williams, and Yvonne Freer. A hierarchical switching linear dynamical system applied to the detection of sepsis in neonatal condition monitoring. In UAI, pp. 752–761, 2014.
 Sussillo et al. (2016) David Sussillo, Rafal Józefowicz, L. F Abbott, and Chethan Pandarinath. LFADS  Latent Factor Analysis via Dynamical Systems. CoRR, abs/1608.06315, 2016.
 Zhao & Park (2016) Yuan Zhao and Il Memming Park. Interpretable nonlinear dynamic modeling of neural trajectories. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Zhao & Park (2017) Yuan Zhao and Il Memming Park. Variational Latent Gaussian Process for Recovering SingleTrial Dynamics from Population Spike Trains. Neural Computation, 29(5), May 2017. doi: 10.1162/NECO_a_00953.
 Zhao & Park (2018) Yuan Zhao and Il Memming Park. Variational joint filtering. arXiv, abs/1707.09049, 2018.
 Zoeter & Heskes (2003) Onno Zoeter and Tom Heskes. Hierarchical visualization of timeseries data using switching linear dynamical systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1202–1214, 2003.
Appendix A Proof of Theorem 1
Proof.
Let be a balanced binary tree with leaf nodes. To show that the models are equal, it suffices to show the equivalence of the likelihood and the prior between models. For compactness, we drop the affine term and the operator. The likelihood of TrSLDS is
(26) 
and the likelihood of the residual model is
(27) 
where is defined in eq. (16). Substituting into eq. (27) equates the likelihoods. All that is left to do is to show the equality of the priors.
Appendix B Details on Bayesian Inference
b.1 Handling Bernoulli Observations
Suppose the observation of the system at time follows
(33)  
(34) 
where , . Equation 33 is of the same form as the left hand side of eq. (20), thus it is amenable to PG augmentation. We introduce PG axillary variables . Conditioning on , eq. (33) becomes
(35)  
(36)  
(37) 
where , and .
The observation is now effectively Gaussian and can be incorporated into the message passing for . The emission parameters are also conjugate with the augmented observation potential given a Matrix Normal prior. The conditional posterior on the axillary PG variables also follows a PG distribution i.e. . Note that this augmentation scheme can also work for negative binomial, binomial, and multinomial observations (Polson et al., 2013; Linderman et al., 2015).
b.2 Message Passing for
Assuming that the observations, , are linear and Gaussian, the posterior of the continuous latent states, , conditioned on all the other variables is proportional to
(38) 
where is the potential of the conditionally linear dynamics, is the potential of the observation and is the recurrence potential. is a product of all the internal nodes traversed at time
(39) 
If the potentials in eq. (38) were all linear and Gaussian, then we could efficiently sample from the posetrior of by passing messsages forward through Kalman Filtering and then sampling backwards; the prescence of the recurrence potentials prevent this because they aren’t Gaussian. By augmenting the model with the PG r.v.’s, the recurrence potential at internal node becomes
(40) 
which is effectively Gaussian , allowing for the use of the Kalman filter for message passing.
Appendix C Initialization
We initialized the Gibbs sampler using the following initialization procedure: (i) probabilistic PCA was performed on the data, to initialize the emission parameters, and the continuous latent states, . (ii) To initialize the dynamics of the nodes ,, and the hyperplanes, , we propose greedily fitting the proposed model using MSE as the loss function. We first optimize over the root node
(41) 
and obtain (Note that can obtained in closed form by computing their corresponding OLS estimates). Fixing and , we then optimize over the second level in the tree
(42)  
(43)  
(44) 
This procedure would continue until we reach the leaf nodes of the tree. and are then used to initialize the dynamics and the hyperplanes, respectively. In our simulations, we used stochastic gradient descent with momentum to perform the optimization. (iii) The discrete latent states, , were initialized by performing hard classification using and the initial estimate of .
Appendix D Dealing with rotational invariance
A well known problem with these types of model is it’s susceptibility to rotational and scaling transformation, thus we can only learn the dynamics up to an affine transformation Erosheva & Curtis (2017). During Gibbs sampling the parameters will continuously rotate and scale, which can slow down the mixing of the chains. One possible solution to the issue is if we constrained to have some special structure which would make the model identifiable; this would require sampling from manifolds which is usually inefficient. Similar to Geweke & Zhou (1996), we use the following procedure to prevent the samples from continuously rotating and scaling:

Once we obtain a sample from the conditional posterior of the emission parameters , we normalize the columns of .

RQ decomposition is performed on to obtain where is an upper triangular matrix and is an orthogonal matrix.

We set and rotate all the parameters of the model using .
Appendix E Scalability and Computational Complexity of the Inference
The rSLDS and the TrSLDS share the same linear time complexity for sampling the discrete and continuous states, and both models learn K1 hyperplanes to weakly partition the space. Specifically, both models incur: an cost for sampling the discrete states, which increases to if we allow Markovian dependencies between discrete states; an cost (D is the continuous state dimension) for sampling the continuous states, just like in a linear dynamical system; and an cost for sampling the hyperplanes. The only additional cost of the TrSLDS stems from the hierarchical prior on state dynamics. Unlike the rSLDS, we impose a treestructured prior on the dynamics to encourage similar dynamics between nearby nodes in the tree. Rather than sampling K dynamics parameters, we need to sample 2K1. Since they are all related via a treestructured Gaussian graphical model, the cost of an exact sample is just as in the rSLDS, with the only difference being a constant factor of about 2. Thus, we obtain a multiscale view of the underlying system with a negligible effect on the computational complexity.
To see how the number of discrete latent states effects the convergence speed of the Gibbs sampler, we fit 3 TrSLDS, with respectively, to a Lorenz Attractor described in Sec. but used 250 trajectories to train the model as opposed to 50. To assess convergence, we plotted the logarithm of the joint density as a function of Gibbs samples. The results are shown Fig. 5.
Appendix F Synthetic NASCAR^{®}
We ran the TrSLDS on the synthetic NASCAR^{®} example from (Linderman et al., 2017), where the underlying model is an rSLDS. We trained TrSLDS on 10,000 time points and ran Gibbs for 1,000 samples; the last sample was used to create Fig. 6 where the vector fields where created from the mode of the conditional posterior of the dynamics. Even though the partitions were created using sequential stickbreaking, TrSLDS was able to reconstruct the dynamics of the model.
Appendix G Tree Synthetic NASCAR^{®}
To check whether the sampler is mixing adequately, we test TrSLDS on a twist on the synthetic NASCAR^{®} example where the underlying model is a TrSLDS. We also ran rSLDS on the example to highlight the limitations of seqeuntail stickbreaking. We trained both rSLDS and TrSLDS on 20,000 time points and Gibbs was ran for 1,000 samples; the last sample was used to create Fig. 7. To compare the precditive performance between the models, the was computed for rSLDS and TrSLDS.