Linear Dynamics: Clustering without identification

Linear Dynamics: Clustering without identification

Chloe Ching-Yun Hsu
University of California, Berkeley
This work was done at Google.
   Michaela Hardt1
   Moritz Hardt1
University of California, Berkeley
1footnotemark: 1

Clustering time series is a delicate task; varying lengths and temporal offsets obscure direct comparisons. A natural strategy is to learn a parametric model for each time series and to cluster the model parameters rather than the sequences themselves. Linear dynamical systems are a fundamental and powerful parametric model class. However, identifying the parameters of a linear dynamical systems is a venerable task, permitting provably efficient solutions only in special cases. In this work, we show that clustering the parameters of unknown linear dynamical systems is, in fact, easier than identifying them. We analyze a computationally efficient clustering algorithm that enjoys provable convergence guarantees under a natural separation assumption. Although easy to implement, our algorithm is general, handling multi-dimensional data with time offsets and partial sequences. Evaluating our algorithm on both synthetic data and real electrocardiogram (ECG) signals, we see significant improvements in clustering quality over existing baselines.

1 Introduction

Unlabeled time-series data arise in a wide range of application domains, such as sensor data from homes, hospitals, particle accelerators, oceans, and space. Clustering is a useful tool to explore such data and discover patterns. However, clustering time series is a challenging task. Standard measures of similarity, such as the Euclidean distance, commonly used for clustering static data, fail to account for shifts and variable lengths of different time series. Clustering instead the learned parameters of a dynamic model naturally overcomes these limitations, but often the underlying models can be difficult to learn.

In this work, we consider clustering based on the eigenvalues of linear dynamical systems. Linear dynamical systems (LDS) are a simple yet general choice as the hidden generative model for time-series data. Many machine learning models can be unified as special cases of linear dynamical systems, including principal component analysis (PCA), mixtures of Gaussian clusters, Kalman filter models, and hidden Markov models [74]. Although important, linear system identification has provably efficient solutions only in special cases, see e.g. [40, 41, 39, 79]. In practice, the expectation–maximization (EM) algorithm [74, 78, 29] is widely used for parameter estimation, but it is inherently non-convex and can get stuck in local minima [40]. Even when identification is hard, is there still hope to find meaningful clusters of linear systems without fully learning the systems? We provide a positive answer to this question.

Contributions. We observe that for clustering purposes, linear systems can be viewed as equivalent up to change of basis. A suitable similarity measure for time series is therefore the -distance of the eigenvalues from the underlying LDSs. To find the eigenvalues, we utilize a fundamental correspondence between linear systems and Autoregressive-Moving-Average (ARMA) models. Specifically, our bi-directional perturbation bounds prove that two LDSs have similar eigenvalues if and only if their generated time series have similar auto-regressive parameters. Based on a consistent estimator for the autoregressive model parameters of an ARMA model [89], we propose a regularized iterated least-squares regression method to estimate the LDS eigenvalues. Our method runs in time linear in the sequence length and converges to true eigenvalues at the rate .

This gives rise to a simple new approach to clustering time series: First use regularized iterated least-squares regression to fit the ARMA model; then cluster the fitted AR parameters.

We carry out detailed experiments on synthetic and real ECG data to compare our approach to strong baselines, including commonly used model-agnostic clustering of time series, dynamic time warping [18] and k-Shape [69], and alternative approaches to clustering based on LDS, AR and ARMA model parameters as well as PCA. We find that our approach yields clusters of superior quality. Moreover, it is very practical due to its simple implementation (<100 lines of Python code) and linear running time, which set it apart from dynamic time warping with a running time quadratic in the sequence length and alternative approaches to clustering based on LDS/ARMA parameters that often fail to converge.

Organization. We review LDS and ARMA models in Sec. 3. In Sec. 4 we show that outputs from any -dimensional LDS with hidden inputs can be equivalently generated by an ARMA() model, and that the autoregressive parameters from the ARMA() model can be used to provably learn the LDS eigenvalues. In Sec. 5, we present the regularized iterated regression algorithm, a consistent estimator of autoregressive parameters in ARMA models with applications to clustering. We carry out a series of experiments on simulated synthetic data and real ECG data to compare the clustering performance of our method to baselines in Sec. 6. In the appendix, we describe generalizations of our algorithm to observable inputs and multidimensional outputs, and include additional simulation results.

2 Related Work

Linear dynamical system identification. The LDS identification problem has been studied since Kalman [50] in the 60s, yet the theoretical bounds are still not fully understood. Recently developed provably efficient algorithms  [40, 41, 39, 79, 22] require setups not well-suited for clustering. Hazan et al. [40, 41] proposed spectral filtering algorithms that minimize prediction error and in general do not identify system parameters. Hardt et al. [39] show that gradient descent can identify system parameters under a strong assumption on the roots of the system. Simchowitz et al. [79] and Dean et al. [22] proved new bounds for the LDS identification problem, with the assumption of observable states. We focus on the case of non-observable hidden states that is common for time-series data.

Tsiamis et al. [90] recently provided a subspace identification algorithm that recovers system parameters with a non-asymptotic error convergence rate . Their focus is on new theoretical finite sample complexity bounds, and the proposed algorithm is complex to implement. Our method has the same asymptotic error convergence rate and is much simpler in concept and in implementation.

Linear dynamical system distance. The control systems community [38] have long studied the distance notion between LDSs, but most approaches are computationally expensive. More recently, there have been new definitions of LDS distance from the computer vision community based on the Kullback-Leibier (KL) divergence [12], Binet-Cauchy Kernels [94], cepstra [21, 62] and group theory [1], for applications in clustering video trajectories. Due to the nature of computer vision data, these approaches assume huge output dimension of 1000s - 10000s, while we focus on the case of single or a few output dimensions as typical in time series datasets.

Time series clustering. Time-series clustering has been extensively studied in many areas including biology, climatology, energy consumption, finance, medicine, robotics, and voice recognition [2, 60]. Most existing techniques use one of the three major approaches: raw-data-based (e.g. [72, 10, 66, 13, 14, 95, 5, 54, 30, 92, 55, 64, 75]), feature-based (e.g. [77, 32, 27, 97, 96] ), or model-based (e.g. [52, 98, 3, 80, 59, 68, 46, 15, 9, 61, 58, 71]). Raw-data based approaches have the downside of working directly with noisy high dimensional data, while feature-based approaches require domain-specific feature extraction. Model-based approaches have the drawback that underlying models might be hard to learn.

Our approach falls into the model-based approach category, with linear dynamical system as our model. In model-based approaches, each time series is assumed to be generated by some parametric model, and similarity of time series is defined over the similarity of their model parameters. Common choices of the model include Gaussian mixture models [9, 88], ARIMA models [52, 98, 61, 71, 46], and hidden Markov models [80, 68, 8, 43]. Since Gaussian mixture models, ARIMA models, and hidden Markov models are all special cases of the more general linear dynamical system model [74], our work is a highly general model-based approach for time-series clustering.

At first glimpse, our approach might seem similar to AR- and ARMA-based clustering, but it is based on only half of the ARMA parameters, i.e. the autoregressive parameters, as we show that changing the moving-average parameters stays within an equivalence class of LDSs. Compared to ARMA-model-based clustering, with hard to estimate parameters, our approach enjoys the benefit of reliable convergence. We also differ from AR-model-based clustering because fitting a pure AR model to ARMA processes results in biased estimates. To the best of our knowledge, this is the first work to propose time-series clustering based on estimating only the AR parameters in ARMA.

Autoregressive parameter estimation. In linear systems and control literature, there are known methods for estimating the AR parameters in ARMA models, including high-order Yule-Walker (HOYW), MUSIC, and ESPRIT [83, 82, 84, 81, 47]. Our method for AR parameter estimation is based on the iterated regression technique first proposed by Tsay et. al. [89]. This method is related to the Two–Stage Least Squares ARMA Method [63, 83], while a major difference is that the iterated regression method is a consistent estimator for a fixed degree of the AR part, whereas the Two–Stage Least Squares ARMA method is only consistent as (i.e. asymptotic bias tends to 0 as ).

Compared to HOYW, MUSIC, ESPRIT, and other spectral analysis methods, perhaps a more important difference is that we generalize the iterated regression method to estimate the AR part of ARMAX with observed exogenous inputs (see Appendix C). While the iterated regression based method naturally extend to ARMAX, it is unclear how to extend spectral analysis methods for exogenous inputs to our knowledge.

3 Preliminaries

In this section we review LDS and ARMA and state our assumptions.

3.1 Linear dynamical systems

Formally, a discrete-time linear dynamical system (LDS) with parameters receives inputs , has hidden states , and generates outputs according to the following time-invariant recursive equations:


Assumptions. We assume that the stochastic noise is sampled from . When the inputs are hidden, as commonly the case in practice, we assume to be i.i.d. Gaussians.

The model equivalence theorem (Theorem 4.1) and the general version of the approximation theorem (Theorem 4.2) do not require any additional assumptions and hold for any real matrix . Under the additional assumption that only has simple eigenvalues in , i.e. each eigenvalue has multiplicity 1, we give a better convergence bound. This is still a weak assumption compared to prior work.

Distance between linear dynamical systems. For clustering purposes, since we want to capture how the systems evolve, the LDSs can be viewed as equivalent up to change of basis. Therefore, we define distance based on the spectrum of the transition matrix . We define the distance as , where and are the spectrum of and listed in an order that minimizes the distance. This distance definition satisfies non-negativity, identity, symmetry, and triangle inequality.

3.2 Autoregressive-moving-average models

The autoregressive-moving-average (ARMA) model is a common tool for time-series analysis. It captures two aspects of dependencies in time series by combining the autoregressive (AR) model and the moving-average (MA) model. The AR part involves regressing the variable with respect its lagged past values, while the MA part involves regressing the variable against past error terms.

Autoregressive model. The AR model describes how the current value in the time series depends on the lagged past values. For example, if the GDP realization is high this quarter, the GDP in the next few quarters are likely high as well. An autoregressive model of order , noted as AR, depends on the past steps,

where are autoregressive parameters, is a constant, and is white noise.

When the errors are normally distributed, the ordinary least squares (OLS) regression is a conditional maximum likelihood estimator for AR models yielding optimal estimates [25].

Moving-average model. The MA model, on the other hand, captures the delayed effects of unobserved random shocks in the past. For example, changes in winter weather could have a delayed effect on food harvest in the next fall. A moving-average model of order , noted as MA, depends on unobserved lagged errors in the past steps,

where are moving-average parameters, is a constant, and the errors are white noise.

ARMA model. The autoregressive-moving-average (ARMA) model, denoted as ARMA, merges AR and MA models to consider dependencies both on past time series values and past unpredictable shocks,

The ARMA model can be generalized to handle exogenous inputs, as discussed in Appendix C.

Estimating ARMA models is significantly harder than AR, since the model depends on unobserved variables and the maximum likelihood equations are intractable [24, 16]. Maximum likelihood estimation (MLE) methods are commonly used for fitting ARMA, but have converge issues. Although regression methods are also used in practice, OLS is a biased estimator for ARMA models [87].

4 Learning eigenvalues without system identification

In this section, we provide theoretical foundations for learning LDS eigenvalues from autoregressive parameters without full system identification.

Model equivalence and characteristic polynomial.

As the theoretical foundation for our approach for learning eigenvalues, we prove that outputs from any LDS can be equivalently generated by an ARMA model, and that the AR parameters in the ARMA model contain full information about the LDS eigenvalues.

Theorem 4.1.

The outputs from an -dimensional linear dynamical system with hidden inputs can be equivalently generated by an ARMA model. Furthermore, the characteristic polynomial of the transition matrix in the LDS can be recovered by , where are the autoregressive parameters in the ARMA model.

We prove a generalized version of Theorem 4.1 in Appendix A. Theorem A.1 is generalized to incorporate observed inputs through an ARMAX() model with exogenous inputs.

While general model equivalence between LDS and ARMA/ARMAX is well known [49, 4], we are not aware of prior detailed analysis of the exact correspondence between the characteristic polynomial of LDS and the ARMA/ARMAX autoregressive parameters along with perturbation bounds.

Note that the converse of Theorem 4.1 also holds. An ARMA() model can be seen as a -dimensional LDS with the relevant past values and error terms in the hidden state, .

An immediate corollary of Theorem 4.1 is that the autoregressive parameters of system outputs contain full information about all non-zero eigenvalues of the system.

Corollary 4.1.

The output series of two linear dynamical systems have the same autoregressive parameters if and only if they have the same non-zero eigenvalues with the same multiplicities.


By Theorem 4.1, the autoregressive parameters are determined by the characteristic polynomial. Two LDSs of the same dimension have the same autoregressive parameters if and only if they have the same characteristic polynomials, and hence the same eigenvalues with the same multiplicities. Two LDSs of different dimensions can have the same autoregressive parameters if and only if and , in which case they have the same non-zero eigenvalues with the same multiplicities. ∎

Approximation theorem for LDS eigenvalues.

We show that small error in the AR parameter estimation guarantees small error in eigenvalue estimation. This implies that effective estimation algorithm for the AR parameters in ARMA models leads to effective estimation of LDS eigenvalues.

Theorem 4.2.

Let be the outputs from an -dimensional linear dynamical system with parameters , eigenvalues , and hidden inputs. Let be the estimated autoregressive parameters for with error , and let be the roots of the polynomial .

Without any additional assumption on , the roots converge to the true eigenvalues with convergence rate . If all eigenvalues of are simple (no multiplicity), then the convergence rate is .

When the LDS has all simple eigenvalues, we provide a more explicit bound on the condition number.

Theorem 4.3.

In the same setting as above in Theorem 4.2, when all eigenvalues of are simple, , then the condition number is bounded by

where is the spectral radius, i.e. largest absolute value of its eigenvalues.

In particular, when , i.e. when the matrix is Lyapunov stable, then the absolute difference between the root from the auto-regressive method and the eigenvalue is bounded by

We defer the proofs to Appendix B.

5 Estimation of ARMA autoregressive parameters

In general, learning ARMA models is hard, since the output series depends on unobserved error terms. Fortunately, for our purpose we are only interested in the autoregressive parameters, that are easier to learn since the past values of the time series are observed.

Note that the autoregressive parameters in an ARMA() model are not equivalent to the pure AR() parameters for the same time series. For AR() models, ordinary least squares (OLS) regression is a consistent estimator of the autoregressive parameters [56]. However, for ARMA() models, due to the serial correlation in the error term , the OLS estimates for autoregressive parameters can be biased [87].

Regularized iterated regression. Tsay et al. proposed iterated regression as a consistent estimator for the autoregressive parameters in ARMA models in the 80s [89]. While their method is theoretically well-grounded, it tends to over-fit and result in large parameters on sequence lengths in the common range of 100s - 1000s in our experiments. To resolve this issue, we propose a regularized iterated regression method, which has the same theoretical guarantees but better practical performance.

We can also generalize the method to handle multidimensional outputs from the LDS and observed inputs by using ARMAX instead of ARMA models. We leave the full description of the more general algorithm to Appendix C as Algorithm 2, and describe a simpler version in Algorithm 1.

Input: Time series , target hidden state dimension , and regularization coefficient .
Initialize error term estimates for ;
for  do
       Perform -regularized least squares regression to estimate , , and in with regularization strength only on the terms;
       Update to be the residuals;
end for
Return .
Algorithm 1 Regularized iterated regression for autoregressive parameter estimation

An important detail to notice is that the -th iteration of the regression only uses error terms from the past lags. In other words, the initial iteration is an ARMA() regression, the first iteration is an ARMA() regression, and so forth until ARMA() in the last iteration.

Time complexity. The iterated regression involves steps of least squares regression each on at most variables. Therefore, the total time complexity of Algorithm 1 is , where is the sequence length and is the hidden state dimension.

Convergence rate. Tsay et al. [89] proved the consistency and the convergence rate of iterated regression for estimating autoregressive parameters in ARMA processes. Adding regularization does not change the asymptotic property of the estimator.

Theorem 5.1 ([89]).

Suppose that is an ARMA() process, stationary or not. The estimated autoregressive parameters from iterated regression converges in probability to the true parameters with rate

or more explicitly, convergence in probability means that for all ,

The main application discussed here is time-series clustering. Other than clustering, the regularized iterated regression method for estimating AR parameters in ARMA/ARMAX also has potential applications as a feature engineering step for supervised time series classification and prediction tasks, or as a preliminary estimation step for initialization in full LDS system identification.

5.1 Applications to clustering

Clustering time series requires a definition of appropriate distance measure. Under the fairly general assumption that the time series are generated by latent LDSs, we can then define distance between time series as the eigenvalue distance between their latent LDSs. The assumption of a latent LDS is general since many classic models such as PCA, mixtures of Gaussians, and hidden Markov models are special cases of LDSs.

To estimate the LDS distance, we claim that the distance between autoregressive parameters can approximate the LDS distance. Theorem 4.2 shows that small autoregressive parameter distance implies small eigenvalue distance. The converse of Theorem 4.2 is also true, i.e. dynamical systems with small eigenvalue distance have small autoregressive parameter distance, which follows from perturbation bounds for characteristic polynomials [45]. In Appendix D.2 we show simulation results that the two distances have an approximately linear relation with high correlation.

Therefore, we propose a simple new time series clustering algorithm: 1) first use iterated regression to estimate the autoregressive parameters in ARMA models for each times series, and 2) then apply any standard clustering algorithm such as K-means on the distance between autoregressive parameters.

Our method is very flexible. It handles multi-dimensional data and exogenous inputs as illustrated in Algorithm 2. It is scale, shift, and offset invariant, as the autoregressive parameters in ARMA models are. It accommodates missing values in partial sequences as we can drop missing values in OLS. It also allows sequences to have different lengths, and could be adapted to handle sequences with different sampling frequencies, as the compound of multiple steps of LDS evolution is still linear.

6 Experiments

We experimentally evaluate the quality and efficiency of the clustering from our method and compare it to existing baselines. All experiments are carried out on an instance with 6 vCPUs, 15 GB memory running Debian GNU/Linux 9. We start out with simulated data in Sec. 6.3 satisfying the assumptions of our method. Then we turn to real ECG data in Sec. 6.4 to see if our method works in practice.

6.1 Methods

We compare the following 6 approaches including model-free and model-based clustering approaches.

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • ARMA: K-means on AR parameters in ARMA() model estimated by regularized iterated regression in Algorithm 1.

  • AR: K-means on AR parameters in AR() model estimated by OLS [76].

  • LDS: K-means on estimated LDS eigenvalues learned via Gibbs sampling in pylds [48].

  • PCA: K-means on PCA, mapping raw series into -dimensional space, using sklearn [70].

  • k-Shape: A shape-based time series clustering method [69] based on normalized cross-correlation with a scalable iterative refinement procedure, using tslearn [86].

  • DTW: K-means on soft dynamic time warping distance [18], using tslearn [86].

  • GAK: K-means on Global Alignment Kernel [19, 23], using tslearn [86].

Alternative methods attempted and omitted. We tried estimating AR parameters in ARMA model by the MLE method in statsmodels [76], which results in convergence failures on around 20% of time series and drastically worse performance (see Table 4 in Appendix D.2). For learning LDS eigenvalues, in addition to Gibbs sampling, we also tried the MLE method with Kalman filtering in statsmodels [76], and similarly omit it due to convergence failures on around 30% of time series and worse performance. We also tried K-means clustering directly based on raw time series, which resulted in worse performance than PCA, so we omit it.

6.2 Metrics of Cluster Quality

We measure the quality of learned clusters by comparing them to the ground truth cluster labels using the adjusted mutual information (AMI) score [93] implemented in sklearn [86]. AMI adjusts the mutual information score to account for the number of clusters that tends to increase MI. The AMI metric is symmetric and independent of permutation of label values. Additional metrics we consider are the adjusted Rand score [44] and V-measure [73] shown in Section 6.4 and Appendix D.2.

6.3 Simulation

We generate data following the assumptions behind our method. We study clustering quality and efficiency across methods and provide a deeper dive into the underlying eigenvalue estimation.

Dataset. We generate LDSs representing cluster centers with random matrices of standard Gaussians with a spectral radius of at most 1. We require a certain minimum distance between two cluster centers. For each cluster center, we derive sufficiently close LDSs. From those, we generate time series of length 1000 by drawing inputs from standard Gaussians and adding noise to the output sampled from . More details are provided in Appendix D.1.

Performance of Clustering. Our iterated ARMA regression method yields the best clustering quality as measured by the adjusted mutual information (AMI), and is significantly faster than k-Shape and LDS, as shown in Table 6.3. In Appendix D.2 we show that these results hold up for other metrics and choices of sequence length and the number of clusters. 111We omit DTW and GAK from Table 6.3 as DTW takes 10000+ secs and GAK takes 1000+ secs per run for 100 series of length 1000 and result in worse AMI. See Table 2 for DTW and GAK performance on ECG data.

3 true clusters 10 true clusters
Method AMI Runtime (secs) AMI Runtime (secs)
ARMA 0.14 (0.12-0.16) 0.37 (0.35-0.40) 0.24 (0.23-0.25) 0.33 (0.30-0.35)
AR 0.09 (0.08-0.10) 1.51 (1.42-1.59) 0.21 (0.20-0.22) 1.11 (1.03-1.19)
LDS 0.07 (0.05-0.09) 182.2 (164.6-199.8) 0.18 (0.17-0.19) 114.5 (107.9-121.1)
PCA 0.00 (-0.00-0.00) 0.04 (0.03-0.04) 0.01 (0.00-0.01) 0.05 (0.05-0.05)
k-Shape 0.06 (0.05-0.07) 18.5 (16.9-20.2) 0.07 (0.06-0.08) 48.1 (40.4-55.9)
Table 1: Performance for clustering 100 random 2-dimensional LDSs generating time series of length 1000. The 95% confidence intervals are derived from standard deviation on 100 random samples of 100 time series. AMI is the adjusted mutual information score between ground truth cluster labels and learned cluster labels.

Eigenvalue Estimation. Good clustering results rely on good approximations of the LDS eigenvalue distance. Our analyses in Theorem 5.1 and Theorem 4.2 proved that the iterative ARMA regression algorithm can learn the LDS eigenvalues with converge rate . In Figure 1, we see that the observed convergence rate in simulations matches the theoretical bound.



Figure 1: Convergence rate of absolute -error in eigenvalue estimation. The x-axis is the inverse square root of sequence length, i.e. , corresponding to = 448, 744, 1467, 4133, and 40000.

Compared to LDS and AR, the ARMA method with iterative regression achieves the lowest error for most configurations. When the sequences are too short, the AR method gives better results, although its estimation is biased. When the sequences are very long (), the error from LDS gets closer to that of the ARMA regression, but LDS is orders of magnitudes slower.

6.4 Experiments on real-world data

In this section we test our method on a real electrocardiogram (ECG) dataset. While our simulation results show the efficacy of our method, a lot of assumptions are satisfied by the data generation process that may not hold on real data.

Method AMI Adj. Rand Score V-measure Runtime (secs)
ARMA 0.12 (0.10-0.13) 0.12 (0.11-0.14) 0.14 (0.12-0.16) 0.07 (0.07-0.08)
AR 0.10 (0.09-0.12) 0.10 (0.09-0.12) 0.13 (0.11-0.14) 0.27 (0.25-0.29)
PCA 0.04 (0.03-0.05) 0.02 (0.02-0.03) 0.07 (0.06-0.08) 0.01 (0.01-0.01)
LDS 0.09 (0.07-0.10) 0.11 (0.10-0.13) 0.10 (0.09-0.11) 14.83 (14.43-15.23)
k-Shape 0.08 (0.07-0.10) 0.09 (0.07-0.11) 0.10 (0.09-0.12) 1.06 (0.80-1.31)
DTW 0.03 (0.02-0.04) 0.02 (0.01-0.03) 0.06 (0.05-0.07) 6.05 (5.63-6.47)
GAK 0.04 (0.03-0.05) 0.04 (0.03-0.05) 0.06 (0.05-0.07) 0.45 (0.45-0.46)
Table 2: Clustering performance on electrocardiogram (ECG) data separating segments of normal sinus rhythm from supraventricular tachycardia. 95% Confidence intervals are from 100 bootstrapped samples of 50 series.

Dataset. We use the MIT-BIH dataset from physionet [65, 31], the most common dataset used to design and evaluate ECG algorithms [99, 67, 11, 85, 53, 20]. It contains 48 half-hour recordings collected at the Beth Israel Hospital between 1975 and 1979. Each two-channel recording is digitized at a rate of 360 samples per second per channel. 15 distinct rhythms are annotated in recordings including abnormalities of cardiac rhythm (arrhythmias) by two cardiologists.

Detecting cardiac arrhythmias has stimulated a lot of research beyond the scope of this paper with product applications such as Apple’s FDA-approved detection of atrial fibrillation [91]. Notably, AR and ARIMA models were applied [51, 28, 17] and more recently convolutional neural networks [37].

We bootstrap 100 samples of 50 time series; each bootstrapped sample consists of 2 labeled clusters: 25 series with supraventricular tachycardia and 25 series with normal sinus rhythm. Each series has length 500 which adequately captures a complete cardiac cycle. We set the ARMA -regularization coefficient to be 0.01, chosen based on our simulation results. For DTW and GAK, further subsampling improves performance so we report the best metrics from subsampling to sequence length 100.

Results. Comparing our method, ARMA, to the methods outlined in Section 6.1 in Table 2, we see that our method achieves the best quality closely followed by the AR method, according to adjusted mutual information, adjusted Rand score and V-measure, while being very efficient at the same time.

To put our real-world experimental results on ECG data into perspective, the adjusted rand score of 0.12 is not too far off from the best score of 0.16 observed in simulation experiments even when LDS is the ground truth (Table 3-4 in Appendix D). Clustering time series based on their underlying dynamics is an inherently challenging task due to the unobserved latent states.

The main improvement of our ARMA method over AR is to correct for the bias in AR due to serial correlation in error terms (see line 194-198). In our ECG experiment the bias effect is small, but may increase further with longer sequence lengths. See Figure 1 in Section 6 and Figure 2 in Section D.2 for simulation results on varying sequence lengths. While the LDS method is only slightly worse than ARMA, it is 200+ times slower, and more complicated to implement.

7 Conclusion

We give a fast, simple, and provably effective method for clustering time series based on the distance between latent linear dynamical systems (LDSs) that is flexible to handle varying lengths, temporal offsets, as well as multidimensional inputs and outputs. Our algorithm combines statistical techniques from the 80’s with new insights on the correspondence between LDSs and ARMA models. Specifically, we show that two LDSs have similar eigenvalues if and only if their generated time series have similar auto-regressive parameters. While LDSs are very general models encompassing mixtures of Gaussian clusters, Kalman filter models, and hidden Markov models, they may not fit all practical applications, and we plan to extend our analysis to non-linear models in the future. Nevertheless, our experiments show that our efficient algorithm yields higher quality clusters compared to various strong baselines not just in simulations but also for real ECG data.


We would like to thank the anonymous reviewers for their thoughtful comments, especially for pointing us to relevant related work.


  • [1] B. Afsari, R. Chaudhry, A. Ravichandran, and R. Vidal (2012) Group action induced distances for averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2208–2215. Cited by: §2.
  • [2] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah (2015) Time-series clustering–a decade review. Information Systems 53, pp. 16–38. Cited by: §2.
  • [3] J. Aßfalg, H. Kriegel, P. Kröger, P. Kunath, A. Pryakhin, and M. Renz (2006) Similarity search on time series based on threshold queries. In International Conference on Extending Database Technology, pp. 276–294. Cited by: §2.
  • [4] K. J. Åström and B. Wittenmark (2013) Computer-controlled systems: theory and design. Courier Corporation. Cited by: §4.
  • [5] A. Banerjee and J. Ghosh (2001) Clickstream clustering using weighted longest common subsequences. In Proceedings of the web mining workshop at the 1st SIAM conference on data mining, Vol. 143, pp. 144. Cited by: §2.
  • [6] B. Beauzamy (1999) How the roots of a polynomial vary with its coefficients: a local quantitative result. Canadian Mathematical Bulletin 42 (1), pp. 3–12. Cited by: §B.1.
  • [7] B. Bercu (1995) Weighted estimation and tracking for armax models. SIAM Journal on Control and Optimization 33 (1), pp. 89–106. Cited by: §A.1.
  • [8] M. Bicego, V. Murino, and M. A. Figueiredo (2003) Similarity-based clustering of sequences using hidden markov models. In International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 86–95. Cited by: §2.
  • [9] C. Biernacki, G. Celeux, and G. Govaert (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE transactions on pattern analysis and machine intelligence 22 (7), pp. 719–725. Cited by: §2, §2.
  • [10] D. Buzan, S. Sclaroff, and G. Kollios (2004) Extraction and clustering of motion trajectories in video. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Vol. 2, pp. 521–524. Cited by: §2.
  • [11] R. Ceylan, Y. Özbay, and B. Karlik (2009) A novel approach for classification of ecg arrhythmias: type-2 fuzzy clustering neural network. Expert Systems with Applications 36 (3), pp. 6721–6726. Cited by: §6.4.
  • [12] A. B. Chan and N. Vasconcelos (2005) Probabilistic kernels for the classification of auto-regressive visual processes. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, pp. 846–851. Cited by: §2.
  • [13] L. Chen and R. Ng (2004) On the marriage of lp-norms and edit distance. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pp. 792–803. Cited by: §2.
  • [14] L. Chen, M. T. Özsu, and V. Oria (2005) Robust and fast similarity search for moving object trajectories. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 491–502. Cited by: §2.
  • [15] Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. Tung (2007) Spade: on shape-based pattern detection in streaming time series. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 786–795. Cited by: §2.
  • [16] B. Choi (2012) ARMA model identification. Springer Science & Business Media. Cited by: §3.2.
  • [17] M. Corduas and D. Piccolo (2008) Time series clustering and classification by the autoregressive metric. Computational Statistics & Data Analysis 52, pp. 1860–1872. Cited by: §6.4.
  • [18] M. Cuturi and M. Blondel (2017) Soft-dtw: a differentiable loss function for time-series. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 894–903. Cited by: §1, 6th item.
  • [19] M. Cuturi (2011) Fast global alignment kernels. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 929–936. Cited by: 7th item.
  • [20] P. De Chazal, M. O’Dwyer, and R. B. Reilly (2004) Automatic classification of heartbeats using ecg morphology and heartbeat interval features. IEEE transactions on biomedical engineering 51 (7), pp. 1196–1206. Cited by: §6.4.
  • [21] K. De Cock and B. De Moor (2000) Subspace angles and distances between arma models. In Proc. of the Intl. Symp. of Math. Theory of networks and systems, Vol. 1. Cited by: §2.
  • [22] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu (2017) On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688. Cited by: §2.
  • [23] I. S. Dhillon, Y. Guan, and B. Kulis (2004) Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 551–556. Cited by: 7th item.
  • [24] J. Durbin (1959) Efficient estimation of parameters in moving-average models. Biometrika 46 (3/4), pp. 306–316. Cited by: §3.2.
  • [25] J. Durbin (1960) Estimation of parameters in time-series regression models. Journal of the Royal Statistical Society: Series B (Methodological) 22 (1), pp. 139–153. Cited by: §3.2.
  • [26] M. E. El-Mikkawy (2003) Explicit inverse of a generalized vandermonde matrix. Applied mathematics and computation 146 (2-3), pp. 643–651. Cited by: §B.3.
  • [27] T. Fu, F. Chung, V. Ng, and R. Luk (2001) Pattern discovery from stock time series using self-organizing maps. In Workshop Notes of KDD2001 Workshop on Temporal Data Mining, pp. 26–29. Cited by: §2.
  • [28] D. Ge, N. Srinivasan, and S. M Krishnan (2002-12) Cardiac arrhythmia classification using autoregressive modeling. Biomedical engineering online 1, pp. 5. External Links: Document Cited by: §6.4.
  • [29] Z. Ghahramani and G. E. Hinton (1996) Parameter estimation for linear dynamical systems. Technical report Technical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science. Cited by: §1.
  • [30] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, and P. Boesiger (1998) A new correlation-based fuzzy logic clustering algorithm for fmri. Magnetic Resonance in Medicine 40 (2), pp. 249–260. Cited by: §2.
  • [31] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley (2000 (June 13)) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Note: Circulation Electronic Pages: PMID:1085218; doi: 10.1161/01.CIR.101.23.e215 Cited by: §6.4.
  • [32] C. Goutte, P. Toft, E. Rostrup, F. Å. Nielsen, and L. K. Hansen (1999) On clustering fmri time series. NeuroImage 9 (3), pp. 298–310. Cited by: §2.
  • [33] C. W. Granger and M. J. Morris (1976) Time series modelling and interpretation. Journal of the Royal Statistical Society: Series A (General) 139 (2), pp. 246–257. Cited by: §A.1, §A.3, Lemma A.1.
  • [34] A. Greenbaum, R. Li, and M. L. Overton (2019) First-order perturbation theory for eigenvalues and eigenvectors. arXiv preprint arXiv:1903.00785. Cited by: §B.2.
  • [35] L. Guo (1996) Self-convergence of weighted least-squares with applications to stochastic adaptive control. IEEE transactions on automatic control 41 (1), pp. 79–89. Cited by: §A.1.
  • [36] E. J. Hannan, W. T. Dunsmuir, and M. Deistler (1980) Estimation of vector armax models. Journal of Multivariate Analysis 10 (3), pp. 275–295. Cited by: §A.1.
  • [37] A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, M. P. Turakhia, C. Bourn, and A. Y. Ng (2019) Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine 15. External Links: Link Cited by: §6.4.
  • [38] B. Hanzon (1989) Identifiability, recursive identification and spaces of linear dynamical systems: part i. CWI Tracts. Cited by: §2.
  • [39] M. Hardt, T. Ma, and B. Recht (2018) Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research 19 (1), pp. 1025–1068. Cited by: §1, §2.
  • [40] E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang (2018) Spectral filtering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pp. 4634–4643. Cited by: §1, §2.
  • [41] E. Hazan, K. Singh, and C. Zhang (2017) Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pp. 6702–6712. Cited by: §1, §2.
  • [42] R. Hryniv and P. Lancaster (1999) On the perturbation of analytic matrix functions. Integral Equations and Operator Theory 34 (3), pp. 325–338. Cited by: §B.2.
  • [43] J. Hu, B. Ray, and L. Han (2006) An interweaved hmm/dtw approach to robust time series clustering. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3, pp. 145–148. Cited by: §2.
  • [44] L. Hubert and P. Arabie (1985) Comparing partitions. Journal of classification 2 (1), pp. 193–218. Cited by: §D.2, §6.2.
  • [45] I. C. Ipsen and R. Rehman (2008) Perturbation bounds for determinants and characteristic polynomials. SIAM Journal on Matrix Analysis and Applications 30 (2), pp. 762–776. Cited by: §5.1.
  • [46] B. H. Jansen, A. Hasman, and R. Lenten (1981) Piecewise analysis of eegs using ar-modeling and clustering. Computers and Biomedical Research 14 (2), pp. 168–178. Cited by: §2, §2.
  • [47] M. Jansson and P. Stoica (2003) Optimal yule walker method for pole estimation of arma signals. IFAC Proceedings Volumes 36 (16), pp. 1891–1895. Cited by: §2.
  • [48] M. Johnson and S. Linderman (2018) PyLDS: bayesian inference for linear dynamical systems. GitHub. Note: Cited by: 3rd item.
  • [49] T. Kailath (1980) Linear systems. Vol. 156, Prentice-Hall Englewood Cliffs, NJ. Cited by: §4.
  • [50] R. E. Kalman (1960) A new approach to linear filtering and prediction problems. Journal of basic Engineering 82 (1), pp. 35–45. Cited by: §2.
  • [51] K. Kalpakis, D. Gada, and V. Puttagunta (2001-11) Distance measures for effective clustering of arima time-series. In Proceedings 2001 IEEE International Conference on Data Mining, Vol. , pp. 273–280. External Links: Document, ISSN Cited by: §6.4.
  • [52] K. Kalpakis, D. Gada, and V. Puttagunta (2001) Distance measures for effective clustering of arima time-series. In Proceedings 2001 IEEE international conference on data mining, pp. 273–280. Cited by: §2, §2.
  • [53] M. Korürek and A. Nizam (2010) Clustering mit–bih arrhythmias with ant colony optimization using time domain and pca compressed wavelet coefficients. Digital Signal Processing 20 (4), pp. 1050–1060. Cited by: §6.4.
  • [54] K. Košmelj and V. Batagelj (1990) Cross-sectional approach for clustering time varying data. Journal of Classification 7 (1), pp. 99–109. Cited by: §2.
  • [55] M. Kumar, N. R. Patel, and J. Woo (2002) Clustering seasonality patterns in the presence of errors. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 557–563. Cited by: §2.
  • [56] T. Lai and C. Wei (1983) Asymptotic properties of general autoregressive models and strong consistency of least-squares estimates of their parameters. Journal of multivariate analysis 13 (1), pp. 1–23. Cited by: §5.
  • [57] P. Lancaster, A. Markus, and F. Zhou (2003) Perturbation theory for analytic matrix functions: the semisimple case. SIAM Journal on Matrix Analysis and Applications 25 (3), pp. 606–626. Cited by: §B.2, Lemma B.2.
  • [58] C. Li and G. Biswas (1999) Temporal pattern generation using hidden markov model based unsupervised classification. In International Symposium on Intelligent Data Analysis, pp. 245–256. Cited by: §2.
  • [59] L. Li and B. A. Prakash (2011) Time series clustering: complex is simpler!. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 185–192. Cited by: §2.
  • [60] T. W. Liao (2005) Clustering of time series data—a survey. Pattern recognition 38 (11), pp. 1857–1874. Cited by: §2.
  • [61] E. A. Maharaj (2000) Cluster of time series. Journal of Classification 17 (2), pp. 297–314. Cited by: §2, §2.
  • [62] R. J. Martin (2000) A metric for arma processes. IEEE transactions on Signal Processing 48 (4), pp. 1164–1170. Cited by: §2.
  • [63] D. Q. Mayne and F. Firoozan (1982) Linear identification of arma processes. Automatica 18 (4), pp. 461–466. Cited by: §2.
  • [64] C. S. Möller-Levet, F. Klawonn, K. Cho, and O. Wolkenhauer (2003) Fuzzy clustering of short time-series and unevenly distributed sampling points. In International Symposium on Intelligent Data Analysis, pp. 330–340. Cited by: §2.
  • [65] G.B. Moody and R.G. Mark (2001-06) The impact of the mit-bih arrhythmia database. IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society 20, pp. 45–50. External Links: Document Cited by: §6.4.
  • [66] M. D. Morse and J. M. Patel (2007) An efficient and accurate method for evaluating time series similarity. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 569–580. Cited by: §2.
  • [67] Y. Özbay, R. Ceylan, and B. Karlik (2006) A fuzzy clustering neural network architecture for classification of ecg arrhythmias. Computers in Biology and Medicine 36 (4), pp. 376–388. Cited by: §6.4.
  • [68] A. Panuccio, M. Bicego, and V. Murino (2002) A hidden markov model-based approach to sequential data clustering. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 734–743. Cited by: §2, §2.
  • [69] J. Paparrizos and L. Gravano (2015) K-shape: efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1855–1870. Cited by: §1, 5th item.
  • [70] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: 4th item.
  • [71] D. Piccolo (1990) A distance measure for classifying arima models. Journal of Time Series Analysis 11 (2), pp. 153–164. Cited by: §2, §2.
  • [72] C. Piciarelli, G. L. Foresti, and L. Snidaro (2005) Trajectory clustering and its applications for video surveillance. In IEEE Conference on Advanced Video and Signal Based Surveillance, 2005., pp. 40–45. Cited by: §2.
  • [73] A. Rosenberg and J. Hirschberg (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Cited by: §D.2, §6.2.
  • [74] S. Roweis and Z. Ghahramani (1999) A unifying review of linear gaussian models. Neural computation 11 (2), pp. 305–345. Cited by: §1, §2.
  • [75] H. Sakoe, S. Chiba, A. Waibel, and K. Lee (1990) Dynamic programming algorithm optimization for spoken word recognition. Readings in speech recognition 159, pp. 224. Cited by: §2.
  • [76] S. Seabold and J. Perktold (2010) Statsmodels: econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, Vol. 57, pp. 61. Cited by: §D.2, 2nd item, §6.1.
  • [77] C. Shaw and G. King (1992) Using cluster analysis to classify time series. Physica D: Nonlinear Phenomena 58 (1-4), pp. 288–298. Cited by: §2.
  • [78] R. H. Shumway and D. S. Stoffer (1982) An approach to time series smoothing and forecasting using the em algorithm. Journal of time series analysis 3 (4), pp. 253–264. Cited by: §1.
  • [79] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht (2018) Learning without mixing: towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334. Cited by: §1, §2.
  • [80] P. Smyth (1997) Clustering sequences with hidden markov models. In Advances in neural information processing systems, pp. 648–654. Cited by: §2, §2.
  • [81] P. Stoica, B. Friedlander, and T. Söderström (1987) Optimal instrumental variable multistep algorithms for estimation of the ar parameters of an arma process. International Journal of Control 45 (6), pp. 2083–2107. Cited by: §2.
  • [82] P. Stoica, B. Friedlander, and T. Söderström (1988) A high-order yule-walker method for estimation of the ar parameters of an arma model. Systems & control letters 11 (2), pp. 99–105. Cited by: §2.
  • [83] P. Stoica, R. L. Moses, et al. (2005) Spectral analysis of signals. Cited by: §2.
  • [84] P. Stoica, R. L. Moses, T. Soderstrom, and J. Li (1991) Optimal high-order yule-walker estimation of sinusoidal frequencies. IEEE Transactions on signal processing 39 (6), pp. 1360–1368. Cited by: §2.
  • [85] Z. Syed, J. Guttag, and C. Stultz (2007) Clustering and symbolic analysis of cardiovascular signals: discovery and visualization of medically relevant patterns in long-term data using limited prior knowledge. EURASIP Journal on Applied Signal Processing 2007 (1), pp. 97–97. Cited by: §6.4.
  • [86] R. Tavenard (2017) Tslearn: a machine learning toolkit dedicated to time-series data. URL https://github. com/rtavenar/tslearn. Cited by: 5th item, 6th item, 7th item, §6.2.
  • [87] G. C. Tiao and R. S. Tsay (1983) Consistency properties of least squares estimates of autoregressive parameters in arma models. The Annals of Statistics, pp. 856–871. Cited by: §3.2, §5.
  • [88] D. Tran and M. Wagner (2002) Fuzzy c-means clustering-based speaker verification. In AFSS International Conference on Fuzzy Systems, pp. 318–324. Cited by: §2.
  • [89] R. S. Tsay and G. C. Tiao (1984) Consistent estimates of autoregressive parameters and extended sample autocorrelation function for stationary and nonstationary arma models. Journal of the American Statistical Association 79 (385), pp. 84–96. Cited by: §1, §2, Theorem 5.1, §5, §5.
  • [90] A. Tsiamis and G. J. Pappas (2019) Finite sample analysis of stochastic system identification. arXiv preprint arXiv:1903.09122. Cited by: §2.
  • [91] M. Turakhia, M. Desai, H. Hedlin, A. Rajmane, N. Talati, T. Ferris, S. Desai, D. Nag, M. Patel, P. Kowey, J. Rumsfeld, A. M. Russo, M. True Hills, C. B. Granger, K. W. Mahaffey, and M. Perez (2018-09) Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the apple heart study. American Heart Journal 207, pp. . External Links: Document Cited by: §6.4.
  • [92] J. J. Van Wijk and E. R. Van Selow (1999) Cluster and calendar based visualization of time series data. In Proceedings 1999 IEEE Symposium on Information Visualization (InfoVis’ 99), pp. 4–9. Cited by: §2.
  • [93] N. X. Vinh, J. Epps, and J. Bailey (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11 (Oct), pp. 2837–2854. Cited by: §6.2.
  • [94] S. Vishwanathan, A. J. Smola, and R. Vidal (2007) Binet-cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision 73 (1), pp. 95–119. Cited by: §2.
  • [95] M. Vlachos, G. Kollios, and D. Gunopulos (2002) Discovering similar multidimensional trajectories. In Proceedings 18th international conference on data engineering, pp. 673–684. Cited by: §2.
  • [96] M. Vlachos, J. Lin, E. Keogh, and D. Gunopulos (2003) A wavelet-based anytime algorithm for k-means clustering of time series. In In proc. workshop on clustering high dimensionality data and its applications, Cited by: §2.
  • [97] J. Wilpon and L. Rabiner (1985) A modified k-means clustering algorithm for use in isolated work recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 33 (3), pp. 587–594. Cited by: §2.
  • [98] Y. Xiong and D. Yeung (2002) Mixtures of arma models for model-based time series clustering. In 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 717–720. Cited by: §2, §2.
  • [99] Y. Yeh, C. W. Chiou, and H. Lin (2012) Analyzing ecg for cardiac arrhythmia using cluster analysis. Expert Systems with Applications 39 (1), pp. 1000–1010. Cited by: §6.4.

Appendix A Proofs for model equivalence

In this section, we prove a generalization of Theorem 4.1 for both LDSs with observed inputs and LDSs with hidden inputs.

a.1 Preliminaries

ARMAX model

First, we describe a generalization of ARMA to handle exogenous inputs.

When there are additional exogenous inputs to the an ARMA model, it is then known as a autoregressive–moving-average model with exogenous inputs (ARMAX),

where is a known external time series, possibly multidimensional. In the case where is a vector, the parameters are also vectors.

ARMAX models can be estimated through MLE methods or weighted least squares [35, 7, 36].

Lag Operator

We also introduce the lag operator, a convenient notation that we will use repeatedly.

The lag operator , also called the backward operator, is a concise way to describe ARMA models [33], defined as . The lag operator could be raise to powers, or form polynomials. For example, , and . The lag polynomials can be multiplied or inverted.

An AR() model can be characterized by

where is a polynomial of the lag operator of degree . For example, any AR(2) model can be described as .

Similarly, an MA() can be characterized by a polynomial of degree ,

For example, for an MA(2) model the equation would be .

Merging the two and adding dependency to exogenous input, we can write an ARMAX() model as


where , and are polynomials of degree and respectively. When the exogenous time series is multidimensional, is a vector of degree- polynomials.

Sum of ARMA processes

It is known that the sum of ARMA processes is still an ARMA process.

Lemma A.1 (Main Theorem in [33]).

The sum of two independent stationary series generated by ARMA(, ) and ARMA(, ) is generated by ARMA(, ), where and .

In shorthand notation, .

When two ARMAX processes share the same exogenous input series, the dependency on exogenous input is additive, and the above can be extended to .

Jordan canonical form and canonical basis

Every square real matrix is similar to a complex block diagonal matrix known as its Jordan canonical form (JCF). In the special case for diagonalizable matrices, JCF is the same as the diagonal form. Based on JCF, there exists a canonical basis consisting only of eigenvectors and generalized eigenvectors of . A vector is a generalized eigenvector of rank with corresponding eigenvalue if and .

Relating the canonical basis to the characteristic polynomial, the characteristic polynomial can be completely factored into linear factors over . The complex roots are eigenvalues of . For each eigenvalue , there exist linearly independent generalized eigenvectors such that .

a.2 General model equivalence theorem

Now we state Theorem A.1, a more generalized and detailed version of Theorem 4.1.

Theorem A.1.

For any linear dynamical system with parameters , hidden dimension , inputs , and outputs , the outputs satisfy


where is the lag operator, is the reciprocal polynomial of the characteristic polynomial of , and is an -by- matrix of polynomials of degree .

This implies that each dimension of can be generated by an ARMAX() model, where the autoregressive parameters are the characteristic polynomial coefficients in reverse order and in negative values.

To prove the theorem, we introduce a lemma to analyze the autoregressive behavior of the hidden state projected to a generalized eigenvector direction.

Lemma A.2.

Consider a linear dynamical system with parameters , hidden states , inputs , and outputs as defined in (1). For any generalized eigenvector of with eigenvector and rank , the lag operator polynomial applied to time series results in


To expand the LHS, first observe that

We can apply again similarly to obtain

and in general we can show inductively that

where the RHS is a linear transformation of .

Since by definition of generalized eigenvectors, , and hence itself is a linear transformation of . ∎

Proof for Theorem a.1

Using Lemma A.2 and the canonical basis, we can prove Theorem A.1.


Let be the eigenvalues of with multiplicity . Since is a real-valued matrix, its adjoint has the same characteristic polynomial and eigenvalues as . There exists a canonical basis for , where are generalized eigenvectors with eigenvalue , are generalized eigenvectors with eigenvalue , so on and so forth, and are generalized eigenvectors with eigenvalue .

By Lemma (A.2),

We then apply lag operator polynomial to both sides of each equation. The lag polynomial in the LHS becomes For the RHS, since is of degree , it lags the RHS by at most additional steps, and the RHS becomes a linear transformation of .

Thus, for each , is a linear transformation of .

The outputs of the LDS are defined as . By linearity, and since is of degree , both and are linear transformations of . We can write any such linear transformation as for some -by- matrix of polynomials of degree . Thus, as desired,

The reciprocal polynomial has the same coefficients in reverse order as the original polynomial. According to the lag operator polynomial on the LHS, , and , so the -th order autoregressive parameter is the negative value of the -th order coefficient in the characteristic polynomial .

a.3 The hidden input case: Theorem 4.1 as a corollary

Theorem 4.1 comes as a corollary to Theorem A.1, with a short proof here.


Define to be the output without noise, i.e. . By Theorem A.1, . Since we assume the hidden inputs are i.i.d. Gaussians, is then generated by an ARMA() process with autoregressive polynomial .

The output noise itself can be seen as an ARMA() process. By Lemma A.1, ARMA() + ARMA() = ARMA() = ARMA(). Hence the outputs are generated by an ARMA() process as claimed in Theorem 4.1. It is easy to see in the proof of Lemma A.1 that the autoregressive parameters do not change when adding a white noise [33]. ∎

Appendix B Proof for eigenvalue approximation theorems

Here we restate Theorem 4.2 and Theorem 4.3 together, and prove it in three steps for 1) the general case, 2) the simple eigenvalue case, and 3) the explicit condition number bounds for the simple eigenvalue case.

Theorem B.1.

Suppose are the outputs from an -dimensional latent linear dynamical system with parameters and eigenvalues . Let be the estimated autoregressive parameters with error , and let be the roots of the polynomial .

Without any additional assumption on , the roots converge to the eigenvalues with convergence rate . If all eigenvalues of are simple (i.e. multiplicity 1), then the convergence rate is . If is symmetric, Lyapunov stable (spectral radius at most 1), and only has simple eigenvalues, then

b.1 General (1/n)-exponent bound

This is a known perturbation bound on polynomial root finding due to Ostrowski [6].

Lemma B.1.

Let and be two polynomials of degree . If , then the roots of and roots of under suitable order satisfy

where .

The general convergence rate in Theorem 4.2 follows directly from Lemma B.1 and Theorem 4.1.

b.2 Bound for simple eigenvalues

The -exponent in the above bound might seem not very ideal, but without additional assumptions the -exponent is tight. As an example, the polynomial has roots . This is a general phenomenon that a root with multiplicity could split into roots at rate , and is related to the regular splitting property [42, 57] in matrix eigenvalue perturbation theory.

Under the additional assumption that all the eigenvalues are simple (no multiplicity), we can prove a better bound using the following idea with companion matrix: Small perturbation in autoregressive parameters results in small perturbation in companion matrix, and small perturbation in companion matrix results in small perturbation in eigenvalues.

Matrix eigenvalue perturbation theory

The perturbation bound on eigenvalues is a well-studied problem [34]. The regular splitting property states that, for an eigenvalue with partial multiplicities , an perturbation to the matrix could split the eigenvalue into distinct eigenvalues for and , and each eigenvalue is moved from the original position by .

For semi-simple eigenvalues, geometric multiplicity equals algebraic multiplicity. Since geometric multiplicity is the number of partial multiplicities while algebraic multiplicity is the sum of partial multiplicities, for semi-simple eigenvalues all partial multiplicities . Therefore, the regular splitting property corresponds to the asymptotic relation in equation 4. It is known that regular splitting holds for any semi-simple eigenvalue even for non-Hermitian matrices.

Lemma B.2 (Theorem 6 in [57]).

Let be an analytic matrix function with semi-simple eigenvalue at of multiplicity . Then there are exactly eigenvalues of for which as , and for these eigenvalues

Companion Matrix

Matrix perturbation theory tell us how perturbations on matrices change eigenvalues, while we are interested in how perturbations on polynomial coefficients change roots. To apply matrix perturbation theory on polynomials, we introduce the companion matrix, also known as the controllable canonical form in control theory.

Definition B.1.

For a monic polynomial , the companion matrix of the polynomial is the square matrix

The matrix is the companion in the sense that its characteristic polynomial is equal to .

In relation to a pure autoregressive AR() model, the companion matrix corresponds to the transition matrix in the linear dynamical system when we encode the values form the past lags as a -dimensional state

If , then

Proof of Theorem 4.2 for simple eigenvalues

Let be the outputs of a linear dynamical system with only simple eigenvalues, and let be the ARMAX autoregressive parameters for . Let be the companion matrix of the polynomial . The companion matrix is the transition matrix of the LDS described in equation 5. Since this LDS the same autoregressive parameters and hidden state dimension as the original LDS, by Corollary 4.1 the companion matrix has the same characteristic polynomial as the original LDS, and thus also has simple (and hence also semi-simple) eigenvalues. The convergence rate then follows from Lemma B.2 and Theorem 5.1, as the error on ARMAX parameter estimation can be seen as perturbation on the companion matrix. ∎

A note on the companion matrix

One might hope that we could have a more generalized result using Lemma B.2 for all systems with semi-simple eigenvalues instead of restricting to matrices with simple eigenvalues. Unfortunately, even if the original linear dynamical system has only semi-simple eigenvalues, in general the companion matrix is not semi-simple unless the original linear dynamical system is simple. This is because the companion matrix always has its minimal polynomial equal to its characteristic polynomial, and hence has geometric multiplicity 1 for all eigenvalues. This also points to the fact that even though the companion matrix has the form of the controllable canonical form, in general it is not necessarily similar to the transition matrix in the original LDS.

b.3 Explicit bound for condition number

In this subsection, we write out explicitly the condition number for simple eigenvalues in the asymptotic relation , to show how it varies according to the spectrum. Here we use the notation to note the condition number for eigenvalue in companion matrix .

Lemma B.3.

For a companion matrix with simple eigenvalues , the eigenvalues of the perturbed matrix by satisfy


and the condition number is bounded by


where is the spectral radius, i.e. largest absolute value of its eigenvalues.

In particular, when , i.e. when the matrix is Lyapunov stable,


For each simple eigenvalue of the companion matrix with column eigenvector and row eigenvector , the condition number of the eigenvalue is


This is derived from differentiating the eigenvalue equation , and multiplying the differentiated equation by , which results in



The companion matrix can be diagonalized as , the rows of the Vandermonde matrix are the row eigenvectors of , while the columns of are the column eigenvectors of . Since the the -th row and the -th column have inner product 1 by definition of matrix inverse, the condition number is given by

Formula for inverse of Vandermonde matrix

The Vandermonde matrix is defined as


The inverse of the Vandermonde matrix is given by [26] using elementary symmetric polynomial.



Pulling out the common denominator, the -th column vector of is

where the elementary symmetric polynomials are over variables .

For example, if , then the 3rd column (up to scaling) would be

Bounding the condition number

As discussed before, the condition number for eigenvalue is

where is the -th row of the Vandermonde matrix and is the -th column of .

By definition