Learning Koopman Invariant Subspaces
for Dynamic Mode Decomposition^{†}^{†}thanks: This is a preprint of the following conference paper: N. Takeishi, Y. Kawahara, and T. Yairi. Learning Koopman Invariant Subspaces for Dynamic Mode Decomposition. In Advances in Neural Information Processing Systems, volume 30, 2017. (to appear)
Abstract
Spectral decomposition of the Koopman operator is attracting attention as a tool for the analysis of nonlinear dynamical systems. Dynamic mode decomposition is a popular numerical algorithm for Koopman spectral analysis; however, we often need to prepare nonlinear observables manually according to the underlying dynamics, which is not always possible since we may not have any a priori knowledge about them. In this paper, we propose a fully datadriven method for Koopman spectral analysis based on the principle of learning Koopman invariant subspaces from observed data. To this end, we propose minimization of the residual sum of squares of linear leastsquares regression to estimate a set of functions that transforms data into a form in which the linear regression fits well. We introduce an implementation with neural networks and evaluate performance empirically using nonlinear dynamical systems and applications.
1 Introduction
A variety of timeseries data are generated from nonlinear dynamical systems, in which a state evolves according to a nonlinear map or differential equation. In summarization, regression, or classification of such timeseries data, precise analysis of the underlying dynamical systems provides valuable information to generate appropriate features and to select an appropriate computation method. In applied mathematics and physics, the analysis of nonlinear dynamical systems has received significant interest because a wide range of complex phenomena, such as fluid flows and neural signals, can be described in terms of nonlinear dynamics. A classical but popular view of dynamical systems is based on state space models, wherein the behavior of the trajectories of a vector in state space is discussed (see, e.g., [1]). Timeseries modeling based on a state space is also common in machine learning. However, when the dynamics are highly nonlinear, analysis based on state space models becomes challenging compared to the case of linear dynamics.
Recently, there is growing interest in operatortheoretic approaches for the analysis of dynamical systems. Operatortheoretic approaches are based on the Perron–Frobenius operator [2] or its adjoint, i.e., the Koopman operator (composition operator) [3, 4]. The Koopman operator defines the evolution of observation functions (observables) in a function space rather than state vectors in a state space. Based on the Koopman operator, the analysis of nonlinear dynamical systems can be lifted to a linear (but infinitedimensional) regime. Consequently, we can consider modal decomposition, with which the global characteristics of nonlinear dynamics can be inspected [4, 5]. Such modal decomposition has been intensively used for scientific purposes to understand complex phenomena (e.g., [6, 7, 8, 9]) and also for engineering tasks, such as signal processing and machine learning. In fact, modal decomposition based on the Koopman operator has been utilized in various engineering tasks, including robotic control [10], image processing [11], and nonlinear system identification [12].
One of the most popular algorithms for modal decomposition based on the Koopman operator is dynamic mode decomposition (DMD) [6, 7, 13]. An important premise of DMD is that the target dataset is generated from a set of observables that spans a function space invariant to the Koopman operator (referred to as Koopman invariant subspace). However, when only the original state vectors are available as the dataset, we must prepare appropriate observables manually according to the underlying nonlinear dynamics. Several methods have been proposed to utilize such observables, including the use of basis functions [14] and reproducing kernels [15]. Note that these methods work well only if appropriate basis functions or kernels are prepared; however, it is not always possible to prepare such functions if we have no a priori knowledge about the underlying dynamics.
In this paper, we propose a fully datadriven method for modal decomposition via the Koopman operator based on the principle of learning Koopman invariant subspaces (LKIS) from scratch using observed data. To this end, we estimate a set of parametric functions by minimizing the residual sum of squares (RSS) of linear leastsquares regression, so that the estimated set of functions transforms the original data into a form in which the linear regression fits well. In addition to the principle of LKIS, an implementation using neural networks is described. Moreover, we introduce empirical performance of DMD based on the LKIS framework with several nonlinear dynamical systems and applications, which proves the feasibility of LKISbased DMD as a fully datadriven method for modal decomposition via the Koopman operator.
2 Background
2.1 Koopman spectral analysis
We focus on a (possibly nonlinear) discretetime autonomous dynamical system
(1) 
where denotes the state space and represents the associated probability space. In dynamical system (1), Koopman operator [4, 5] is defined as an infinitedimensional linear operator that acts on observables (or ), i.e.,
(2) 
with which the analysis of nonlinear dynamics (1) can be lifted to a linear (but infinitedimensional) regime. Since is linear, let us consider a set of eigenfunctions of with eigenvalues , i.e., for , where and . Further, suppose that can be expressed as a linear combination of those infinite number of eigenfunctions, i.e., with a set of coefficients . By repeatedly applying to both sides of this equation, we obtain the following modal decomposition:
(3) 
Here, the value of is decomposed into a sum of Koopman modes , each of which evolves over time with its frequency and decay rate respectively given by and , since is a complex value. The Koopman modes and their eigenvalues can be investigated to understand the dominant characteristics of complex phenomena that follow nonlinear dynamics. The above discussion can also be applied straightforwardly to continuoustime dynamical systems [4, 5].
Modal decomposition based on , often referred to as Koopman spectral analysis, has been receiving attention in nonlinear physics and applied mathematics. In addition, it is a useful tool for engineering tasks including machine learning and pattern recognition; the spectra (eigenvalues) of can be used as features of dynamical systems, the eigenfunctions are a useful representation of timeseries for various tasks, such as regression and visualization, and itself can be used for prediction and optimal control. Several methods have been proposed to compute modal decomposition based on , such as generalized Laplace analysis [5, 16], the Ulam–Galerkin method [17], and DMD [6, 7, 13]. DMD, which is reviewed in more detail in the next subsection, has received significant attention and been utilized in various data analysis scenarios (e.g., [6, 7, 8, 9]).
Note that the Koopman operator and modal decomposition based on it can be extended to random dynamical systems actuated by process noise [4, 14, 18]. In addition, Proctor et al. [19, 20] discussed Koopman analysis of systems with control signals. In this paper, we primarily target autonomous deterministic dynamics (e.g., Eq. (1)) for the sake of presentation clarity.
2.2 Dynamic mode decomposition and Koopman invariant subspace
Let us review DMD, an algorithm for Koopman spectral analysis (further details are in the supplementary). Consider a set of observables and let be a vectorvalued observable. In addition, define two matrices generated by , and , i.e.,
(4) 
where is the number of snapshots in the dataset. The core functionality of DMD algorithms is computing the eigendecomposition of matrix [21, 13], where is the Moore–Penrose pseudoinverse of . The eigenvectors of are referred to as dynamic modes, and they coincide with the Koopman modes if the corresponding eigenfunctions of are in [21]. Alternatively (but nearly equivalently), the condition under which DMD works as a numerical realization of Koopman spectral analysis can be described as follows.
Rather than calculating the infinitedimensional directly, we can consider the restriction of to a finitedimensional subspace. Assume the observables are elements of . The Koopman invariant subspace is defined as s.t. . If is spanned by a finite number of functions, then the restriction of to , which we denote , becomes a finitedimensional linear operator. In the sequel, we assume the existence of such . If spans , then DMD’s matrix coincides with asymptotically, wherein is the realization of with regard to the frame (or basis) . For modal decomposition (3), the (vectorvalued) Koopman modes are given by and the values of the eigenfunctions are obtained by , where and are the right and lefteigenvectors of normalized such that [21, 14], and denotes the conjugate transpose of .
Here, an important problem in the practice of DMD arises, i.e., we often have no access to that spans a Koopman invariant subspace . In this case, for nonlinear dynamics, we must manually prepare adequate observables. Several researchers have addressed this issue; Williams et al. [14] leveraged a dictionary of predefined basis functions to transform original data, and Kawahara [15] defined Koopman spectral analysis in a reproducing kernel Hilbert space. Brunton et al. [22] proposed the use of observables selected in a datadriven manner [23] from a function dictionary. Note that, for these methods, we must select an appropriate function dictionary or kernel function according to the target dynamics. However, if we have no a priori knowledge about them, which is often the case, such existing methods do not have to be applied successfully to nonlinear dynamics.
3 Learning Koopman invariant subspaces
3.1 Minimizing residual sum of squares of linear leastsquares regression
In this paper, we propose a method to learn a set of observables that spans a Koopman invariant subspace , given a sequence of measurements as the dataset. In the following, we summarize desirable properties for such observables, upon which the proposed method is constructed.
Theorem 1.
Consider a set of squareintegrable observables , and define a vectorvalued observable . In addition, define a linear operator whose matrix form is given as . Then, if and only if spans a Koopman invariant subspace.
Proof.
If , then for any ,
where denotes the element of ; thus, is a Koopman invariant subspace. On the other hand, if spans a Koopman invariant subspace, there exists a linear operator such that ; thus, . Therefore, an instance of the matrix form of is obtained in the form of . ∎
According to Theorem 1, we should obtain that makes zero. However, such problems cannot be solved with finite data because is a function. Thus, we give the corresponding empirical risk minimization problem based on the assumption of ergodicity of and the convergence property of the empirical matrix as follows.
Assumption 1.
For dynamical system (1), the timeaverage and spaceaverage of a function (or ) coincide in for almost all , i.e.,
Theorem 2.
Proof.
Since is the minimumnorm solution of the linear leastsquares regression from the columns of to those of , we constitute the learning problem to estimate a set of function that transforms the original data into a form in which the linear leastsquares regression fits well. In particular, we minimize RSS, which measures the discrepancy between the data and the estimated regression model (i.e., linear leastsquares in this case). We define the RSS loss as follows:
(5) 
which becomes zero when spans a Koopman invariant subspace. If we implement a smooth parametric model on , the local minima of can be found using gradient descent. We adopt that achieves a local minimum of as a set of observables that spans (approximately) a Koopman invariant subspace.
3.2 Linear delay embedder for state space reconstruction
In the previous subsection, we have presented an important part of the principle of LKIS, i.e., minimization of the RSS of linear leastsquares regression. Note that, to define RSS loss (5), we need access to a sequence of the original states, i.e., , as a dataset. In practice, however, we cannot necessarily observe full states due to limited memory and sensor capabilities. In this case, only transformed (and possibly degenerated) measurements are available, which we denote with a measurement function . To define RSS loss (5) given only degenerated measurements, we must reconstruct the original states from the actual observations .
Here, we utilize delaycoordinate embedding, which has been widely used for state space reconstruction in the analysis of nonlinear dynamics. Consider a univariate timeseries , which is a sequence of degenerated measurements . According to the wellknown Taken’s theorem [25, 26], a faithful representation of that preserves the structure of the state space can be obtained by with some lag parameter and embedding dimension if is greater than . For a multivariate timeseries, embedding with nonuniform lags provides better reconstruction [27]. For example, when we have a twodimensional timeseries , an embedding with nonuniform lags is similar to with each value of and . Several methods have been proposed for selection of and [27, 28, 29]; however, appropriate values may depend on the given application (attractor inspection, prediction, etc.).
In this paper, we propose to surrogate the parameter selection of the delaycoordinate embedding by learning a linear delay embedder from data. Formally, we learn embedder such that
(6) 
where , , and is a hyperparameter of maximum lag. We estimate weight as well as the parameters of by minimizing RSS loss (5), which is now defined using instead of . Learning from data yields an embedding that is suitable for learning a Koopman invariant subspace. Moreover, we can impose L1 regularization on weight to make it highly interpretable if necessary according to the given application.
3.3 Reconstruction of original measurements
Simple minimization of may yield trivial , such as constant values. We should impose some constraints to prevent such trivial solutions. In the proposed framework, modal decomposition is first obtained in terms of learned observables ; thus, the values of must be backprojected to the space of the original measurements to obtain a physically meaningful representation of the dynamic modes. Therefore, we modify the loss function by employing an additional term such that the original measurements can be reconstructed from the values of by a reconstructor , i.e., . Such term is given as follows:
(7) 
and, if is a smooth parametric model, this term can also be reduced using gradient descent. Finally, the objective function to be minimized becomes
(8) 
where is a parameter that controls the balance between and .
3.4 Implementation using neural networks
In Sections 3.1–3.3, we introduced the main concepts for the LKIS framework, i.e., RSS loss minimization, learning the linear delay embedder, and reconstruction of the original measurements. Here, we demonstrate an implementation of the LKIS framework using neural networks.
Figure 1 shows a schematic diagram of the implementation of the framework. We model and using multilayer perceptrons (MLPs) with a parametric ReLU activation function [30]. Here, the sizes of the hidden layer of MLPs are defined by the arithmetic means of the sizes of the input and output layers of the MLPs. Thus, the remaining tunable hyperparameters are (maximum delay of ), (dimensionality of ), and (dimensionality of ). To obtain with dimensionality much greater than that of the original measurements, we found that it was useful to set even when fullstate measurements (e.g., ) were available.
After estimating the parameters of , , and , DMD can be performed normally by using the values of the learned , defining the data matrices in Eq. (4), and computing the eigendecomposition of ; the dynamic modes are obtained by , and the values of the eigenfunctions are obtained by , where and are the right and lefteigenvectors of . See Section 2.2 for details.
In the numerical experiments described in Sections 5 and 6, we performed optimization using firstorder gradient descent. To stabilize optimization, batch normalization [31] was imposed on the inputs of hidden layers. Note that, since RSS loss function (5) is not decomposable with regard to data points, convergence of stochastic gradient descent (SGD) cannot be shown straightforwardly. However, we empirically found that the nondecomposable RSS loss was often reduced successfully, even with minibatch SGD. Let us show an example; the fullbatch RSS loss (denoted ) under the updates of the minibatch SGD are plotted in the rightmost panel of Figure 4. Here, decreases rapidly and remains small. For SGD on nondecomposable losses, Kar et al. [32] provided guarantees for some cases; however, examining the behavior of more general nondecomposable losses under minibatch updates remains an open problem.
4 Related work
The proposed framework is motivated by the operatortheoretic view of nonlinear dynamical systems. In contrast, learning a generative (statespace) model for nonlinear dynamical systems directly has been actively studied in machine learning and optimal control communities, on which we mention a few examples. A classical but popular method for learning nonlinear dynamical systems is using an expectationmaximization algorithm with Bayesian filtering/smoothing (see, e.g., [33]). Recently, using approximate Bayesian inference with the variational autoencoder (VAE) technique [34] to learn generative dynamical models has been actively researched. Chung et al. [35] proposed a recurrent neural network with random latent variables, Gao et al. [36] utilized VAEbased inference for neural population models, and Johnson et al. [37] and Krishnan et al. [38] developed inference methods for structured models based on inference with a VAE. In addition, Karl et al. [39] proposed a method to obtain a more consistent estimation of nonlinear state space models. Moreover, Watter et al. [40] proposed a similar approach in the context of optimal control. Since generative models are intrinsically aware of process and observation noises, incorporating methodologies developed in such studies to the operatortheoretic perspective is an important open challenge to explicitly deal with uncertainty.
5 Numerical examples
In this section, we provide numerical examples of DMD based on the LKIS framework (LKISDMD) implemented using neural networks. We conducted experiments on three typical nonlinear dynamical systems: a fixedpoint attractor, a limitcycle attractor, and a system with multiple basins of attraction. We show the results of comparisons with other recent DMD algorithms, i.e., Hankel DMD [41, 42], extended DMD [14], and DMD with reproducing kernels [15]. The detailed setups of the experiments discussed in this section and the next section are described in the appendix.
Fixedpoint attractor
Consider a twodimensional nonlinear map on :
(9) 
which has a stable equilibrium at the origin if . The Koopman eigenvalues of system (9) include and , and the corresponding eigenfunctions are and , respectively. is also an eigenvalue with corresponding eigenfunction . A minimal Koopman invariant subspace of system (9) is , and the eigenvalues of the Koopman operator restricted to such subspace include , and . We generated a dataset using system (9) with and and applied LKISDMD (), linear Hankel DMD [41, 42] (delay 2), and DMD with basis expansion by , which corresponds to extended DMD [14] with a right and minimal observable dictionary. The estimated Koopman eigenvalues are shown in Figure 3, wherein LKISDMD successfully identifies the eigenvalues of the target invariant subspace. In Figure 3, we show eigenvalues estimated using data contaminated with white Gaussian observation noise (). The eigenvalues estimated by LKISDMD coincide with the true values even with the observation noise, whereas the results of DMD with basis expansion (i.e., extended DMD) are directly affected by the observation noise.
Limitcycle attractor
We generated data from the limit cycle of the FitzHugh–Nagumo equation
(10) 
where , , , and . Since trajectories in a limitcycle are periodic, the (discretetime) Koopman eigenvalues should lie near the unit circle. Figure 4 shows the eigenvalues estimated by LKISDMD (), linear Hankel DMD [41, 42] (delay 8), and DMDs with reproducing kernels [15] (polynomial kernel of degree 4 and RBF kernel of width 1). The eigenvalues produced by LKISDMD agree well with those produced by kernel DMDs, whereas linear Hankel DMD produces eigenvalues that would correspond to rapidly decaying modes.
Multiple basins of attraction
Consider the unforced Duffing equation
(11) 
where , , and . States following (11) evolve toward or depending on which basin of attraction the initial value belongs to unless the initial state is on the stable manifold of the saddle. Generally, a Koopman eigenfunction whose continuoustime eigenvalue is zero takes a constant value in each basin of attraction [14]; thus, the contour plot of such an eigenfunction shows the boundary of the basins of attraction. We generated 1,000 episodes of timeseries starting at different initial values uniformly sampled from . The left plot in Figure 5 shows the continuoustime Koopman eigenvalues estimated by LKISDMD (), all of which correspond to decaying modes (i.e., negative real parts) and agree with the property of the data. The center plot in Figure 5 shows the true basins of attraction of (11), and the right plot shows the estimated values of the eigenfunction corresponding to the eigenvalue of the smallest magnitude. The surface of the estimated eigenfunction agrees qualitatively with the true boundary of the basins of attractions, which indicates that LKISDMD successfully identifies the Koopman eigenfunction.
6 Applications
The numerical experiments in the previous section demonstrated the feasibility of the proposed method as a fully datadriven method for Koopman spectral analysis. Here, we introduce practical applications of LKISDMD.
Chaotic timeseries prediction
Prediction of a chaotic timeseries has received significant interest in nonlinear physics. We would like to perform the prediction of a chaotic timeseries using DMD, since DMD can be naturally utilized for prediction as follows. Since is decomposed as and is obtained by where is a lefteigenvalue of , the next step of can be described in terms of the current step, i.e., . In addition, in the case of LKISDMD, the values of must be backprojected to using the learned . We generated two types of univariate timeseries by extracting the series of the Lorenz attractor [43] and the Rossler attractor [44]. We simulated 25,000 steps for each attractor and used the first 10,000 steps for training, the next 5,000 steps for validation, and the last 10,000 steps for testing prediction accuracy. We examined the prediction accuracy of LKISDMD, a simple LSTM network, and linear Hankel DMD [41, 42], all of whose hyperparameters were tuned using the validation set. The prediction accuracy of every method and an example of the predicted series on the test set by LKISDMD are shown in Figure 7. As can be seen, the proposed LKISDMD achieves the smallest rootmeansquare (RMS) errors in the 30step prediction.
Unstable phenomena detection
One of the most popular applications of DMD is the investigation of the global characteristics of dynamics by inspecting the spatial distribution of the dynamic modes. In addition to the spatial distribution, we can investigate the temporal profiles of mode activations by examining the values of corresponding eigenfunctions. For example, assume there is an eigenfunction that corresponds to a discretetime eigenvalue whose magnitude is considerably smaller than one. Such a small eigenvalue indicates a rapidly decaying (i.e., unstable) mode; thus, we can detect occurrences of unstable phenomena by observing the values of . We applied LKISDMD () to a timeseries generated by a farinfrared laser, which was obtained from the Santa Fe Time Series Competition Data [45]. We investigated the values of eigenfunction corresponding to the eigenvalue of the smallest magnitude. The original timeseries and values of obtained by LKISDMD are shown in Figure 7. As can be seen, the activations of coincide with sudden decays of the pulsation amplitudes. For comparison, we applied the novelty/changepoint detection technique using oneclass support vector machine (OCSVM) [46] and direct densityratio estimation by relative unconstrained leastsquares importance fitting (RuLSIF) [47]. We computed AUC, defining the sudden decays of the amplitudes as the points to be detected, which were 0.924, 0.799, and 0.803 for LKIS, OCSVM, and RuLSIF, respectively.
7 Conclusion
In this paper, we have proposed a framework for learning Koopman invariant subspaces, which is a fully datadriven numerical algorithm for Koopman spectral analysis. In contrast to existing approaches, the proposed method learns (approximately) a Koopman invariant subspace entirely from the available data based on the minimization of RSS loss. We have shown empirical results for several typical nonlinear dynamics and application examples.
We have also introduced an implementation using multilayer perceptrons; however, one possible drawback of such an implementation is the local optima of the objective function, which makes it difficult to assess the adequacy of the obtained results. Rather than using neural networks, the observables to be learned could be modeled by a sparse combination of basis functions as in [23] but still utilizing optimization based on RSS loss. Another possible future research direction could be incorporating approximate Bayesian inference methods, such as VAE [34]. The proposed framework is based on a discriminative viewpoint, but inference methodologies for generative models could be used to modify the proposed framework to explicitly consider uncertainty in data.
Acknowledgments
This work was supported by JSPS KAKENHI Grant No. JP15J09172, JP26280086, JP16H01548, and JP26289320.
References
 Hirsch et al. [2012] M. W. Hirsch, S. Smale, and R. L. Devaney. Differential Equations, Dynamical Systems, and an Introduction to Chaos. Academic Press, 3rd edition, 2012.
 Lasota and Mackey [1994] A. Lasota and M. C. Mackey. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics. Springer, 2nd edition, 1994.
 Koopman [1931] B. O. Koopman. Hamiltonian systems and transformation in Hilbert space. Proc. of the National Academy of Sciences of the United States of America, 17(5):315–318, 1931.
 Mezić [2005] I. Mezić. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynamics, 41:309–325, 2005.
 Budis̆ić et al. [2012] M. Budis̆ić, R. Mohr, and I. Mezić. Applied Koopmanism. Chaos, 22:047510, 2012.
 Rowley et al. [2009] C. W. Rowley, I. Mezić, S. Bagheri, P. Schlatter, and D. S. Henningson. Spectral analysis of nonlinear flows. J. of Fluid Mechanics, 641:115–127, 2009.
 Schmid [2010] P. J. Schmid. Dynamic mode decomposition of numerical and experimental data. J. of Fluid Mechanics, 656:5–28, 2010.
 Proctor and Eckhoff [2015] J. L. Proctor and P. A. Eckhoff. Discovering dynamic patterns from infectious disease data using dynamic mode decomposition. Int’l Health, 7(2):139–145, 2015.
 Brunton et al. [2016a] B. W. Brunton, L. A. Johnson, J. G. Ojemann, and J. N. Kutz. Extracting spatialtemporal coherent patterns in largescale neural recordings using dynamic mode decomposition. J. of Neuroscience Methods, 258:1–15, 2016a.
 Berger et al. [2015] E. Berger, M. Sastuba, D. Vogt, B. Jung, and H. B. Amor. Estimation of perturbations in robotic behavior using dynamic mode decomposition. Advanced Robotics, 29(5):331–343, 2015.
 Kutz et al. [2016a] J. N. Kutz, X. Fu, and S. L. Brunton. Multiresolution dynamic mode decomposition. SIAM J. on Applied Dynamical Systems, 15(2):713–735, 2016a.
 Mauroy and Goncalves [2016] A. Mauroy and J. Goncalves. Linear identification of nonlinear systems: A lifting technique based on the Koopman operator. In Proc. of the IEEE 55th Conf. on Decision and Control, pages 6500–6505, 2016.
 Kutz et al. [2016b] J. N. Kutz, S. L. Brunton, B. W. Brunton, and J. L. Proctor. Dynamic Mode Decomposition: DataDriven Modeling of Complex Systems. SIAM, 2016b.
 Williams et al. [2015] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley. A datadriven approximation of the Koopman operator: Extending dynamic mode decomposition. J. of Nonlinear Science, 25(6):1307–1346, 2015.
 Kawahara [2016] Y. Kawahara. Dynamic mode decomposition with reproducing kernels for Koopman spectral analysis. In Advances in Neural Information Processing Systems, volume 29, pages 911–919. 2016.
 Mezić [2013] I. Mezić. Analysis of fluid flows via spectral properties of the Koopman operator. Annual Review of Fluid Mechanics, 45:357–378, 2013.
 Froyland et al. [2013] G. Froyland, G. A. Gottwald, and A. Hammerlindl. A computational method to extract macroscopic variables and their dynamics in multiscale systems. SIAM J. on Applied Dynamical Systems, 13(4):1816–1846, 2013.
 Takeishi et al. [2017] N. Takeishi, Y. Kawahara, and T. Yairi. Subspace dynamic mode decomposition for stochastic Koopman analysis. Physical Review E, 96:033310, 2017.
 Proctor et al. [2016a] J. L. Proctor, S. L. Brunton, and J. N. Kutz. Dynamic mode decomposition with control. SIAM J. on Applied Dynamical Systems, 15:142–161, 2016a.
 Proctor et al. [2016b] J. L. Proctor, S. T. Brunton, and J. N. Kutz. Generalizing Koopman theory to allow for inputs and control. art. arXiv:1602.07647, 2016b.
 Tu et al. [2014] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition: Theory and applications. J. of Computational Dynamics, 1(2):391–421, 2014.
 Brunton et al. [2016b] S. L. Brunton, B. W. Brunton, J. L. Proctor, and J. N. Kutz. Koopman invariant subspaces and finite linear representations of nonlinear dynamical systems for control. PLoS ONE, 11(2):e0150171, 2016b.
 Brunton et al. [2016c] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. In Proc. of the National Academy of Sciences of the United States of America, volume 113, pages 3932–3937, 2016c.
 Rakočević [1997] V. Rakočević. On continuity of the Moore–Penrose and Drazin inverses. Matematički Vesnik, 49:163–172, 1997.
 Takens [1981] F. Takens. Detecting strange attractors in turbulence. In Dynamical systems and turbulence, Lecture Notes in Mathematics, pages 366–381. 1981.
 Sauer et al. [1991] T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. J. of Statistical Physics, 65(3):579–616, 1991.
 Garcia and Almeida [2005] S. P. Garcia and J. S. Almeida. Multivariate phase space reconstruction by nearest neighbor embedding with different time delays. Physical Review E, 72(2):027205, 2005.
 Hirata et al. [2006] Y. Hirata, H. Suzuki, and K. Aihara. Reconstructing state spaces from multivariate data using variable delays. Physical Review E, 74(2):026202, 2006.
 Vlachos and Kugiumtzis [2010] I. Vlachos and D. Kugiumtzis. Nonuniform statespace reconstruction and coupling detection. Physical Review E, 82(1):016207, 2010.
 He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. art. arXiv:1502.01852, 2015.
 Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. of the 32nd Int’l Conf. on Machine Learning, pages 448–456, 2015.
 Kar et al. [2014] P. Kar, H. Narasimhan, and P. Jain. Online and stochastic gradient methods for nondecomposable loss functions. In Advances in Neural Information Processing Systems, volume 27, pages 694–702. 2014.
 Ghahramani and Roweis [1999] Z. Ghahramani and S. T. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems, volume 11, pages 431–437. 1999.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Stochastic gradient VB and the variational autoencoder. In Proc. of the 2nd Int’l Conf. on Learning Representations, 2014.
 Chung et al. [2015] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, volume 28, pages 2980–2988. 2015.
 Gao et al. [2016] Y. Gao, E. W. Archer, L. Paninski, and J. P. Cunningham. Linear dynamical neural population models through nonlinear embeddings. In Advances in Neural Information Processing Systems, volume 29, pages 163–171. 2016.
 Johnson et al. [2016] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, volume 29, pages 2946–2954. 2016.
 Krishnan et al. [2017] R. Krishnan, U. Shalit, and D. Sontag. Structured inference networks for nonlinear state space models. In Proc. of the 31st AAAI Conf. on Artificial Intelligence, pages 2101–2109, 2017.
 Karl et al. [2017] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational Bayes filters: Unsupervised learning of state space models from raw data. In Proc. of the 5th Int’l Conf. on Learning Representations, 2017.
 Watter et al. [2015] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, volume 28, pages 2746–2754. 2015.
 Arbabi and Mezić [2016] H. Arbabi and I. Mezić. Ergodic theory, dynamic mode decomposition and computation of spectral properties of the Koopman operator. art. arXiv:1611.06664, 2016.
 Susuki and Mezić [2015] Y. Susuki and I. Mezić. A prony approximation of Koopman mode decomposition. In Proc. of the IEEE 54th Conf. on Decision and Control, pages 7022–7027, 2015.
 Lorenz [1963] E. N. Lorenz. Deterministic nonperiodic flow. J. of the Atmospheric Sciences, 20:130–141, 1963.
 Rössler [1976] O. E. Rössler. An equation for continuous chaos. Physical Letters, 57A(5):397–398, 1976.
 Weigend and Gershenfeld [1993] A. S. Weigend and N. A. Gershenfeld. Time Series Prediction: Forecasting The Future And Understanding The Past. Westview Press, 2nd edition, 1993.
 Canu and Smola [2006] S. Canu and A. Smola. Kernel methods and the exponential family. Neurocomputing, 69(79):714–720, 2006.
 Liu et al. [2013] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Changepoint detection in timeseries data by relative densityratio estimation. Neural Networks, 43:72–83, 2013.
 Chen et al. [2012] K. K. Chen, J. H. Tu, and C. W. Rowley. Variants of dynamic mode decomposition: Boundary condition, Koopman, and Fourier analyses. Nonlinear Science, 22(6):887–915, 2012.
 Funk [2015] S. Funk. RMSprop loses to SMORMS3  Beware the Epsilon! http://sifter.org/~simon/journal/20150420.html, 2015. [Online; accessed 29April2017].
 Shampine and Reichelt [1997] L. F. Shampine and M. W. Reichelt. The MATLAB ODE suite. SIAM J. on Scientific Computing, 18:1–22, 1997.
 Brunton et al. [2017] S. L. Brunton, B. W. Brunton, J. L. Proctor, E. Kaiser, and J. N. Kutz. Chaos as an intermittently forced linear system. Nature Communications, 8:19, 2017.
 Chang and Lin [2011] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. ACM Trans. on Intelligent Systems and Technology, 2(3):1–27, 2011.
Appendix A Algorithm of dynamic mode decomposition
Dynamic mode decomposition (DMD) was originally invented as a tool for inspecting fluid flows [6, 7], and it has been utilized in various fields other than fluid dynamics. An output of DMD coincides with Koopman spectral analysis if we have that spans a Koopman invariant subspace. The popular algorithm of DMD [21], which is based on the singular value decomposition (SVD) of a data matrix, is defined as follows.
Algorithm (Dmd [21]).

Given a sequence of , define data matrices and .

Calculate the compact SVD of as , where is the rank of .

Compute a matrix .

Calculate eigendecomposition of , i.e., compute and such that .

In addition, calculate lefteigenvectors of .

Backproject the eigenvectors to the original space by and .

Normalize and such that , where if and otherwise.

Return dynamic modes and corresponding eigenvalues . In addition, return the values of corresponding eigenfunctions by .
The eigenvalues computed using the algorithm above are “discretetime” ones, in the sense that they represent frequencies and decay rates in terms of discretetime dynamical systems. The “continuoustime” counterparts can be computed easily by , where is the time interval in the discretetime setting.
Note that the definition above presumes the access to an appropriate observable . In contrast, in the proposed method, the abovementioned algorithm is run after applying the proposed framework to learn from observed data.
Appendix B Detailed experimental setup
In this appendix section, the configurations of the numerical examples and applications, which were omitted in the main text, are described.
b.1 General settings
Hyperparameters
In each experiment, parameter was fixed at . Note that the quality of the results was not sensitive to the values of . We modeled and with multilayer perceptrons by setting the number of hidden nodes (denoted ) as the arithmetic means of the input and output sizes, i.e., for and for , where , , and . Therefore, the remaining hyperparameters to be tuned were (maximum lag), , and . However, unless otherwise noted, we fixed by . Consequently, the independent hyperparameters were and .
Preprocessing
One must not subtract the mean from the original data because subtracting something from the data may change the spectra of the underlying dynamical systems (see, e.g., [48]). If the absolute values of the data were too large, we simply divided the data by the maximum absolute value for each series.
Optimization
In optimization, we found that the adaptive learning rate by SMORMS3 [49] achieved fast convergence compared to a fixed learning rate and other adaptation techniques. The maximum learning rate of SMORMS3 was selected from to in each experiment according to the amount of data. In some cases, optimization was performed in two stages: the parameters of , , and were updated in the first stage, and, in the second stage, the parameters of and were fixed and only was updated. This twostage optimization was particularly useful for the application of prediction, where a precise reconstruction of the original measurements was necessary. Moreover, when the original states of the dynamical system were available and used without delay (i.e., and ), parameter of the linear embedder was fixed to be an identity matrix (i.e., no embedder was used). Also, we set the minibatch size from 100 to 500 because smaller minibatches often led to an unstable computation of pseudoinverse.
b.2 Fixedpoint attractor experiment
In the experiment using the fixedpoint attractor, the data were generated with four initial values: , , , and , with the length of each episode being . In the case of noisy dataset, the standard deviation of the observation noise was set to . In both experiments (with and without observation noise), we set and to cover the minimal threedimensional Koopman invariant subspace.
b.3 Limitcycle attractor experiment
The data were generated using MATLAB’s ode45 function [50], which was run with timestep and initial value for 2,000 steps. The hyperparameters of LKISDMD, linear Hankel DMD, and kernel DMDs were set such that they produced 16 eigenvalues, i.e., and for LKISDMD, and POD modes whose singular value was less than were disposed in kernel DMDs ( for the polynomial kernel and for the RBF kernel).
b.4 Multiple basins of attraction experiment
The data were generated using the settings provided in the literature [14]; 1,000 initial values were drawn from the uniform distribution on and each initial value was proceeded in time for 11 steps with . We used MATLAB’s ode45 function for numerical integration. For LKISDMD, we set and . Note that the values of the estimated eigenfunction were evaluated and plotted in consideration of each data point.
b.5 Chaotic timeseries prediction experiment
The data were generated from the Lorenz attractor [43] (parameters , , and ) and the Rossler attractor [44] (parameters , , and ). We generated 25,000 steps for each attractor and divided them into training, validation, and test sets. For all methods, the delay dimension was fixed at , i.e., for LKISDMD and linear Hankel DMD, and backpropagation was truncated to length to learn the LSTM network. We tuned of LKISDMD and the dimensionality of LSTM’s hidden state (denoted ) according to the 30step prediction accuracies obtained using the validation set. Here, we obtained and for the Lorenz data and and for the Rossler data.
In this experiment, LSTM was applied because it had been utilized for various nonlinear timeseries, and Hankel DMD was used because it had been successfully utilized for analysis of chaotic systems [51].
b.6 Unstable phenomena detection experiment
The dataset was obtained from the Santa Fe Time Series Competition Data [45]. Note that the author’s [45] original web page was not available on the date of submission of this manuscript (May 2017); however, the dataset itself was still available online. The length of delay (or sliding window) was fixed to for all methods applied in this experiment. In addition, no intensive tuning of the other hyperparameters was conduct because the purpose was qualitative. The default settings of libsvm [52] were used for the oneclass SVM (except for ). For the densityratio estimation by RuLSIF, the default values of the implementation by the authors of [47] were used.
In this experiment, OCSVM was applied because it was a kind of de facto standard for novelty/changepoint detection, and RuLSIF ws used because it had achieved the best performance among methods based on densityratio estimation [47].