Sparsitypromoting algorithms for the discovery of informative Koopman invariant subspaces
Abstract
Koopman decomposition is a nonlinear generalization of eigen decomposition, and is being increasingly utilized in the analysis of spatiotemporal dynamics. Wellknown techniques such as the dynamic mode decomposition (DMD) and its variants provide approximations to the Koopman operator, and have been applied extensively in many fluid dynamic problems. Despite being endowed with a richer dictionary of nonlinear observables, nonlinear variants of the DMD, such as extended/kernel dynamic mode decomposition (EDMD/KDMD) are seldom applied to largescale problems primarily due to the difficulty of discerning the Koopman invariant subspace from thousands of resulting Koopman triplets: eigenvalues, eigenvectors, and modes. To address this issue, we revisit the formulation of EDMD and KDMD, and propose an algorithm based on multitask feature learning to extract the most informative Koopman invariant subspace by removing redundant and spurious Koopman triplets. These algorithms can be viewed as sparsity promoting extensions of EDMD/KDMD and are presented in an opensource package. Further, we extend KDMD to a continuoustime setting and show a relationship between the present algorithm, sparsitypromoting DMD and a empirical criterion from the viewpoint of nonconvex optimization. The effectiveness of our algorithm is demonstrated on examples ranging from simple dynamical systems to twodimensional cylinder wake flows at different Reynolds numbers and a threedimensional turbulent ship airwake flow. The latter two problems are designed such that very strong transients are present in the flow evolution, thus requiring accurate representation of decaying modes.
oopman decomposition; Sparsity promoting techniques; Kernel methods; feature learning.
1 Introduction
Koopman theory (Budišić et al., 2012) offers an elegant framework to reduce spatiotemporal fields associated with nonlinear dynamics as a linear combination of time evolving modes (Rowley et al., 2009). Consider a general continuous nonlinear dynamical system , where the system state , is a vectorvalued smooth function and is the tangent bundle, i.e., . The aforementioned task of decomposition is equivalent to finding a set of observables associated with the system, , that evolve in time while linearly spanning the system state . The significance of the above statement is that it represents a global linearization of the nonlinear dynamical system (Budišić et al., 2012; Brunton et al., 2016a). Formally, this idea can be traced back to Koopman operator theory for Hamiltonian dynamical systems introduced by Koopman (1931) to study the evolution of observables on . For , the continuoustime Koopman operator is the set that governs the dynamics of observables as where is the flow map of the dynamical system (Mezić, 2013). While the Koopman operator is linear over the space of observables, is most often infinitedimensional, e.g., which makes the approximation of the Koopman operator a difficult problem. Throughout this work, we assume . Readers interested in a detailed discussion of are referred to Bruce et al. (2019).
is a Koopman invariant subspace of , if for any , for any , we have . Extracting such a subspace yields the linear representation we seek, and will yield useful coordinates for a multitude of applications including analysis and control.
In general, physical systems governed by PDEs are infinitedimensional. From a numerical viewpoint, the number of degrees of freedom can be related to the spatial discretization (for example, the number of grid points). Although a finitedimensional manifold can be extracted (Holmes et al., 2012), e.g.,  dimensions via Proper Orthogonal Decomposition (POD), finding a Koopman invariant subspace on such manifolds is still challenging. Currently, the most popular method to approximate the Koopman operator is the dynamic mode decomposition (DMD) (Schmid, 2010; Rowley and Dawson, 2017) mainly for two reasons. First, it is straightforward and computationally efficient compared to nonlinear counterparts such as Extended DMD (EDMD) (Williams et al., 2015) and Kernel DMD (KDMD) (Williams et al., 2014). Second, the essence of DMD is to decompose a spatiotemporal field into several temporally growing/decaying travelling/stationary harmonic waves, which are prevalent in fluid mechanics.However, the accuracy of DMD is limited by the assumption that the Koopman invariant subspace lies in the space spanned by snapshots of the state . Thus, DMD is used to mainly identify and visualize coherent structures. Indeed, DMD can be interpreted as a projection of the action of the Koopman operator on the linear space spanned by snapshots of the system state (Korda and Mezić, 2018).
To overcome the above limitation, it is natural to augment the observable space with either the history of the state (Arbabi and Mezić, 2017a; Brunton et al., 2017; Kamb et al., 2018; Le Clainche and Vega, 2017) or nonlinear observables of the state (Williams et al., 2014, 2015). Although the former is simpler to implement and has strong connections to Takens’ embedding (Kamb et al., 2018; Pan and Duraisamy, 2019), the main practical issue in a predictive setting is the requirement of a large number of data snapshots of the system. Further, if one is only interested in the posttransient dynamics of the system state on an attractor, linear observables with time delays are sufficient to extract an informative Koopman invariant subspace (Arbabi and Mezić, 2017a; Pan and Duraisamy, 2019; Brunton et al., 2017; Arbabi and Mezić, 2017b; Mezić, 2005; Röjsel, 2017). However, if one is interested in the transient dynamics leading to an attractor, or in a dissipative nonlinear system (Williams et al., 2015) for control (Kaiser et al., 2017) or reduced order modeling, time delay embedding may become inappropriate as hundreds of delay snapshots might be required to initialize the model. Hence, nonlinear observables may be more appropriate for such problems.
Driven by the interest in modal analysis and control of transient flow phenomena, we consider augmentation of the observable space with nonlinear functions of the state, e.g., EDMD (Williams et al., 2015)/KDMD (Williams et al., 2014). Although it has been reported that KDMD allows for a set of more interpretable Koopman eigenvalues (Williams et al., 2014) and better accuracy (Röjsel, 2017), issues such as mode selection, spurious modes (Kaiser et al., 2017; Zhang et al., 2017), and choice of dictionary/kernel in EDMD/KDMD remain open. In fact, the choice of kernel type and hyperparameter can significantly affect the resulting eigenmodes, distribution of eigenvalues (Kutz et al., 2016), and the accuracy of predictions (Zhang et al., 2017).
Searching for an accurate and informative Koopman invariant subspace has long been a pursuit in the DMD community. Rowley et al. (2009) and Schmid et al. (2012) considered selecting dominant DMD modes in order of their amplitudes. However, following such a criterion (Tu et al., 2013; Kou and Zhang, 2017), may result in the selection of spurious modes that may have large amplitudes, but decay rapidly. As a result, Tu et al. (2013) considered weighting the loss term by the magnitude of eigenvalues to penalize the retention of fast decaying modes. Sparsitypromoting DMD (spDMD) developed by Jovanović et al. (2014) recasts mode selection in DMD as an optimization problem with a penalty. With a preference of stable modes over fast decaying ones, Tissot et al. (2014) proposed a simpler criterion based on timeaveragedeigenvalueweighted amplitude. This was followed by Kou and Zhang (2017) who used a similar criterion but computed the “energy” of each mode, yielding similar performance to spDMD at a lower computational cost. Based on the orthonormal property of pseudoinverse, Hua et al. (2017) proposed an ordering of Koopman modes by defining a new “energy”. Compared with previous empirical criteria, the “energy” for each mode involves a pseudoinverse which combines the influence from all eigenmodes, and therefore the contribution from each mode cannot be isolated.
Instead of selecting modes from a “reconstruction” perspective, Zhang et al. (2017) studied the issue of spurious modes by evaluating the deviation of the identified eigenfunctions from linear evolution in an a priori sense. Further, Optimized DMD (Chen et al., 2012; Askham and Kutz, 2018) combines DMD with mode selection simultaneously, which is the forerunner of recently proposed neural networkbased models for Koopman eigenfunctions in spirit (Takeishi et al., 2017; Lusch et al., 2018; Otto and Rowley, 2019; Pan and Duraisamy, 2020; Li et al., 2017; Yeung et al., 2019). Regardless of the above issues related to nonconvex optimization (Dawson et al., 2016), extension of optimized DMD to EDMD/KDMD is not straightforward. Further, neural networkbased models require large amounts of data, are prone to local minima, and lack interpetability.
There have been a few attempts towards mode selection in EDMD/KDMD. Brunton et al. (2016a) present an iterative method that augments the dictionary of EDMD until a convergence criterion is reached for the subspace. This is effectively a recursive implementation of EDMD. Recently, Haseli and Cortés (2019) showed that given a sufficient amount of data, if there is any accurate Koopman eigenfunction spanned by the dictionary, it must correspond to one of the obtained eigenvectors. Moreover, they proposed the idea of mode selection by checking if the reciprocal of identified eigenvalue also appears when the temporal sequence of data is reversed, which is similar to the idea of comparing eigenvalues on the complex plane from different trajectories, as proposed by Hua et al. (2017). In contrast to the “bottomup” method of Brunton et al. (2016a) with subspace augmentation, Hua et al. (2017) proposed a “topdown” subspace subtraction method relying on iteratively projecting the features onto the null space. A similar idea can be traced back to Kaiser et al. (2017) who propose a search for the sparsest vector in the null space.
Enlightened by the aforementioned attempts, we propose a EDMD/KDMD framework equipped with the following strategy to extract the Koopman invariant subspace:

We first evaluate the normalized maximal deviation of the evolution of each eigenfunction from linear evolution in a posteriori way.

Using the above criteria, we select a userdefined number of accurate EDMD/KDMD modes;
To the best of our knowledge, this is the first attempt to address sparse identification of the Koopman invariant subspace in EDMD/KDMD with both sparse reconstruction and removal of spurious modes.
The organization of the paper is as follows. In Section 2, we provide a review of EDMD/KDMD in discretetime and present corresponding extensions to continuoustime. Following this, we discuss current challenges in standard EDMD/KDMD. In Section 3, we propose the framework of sparse identification of the informative Koopman invariant subspace for EDMD/KDMD with hyperparameter selection and implementation with the help of a parallel SVD tool. Differences and connections between spDMD, Kou’s emprical criterion, and our proposed framework is shown from the viewpoint of optimization. In Section 4, numerical verification and modal analysis on examples from a fixed point attractor to a transient cylinder wake flow, and turbulent shipairwake are provided. In Section 5, conclusions are drawn.
2 Review of EDMD/KDMD
In this section, we provide a summary of our framework to extract the Koopman operator using EDMD/KDMD in both discretetime and continuoustime. Although the original EDMD is seldom used for datadriven Koopman analysis in fluid flows due to its poor scaling, it has recently been reported that a scalable version of EDMD can be applied to highdimensional systems (DeGennaro and Urban, 2019). Since our algorithm to extract the Koopman invariant subspace is agnostic to the type of technique to compute the original set of Koopman modes, we include both EDMD and KDMD in this paper for completeness.
2.1 Extended Dynamic Mode Decomposition
Discretetime Formulation
For simplicity, consider sequential snapshots sampled uniformly in time with on a trajectory, . The EDMD algorithm (Williams et al., 2015) assumes a dictionary of linearly independent functions , that approximately spans a Koopman invariant subspace
(1) 
Thus we can write for any as with , where the feature vector is
(2) 
Consider a set of observables as , with . After the discretetime Koopman operator is applied on each , given data , we have the following,
(3) 
The standard EDMD algorithm seeks a such that
(4) 
as the sum of the square of residual from eq. 3 is minimized over . In the above equation,
(5) 
(6) 
Considering , one obtains . Thus, the corresponding minimizer is,
(7) 
where denotes the pseudoinverse and
(8)  
(9) 
where H denotes conjugate transpose. Note that when the set of observable fully span , i.e., is full rank, reduces to identity. Then we have the more familiar as,
(10) 
which is independent of the choice of .
Assuming that all eigenvalues of are simple
(11) 
where . Finally, the Koopman modes of a vectorvalued dimensional observable function with are the vectors of weights assoicated with the expansion in terms of eigenfunctions as,
(12) 
where is often numerically determined in practice by ordinary least squares,
(13) 
with .
Continuoustime Formulation
Consider data snapshots of the dynamical system with state sampled over as where . Recall the generator of the semigroup of Koopman operators where is the domain in which the aforementioned limit is welldefined and . One can have the evolution of feature vector as,
(14) 
Similarly, one can find a that minimizes the sum of the square of residual minimized solution,
(15) 
where
(16)  
(17) 
Consider eigenvalues and eigenvectors of . Koopman eigenfunctions are in the same form as that in discretetime formulations while continuoustime Koopman eigenvalues can be converted to the aforementioned discretetime sense as .
2.2 Kernel Dynamic Mode Decomposition
Discretetime Formulation
Instead of explicitly expressing a dictionary of candidate functions, one can instead implicitly define a dictionary of candidate functions via the kernel trick, which is essentially the dual form of EDMD (Williams et al., 2014). Note that, from the EDMD formulation in eq. 4, any vector in the range of orthogonal to the range of is annihilated, and therefore cannot be inferred (Williams et al., 2014). Thus, assuming has rank , denote a full SVD of and economicalSVD , one can represent without loss (Otto and Rowley, 2019). Then since the multiplication by a unitary matrix preserves the Frobenius form, one can have , where are the last rows of . By taking derivatives with respect to , one can obtain the minimalnorm minimizer as,
(18) 
Notice that, since any column in the span of that is orthogonal to the span of will be annihilated by and thus cannot be utilized to determine , it is reasonable to restrict within the column space of . Assuming is sufficiently large such that the column space of fully spans , eq. 19 can be proved (details in appendix A),
(19) 
With eq. 19, we can rewrite eq. 18 as the familiar KDMD formulation,
(20) 
where with , for . Again, such a minimizer is independent of the choice of .
Notice that, to compute Koopman eigenvalues and eigenfunctions, one would only need access to and , i.e., the inner product among features on all data points. Fortunately, on a compact domain , it is wellknown from Mercer’s theorem (Mercer, 1909) that once a suitable nonnegative kernel function is defined, is the inner product among a potentially infinite dimensional feature vector evaluated at . Note that the choice of such a feature vector might not be unique but the corresponding reproducing kernel Hilbert space (RKHS) is (Aronszajn, 1950). In the case of a Gaussian kernel, one can choose the canonical feature vector which are “bumps” of a certain bandwidth distributed on . From the view point of kernel PCA (Williams et al., 2014), resulting from the finite dimensional rank truncation on the Gram matrix is a numerical approximation to the most dominant varianceexplained mode shapes in the RKHS evaluated on the data points (Rasmussen, 2003), and represents the dominant varianceexplaining directions in terms of the feature vector in the RKHS.
Similar to EDMD, given , , for , the corresponding th Koopman eigenfunctions and Koopman modes for a vector observable are,
(21)  
(22) 
Continuoustime Formulation
For the kernel trick to be applied in the continuous formulation, write as,
(23) 
To compute , denoting the th component of as ,
(24) 
where the appearance of Jacobian indicates that a differentiable kernel function is required for the extension to continuoustime. For common kernels used in Koopman analysis, the kernel function, Jacobian, and hyperparameters are listed in table 1.
kernel type  hyper para.  

linear  
polynomial  
isotropic Gaussian 
2.3 Challenges in EDMD/KDMD
In this section, we briefly discuss two broad challenges in the use of EDMD and KDMD for Koopman analysis.
Mode Selection
The number of approximated Koopman tuples (eigenfunction, eigenvalue, modes) from EDMD grows with the dictionary size, whereas the KDMD grows with the number of snapshots. However, in most cases, a significant number of the eigenfunctions fail to evolve linearly, or are redundant in contribution to the reconstruction of the state . For example, the Koopman eigenfunctions that vanish nowhere form an Abelian group under pointwise products of functions, while polynomial observables evolve linearly for a general linear system (Budišić et al., 2012). These eigenfunctions, associated with the polynomial observables, are redundant in terms of providing an intrinsic coordinate for the linear dynamics.
When the number of features is larger than the number of data snapshots, EDMD eigenvalues can be misleading (Otto and Rowley, 2019) and often plagued with spurious eigenfunctions that do not evolve linearly even when the number of data snapshots is sufficient. Analytically, it is clear that a Koopman eigenfunction in the span of the dictionary will be associated with one of the eigenvectors obtained from EDMD, given is full rank, and contains sufficient snapshots (Haseli and Cortés, 2019). Indeed, the EDMD is a projection of the Koopman operator under the empirical measure (Korda and Mezić, 2018). As a result, we desired to search for the Koopman invariant subspace in the standard EDMD/KDMD. Since KDMD is just an efficient way of populating a dictionary of nonlinear features in high dimensional spaces, similar arguments apply to KDMD. It should be noted that the numerical conditioning issue can play a critical role since full rank matrices can be poorlyconditioned.
Choice of Dictionary (for EDMD) or Kernel (for KDMD)
Although the use of a kernel defines an infinitedimensional feature space, the resulting finite number of effective features can still be affected by both the type of the kernel and the hyperparameters in the kernel as clearly shown by Kutz et al. (2016). Compared to EDMD/KDMD, which are based on a fixed dictionary of features, neural network approaches (Otto and Rowley, 2019; Pan and Duraisamy, 2020; Lusch et al., 2018) have the potential to be more expressive in searching for a larger Koopman invariant subspace. From a kernel viewpoint (Cho and Saul, 2009), feedforward neural networks enable adaptation of the kernel function to the data. Such a characteristic could become significant when the underlying Koopman eigenfunction is discontinuous. From an efficiency standpoint, a kernelguided scalable EDMD (DeGennaro and Urban, 2019) may be pursued. This can be achieved by generating kernelconsistent random Fourier features or approximating a few components of the feature vector constructed from Mercer’s theorem, i.e., the eigenfunctions of the Hilbert–Schmidt integral operator on the RKHS.
3 Sparse Identification of Informative Koopman Invariant Subspace
In this section, we present an algorithm that uses EDMD/KDMD modes to identify a sparse, accurate, and informative Koopman invariant subspace. The algorithm first prunes spurious, inaccurate eigenmodes to determine a sparse representation of the system state . In addition to the training data, as required in standard EDMD/KDMD, a validation trajectory dataset is required to avoid overfitting on training data in the postprocessing algorithm. The terms spEDMD/spKDMD will refer to filtered mode selections of EDMD and KDMD, respectively.
3.1 Pruning spurious modes by a posteriori error analysis
Given a validation trajectory where associated with the nonlinear dynamical system, for , we define the goodness of th eigenfunctions in a posteriori way as the maximal normalized deviation from linear evolution conditioned on trajectory as in the form
(25)  
(26) 
where . In practice, we evaluate the above integral terms discretely in time. A similar a priori and less restrictive method has been previously proposed Zhang et al. (2017). In contrast, in the proposed method, the maximal error is evaluated in an a posteriori way to better differentiate spurious modes from accurate ones. For any , we can always select top accurate eigenmodes out of eigenmodes as with for the next sparse reconstruction step. To choose an appropriate to linearly reconstruct the system state , we monitor the normalized reconstruction error for the aforementioned set of top accurate eigenmodes in the following form
(27) 
where is the identity matrix, and
(28) 
As a result, the evaluation of eq. 27 for each is of similar expense to leastsquare regression. For an increasing number of selected eigenfunctions , the reconstruction error decreases, while the largest linear evolution error increases. Then, a truncation can be defined by the user to strike a balance between linear evolution error and reconstruction error . By selecting a smaller set of eigenmodes than , one can develop a Koopman invariant subspace of a desired dimension.
3.2 Sparse reconstruction via multitask feature learning
Numerical experiments revealed that, in the selected set of most accurate eigenfunctions, two types of redundant eigenfunctions were found:

Nearly constant eigenfunctions with eigenvalues close to zero,

Pointwise products of Koopman eigenfunctions introduced by nonlinear observables, not useful in linear reconstruction.
To filter the above modes, it is natural to consider sparse regression with most accurate eigenfunctions as features and the system state as target. Note that, since we have guaranteed the accuracy of selected eigenmodes, one can either choose features a priori: or a posteriori . Here we choose the latter since it is directly related to prediction, and can actually be reused from the previous step without additional computational cost. We denote the corresponding feature matrix as ,
(29) 
where . Note that similar features were also considered in sparsitypromoting DMD (Jovanović et al., 2014) and optimized DMD (Chen et al., 2012). Finally, the fact that there is no control over the magnitudes of the implicitly defined features in the standard KDMD may cause unequal weighting between different features. Thus, we consider scaling the initial value of all eigenfunctions to be unity in eq. 30,
(30) 
where
(31) 
Since is finitedimensional, searching for a sparse combination of to reconstruct is equivalent to the solution of a multitask feature learning problem with preference over a relatively small size of features. Note that this type of problem has been studied extensively in the machine learning community (Argyriou et al., 2008a; Zhao et al., 2015; Argyriou et al., 2008b). In this work, given and , we employ the multitask ElasticNet (Pedregosa et al., 2011) to search for a rowwise sparse , which solves the following convex optimization problem:
(32) 
and
(33) 
where defined in eq. 34 is the socalled norm for a matrix ,
(34) 
and is th row of . This norm is special in that it controls the number of shared features learned across all tasks, i.e., th Koopman mode is either driven to a zero vector or not while the standard only controls the number of features for each task independently. As a simple illustration, the norm for three different square matrices (here ) with 01 binary entries is displayed in fig. 1.
The above procedure not only serves the purpose of selecting modes that explains the behavior of all components in the state, but is also particularly natural for EDMD/KDMD since Koopman modes are obtained via regression.
is a penalty coefficient that controls the amount of total regularization in the and norms, while is the ElasticNet mixing parameter (Zou and Hastie, 2005) that ensures uniqueness of the solution when highly correlated features exist. In our case, we choose and sweep over a certain range with nonzero features denoted as for each , while monitoring the normalized residual to choose a proper . It has to be mentioned that, sometimes the sparsest solution from a multitask ElasticNet was found to shrink to a small number instead of zero. This is a consequence of the insufficiency of the current optimization algorithm which employs coordinate descent Pedregosa et al. (2011). Hence for each target component, we consider an additional hardthresholding step by setting the corresponding magnitude of the coefficient, i.e., contribution of any mode, to zero if it is smaller than a certain threshold .
Finally, we refit the Koopman modes as which avoids the bias introduced by the penalty term
Relationship between sparsitypromoting DMD, Kou’s criterion and multitask feature learning
For simplicity, neglecting the ElasticNet part (i.e., using ), eq. 32 with modes leads to a multitask Lasso problem,
(35) 
Recall that in spDMD (Jovanović et al., 2014), DMD modes with remain the same as standard DMD. Similariy, if we posit a structural constraint on in eq. 35 by enforcing the modes as those from DMD, then there exist such that,
(36) 
Note the fact that . Hence, we recover the optimization step in the spDMD (Jovanović et al., 2014) from eq. 35 as,
(37) 
where are the DMD modes corresponding to eigenvalues as . Hence, multitask feature learning solves a less constrained optimization than spDMD in the context of DMD.
Kou and Zhang (2017) proposed an empirical criterion for mode selection by ordering modes with “energy” defined as
(38) 
From an optimization viewpoint, consider a posteriori prediction matrix from DMD
(39) 
where is determined from DMD using the snapshot pair . is a rank1 summation of contributions from different modes (Schmid, 2010). Hence, a general mode selection technique with a userdefined preference weighting is the following weighted nonconvex optimization problem:
(40) 
where , is one if and zero otherwise. Note that this norm can be viewed as a limiting case of a weighted composite sine function smoothed regularization (Wang et al., 2019).
To solve this nonconvex optimization problem, compared to the popular relaxation method such as the one in sparsitypromoting DMD, a lessknown but rather efficient way is iterative leastsquares hard thresholding. This has been used in sparse identification of dynamical systems (SINDy) (Brunton et al., 2016b), and convergence to a local minimum has been proved (Zhang and Schaeffer, 2019). Indeed, a more rigorous framework that generalizes such an algorithm is the proximal gradient method (Parikh and Boyd, 2014). Much like Newton’s method is a standard tool for unconstrained smooth optimization, the proximal gradient method is the standard tool for constrained nonsmooth optimization. Here, it is straightforward to derive the iterative algorithm that extends to the weighted norm from step to step as
(41) 
where
(42) 
and is the stepsize at step . Notice that the weighted norm is a separable sum of . After some algebra, we have the proximal operator as
(43) 
where is an elementwise hard thresholding operator defined as if and zero otherwise.
Particularly, if one considers the initial stepsize to be extremely small , then the second term in eq. 41 can be neglected. Thus, for , with the following weighting scheme that penalizes fast decaying modes:
(44) 
one immediately realizes the thresholding criterion for th entry of becomes
(45) 
Then, the first iteration in eq. 41 reduces to mode selection with Kou’s criterion in eq. 38. Normally, is very large for unstable modes and small for decaying modes. It is important to note that a) such a choice of preferring unstable/longlasting modes over decaying modes is still userdefined; 2) Optimization is in an a priori sense to obtain DMD. Thus, the insufficiency of the a priori formulation to account for temporal evolution is indeed compensated by this criterion, while DMD in an a posteriori formulation (e.g., sparsitypromoting DMD) includes such a effect implicitly in the optimization. Hence, it is possible that in some circumstances spDMD and Kou’s criterion could achieve similar performance (Kou and Zhang, 2017).
Lastly, it is important to mention the similarities and differences between spKDMD and spDMD: 1) spKDMD will refit Koopman modes while spDMD does not; and 2) The amplitude for all the modes in spKDMD is fixed as unity while it has to be determined in spDMD.
3.3 Hyperparameter selection
For simplicity, hyperparameter selection for KDMD is only discussed in this section. To fully determine a kernel in KDMD, one would have to choose the following:

kernel type,

kernel parameters, e.g., scale parameters ,

rank truncation .
In this work, for simplicity, we fix the kernel type to be an isotropic Gaussian. Motivated by previous work on error evaluation in Koopman modes by Zhang et al. (2017), we consider evaluation with cross validation on a priori mean normalized accuracy given in eqs. 47 and 46 on validation data for different number of rank truncation and kernel parameters. Note that evaluation on maximal instead of mean normalized accuracy would lead to the error metric to be strongly dependent on the local sparsity of training data in the feature space. This is particularly true for a single trajectory for a highdimensional dynamical system is used, and random shuffled cross validation is performed (Pan and Duraisamy, 2018).
(46) 
(47) 
For each set of hyperparameters, we first compute the number of eigenfunctions of which the error defined in eqs. 47 and 46 is below a certain threshold on both training and validation data for each fold of cross validation. Then we compute the average number of such eigenfunctions over all folds which indicates the quality of the corresponding subspace. Finally, we plot the average number versus rank truncation and kernel scale parameters to select hyperparameters.
3.4 Implementation
We implement the described framework in Python with moderate parallelism in each module. We use scipy.special.hermitenorm (Jones et al., 2001) to generate normalized Hermite polynomials and MultiTaskElasticNet in the scikitlearn (Pedregosa et al., 2011) for multitask feature learning where we set the maximal iteration as and tolerance as . MPI parallelism using mpi4py (Dalcin et al., 2011) is used for the grid search in hyperparameter selection. To prepare data with hundreds of gigabytes collected from high fidelity simulations, a distributed SVD written in C++ named Parallel Data Processing (PDP) Tool is developed for dimension reduction. A brief description of this tool is given in appendix C.
4 Results and Discussion
4.1 2D fixed point attractor with the known finite Koopman invariant subspace
We first consider a simple fixed point nonlinear dynamical system which has an analytically determined, finitedimensional nontrivial Koopman invariant subspace (Brunton et al., 2016a; Kaiser et al., 2017) to show the effectiveness of proposed method. We consider a continuoustime formulation. The governing equation for the dynamical system is given as follows,
(48)  
(49) 
where . One natural choice of the minimal Koopman eigenfunctions that linearly reconstructs the state is (Brunton et al., 2016a)
(50) 
with eigenvalues , , respectively.
The way we generate training, validation, and testing data is described below with distribution of the data shown in fig. 3,

training data: a point cloud with pairs of , is generated by Latin hypercube sampling (Baudin, 2015) within the domain .

validation data: a single trajectory with initial condition as , sampling time interval from to .

testing data: a single trajectory with initial condition as , sampling time interval from to .
As an illustration, we consider two models to approximate the Koopman operator from training data:

a continuoustime EDMD with normalized Hermite polynomials up to fifth order with features,

a continuoustime KDMD with isotropic Gaussian kernel with reduced rank .
Details of the above choices based on the steps of hyperparameter selection in section 3.3 are given in section B.1.
Results for continuoustime EDMD with mode selection
As displayed in fig. 4, we begin with an error analysis of all of the eigenmodes on validation data in fig. 3 according to linearly evolving error defined in eq. 26 and defined in eq. 27. From fig. 4, considering both the linearly evolving error and the quality of the reconstruction, we choose the cutoff threshold at . We observe a sharp cutoff in the left subfigure in fig. 4 around the number of selected eigenmodes . This is a reasonable choice, since from the eigenvalues in the right subfigure, we notice the analytic Koopman eigenmodes are not covered until first 8 accurate eigenmodes are selected. Indeed, the first four eigenfunctions (index=) are redundant in terms of reconstruction in this problem
According to eq. 32, to further remove redundant modes, we perform multitask feature learning on the eigenmodes . The corresponding ElasticNet path is shown in fig. 5. Note that each corresponds to a minimizer of eq. 32. To choose a proper so as to find a proper Koopman invariant subspace, it is advisable to check the trend of the normalized reconstruction error and number of nonzero features. Given the dictionary, for simple problems for which there exists an exact Koopman invariant subspace that also spans system state, a proper model can be obtained by selecting which ends up with only 3 eigenfunctions as shown in fig. 5. Moreover, as is common for EDMD with polynomial basis (Williams et al., 2014, 2015), a pyramid of eigenvalues appears in fig. 5.
As shown in fig. 6, both the identified eigenvalues, and contour of the phase angle and magnitude of selected eigenfunctions from spEDMD match the analytic eigenfunctions given in eq. 50 very well. As expected, the prediction on unseen testing data is also excellent. Note that the indices of true eigenfunctions , and ordered by Kou’s criterion in eq. 38 are 8, 5 and 6. In this case, all of the true eigenfunctions are missing in the top 3 modes chosen by Kou’s criterion. Indeed, the top 3 modes chosen by Kou’s criterion have nearly zero eigenvalues.
Results of continuoustime KDMD with mode selection
The mode selection algorithm presented above can be applied in precisely the same form to KDMD, given a set of eigenfunctions and eigenvalues.
Error analysis of eigenfunctions is shown in fig. 7, from which we choose as well. Again, on the left subfigure in fig. 7, we observe a sharp decrease in the reconstruction error after the 4 most accurate modes are included. This is expected, as the second to fourth most accurate modes are analytically exact from the right subfigure. As shown in figs. 9 and 8, it is confirmed that that both spEDMD and spKDMD arrive at the same analytic eigenfunctions with difference up to a constant factor. It should be noted that, although polynomials are not analytically in the RKHS (Minh, 2010), good approximations can still be achieved conditioned on the data we have, i.e., . Again, the indices of true eigenfunctions to ordered by Kou’s criterion are 8, 2 and 3. Hence, is missing in the top 3 modes chosen by Kou’s criterion. Similarly, the first mode chosen by Kou’s criterion has a zero eigenvalue.
Effect of SVD regularization
SVD truncation is a standard regularization technique in the solution of a potentially illconditioned linear system. In the standard EDMD in eq. 7  for example  could be potentially illconditioned, leading to spurious eigenvalues in . Hence, Williams et al. (2014) recommend SVD truncation in eq. 8 to obtain a robust solution of . Effectively, it shrinks the number of EDMD/KDMD modes. It has to be recognized, however, that the mode reduction from SVD truncation is not the same as mode selection. Most importantly, one should not confuse numerical spuriousness from poor numerical conditioning with functional spuriousness from the orthogonal projection error of the Koopman operator (Korda and Mezić, 2018). Indeed, SVD truncation does not always lead to better approximation of a Koopman invariant subspace. It is rather a linear dimension reduction that optimally preserves the variance in the feature space conditioned on the training data without knowing the linear evolution property of each feature.
For demonstration, we take the above fixed point attractor system where we use the same data and standard EDMD algorithm with the same order of Hermite polynomials. The results of prediction on the unseen testing data shown in fig. 10 indicate that even though only 3 eigenfunctions (indeed 3 feature in the Hermite polynomial) are required, standard EDMD fails to identify the correct eigenfunctions with 26 SVD modes while the results improve with 31 modes retained. The sensitivity of standard EDMD with respect to SVD truncation is likely a result of the use of normalized Hermite polynomials where SVD truncation would lead to a strong preference over the subspace spanned by the higher order polynomials. We did not observe such a sensitivity for KDMD, unless the subspace is truncated below 10 modes.
4.2 Twodimensional, transient flow past a cylinder
Transient twodimensional flow past cylinder (fig. 11) is considered at different Reynolds numbers (), where is the freestream velocity, is the diameter of the cylinder, and is the kinematic viscosity. The twodimensional incompressible Navier–Stokes equations govern the dynamics with farfield boundary conditions for pressure and velocity and noslip velocity on the cylinder surface. Numerical simulations are performed using the icoFoam solver in OpenFOAM (Jasak et al., 2007) solving the 2D incompressible NavierStokes equations. We explore by changing the viscosity. The pressure field is initialized with i.i.d Gaussian noise . The velocity is initialized with a uniform freestream velocity superimposed with i.i.d Gaussian noise . It should be noted that the noise is generated on the coarsest mesh shown in fig. 11, and interpolated to the finer meshes. Grid convergence with increasing mesh resolution is assessed by comparing the temporal evolution of the drag coefficient and lift coefficient .
Note that the dynamics of a cylinder wake involves three regimes: nearequilibrium linear dynamics, postHopf birfucation transient dynamics, and periodic limit cycle dynamics (Chen et al., 2012). Instead of considering data only from each of these regimes separately, as reported in studies of DMD (Chen et al., 2012; Taira et al., 2019; Bagheri, 2013), we start collecting data immediately after the flowfield becomes unstable, and stop after the flowfield experiences several limit cycles. The sampling time interval where . For each , 891 snapshots of full field simulation of velocity field and with sampling time interval are collected as two matrices of size . Following this, each velocity component is shifted and scaled (normalized) between . Since spacefilling sampling in any highdimensional space would be extremely difficult, we split the trajectory into training, validation, and testing data by sampling with strides similar to the “evenodd” sampling scheme previously proposed by Otto and Rowley (2019). As illustrated in fig. 12, given index , if , the th point belongs to training set while corresponds to validation, and for testing data. Consequently, the time interval in the training, testing, validation trajectory is tripled as .
Thus, training, validation, and testing data are split with each contains 297 snapshots. Finally, we stack data matrices along the first axis corresponding to the number of grid points, and perform a distributed SVD described in section 3.4. For all three cases, the top 20 POD modes are retained, corresponding to 99% of kinetic energy. Next, we apply our algorithm to discretetime KDMD on this reducedorder nonlinear system. We choose the hyperparameters and , and further details are given in section B.2.
Results of discretetime KDMD with mode selection
For all three Reynolds numbers, a posteriori error analysis is shown in fig. 13. A good choice of the number of accurate modes retained for reconstruction is around 60 since the corresponding maximal deviation from linear evolution is still around 5% while the reconstruction error reaches a plateau after .
After the mode selection on validation data, a family of solutions is obtained with corresponding reconstruction error and the number of nonzeros terms as shown in fig. 14. Note that the chosen solution is highlighted as blue circles. As shown in table 2, nearly half of the accurate KDMD eigenmodes identified are removed with the proposed sparse feature selection. Note that for all three cases, the number of selected modes (around 32 to 34) is still larger than that required in neural network models (around 10) (Pan and Duraisamy, 2020; Otto and Rowley, 2019). This is because the subspace spanned by KDMD/EDMD relies on a predetermined dictionary rather than being dataadaptive like neural network models. Nevertheless, due to the additional expressiveness from nonlinearity, we will see in section 4.2.3 that spKDMD performs significantly better than DMD and spDMD, while enjoying the property of convex optimization at a much lower computational cost than the inherently nonconvex and computationally intensive neural network counterparts.
number of selected modes  34  32  33 

number of total modes  297  297  297 
normalized reconstruction error  0.075  0.105  0.113 