Kernel methods for detecting coherent structures in dynamical data
Abstract
We illustrate relationships between classical kernelbased dimensionality reduction techniques and eigendecompositions of empirical estimates of reproducing kernel Hilbert space (RKHS) operators associated with dynamical systems. In particular, we show that kernel canonical correlation analysis (CCA) can be interpreted in terms of kernel transfer operators and that it can be obtained by optimizing the variational approach for Markov processes (VAMP) score. As a result, we show that coherent sets of particle trajectories can be computed by kernel CCA. We demonstrate the efficiency of this approach with several examples, namely the wellknown Bickley jet, ocean drifter data, and a molecular dynamics problem with a timedependent potential. Finally, we propose a straightforward generalization of dynamic mode decomposition (DMD) called coherent mode decomposition (CMD). Our results provide a generic machine learning approach to the computation of coherent sets with an objective score that can be used for crossvalidation and the comparison of different methods.
While coherent sets of particles are common in dynamical systems, they are notoriously challenging to identify. In this article, we leverage the combination of a suite of methods designed to approximate the eigenfunctions of transfer operators with kernel embeddings in order to design an algorithm for detecting coherent structures in Langrangian data. It turns out that the resulting method is a wellknown technique to analyze relationships between multidimensional variables, namely kernel canonical correlation analysis. Our algorithm successfully identifies coherent structures in several diverse examples, including oceanic currents and a molecular dynamics problem with a moving potential. Furthermore, we show that a natural extension of our algorithm leads to a coherent mode decomposition, a counterpart to dynamic mode decomposition.
I Introduction
Representing and learning effective, lowdimensional manifolds for complex, highdimensional data is one of the cornerstones of machine learning and complex systems theory. In particular, for dynamical processes such as turbulent flows or molecular dynamics, it is known that much of their essential, longtime dynamics can be captured by linear models acting on lowdimensional manifolds, resulting in simpler, interpretable models, and potentially large savings in process simulation, prediction, and control. Due to their simplicity, linear methods to find lowdimensional subspaces are widely used, including principal component analysis (PCA) Hotelling33:PCA , canonical correlation analysis (CCA) Hotelling_Biometrika36_CCA , independent component analysis (ICA) Hyvarinen00:ICA , timelagged independent component analysis (TICA) MS94 ; PPGDN13 , timelagged canonical correlation analysis (TCCA) WuNo17 , and dynamic mode decomposition (DMD) SS08 .
Since the sought manifold is usually nonlinear in the direct state representation of the system, it is important to generalize the aforementioned methods to work in nonlinear feature spaces. Two particularly important classes of learning methods that go beyond applying linear algorithms to userdefined feature functions are neural network approachesMPWN:vampnets ; LiEtAl_Chaos17_EDMD_DL ; OttoRowley_LinearlyRecurrentAutoencoder and kernel methods such as kernel PCA Scholkopf98:KPCA , kernel CCA MRB01:CCA , kernel ICA Bach03:KICA , kernel TICA HZHM03:kTICA , and kernel EDMD WRK15 . The basic idea of kernel methods is to represent data by elements in reproducing kernel Hilbert spaces associated with positive definite kernel functions.
The novel contribution of this work is to derive kernel methods for the identification of coherent structures in highdimensional dynamical data. We do this by establishing deep mathematical connections between kernel methods that have been proposed in machine learning and dynamical systems theory. Our main results are:

[wide, itemindent=itemsep=0ex, topsep=0.5ex]

We show that kernel CCA, when applied to dynamical data, admits a natural interpretation in terms of kernel transfer operators and that the resulting eigenvalue problems are directly linked to methods for the computation of coherent sets. Importantly, kernel CCA predates recent methods for coherent set identification.

We show kernel CCA is optimal in the variational approach for Markov processes (VAMP) WuNo17 . Therefore, kernel CCA optimally approximates the transfer operator singular values and functions within kernel methods, and VAMP is a suitable optimization method for identifying coherent sets.

We propose a new method called coherent mode decomposition, which can be seen as a combination of CCA and DMD.
Establishing similar connections between machine learning and dynamical systems theory have previously led to advances in different applications: By means of the variational approach of conformation dynamics (VAC) NoNu13 , optimal estimators for the leading eigenfunctions of reversible timehomogeneous transfer operators have been made, which are important to identify metastable sets and rare events SchuetteFischerHuisingaDeuflhard_JCompPhys151_146 ; Bovier06:metastability . This insight has led to the introduction of the TICA method as a way to identify slow collective variables to molecular dynamics PPGDN13 ; SP13 —a key step in the modeling of rare events in molecules. Kernel embeddings of conditional probability distributionsSHSF09 ; MFSS16 have been related to Perron–Frobenius and Koopman transfer operators Ko31 ; LaMa94 and their eigenvalue decomposition in Ref. KSM17, . In a similar way, optimization of the VAMP score can be used to derive CCA as an optimal linear algorithm to approximate the singular functions of transfer operators WuNo17 . Doing the same with neural networks as function approximators leads to VAMPnets MPWN:vampnets . Here we show that a similar connection can be made with kernel methods and transfer operator singular functions and demonstrate that the kernel CCA algorithm approximates these functions. By exploiting that the singular functions are simultaneously the eigenfunctions of the forwardbackward dynamics, we can extend this framework to the identification of socalled coherent sets—a generalization of metastable sets to nonautonomous and aperiodic systems FJ18:coherent . Coherent sets are regions of the state space that are not dispersed over a specific time interval. That is, if we let the system evolve, elements of a coherent set will, with a high probability, stay close together, whereas other regions of the state space might be distorted entirely. A large number of publications investigate the numerical approximation of coherent sets with other methods, e.g., Refs. FrSaMo10:coherent, ; FrJu15, ; WRR15, ; HKTH16:coherent, ; BK17:coherent, ; HSD18, ; FJ18:coherent, , see Ref. AP15:review, for an overview of approaches for Lagrangian data. We will not address the problem of possibly sparse or incomplete data. Our goal is to illustrate relationships with established kernelbased approaches and to show that existing methods—developed independently and with different applications in mind, predating many algorithms for the computation of finitetime coherent sets—can be directly applied to detect coherent sets in Lagrangian data.
A highlevel overview of datadriven approaches for the approximation of transfer operators that are relevant for our considerations and relationships with the methods proposed below are shown in Figure 1. For the derivations of some of these methods the system is assumed to be reversible and there are other subtle differences, which will not be discussed here. A comparison of methods for timehomogeneous systems can be found in Ref. KNKWKSN18, and extensions to timeinhomogeneous systems in Ref. KWNS18:noneq, . Moreover, it was shown that the role played by the eigenfunctions in the timehomogeneous setting (for the detection of metastable sets) is assumed by the left and right singular functions in the timeinhomogeneous setting (for the detection of coherent sets)KWNS18:noneq . The right singular functions encode information about the system at initial time and the left singular functions correspond to the system’s state at final time . The connections between kernel CCA and the singular value decomposition of transfer operators will be described in more detail below.
The remainder of this paper is structured as follows: In Section II, we will briefly introduce transfer operators and review the notion of positive definite kernels and induced Hilbert spaces as well as nonlinear generalizations of covariance and crosscovariance matrices. We will then define empirical RKHS operators and show that diverse algorithms can be formulated as eigenvalue problems involving such operators. The relationships between kernel CCA and coherent sets will be studied in Section III. Furthermore, coherent mode decomposition will be derived. Section IV contains numerical results illustrating how to use the presented kernelbased methods for the analysis of dynamical systems. We conclude with a summary of the main results and open problems in Section V.
Ii Prerequisites
We briefly introduce transfer operators, reproducing kernel Hilbert spaces, and operators mapping from one such space to another one (or itself). For more details on the properties of these spaces and the introduced operators, we refer the reader to Refs. Schoe01, ; Steinwart2008:SVM, ; SC04:KernelMethods, and Refs. Baker70:XCov, ; Baker1973, ; KSM17, , respectively.
ii.1 Transfer operators
Let be a stochastic process defined on the state space and let be a fixed lag time. We assume that there exists a transition density function such that is the probability of given . For , let denote the standard space of Lebesgue integrable functions on . Then, for a probability density on , let be the spaces of integrable functions with respect to the corresponding probability measure induced by the density ; that is, .
Given a probability density and an observable , we define the Perron–Frobenius operator and the Koopman operator by
Assuming the process admits a unique equilibrium density , i.e., , we can define for the Perron–Frobenius operator with respect to the equilibrium density as
Under certain conditions, these transfer operators can be defined on and for other choices of . From now on, we will always assume that they are welldefined for (see Refs. LaMa94, ; BaRo95, ; KKS16, for details). This is common whenever Hilbert space properties are needed in the context of transfer operators.
Remark II.1.
For timehomogeneous systems, the associated transfer operators depend only on the lag time . If the system is timeinhomogeneous, on the other hand, the lag time is not sufficient to parametrize the evolution of the system since it also depends on the starting time. This is described in detail in Ref. KWNS18:noneq, . The transition density and the operators thus require two parameters; however, we will omit the starting time dependence for the sake of clarity.
ii.2 Reproducing kernel Hilbert spaces
Given a set and a space of functions , is called a reproducing kernel Hilbert space (RKHS) with inner product if there exists a function with the following properties:

[label=(), itemsep=0ex, topsep=1ex]

for all , and

.
The function is called a kernel and the first property above the reproducing property. A direct consequence is that . That is, the map given by can be regarded as a feature map associated with , the socalled canonical feature map.^{*}^{*}*Such a feature map admitting the property is not uniquely defined. There are other feature space representations such as, for instance, the Mercer feature space.Mercer ; Schoe01 ; Steinwart2008:SVM As long as we are only interested in kernel evaluations, however, it does not matter which one is considered. It is thus possible to represent data by functions in the RKHS. Frequently used kernels include the polynomial kernel and the Gaussian kernel, given by and , respectively. While the feature space associated with the polynomial kernel is finitedimensional, the feature space associated with the Gaussian kernel is infinitedimensional; see, e.g., Ref. Steinwart2008:SVM, . Inner products in these spaces, however, are not evaluated explicitly, but only implicitly through kernel evaluations. This is one of the main advantages of kernelbased methods Scholkopf98:KPCA ; SC04:KernelMethods . Algorithms that can be purely expressed in terms of inner product evaluations can thus be easily kernelized, resulting, as described above, in nonlinear extensions of methods such as PCA, CCA, or TICA.
ii.3 Covariance operators and Gram matrices
Let be a random variable on , where and . The dimensions and can in principle be different. For our applications, however, the spaces and are often identical. The associated marginal distributions are denoted by and , the joint distribution by , and the corresponding densities—which we assume exist—by , , and , respectively. Furthermore, let and be the kernels associated with and and and the respective feature maps. We will always assume that requirements such as measurability of the kernels and feature maps as well as separability of the RKHSs are satisfied.^{†}^{†}†In most cases, these properties follow from mild assumptions about and . For an indepth discussion of these technical details, see Ref. Steinwart2008:SVM, . The RKHSs induced by the kernels and are denoted by and .
We will now introduce covariance operators and crosscovariance operators Baker70:XCov ; Baker1973 on RKHSs. In what follows, we will always assume that and , which ensures that these operators are welldefined and Hilbert–Schmidt (for a comprehensive overview of kernel covariance operators and their applications, see Ref. MFSS16, and references therein). For any , let
denote the tensor product operator Reed from to defined by and .
Definition II.2 (Covariance operators).
The covariance operator and the crosscovariance operator are defined as
Kernel covariance operators satisfy
for all , . Defining and , the corresponding centered counterparts of the covariance and crosscovariance operators and are defined in terms of the meansubtracted feature maps.
As these operators can in general not be determined analytically, empirical estimates are computed from data, i.e.,
(1) 
where and and the training data is drawn i.i.d. from . Analogously, the meansubtracted feature maps can be used to obtain empirical estimates of the centered operators. Since in practice we often cannot explicitly deal with these operators, in particular if the feature space is infinitedimensional, we seek to reformulate algorithms in terms of Gram matrices.
Definition II.3 (Gram matrices).
Given training data as defined above, the Gram matrices are defined as
For a Gram matrix , its centered version is defined by , where and is a vector composed of ones Bach03:KICA . Note that centered Gram matrices are not regular.
Remark II.4.
In what follows, if not noted otherwise, we assume that the covariance operators and and the Gram matrices and are properly centered for CCA.
ii.4 Kernel transfer operators
We now show how transfer operators can be written in terms of covariance and crosscovariance operators—this leads to the concept of kernel transfer operators. We assume the Perron–Frobenius operator and the Koopman operator to be welldefined on as discussed in Section II.1. Kernel transfer operators follow from the assumption that densities and observables in can be represented as elements of the RKHS . Under some technical requirements, such as , the elements of are included in when they are identified with the respective equivalence class of square integrable functions. This correspondence can be derived from the theory of integral operatorsSteinwart2008:SVM and is often used in statistical learning theory RBD10 . We may therefore assume that we can identify RKHS elements with the corresponding equivalence classes of functions in . By requiring for a probability density , we obtain a similar statement for .
We refer to Ref. KSM17, for the derivation of kernel transfer operators and a description of their relationships with kernel embeddings of conditional distributions. We will omit the technical details and directly define kernel transfer operators as the RKHS analogue of the standard transfer operators defined in Section II.1. Using the same integral representations as before and defining the transfer operators on instead of , we obtain the kernel Perron–Frobenius operator and the kernel Koopman operator , respectively.
By defining the timelagged process , we can write kernel transfer operators in terms of covariance and crosscovariance operators KSM17 . Note that and are defined on the same state space ; therefore, we have and hence in this special case. We obtain the important properties and for all , which allows us to write
(2) 
Here, is the Tikhonovregularized inverse of with regularization parameter .^{‡}^{‡}‡See Refs. Gr93, ; EG96, ; EHN96, for a detailed discussion of illposed inverse problems and the regularization of bounded linear operators on Hilbert spaces. Note the abuse of notation, since equality in the above inverse problems is only given asymptotically for and pointwise for feasible . Since is a compact operator, it does not admit a globally defined bounded inverse if the RKHS is infinitedimensional. However, always exists and is bounded. In fact, the operators and as given in the regularized form above are Hilbert–Schmidt.
The above notation and regularization of inverse covariance operators is standard in the context of kernel embeddings of conditional distributions and related Bayesian learning techniques. We refer to Refs. SHSF09, ; Song2013, ; Fukumizu13:KBR, ; Fukumizu15:NBI, ; MFSS16, for detailed discussions of properties of this illposed inverse problem in specific applications.
By replacing the analytical covariance operators with their empirical estimates in (2), we obtain empirical estimates for kernel transfer operators KSM17 . As done with empirical covariance operators in (1), it is possible to rewrite the empirical estimates of kernel transfer operators in terms of RKHS features in and (see Refs. MFSS16, ; KSM17, for the derivation):
(3) 
In this case, and both contain observations in the same space , since and are both defined on .
ii.5 Empirical RKHS operators
In what follows, we will consider finiterank RKHS operators given by a matrix which represents the action of the operator on fixed elements in the RKHSs. We will use this general setting to formulate results about the eigenvalues and eigenfunctions of empirical RKHS operators. Given a matrix , we define the bounded finiterank operator by
(4) 
We remark that although and may contain infinitedimensional objects, we express inner products between RKHS elements in the classical matrixvector multiplication form. That is, we interpret the embedded RKHS elements as (potentially infinitedimensional) column vectors. This notation has become a defacto standard in the machine learning community MFSS16 . We can write empirical estimates of covariance operators in the form of (4). If the RKHS training features in and are generated i.i.d. by the joint probability distribution of random variables and , then the crosscovariance operator takes the general form of an empirical RKHS operator with . We obtain as another special case with identical features drawn only from . Furthermore, the empirical estimates of the kernel Perron–Frobenius and kernel Koopman operator are special cases of as seen in (3) with and , respectively. Note that the roles of and are interchanged for the empirical estimate of the Koopman operator, i.e., it is of the form .
We now show how spectral decomposition techniques can be applied to empirical RKHS operators in this general setting.^{§}^{§}§In general, all considered kernel transfer operators in this paper are compositions of compact and bounded operators and therefore compact. They admit series representations in terms of singular value decompositions as well as eigendecompositions in the selfadjoint caseReed . The functional analytic details and the convergence of and its spectral properties in the infinitedata limit depend on the specific scenario and are beyond the scope of this paper. We can compute eigenvalues and corresponding eigenfunctions of by solving auxiliary matrix eigenvalue problems. For the sake of selfcontainedness, we briefly reproduce the eigendecomposition result from Ref. KSM17, .
Proposition II.5.
Suppose and contain linearly independent elements. Let , then

[wide, label=(), itemindent=itemsep=0ex, topsep=0.5ex]

has an eigenvalue with corresponding eigenfunction if and only if is an eigenvector of associated with , and, similarly,

has an eigenvalue with corresponding eigenfunction if and only if is an eigenvector of .
For the Gaussian kernel, linear independence of elements in and reduces to requiring that the training data contains pairwise distinct elements in and , respectively. For dynamical systems applications, we typically assume that and contain information about the system at time and at time , respectively. A more detailed version of Proposition II.5 and its extension to the singular value decomposition are described in Ref. MSKS18, . Further properties of and its decompositions will be studied in future work. Note that we generally assume that empirical estimates of RKHS operators converge in probability to their analytical counterparts in operator norm in the infinite data limit. These statistical properties and the resulting associated spectral convergence are examined in for example in Ref. RBD10, .
ii.6 Applications of RKHS operators
Decompositions of RKHS operators have diverse applications, which we will only touch upon here. We will consider a specific problem—namely, kernel CCA—in Section III.

[wide, label=(), itemindent=itemsep=0ex, topsep=0.5ex]

By sampling points from the uniform distribution, the Mercer feature mapMercer ; Schoe01 ; Steinwart2008:SVM with respect to the Lebesgue measure on can be approximated by computing eigenfunctions of —i.e., and the auxiliary matrix eigenvalue problem is —as shown in Ref. MSKS18, . This can be easily extended to other measures.

Similarly, given an arbitrary data set , kernel PCA computes the eigenvectors corresponding to the largest eigenvalues of the centered Gram matrix and defines these eigenvectors as the data points projected onto the respective principal components. It is wellknown that kernel PCA can also be defined in terms of the centered covariance operator . A detailed connection of the spectrum of the Gram matrix and the covariance operator is given in Ref. STWCK02, . Up to scaling, the eigenfunctions evaluated in the data points correspond to the principal components.

Given training data and , where denotes the flow associated with the dynamical system and the lag time—that is, if is the state of the system at time , then is the state of the system at time —, we define and as above. Eigenvalues and eigenfunctions of kernel transfer operators can be computed by solving a standard matrix eigenvalue problem (see Proposition II.5). Eigendecompositions of these operators result in metastable sets. For more details and realworld examples, see Refs. KSM17, ; KBSS18, . The main goal of this paper is the extension of the aforementioned methods to compute coherent sets instead of metastable sets.
Iii Kernel CCA and coherent sets
This section contains the main results of our paper. We derive kernel CCA MRB01:CCA for finite and infinitedimensional feature spaces from the viewpoint of dynamical systems, and show that kernel CCA can be used to approximate coherent sets in dynamical data. Furthermore, we derive the new coherent mode decomposition method.
Given two multidimensional random variables and , standard CCA finds two sets of basis vectors such that the correlations between the projections of and onto these basis vectors are maximized Borga01:CCA . The new bases can be found by computing the dominant eigenvalues and corresponding eigenvectors of a matrix composed of covariance and crosscovariance matrices. Just like kernel PCA is a nonlinear extension of PCA, kernel CCA is a generalization of CCA. The goal of kernel CCA is to find two nonlinear mappings and , where and , such that their correlation is maximized Fukumizu07:KCCA . That is, instead of matrices, kernel CCA is now formulated in terms of covariance and crosscovariance operators. More precisely, the kernel CCA problem can be written as
and the solution is given by the eigenfunctions corresponding to the largest eigenvalue of the problem
(5) 
Further eigenfunctions corresponding to subsequent eigenvalues can be taken into account as in the standard setting described above. In practice, the eigenfunctions are estimated from finite samples. The empirical estimates of and are denoted by and , respectively.
Example III.1.
In order to illustrate kernel CCA, let us analyze a synthetic data set similar to the one described in Ref. Fukumizu07:KCCA, using a Gaussian kernel with bandwidth . Algorithms to solve the CCA problem will be described below. The results are shown in Figure 2. Classical CCA would not be able to capture the nonlinear relationship between and .
iii.1 RKHS operator formulation
Since the inverses of the covariance operators in general do not exist, the regularized versions and (cf. Section II.4) are also typically used in the context of CCAFukumizu07:KCCA . Solving the first equation in (5) for and inserting it into the second equation, this results in
(6) 
Comparing this with the aforementioned transfer operator representations (2), can be interpreted as an approximation of the kernel Koopman operator, and as a kernel Koopman operator where now the roles of and are reversed or as a reweighted Perron–Frobenius operator. The composition of these operators corresponds to a pushforward and subsequent pullback of a density . Eigenfunctions of the operator whose associated eigenvalues are close to one thus remain nearly unchanged under the forwardbackward dynamics. This is closely related to the notion of coherence as introduced in Refs. FrSaMo10:coherent, ; Froyland13:coherent, and will be discussed in Section III.4.
Lemma III.2.
Replacing the covariance and crosscovariance operators by their empirical estimates, the eigenvalue problem (6) can be written as
with .
Proof.
Inserting the definitions of the empirical covariance and crosscovariance operators yields
Using , see Ref. MFSS16, , and a similar identity for concludes the proof. ∎
That is, the empirical RKHS operator for kernel CCA is of the form . Applying Proposition II.5, we must solve the auxiliary problem

[label=(), itemsep=0ex, topsep=1.5ex]

, with , or

, with .
Since and as well as and commute, the first problem can be equivalently rewritten as and the second as . The eigenfunction associated with the largest eigenvalue solves the CCA problem, but in order to detect coherent sets, we will need more eigenfunctions later. To obtain the function corresponding to , we compute

[label=(), itemsep=0ex, topsep=1.5ex]

, or

.
[backgroundcolor=boxback,hidealllines=true]
Algorithm III.3.
The CCA problem can be solved as follows:

[leftmargin=3ex, itemindent=0ex, itemsep=0ex, topsep=0.5ex]

Choose a kernel and regularization .

Compute the centered gram matrices and .

Solve .
The corresponding eigenfunction evaluated at all data points , denoted by , is then approximately given by the vector . We can evaluate the eigenfunctions at any other point as described above, but we will mainly use the eigenfunction evaluations at the sampled data points for clustering into coherent sets.
Algorithm III.3 is based on the second problem formulation, i.e., item (ii) above. However, the first variant can be used in the same way. Alternatively, we can rewrite it as an eigenvalue problem of the form
and, consequently,
(7) 
Other formulations can be derived in a similar fashion. The advantage is that no matrices have to be inverted. However, the size of the eigenvalue problem doubles, which might be problematic if the number data points is large.
Remark III.4.
In order to apply the algorithms, we first need to choose a kernel and then tune its parameters, e.g., the bandwidth of the Gaussian kernel, and also the regularization parameter . If the bandwidth is too small, this leads to overfitting and to oversmoothing if it is too large. Crossvalidation techniques can be used to select suitable hyperparameters. The kernel itself determines the complexity of the function space in which the eigenfunctions are sought (see Section III.4). Additionally, the results depend on the lag time . Sets that are coherent for a given lag time are not necessarily coherent for a different lag time since these sets might be dispersed again.
The generalized eigenvalue problem (7) is almost identical to the one derived in Ref. Bach03:KICA, , with the difference that regularization is applied in a slightly different way. That is, the direct eigendecomposition of RKHS operators as proposed in Ref. KSM17, results, as expected, in variants of kernel CCA. The statistical convergence of kernel CCA, showing that finite sample estimators converge to the corresponding population counterparts, has been established in Ref. Fukumizu07:KCCA, . Kernel CCA can be extended to more than two variables or views of the data as described in Refs. Bach03:KICA, ; SC04:KernelMethods, , which might also have relevant applications in the dynamical systems context.
iii.2 Finitedimensional feature space
If the state spaces of the kernels and are finitedimensional, we can directly solve the eigenvalue problem (5) or (6). Assuming the feature space of the kernel is dimensional and spanned by the basis functions , we define by . That is, we are now using an explicit feature space representation. This induces a kernel by defining .^{¶}^{¶}¶For the Mercer feature space representationMercer ; Schoe01 the functions form an orthogonal basis, but orthogonality is not required here. We could, for instance, select a set of radial basis functions, monomials, or trigonometric functions. Analogously, we define a vectorvalued function , with , where is the dimension of the feature space of the kernel . Any function in the respective RKHS can be written as and , where and are coefficient vectors.
Given training data drawn from the joint probability distribution, we obtain and and can compute the centered covariance and crosscovariance matrices , , and explicitly.
[backgroundcolor=boxback,hidealllines=true]
Algorithm III.5.
Given explicit feature maps, we obtain the following CCA algorithm:

[leftmargin=3ex, itemindent=0ex, itemsep=0ex, topsep=0.5ex]

Select basis functions and and regularization .

Compute (cross)covariance matrices , , , and .

Solve the eigenvalue problem
.
The eigenfunctions are then given by . Expressions for can be derived analogously.
The difference between the Gram matrix approach described in Section III.1 and the algorithm proposed here is that the size of the eigenvalue problem associated with the former depends on the number of data points and permits the dimension of the feature space to be infinitedimensional, whereas the eigenvalue problem associated with the latter depends on the dimension of the feature space but not on the size of the training data set. This is equivalent to the distinction between extended dynamic mode decomposition (EDMD) WKR15 and kernel EDMD WRK15 (or the variational approach NoNu13 and kernel TICA SP15 , where the system is typically assumed to be reversible; see Ref. KSM17, for a detailed comparison) with the small difference that often the Moore–Penrose pseudoinverse Penrose is used for EDMD in lieu of the Tikhonovregularized inverse.
iii.3 Relationships with VAMP
Defining , the eigenvalue problem in Algorithm III.5 becomes
The transformed eigenvectors are thus equivalent to the right singular vectors of the matrix
(8) 
and the values are given by the singular values, which we assume to be sorted in nonincreasing order.
This form is the kernel version of the TCCA method that has been first derived as a way to approximate transfer operator singular functions using VAMP, see Ref. WuNo17, . VAMP is an optimization principle that defines a score function whose optimum leads to specific datadriven algorithms. The VAMP score is defined as
where are the singular value estimates obtained from an SVD of, e.g., (8), and is a positive integer.
When using and kernel feature functions, we obtain the kernel CCA algorithms III.3 or III.5. When doing the same on an explicit basis set of feature functions, we obtain TCCA, i.e., timelagged CCA WuNo17 ; No18 . However, since with VAMP a score (or loss) function is available, we have now turned the identification of coherent sets into a generic machine learning problem. For example, training neural networks with VAMP results in VAMPnets, a deep learning method to lowrank approximattion of the transfer operators and the identification of metastable or coherent sets.
On the other hand, the fact that kernel CCA results from maximizing the VAMP score within a kernel approach shows that we can use the VAMP score in the context of crossvalidation in order to optimally determine hyperparameters such as the kernel function. Furthermore we can explore other choices as . For example, the choice has an interesting interpretation in terms of kinetic maps, which are embeddings of the dominant eigenspace or singular space of a transfer operator where Euclidean distances are related to timescales of transitions NoeClementi_JCTC15_KineticMap .
iii.4 Relationships between kernel CCA and transfer operators
We have seen in Section III.1 that the resulting eigenvalue problem (6) involves expressions resembling kernel transfer operators. The goal now is illustrate how this eigenvalue problem is related to the operators derived in Ref. BK17:coherent, for detecting coherent sets. We first introduce a forward operator by
where is some reference density of interest. Furthermore, let be the image density obtained by mapping the indicator function on forward in time. Normalizing with respect to , we obtain a new operator and its adjoint , with
It holds that . Consequently, plays the role of a reweighted Perron–Frobenius operator, whereas can be interpreted as an analogue of the Koopman operator (note that and are defined on reweighted spaces). A more detailed derivation can be found in Ref. BK17:coherent, , where the operator (or a trajectoryaveraged version thereof) is used to detect coherent sets. We want to show that this is, up to regularization, equivalent to the operator in (6).
Proposition III.6.
Assuming that for all , it holds that .
Proof.
The proof is almost identical to the proof for the standard Perron–Frobenius operator (see Ref. KSM17, ). For all , we obtain
We define the RKHS approximation of the operator by . Note that the operator technically depends not only on but also on , which we omit for brevity. In practice, we typically use the same kernel for and . As a result, the eigenvalue problem (6) can now be written as
The adjointness property for , i.e., assuming that the inverse exists without regularization,^{∥}^{∥}∥Conditions for the existence of the inverse can be found, for instance, in Ref. Song2013, and in Section III.2. can be verified as follows:
We have thus shown that the eigenvalue problem for the computation of coherent sets and the CCA eigenvalue problem are equivalent, provided that the RKHS is an invariant subspace of . Although this is in general not the case—depending on the kernel the RKHS might be lowdimensional (e.g., for a polynomial kernel), but could also be infinitedimensional and isometrically isomorphic to (e.g., for the Gaussian kernel)—, we can use the kernelbased formulation as an approximation and solve it numerically to obtain coherent sets. This is the mathematical justification for the claim that CCA detects coherent sets, which will be corroborated by numerical results in Section IV.
iii.5 Coherent mode decomposition
Borrowing ideas from dynamic mode decomposition (DMD) Schmid10 ; TRLBK14 , we now introduce a method that approximates eigenfunctions or eigenmodes of the forwardbackward dynamics using linear basis functions and refer to it as coherent mode decomposition (CMD)—a mixture of CCA and DMD.^{**}^{**}**In fact, the method described below is closer to TICA than DMD, but other variants can be derived in the same fashion, using different combinations of covariance and crosscovariance operators. The relationships between DMD and TICA (including their extensions) and transfer operators are delineated in Refs. KNKWKSN18, ; KSM17, . DMD is often used for finding coherent structures in fluid flows, dimensionality reduction, and also prediction and control; see Ref. KBBP16, for an exhaustive analysis and potential applications.
Let us assume we have highdimensional timeseries data but only relatively few snapshots. That is, with , where and . This is, for instance, the case for fluid dynamics applications where the, e.g., two or threedimensional domain is discretized using (un)structured grids. It is important to note that this analysis is now not based on Lagrangian data as before, where we tracked the positions of particles or drifters over time, but on the Eulerian frame of reference.
Using Algorithm III.5 with and is infeasible here since the resulting covariance and crosscovariance matrices would be prohibitively large; thus, we apply the kernelbased counterpart. The linear kernel is defined by and the Gram matrices are simply given by
where .
[backgroundcolor=boxback,hidealllines=true]
Algorithm III.7.
Coherent mode decomposition.

[leftmargin=3ex, itemindent=0ex, itemsep=0ex, topsep=0.5ex]

Choose regularization .

Compute Gram matrices and .

Solve the eigenvalue problem
.
The eigenfunction evaluated in an arbitrary point is then given by
where we define the coherent mode corresponding to the eigenvalue by . That is, contains the coefficients for the basis functions . Analogously, we obtain
where and .
As mentioned above, DMD (as a special case of EDMD WKR15 ) typically uses the pseudoinverse to compute matrix representations of the corresponding operators. Nonetheless, a Tikhonovregularized variant is described in Ref. EMBK17:DMD, .
Iv Numerical results
As we have shown above, many dimensionality reduction techniques or methods to analyze highdimensional data can be regarded as eigendecompositions of certain empirical RKHS operators. We now seek to illustrate how kernel CCA results in coherent sets and potential applications of the coherent mode decomposition.
iv.1 Coherent sets
We will first apply the method to a wellknown benchmark problem, namely the Bickley jet, and then to ocean data and a molecular dynamics problem.
iv.1.1 Bickley jet
Let us consider a perturbed Bickley jet, which is an approximation of an idealized stratospheric flow Rypina07:coherent and a typical benchmark problem for detecting coherent sets (see, e.g., Refs. HKTH16:coherent, ; BK17:coherent, ; HSD18, ; FJ18:coherent, ). The flow is illustrated in Figure 3. For a detailed description of the model and its parameters, we refer to Ref. BK17:coherent, . Here, the state space is defined to be periodic in the direction with period . In order to demonstrate the notion of coherence, we arbitrarily color one circular set yellow and one red and observe their evolution. The yellow set is dispersed quickly by the flow; the red set, on the other hand, moves around but barely changes shape. The red set is hence called coherent.
We generate 10000 uniformly distributed test points in and then simulate their progression in time. For the computation of the coherent sets, we use only the start and end points of each trajectory, i.e., we define , where denotes the flow associated with the dynamical system. We set . From the vectors and , we then compute the Gram matrices and using the same Gaussian kernel. Here, we define the bandwidth to be and the regularization parameter to be .
A few dominant eigenfunctions are shown in Figure 4 (a)–(d). The first eigenfunction distinguishes between the top and bottom “half” and the second one between the middle part and the rest. The subsequent eigenfunctions pick up combinations of the vortices. Applying means with to the first eigenfunctions results in the coherent sets shown in Figure 4 (e). This is consistent with the results presented in Ref. BK17:coherent, as shown in Figure 4 (f), where we apply spacetime diffusion maps to the trajectory data (comprising 40 snapshots). While the results are qualitatively the same although kernel CCA uses only two snapshots, the coherent sets computed by our approach are less noisy, which might be due to the smoothing effects of the Gaussian kernel.
Choosing a finitedimensional feature space explicitly, as described in Section III.2, by selecting a set of radial basis functions whose centers are given by a regular grid leads to comparable results. Currently, only start and end points of trajectories are considered. As a result, points that drift apart and then reunite at time would constitute coherent sets. Applying kernel CCA to less wellbehaved systems might require more sophisticated kernels that take entire trajectories into account, e.g., by employing averaging techniques as suggested in Ref. BK17:coherent, .
iv.1.2 Ocean data
Ocean currents are driven by winds and tides, as well as differences in salinity. There are five major gyres as illustrated in Figure 5 (a), which has been reproduced with permission of the National Ocean Service (NOAA).^{††}^{††}††NOAA. What is a gyre? https://oceanservice.noaa.gov/facts/gyre.html Our goal now is to detect these gyres from virtual buoy trajectories. In order to generate Lagrangian data, we use the OceanParcels toolbox^{‡‡}^{‡‡}‡‡OceanParcels project: http://oceanparcels.org/ (see Ref. OceanParcels17, for details) and data from the GlobCurrent repository,^{**}^{**}**GlobCurrent data repository: http://www.globcurrent.org/ provided by the European Space Agency. More precisely, our drifter computations are based on the Eulerian total current at significant wave height from the sum of geostrophic and Ekman current components, starting on the 1st of January 2016 and ending on the 31st of December 2016 with 3hourly updates.
We place 15000 uniformly distributed virtual drifters in the oceans and let the flow evolve for one year, which thus constitutes the lag time . Let denote the initial positions and the new positions of the drifters after one year. The domain is , where the first dimension corresponds to the longitudes and the second to the latitudes. For the coherent set analysis, we select a Gaussian kernel with bandwidth , where is the distance between the points and in kilometers computed with the aid of the haversine formula. The regularization parameter is set to . The first two dominant eigenfunctions computed using kernel CCA are shown in Figure 5 (b) and (c) and a means clustering of the six dominant eigenfunctions in Figure 5 (d). CCA correctly detects the main gyres—the splitting of the South Atlantic Gyre and the Indian Ocean Gyre might be encoded in eigenfunctions associated with smaller eigenvalues—and the Antartic Circumpolar Current. The clusters, however, depend strongly on the lag time . In order to illustrate the flow properties, typical trajectories are shown in Figure 5 (e). The trajectories belonging to different coherent sets remain mostly separated, although weak mixing can be seen, for instance, at the borders between the red and purple and red and green clusters.
iv.1.3 Timedependent energy potential
As a last example, we will analyze a moleculardynamics inspired problem, namely diffusion in a timedependent twodimensional energy landscape, given by the stochastic differential equation
with
The parameter is the dimensionless inverse (absolute) temperature, a standard Wiener process, and specifies the number of wells. This is a generalization of a potential defined in Ref. BKKBDS18, , whose wells now move periodically towards and away from the center and which furthermore slowly rotates. We set . The resulting potential for is shown in Figure 6 (a). Particles will typically quickly equilibrate in radial direction towards the closest well and stay in this well, which moves over time. Particles trapped in one well will remain coherent for a relatively long time. The probability of escaping and moving to another one depends on the inverse temperature: The higher , the less likely are transitions between wells.
We generate 1000 uniformly distributed test points in and integrate the system with the aid of the Euler–Maruyama method and the step size from to . As before, we use only the start and end points of the trajectories and a Gaussian kernel (here, and ) for the coherent set analysis.
Due to the centering of the Gram matrices, the eigenvalue vanishes and—depending on the parameter —four eigenvalues close to one remain as illustrated in Figure 6 (b). Figure 6 (c) shows a clustering of the dominant four eigenfunctions for based on PCCA+ Roeblitz2013 , resulting in the expected five coherent sets. The clustering at (see Figure 6 (d)) illustrates that the computed sets indeed remain largely coherent.
Standard methods for the computation of metastable sets such as Ulam’s method, EDMD, or their variants are in general not suitable for nonequilibrium dynamics; see also Ref. KWNS18:noneq, and Section III.3.
iv.2 Coherent mode decomposition
In order to illustrate the coherent mode decomposition outlined in Algorithm III.7, we consider the classical von Kármán vortex street and generate data using a simple Python implementation.^{*†}^{*†}*†Palabos project: http://wiki.palabos.org/numerics:codes It is important to note that here we take into account the full trajectory data , where is the state at time , and define and , whereas we generated uniformly distributed data for the coherent set analysis in the previous subsection and furthermore used only the start and end points of the trajectories. We set and . Some snapshots of the system are shown in Figure 7 (a)–(d). Applying CMD results in the modes depicted in Figure 7 (e)–(h), where the color bar is the same as in Figure 4. As described above, we obtain two modes, denoted by and , for each eigenvalue , where can be interpreted as the timelagged counterpart of .
For this standard DMD benchmark problem, which we chose for illustration purposes, CMD and (regularized) DMD lead to modes that look highly similar. The interpretations, however, are different. While the DMD modes, which correspond to eigenvectors, are objects that are mapped onto scalar multiples of themselves, the CMD modes, which correspond to singular vectors, encode information about how coherent structures at time are transported by the flow to time . In fact, the DMD eigenvalues associated with the DMD modes resembling the CMD modes shown in Figure 7 are negative and close to , implying periodicity. Analogously, the CMD modes are akin to , which also implies periodic motion. Further applications of CMD pertaining to, for instance, more complicated fluid flows or also nonsequential data, will be investigated in future work.
V Conclusion
We demonstrated that several kernelbased dimensionality reduction techniques can be interpreted as eigendecompositions of empirical estimates of certain RKHS operators. Moreover, we showed that applying CCA to Lagrangian data results in coherent sets and illustrated the efficiency of the methods using several examples ranging from fluid to molecular dynamics. This approach worked out of the box, although taking into account entire trajectories might improve the results even further, which would then necessitate dedicated kernels. In this work, we analyzed only lowdimensional benchmark problems. Nevertheless, the kernelbased algorithms can be easily applied to more complex problems and also nonvectorial domains such as graphs or strings.
As a byproduct of the coherent set analysis, we derived a method called CMD that is a hybrid of CCA and DMD (or TICA). This method can, for instance, be applied to highdimensional fluid flow or video data. For specific problems, CMD and DMD—unsurprisingly, given the close proximity—result in highly similar modes. An interesting topic for future research would be to systematically analyze the relationships between these methods. Furthermore, as with the transfer operators and embedded transfer operators as well as their kernelbased estimates KSM17 , there are again several different combinations and variants of the proposed algorithms.
Another open problem is the influence of different regularization techniques on the numerical results. How does Tikhonov regularization compare to approaches based on pseudoinverses or other spectral filtering methods? And how do we choose the kernel and the regularization parameters in an optimal way, preferably without crossvalidation? Additionally, future work includes analyzing the properties of the empirical estimate . Can we show convergence in the infinitedata limit? Which operators can be approximated by and can we derive error bounds for the resulting eigenvalues and eigenfunctions?
We expect the results in this paper to be a starting point for further theoretical research into how RKHS operators in the context of dynamical systems could be approximated and, furthermore, how they connect to statistical learning theory. Additionally, the methods proposed here might be combined with classical modifications of CCA in order to improve the numerical performance. The experiments here were performed using Matlab, and the methods have been partially reimplemented in Python and are available at https://github.com/sklus/d3s/.
Acknowledgements
We would like to thank Péter Koltai for the Bickley jet implementation as well as helpful discussions related to coherent sets, Ingmar Schuster for pointing out similarities between CCA and kernel transfer operators, and the reviewers for many helpful suggestions for improvements. We gratefully acknowledge funding from Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1114 (Scaling Cascades in Complex Systems, project ID: 235221301, projects A04 and B06), Germany’s Excellence Strategy (MATH+: The Berlin Mathematics Research Center, EXC2046/1, project ID: 390685689, projects AA12, AA16, and EF12), and European Commission through ERC CoG 772230 “ScaleCell”.
References
 [1] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6):417–441, 1933.
 [2] H. Hotelling. Relations between two sets of variates. Biometrika, 28:321–377, 1936.
 [3] A. Hyvärinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Network, 13(45):411–430, 2000.
 [4] L. Molgedey and H. G. Schuster. Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72:3634–3637, 1994.
 [5] G. PérezHernández, F. Paul, T. Giorgino, G. De Fabritiis, and F. Noé. Identification of slow molecular order parameters for Markov model construction. The Journal of Chemical Physics, 139(1), 2013.
 [6] H. Wu and F. Noé. Variational approach for learning Markov processes from time series data. ArXiv eprints, 2017.
 [7] P. Schmid and J. Sesterhenn. Dynamic Mode Decomposition of numerical and experimental data. In 61st Annual Meeting of the APS Division of Fluid Dynamics. American Physical Society, 2008.
 [8] A. Mardt, L. Pasquali, H. Wu, and F. Noé. VAMPnets for deep learning of molecular kinetics. Nature Communications, 9(1):5, 2018.
 [9] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode decomposition with dictionary learning: A datadriven adaptive spectral decomposition of the Koopman operator. Chaos, 27:103111, 2017.
 [10] S. E. Otto and C. W. Rowley. Linearlyrecurrent autoencoder networks for learning dynamics. SIAM Journal on Applied Dynamical Systems, 18(1):558–593, 2019.
 [11] B. Schölkopf, A. Smola, and K.R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.
 [12] T. Melzer, M. Reiter, and H. Bischof. Nonlinear feature extraction using generalized canonical correlation analysis. In G. Dorffner, H. Bischof, and K. Hornik, editors, Artificial Neural Networks — ICANN 2001, pages 353–360, Berlin Heidelberg, 2001. Springer.
 [13] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2003.
 [14] S. Harmeling, A. Ziehe, M. Kawanabe, and K.R. Müller. Kernelbased nonlinear blind source separation. Neural Computation, 15(5):1089–1124, 2003.
 [15] M. O. Williams, C. W. Rowley, and I. G. Kevrekidis. A kernelbased method for datadriven Koopman spectral analysis. Journal of Computational Dynamics, 2(2):247–265, 2015.
 [16] F. Noé and F. Nüske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Modeling & Simulation, 11(2):635–655, 2013.
 [17] C. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard. A Direct Approach to Conformational Dynamics based on Hybrid Monte Carlo. Journal of Computational Physics, 151:146–168, 1999.
 [18] Anton Bovier. Metastability: a potential theoretic approach. In Proceedings of the International Congress of Mathematicians, pages 499–518, 2006.
 [19] C. R. Schwantes and V. S. Pande. Improvements in Markov State Model construction reveal many nonnative interactions in the folding of NTL9. Journal of Chemical Theory and Computation, 9:2000–2009, 2013.
 [20] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
 [21] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1–2):1–141, 2017.
 [22] B. Koopman. Hamiltonian systems and transformation in Hilbert space. Proceedings of the National Academy of Sciences, 17(5):315, 1931.
 [23] A. Lasota and M. C. Mackey. Chaos, fractals, and noise: Stochastic aspects of dynamics, volume 97 of Applied Mathematical Sciences. Springer, 2nd edition, 1994.
 [24] S. Klus, I. Schuster, and K. Muandet. Eigendecompositions of transfer operators in reproducing kernel Hilbert spaces. ArXiv eprints, 2017.
 [25] G. Froyland and O. Junge. Robust FEMbased extraction of finitetime coherent sets using scattered, sparse, and incomplete trajectories. SIAM Journal on Applied Dynamical Systems, 17(2):1891–1924, 2018.
 [26] G. Froyland, N. Santitissadeekorn, and A. Monahan. Transport in timedependent dynamical systems: Finitetime coherent sets. Chaos: An Interdisciplinary Journal of Nonlinear Science, 20(4):043116, 2010.
 [27] G. Froyland and O. Junge. On fast computation of finitetime coherent sets using radial basis functions. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(8), 2015.
 [28] M. O. Williams, I .I. Rypina, and C. W. Rowley. Identifying finitetime coherent sets from limited quantities of Lagrangian data. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(8), 2015.
 [29] A. Hadjighasem, D. Karrasch, H. Teramoto, and G. Haller. Spectralclustering approach to Lagrangian vortex detection. Physical Review E, 93:063107, 2016.
 [30] R. Banisch and P. Koltai. Understanding the geometry of transport: Diffusion maps for Lagrangian trajectory data unravel coherent sets. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(3):035804, 2017.
 [31] B. E. Husic, K. L. SchlueterKuck, and J. O. Dabiri. Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity. PLoS ONE, 14(3):e0212442, 2019.
 [32] M. R. Allshouse and T. Peacock. Lagrangian based methods for coherent structure detection. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(9):097617, 2015.
 [33] S. Klus, F. Nüske, P. Koltai, H. Wu, I. Kevrekidis, C. Schütte, and F. Noé. Datadriven model reduction and transfer operator approximation. Journal of Nonlinear Science, 28:985–1010, 2018.
 [34] P. Koltai, H. Wu, F. Noé, and C. Schütte. Optimal datadriven estimation of generalized Markov state models for nonequilibrium dynamics. Computation, 6(1), 2018.
 [35] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT press, Cambridge, USA, 2001.
 [36] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 1st edition, 2008.
 [37] J. ShaweTaylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
 [38] C. Baker. Mutual information for Gaussian processes. SIAM Journal on Applied Mathematics, 19(2):451–458, 1970.
 [39] C. Baker. Joint measures and crosscovariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973.
 [40] J. R. Baxter and J. S. Rosenthal. Rates of convergence for everywherepositive Markov chains. Statistics & Probability Letters, 22(4):333–338, 1995.
 [41] S. Klus, P. Koltai, and C. Schütte. On the numerical approximation of the Perron–Frobenius and Koopman operator. Journal of Computational Dynamics, 3(1):51–79, 2016.
 [42] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, 209:415–446, 1909.
 [43] M. Reed and B. Simon. Methods of Mathematical Physics I: Functional Analysis. Academic Press Inc., 2 edition, 1980.
 [44] L. Rosasco, M. Belkin, and E. De Vito. On learning with integral operators. Journal of Machine Learning Research, 11:905–934, 2010.
 [45] C. W. Groetsch. Inverse Problems in the Mathematical Sciences. Vieweg, 1993.
 [46] H. Engl and C. W. Groetsch. Inverse and IllPosed Problems. Academic Press, 1996.
 [47] H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer, 1996.
 [48] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
 [49] K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes’ rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14:3753–3783, 2013.
 [50] K. Fukumizu. Nonparametric bayesian inference with kernel mean embedding. In G. Peters and T. Matsui, editors, Modern Methodology and Applications in SpatialTemporal Modeling. 2017.
 [51] M. Mollenhauer, I. Schuster, S. Klus, and C. Schütte. Singular value decomposition of operators on reproducing kernel Hilbert spaces. ArXiv eprints, 2018.
 [52] John ShaweTaylor, Christopher K. I. Williams, Nello Cristianini, and Jaz Kandola. On the eigenspectrum of the gram matrix and its relationship to the operator eigenspectrum. In Algorithmic Learning Theory. ALT 2002. Lecture Notes in Computer Science, vol 2533., pages 23–40, 11 2002.
 [53] S. Klus, A. Bittracher, I. Schuster, and C. Schütte. A kernelbased approach to molecular conformation analysis. The Journal of Chemical Physics, 149:244109, 2018.
 [54] M. Borga. Canonical correlation: a tutorial, 2001.
 [55] K. Fukumizu, F. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361–383, 2007.
 [56] G. Froyland. An analytic framework for identifying finitetime coherent sets in timedependent dynamical systems. Physica D: Nonlinear Phenomena, 250:1–19, 2013.
 [57] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley. A datadriven approximation of the Koopman operator: Extending dynamic mode decomposition. Journal of Nonlinear Science, 25(6):1307–1346, 2015.
 [58] C. R. Schwantes and V. S. Pande. Modeling molecular kinetics with tICA and the kernel trick. Journal of Chemical Theory and Computation, 11(2):600–608, 2015.
 [59] R. Penrose. A generalized inverse for matrices. Mathematical Proceedings of the Cambridge Philosophical Society, 51(3):406–413, 1955.
 [60] F. Noé. Machine learning for molecular dynamics on long timescales. ArXiv eprints, 2018.
 [61] F. Noé and C. Clementi. Kinetic distance and kinetic maps from molecular dynamics simulation. Journal of Chemical Theory and Computation, 11:5002–5011, 2015.
 [62] P. J. Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics, 656:5–28, 2010.
 [63] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition: Theory and applications. Journal of Computational Dynamics, 1(2), 2014.
 [64] J. N. Kutz, S. L. Brunton, B. W. Brunton, and J. L. Proctor. Dynamic Mode Decomposition: DataDriven Modeling of Complex Systems. SIAM, 2016.
 [65] N. B. Erichson, L. Mathelin, S. L. Brunton, and N. J. Kutz. Randomized dynamic mode decomposition. ArXiv eprints, 2017.
 [66] I. I. Rypina, M. G. Brown, F. J. BeronVera, H. Koçak, M. J. Olascoaga, and I. A. Udovydchenkov. On the Lagrangian dynamics of atmospheric zonal jets and the permeability of the stratospheric polar vortex. Journal of the Atmospheric Sciences, 64(10):3595–3610, 2007.
 [67] M. Lange and E. van Sebille. Parcels v0.9: prototyping a Lagrangian ocean analysis framework for the petascale age. Geoscientific Model Development, 10(11):4175–4186, 2017.
 [68] A. Bittracher, P. Koltai, S. Klus, R. Banisch, M. Dellnitz, and C. Schütte. Transition manifolds of complex metastable systems: Theory and datadriven computation of effective dynamics. Journal of Nonlinear Science, 28(2):471–512, 2018.
 [69] S. Röblitz and M. Weber. Fuzzy spectral clustering by PCCA+: application to markov state models and data classification. Advances in Data Analysis and Classification, 7(2):147–179, 2013.