A kernel-based method for coarse graining complex dynamical systems

# A kernel-based method for coarse graining complex dynamical systems

Andreas Bittracher Department of Mathematics and Computer Science, Freie Universität Berlin, Germany Stefan Klus Department of Mathematics and Computer Science, Freie Universität Berlin, Germany
Boumediene Hamzi
Christof Schütte
###### Abstract

We present a novel kernel-based machine learning algorithm for identifying the low-dimensional geometry of the effective dynamics of high-dimensional multiscale stochastic systems. Recently, the authors developed a mathematical framework for the computation of optimal reaction coordinates of such systems that is based on learning a parametrization of a low-dimensional transition manifold in a certain function space. In this article, we enhance this approach by embedding and learning this transition manifold in a reproducing kernel Hilbert space, exploiting the favorable properties of kernel embeddings. Under mild assumptions on the kernel, the manifold structure is shown to be preserved under the embedding, and distortion bounds can be derived. This leads to a more robust and more efficient algorithm compared to previous parametrization approaches.

## 1 Introduction

Many of the dynamical processes investigated in the sciences today are characterized by the existence of phenomena on multiple, interconnected time scales that determine the long-term behavior of the process. Examples include the inherently multiscale dynamics of atmospheric vortex- and current formation which needs to be considered for effective weather prediction [24, 29], or the vast difference in time scales on which bounded atomic interactions, side-chain interactions, and the resulting formation of structural motifs occur in biomolecules [19, 12, 10]. An effective approach to analyzing these systems is often the identification of a low-dimensional observable of the system that captures the interesting behaviour on the longest time scale. However, the computerized identification of such observables from simulation data poses a significant computational challenge, especially for high-dimensional systems.

Recently, the authors have developed a novel mathematical framework for identifying such essential observables for the slowest time scale of a system [6]. The method—called the transition manifold approach—was primarily motivated by molecular dynamics, where the dynamics is typically described by a thermostated Hamiltonian system or diffusive motion in molecular dynamics landscapes. In these systems, local minima of the potential energy landscape induce metastable behavior, which is the phenomenon that on long time scales, the dynamics is characterized by rare transitions between certain sets that happen roughly along interconnecting transition pathways [35, 41, 17]. The sought-after essential observables should thus resolve these transition events, and are called reaction coordinates in this context [46, 4], a notion that we will adopt here. Despite of its origins, the transition manifold approach is also applicable to other classes of reducible systems (which will also be demonstrated in this article).

At the heart of this approach is the insight that good reaction coordinates can be found by parametrizing a certain transition manifold in the function space . This manifold has strong connections to the aforementioned transition pathway [5], but is not equivalent. Its defining property is that, for times that fall between the fastest and slowest time scales, the transition density functions with relaxation time concentrate around . The algorithmic strategy to parametrize can then be summarized as

1. Choose test points in the dynamically relevant regions of the state space.

2. Sample the transition densities for each test point by Monte Carlo simulation.

3. Embed the transition densities into a Euclidean space by a generic embedding function.

4. Parametrize the embedded transition densities with the help of established manifold learning techniques.

The result is a reaction coordinate evaluated in the test points. This reaction coordinate has been shown to be as expressive as the dominant eigenfunctions of the transfer operator associated with the system [6], which can be considered an “optimal” reaction coordinate [20, 11, 30]. One decisive advantage of our method, however, is the ability to compute the reaction coordinate locally (by choosing the test points), whereas with conventional methods, the inherently global computation of transfer operator eigenfunctions quickly becomes infeasible due to the curse of dimensionality. Kernel-based methods for the computation of eigenfunctions alleviate this problem to some extent [43, 25]. Nevertheless, the number of dominant eigenfunctions critically depends on the number of metastable states, which can be significantly larger than the natural dimension of the reaction coordinate [6].

However, the algorithm proposed originally had several shortcomings related to the choice of the embedding function. First, in order to ensure the preservation of the manifold’s topology under the embedding, the dimension of had to be known in advance. Second, the particular way of choosing the embedding functions allowed no control over the distortion of , which may render the parametrization problem numerically ill-conditioned.

The goal of this article is to overcome the aforementioned problems by kernelizing the transition manifold embedding. That is, we present a method to implicitly embed the transition manifold into a reproducing kernel Hilbert space (RKHS) with a proper kernel. The RKHS is—depending on the kernel—a high- or even infinite-dimensional function space with the crucial property that inner products between points embedded into it can be computed by cheap kernel evaluations, without ever explicitly computing the embedding [48, 38]. In machine learning, this so-called kernel trick is often used to derive nonlinear versions of originally linear algorithms, by interpreting the RKHS-embedding of a data set as a high-dimensional, nonlinear transformation, and (implicitly) applying the linear algorithm to the transformed data. This approach has been successfully applied to methods such as principal component analysis (PCA) [40], canonical correlation analysis (CCA) [31], and time-lagged independent component analysis (TICA) [43], to name but a few.

Due to their popularity, the metric properties of the kernel embedding are well-studied [45, 21, 47, 22, 34]. In particular, for characteristic kernels, the RKHS is “large” in an appropriate sense, and geometrical information is well-preserved under the embedding. For our application, this means that distances between points on the transition manifold are approximately preserved, and thus the distortion of under the embedding can be bounded. This will guarantee that the final manifold learning problem is well-posed. Moreover, if the transformation induced by the kernel embedding is able to approximately linearize the transition manifold, there is hope that efficient linear manifold learning methods can be used to parametrize the embedded transition manifold.

The main contributions of this work are as follows:

1. We develop a kernel-based algorithm to approximate transition manifolds and compare it with the Euclidean embedding counterpart.

2. We derive measures for the distortion of the embedding and associated error bounds.

3. We illustrate the efficiency of the proposed approach using academic and molecular dynamics examples.

In Section 2, we will formalize the definition of transition manifolds and derive conditions under which systems possess such manifolds. Section 3 introduces kernels and the induced RKHSs. Furthermore, we show that the algorithm to compute transition manifolds numerically can be written purely in terms of kernel evaluations and derive measures for the distortion of the manifold caused by the embedding into the RKHS. Numerical results illustrating the benefits of the proposed kernel-based methods are presented in Section 4 and a conclusion and future work in Section 5.

## 2 Reaction coordinates based on transition manifolds

In what follows, let (abbreviated as ) be a reversible, thus ergodic, stochastic process on a compact state space and its unique invariant density. That is, if , then for all . For and , let denote the transition density function of the system, i.e., describes the probability of finding the system in at time , after starting in at time .

### 2.1 Reducibility of dynamical systems

We assume the state space dimension to be large. The main objective of this work is the identification of good low-dimensional reaction coordinates (RCs) or order parameters of the system. An -dimensional RC is a smooth map from the full state space to a lower-dimensional space , . Informally, we call such an RC good if the long-term dynamics of the projected process —i.e., the process observed through the RC—resembles the long-term dynamics of the full process . A quantitative measure for the quality of RCs will follow in Section 2.2.

A geometric characterization of good reaction coordinates based on the so-called transition manifold was derived in [6].

###### Definition 2.1.

Let be fixed. The set

 ˜M:={pτx∣x∈X},

is called the fuzzy transition manifold of the system.

The definition is motivated by the following observation: If the system is in a certain way separable into a slowly equilibrating -dimensional part and a quickly equilibrating -dimensional part, and is chosen to be in a certain intermediate111Intermediate here means that it has to be large enough so that the fast processes essentially equilibrated but the slow processes are still present and distinguishable. time scale range, then approximately forms an -dimensional manifold in . Let us illustrate this with the aid of a simple example.

###### Example 2.2.

Consider the process to be described by overdamped Langevin dynamics

 dXt=−∇V(Xt)dt+√2/βdWt, (1)

with the energy potential , the inverse temperature , and Brownian motion . The potential depicted in Figure 1 possesses two metastable states, located around the local energy minima. Any realization of this system that started in one of the the metastable sets will likely remain in that set for a long time. Moreover, a realization that started outside of the wells will with high probability quickly leave the transition region and move into one of the two metastable sets. The probability of whether the trajectory will be trapped in the left or right well depends almost exclusively on the horizontal coordinate of the starting point , or, in other words, the progress of along the transition path. Thus, for times that allow typical trajectories to move into one of the wells, the transition densities also depend only on the horizontal but not the vertical coordinate of . This means that the fuzzy transition manifold essentially forms a one-dimensional manifold in the space of probability densities. Also, any parametrization of this manifold corresponds to a parametrization of the horizontal coordinate, and thus to a good reaction coordinate.

The concept of (fuzzy) transition manifolds can be made rigorous by the following definition:

###### Definition 2.3.

The process is called -reducible if there exists an -dimensional manifold such that for a fixed lag time , it holds that

 minf∈M∥f−pτx∥L21/ρ≤εfor all x∈X.

Any such is called a transition manifold of the system.

Two remarks are in order:

1. The fact that we require closeness with respect to the -norm, instead of, for example, the -norm, is indeed important here, as measuring the quality of a given RC will require a certain Hilbert space structure. This will be detailed below; for now, simply note that for all it holds that since .

2. The original definition of -reducibility (see [6, Definition 4.4]), is marginally different from the definition above: Instead of , we here require . The introduction of this slightly stronger technical requirement allows us to later control a certain embedding error, see Proposition 2.8. Note that the proofs in [6] regarding the optimality of the final reaction coordinate are not affected by this change.

In what follows, we always assume that the process is -reducible with small and .

###### Remark 2.4.

In addition to metastable systems, many other relevant problems possess such transition manifolds. For instance, one can show that, under mild conditions on and , systems with explicit time scale separation, given by

 dXt =f(Xt,Yt)dt+dW(1)t, dYt =1εg(Xt,Yt)dt+1√εdW(2)t,

also possess a transition manifold since essentially depends only on for and .

### 2.2 A measure for the quality of reaction coordinates

We will now derive a measure for evaluating the quality of reaction coordinates that is based on transfer operators. The Perron–Frobenius operator associated with the process is defined by

 (Ptu)(y)=∫Xu(x)ptx(y)dx.

This operator can be seen as the push-forward of arbitrary starting densities, i.e., if , then .

The operator can be naturally extended to the inner product space and then has particularly advantageous properties (see [2, 37, 26]): Defined on , the operator is self-adjoint due to the reversibility of . Moreover, under relatively mild conditions, it does not exhibit any essential spectrum [41]. Hence, its eigenfunctions form an orthonormal basis of and the associated eigenvalues are real. Now, the significance of the dominant eigenpairs for the system’s time scales is well-known [41]. This is the primary reason for the choice of the -norm in Definition 2.3.

Let be the eigenvalues of , sorted by decreasing absolute value, and the corresponding eigenfunctions, where . It holds that is independent of , isolated and the sole eigenvalue with absolute value . Furthermore, . The subsequent eigenvalues decrease monotonously to zero both for increasing index and time. That is,

 limi→∞|θti|=0andlimt→∞|θti|=0.

The associated eigenfunctions can be interpreted as sub-processes of decreasing longevity in the following sense: Let , with , , then

 Ptu=∞∑i=0θtiαiψi≈d∑i=0θtiαiψi

since for the lag time as defined above, there exists an index such that for all and all . Hence, the major part of the information about the long-term density propagation of is encoded in the dominant eigenpairs.

In addition to the Perron–Frobenius operator for the full process , we can define the Perron–Frobenius operator for the projected process , denoted by . Formally, it can be computed by applying the Zwanzig projection operator to , for details see [50]. Assume now that we are interested in reproducing the long-term dynamics of , i.e., the action of , for bigger than some fixed lag time . A natural requirement for good reaction coordinates then is

 Ptu≈Ptξu (2)

for and all densities . Therefore, a necessary condition is

 Πξψi≈ψi,i=0,…,d.

That is, the eigenfunctions must be almost constant along the level sets of . For a formal proof of this statement, see [6]. This motivates the following definition:

###### Definition 2.5.

Let be the eigenpairs of the Perron–Frobenius operator. Let and such that for all and . We call a function a good reaction coordinate if for all there exist functions such that

 ∥∥ψi−~ψi∘ξ∥∥∞≈0. (3)

If condition (3) is fulfilled, we say that (approximately) parametrizes the dominant eigenfunctions.

### 2.3 Optimal reaction coordinates

We now justify why reaction coordinates that are based on parametrizations of the transition manifold indeed fulfill the optimality condition (3). Let be the nearest-point projection onto , i.e.,

 Q(f)=argming∈M∥f−g∥L21/ρ.

Assume further that some parametrization of is known, i.e., is one-to-one on and its image in . The reaction coordinate defined by

 ξ(x)=(γ∘Q)(pτx) (4)

is good in the sense of Definition 2.5.

###### Theorem 2.6 ([6, Corollary 3.8]).

Let the system be -reducible and defined as in (4). Then for all , there exist functions such that

 ∥∥ψi−~ψi∘ξ∥∥∞≤ε|θτi|. (5)

1. The choice of the -norm in Definition 2.3 is crucial for Theorem 2.6 to hold.

2. Metastable systems typically exhibit a time scale gap. Therefore, can always be chosen such that , . Consequently, the RC (4) is indeed good according to Definition 2.5.

The main task for the rest of the paper is now the numerical computation of an (approximate) parametrization of .

### 2.4 Whitney embedding of the transition manifold

One approach to find a parametrization of , proposed by the authors in [6], is to first embed into a more accessible Euclidean space and to parametrize the embedded manifold. In order to later compare it with our new method, we will briefly describe this approach here.

To construct an embedding that preserves the topological structure of , without prior knowledge about , a variant of the Whitney embedding theorem can be used. It extends the classic Whitney theorem to arbitrary Banach spaces and was proven by Hunt and Kaloshin in [23].

###### Theorem 2.7 (Whitney embedding theorem, [23]).

Let be a Banach space, let be a smooth manifold of dimension and let . Then almost every (in the sense of prevalence) bounded linear map is one-to-one on and its image in .

In our case, and is the transition manifold. Assuming that the dimension of the transition manifold is known in advance, this result states that we can effectively choose the embedding randomly, and the image will again be an -dimensional manifold in , provided that .

We restrict ourselves to embeddings of the form

 F(f):=∫η(x′)f(x′)dx′,

with arbitrarily chosen feature maps that we require to be bounded on . The dynamical embedding of a point is then defined by

 E(x):=F(ptx)=∫η(x′)ptx(x′)dx′. (6)

This is the Euclidean representation of the density , and the set is the Euclidean representation of the fuzzy transition manifold. It again clusters around an -dimensional manifold in , namely the image of the transition manifold under :

###### Proposition 2.8.

Let the process be -reducible with transition manifold , and and defined as above. Then

 infv∈F(M)∥v−E(x)∥∞≤∥η∥∞εfor all x∈X.
###### Proof.

Let . Then there is an such that . As , i.e., consists of transition densities, we have

 infv∈F(M)∥v−E(x)∥∞ ≤∥F(ptx∗)−F(ptx)∥∞ =∥∥∥∫Xη(x′)(ptx∗(x′)−ptx(x′))dx′∥∥∥∞ ≤∥η∥∞∥ptx∗−ptx∥L1≤∥ptx∗−ptx∥L21/ρ≤∥η∥∞ε.\qed
###### Remark 2.9.

Together, Theorem 2.7 and Proposition 2.8 guarantee at least a minimal degree of well-posedness of the embedding problem: The embedded manifold has the same topological structure as , and clusters closely around it (if is small). However, guarantees on the condition of the problem cannot be made. The manifold will in general be distorted by , to a degree that might pose problems for numerical manifold learning algorithms. This problem is illustrated in Figure 2. Such a situation typically occurs if some of the components of the embedding are strongly correlated.

Additionally, the Whitney embedding theorem cannot guarantee that the fuzzy transition manifold will be preserved under the embedding, as analytically is not a manifold. Thus, is in general not injective on .

### 2.5 Data-driven algorithm for parametrizing the transition manifold

Due to its implicit definition, the transition manifold is hard to analyze and parametrize without further knowledge of the system. However, as and concentrates -closely around , we can assume that any parametrization of the dominant directions of is also a good parametrization of . The embedding can be efficiently sampled from simulation data in the following way: Let denote the endpoint of the time- realization of starting in and outcome , where is the sample space underlying the process . For , fixed as in Definition 2.3 and arbitrarily chosen , let . In practice, the will be generated by multiple runs of a numerical SDE solver starting in with different random seeds.

The empirical dynamical embedding is defined by

 ˆE(x):=1MM∑k=1η(y(k)), (7)

and is an estimator for .

Let be a finite sample of state space points, which covers the “dynamically relevant” part of state space, i.e., the regions of of substantial measure . The exact distribution of the sampling points is not important here. If is bounded or periodic, could be drawn from the uniform distribution or chosen to form a regular grid. In practice, it often consists of a sample of the system’s equilibrium measure . Due to Proposition 2.8, the point cloud then clusters around an -dimensional manifold in . Parametrizing this manifold is a classical manifold learning task for which a wide variety of algorithms can be deployed. Algorithm 2.1 summarizes the overall procedure of computing good reaction coordinates based on parametrizing the embedded fuzzy transition manifold.

## 3 Kernel-based parametrization of the transition manifold

The approach described above for learning a parametrization of the transition manifold by embedding it into Euclidean spaces requires a priori knowledge of the dimension of . Also, more importantly, might be strongly distorted by the embedding, as described in Section 2.4. The kernel-based parametrization, which is the main novelty of this work, will address both of these shortcomings by embedding into reproducing kernel Hilbert spaces.

### 3.1 Kernel reformulation of the embedding algorithm

Dimensionality reduction algorithms that can be used in Algorithm 2.1 include Diffusion Maps [14], Multidimensional Scaling [49, 28], and Locally Linear Embedding [36]. These, and many others, require only a notion of distance between pairs of data points. In our case, this amounts to the Euclidean distances between embedded points, i.e., , which can be computed by the Euclidean inner products , as

 ∥∥E(xi)−E(xj)∥∥22=⟨E(xi),E(xi)⟩−2⟨E(xi),E(xj)⟩+⟨E(xj),E(xj)⟩.

Other compatible algorithms such as Principal Component Analysis are based directly on the inner products. The inner products can be written as

and the empirical counterpart is

However, rather than explicitly computing the inner product between the features on the right-hand side, we now assume that it can be computed implicitly by using a kernel function , i.e.,

 k(yi,yj)=⟨η(yi),η(yj)⟩. (8)

This assumption, called the kernel trick, is commonly used to avoid the costly computation of inner products between high-dimensional features. However, instead of defining the kernel based on previously chosen features, one typically considers kernels that implicitly define high- and possibly infinite-dimensional feature spaces. In this way, we are able to avoid the choice of feature maps altogether.

Kernels with this property span a so-called reproducing kernel Hilbert space:

###### Definition 3.1 (Reproducing kernel Hilbert space [45]).

Let be a positive definite function. The set of functions , together with the corresponding inner product which fulfills

1. and

2. for all

is called the reproducing kernel Hilbert space (RKHS) associated with the kernel .

Requirement 2 implies that

 ⟨k(x,⋅),k(x′,⋅)⟩=k(x,x′)for % all x,x′∈X, (9)

thus can be regarded as a function-valued feature map (the so-called canonical feature map). However, each positive definite kernel is guaranteed to also possess a feature map of at most countable dimension.

###### Theorem 3.2 (Mercer’s theorem [32]).

Let be a positive definite kernel and be a finite Borel measure with support . Define the integral operator by

 Tkf=∫k(⋅,x)f(x)dν(x). (10)

Then there is an orthonormal basis of consisting of eigenfunctions of rescaled with the square root of the corresponding nonnegative eigenvalues such that

 k(x,x′)=∞∑i=0λiφi(x)φi(x′)for all x,x′∈X. (11)

The above formulation of Mercer’s theorem has been taken from [34]. The Mercer features thus fulfill (8) for their corresponding kernel. In what follows, if not stated otherwise, will be the standard Lebesgue measure.

###### Example 3.3.

Examples of commonly used kernels are:

1. Linear kernel: . One sees immediately that (8) is fulfilled by choosing , (also spanning the Mercer feature space).

2. Polynomial kernel of degree : . It can be shown that the Mercer feature space is spanned by the monomials in up to degree .

3. Gaussian kernel: where is called the bandwidth of the kernel. Let and with be a multi-index. The Mercer features of then take the form

 ηp(x)=ep1(x1)⋯epn(xn),

see [48], where

 epi(xi)=√2piσ2pipi!xpiiexp(−1σ2x2i). △

Let denote the density embedding based on the Mercer features of the kernel , i.e.,

 (Fk(ptx))i:=∫ηi(x′)ptx(x′)dx′,i=0,1,2,…, (12)

and let . The amount of information about preserved by the embedding depends on the choice of the kernel . For the first two kernels in Example 3.3, the information preserved has a familiar stochastic interpretation (see, e.g., [34, 39, 47]):

1. Let be the linear kernel. Then

 ∥Fk(ptx1)−Fk(ptx2)∥=0⟺∫ptx1(y)dy=∫ptx2(y)dy,

i.e., the means of and coincide.

2. Let be the polynomial kernel of degree . Then

 ∥Fk(ptx1)−Fk(ptx2)∥=0⟺mi(ptx1)=mi(ptx2),i=1,…,p,

i.e., the first moments of and coincide.

###### Remark 3.4.

In practice, comparing the first moments often is enough to sufficiently distinguish the transition densities that constitute the transition manifold. However, densities that differ only in higher moments cannot be distinguished by , which means that for the above two kernels, is not injective. Therefore does not belong to the prevalent class of maps that is at the heart of the Whitney embedding theorem 2.7. We can therefore not utilize the Whitney embedding theorem to argue that the topology of is preserved under . Instead, in Section 3.2, we will use a different argument to show that the embedding is indeed injective for the Gaussian kernel (and others).

Still, by formally using the Mercer dynamical embedding in (6) (abusing notation if there are countably infinitely many such features), and using the kernel trick, we can now reformulate Algorithm 2.1 as a kernel-based method that does not require the explicit computation of any feature vector. This is summarized in Algorithm 3.1.

### 3.2 Condition of the kernel embedding

We will now investigate to what extend the kernel embedding preserves the topology and geometry of the transition manifold.

#### 3.2.1 Kernel mean embedding

We derived the kernel-based algorithm by considering the embedding of the transition manifold into the image space of the Mercer features in order to highlight the similarity to the Whitey embedding based on randomly drawn features. Of course, the Mercer features never had to be computed explicitly.

However, in order to investigate the quality of this embedding procedure it is advantageous to consider a different, yet equivalent embedding map: The transition manifold can be directly embedded into the RKHS by means of the kernel mean embedding operator.

###### Definition 3.5.

Let be a positive definite kernel and the associated RKHS. Let be a probability density over . Define the kernel mean embedding of  by

 μ(p) :=∫Xk(x,⋅)p(x)dx and the empirical kernel mean embedding by ^μ(p) :=1m∑ik(xi,⋅)with{x1,…,xm}∼p.

Note that and are again elements of and that for in (10) being the Lebesgue measure we obtain . Further, one sees that

Thus, for investigating whether the embedding preserves distances or inner products between densities, we can equivalently investigate the embedding . This is advantageous as injectivity and isometry properties of the kernel mean embedding are well-studied.

#### 3.2.2 Injectivity of the kernel mean embedding

A first important result is that can be chosen such that is injective. Such kernels are called characteristic [21]. In [47], several conditions for characteristic kernels are listed, including the following:

###### Theorem 3.6 ([47, Theorem 7]).

The kernel is characteristic if for all it holds that

 (13)

Condition (13) is known as the Mercer condition, which is, for example, fulfilled by the Gaussian kernel from Example 3.3. The Mercer features of such a kernel are particularly rich.

###### Theorem 3.7.

Assume that the kernel satisfies the Mercer condition (13). Then the eigenfunctions of form an orthonormal basis of .

For more details, see, e.g., [38, 48]. It is easy to see that for kernels fulfilling (13), as a map from to is Lipschitz continuous:

###### Lemma 3.8.

Let be a characteristic kernel with Mercer eigenvalues , . Then is Lipschitz continuous with constant

 c:=√λ0. (14)
###### Proof.

As is linear, it suffices to show that for all . We obtain

 ∥μ(f)∥2H =⟨μ(f),μ(f)⟩H =⟨∫Xf(x)k(x,⋅)dx,∫Xf(x)k(x,⋅)dx⟩H =∫X∫Xf(x)f(y)k(x,y)dxdy,

where (9) was used in the last line. By expanding into its Mercer features via (11), this becomes

 ∥μ(f)∥2H =∫X∫Xf(x)f(y)(∑i∈Nλiφi(x)φi(y))dxdy =∑i∈Nλi⟨f,φi⟩2L2.

By Theorem 3.7, the form an orthonormal basis of , and thus

 ∥μ(f)∥2H ≤λ0∥f∥2L2.\qed

Thus, if the kernel is characteristic, the structure of the TM and the fuzzy TM are qualitatively preserved under the embedding.

###### Corollary 3.9.

Let be a characteristic kernel and let be -reducible. Then is an -dimensional manifold in , and for all it holds that

 infg∈μ(M)∥g−μ(ptx)∥H≤√λ0∥√ρ∥∞ε.
###### Proof.

By Lemma 3.8, the map is continuous (and furthermore injective), and thus is again an -dimensional manifold. For , consider now any with . For , we then get

 ∥∥g−μ(ptx)∥∥H =∥∥μ(f−ptx)∥∥H which by Lemma 3.8 is
###### Remark 3.10.

This result should be seen as an analogue to Proposition 2.8 for the Whitney-based TM embedding. In short, for characteristic kernels, the injectivity and continuity of guarantee that the image of under is again an -dimensional manifold in , and Corollary 3.9 guarantees that the embedded fuzzy transition manifold still clusters closely around (if and in Corollary 3.9 are small). This again guarantees a minimal degree of well-posedness of the problem.

#### 3.2.3 Distortion under the kernel mean embedding

Unlike the Whitney embedding, the kernel embedding now allows us to derive conditions under which the distortion of is bounded. We have to show that the -distance between points on is not overly decreased or increased by the kernel mean embedding. To formalize this, we consider measures for the manifold’s internal distortion, following the notions of metric embedding theory [1].

###### Definition 3.11.

Let be the embedding corresponding to a characteristic kernel. The contraction of is defined by

 C(μ):=supp,q∈Mq≠p∥p−q∥L21/ρ∥μ(p)−μ(q)∥H. (15)

The expansion of is defined by

 E(μ):=supp,q∈Mq≠p∥μ(p)−μ(q)∥H∥p−q∥L21/ρ. (16)

Finally, the distortion of is defined by

 D(μ):=C(μ)E(μ). (17)

It follows that and if and only if is isometric. We call the embedding well-conditioned if is small.

###### Proposition 3.12.

Assume is the largest and the smallest bound such that

 d1∥p−q∥L21/ρ≤∥μ(p)−μ(q)∥H≤d2∥p−q∥L21/ρ (18)

holds for all . Then the distortion of the kernel embedding is .

###### Proof.

Due to the first inequality in (18),

 ∥p−q∥L21/ρ∥μ(p)−μ(q)∥H≤d−11for all p,q∈M,

and thus . Due to the second inequality in (18),

 ∥μ(p)−μ(q)∥H∥p−q∥L21/ρ≤d2for all p,q∈M,

and thus . Combined, we obtain . ∎

The second inequality in (18) is just Lipschitz continuity of , for which conditions have been derived in Lemma 3.8, so it can be assumed to hold with the corresponding Lipschitz constant.

However, one can show that, even for characteristic kernels, it is not possible to derive a bound for the first inequality that holds uniformly for all :

###### Proposition 3.13.

Assume the kernel embedding operator has absolutely bounded orthonormal eigenfunctions with corresponding nonnegative eigenvalues (arranged in nonincreasing order). Assume increases super-linearly with . Then there exist functions such that

 ∥μ(p)−μ(q)∥H∥p−q∥L21/ρ<ε

for any arbitrarily small .

###### Proof.

See Appendix A. ∎

A similar, but non-quantitative result has been derived in [47, Theorem 19]. The idea behind the proof is that, if and vary only in higher eigenfunctions of the embedding operator (see also Theorem 3.2), the -distance can become arbitrarily small. If, however, we can reasonably restrict our considerations to the subclass of functions whose variation in the higher is small compared to the variation in the lower , a favorable bound can be derived. Let the expansion of be given by

 h=∞∑i=0~hiφi

with the sequence . Now, for any such that there exists an index such that , define the factor

 c(h,imax):=1+∑∞i=imax+1~h2i∑imaxi=0~h2i. (19)

This factor bounds the contribution of the higher Mercer eigenfunctions to by the contribution of the lower ones:

 ∞∑i=0~h2i=c(h,imax)⋅imax∑i=0~h2i.

Thus, for an individual , we can bound the distortion of the -norm under with the help of .

###### Lemma 3.14.

Let , , and be defined as in (19). Then

 ∥μ(h)∥H≥ ⎷λimaxc(h,i% max)∥h∥2.
###### Proof.

See Appendix A. ∎

This already shows that for ,

 ∥μ(p)−μ(q)∥H≥ ⎷λimaxc(p−q,imax)∥p−q∥L2.

However, this is only an intermediate result, as the relevant distance measure in Definition 2.5 is the -norm, for reasons detailed in Section 2.2. Unfortunately, a naive estimation yields

 ∥h∥L21/ρ≤∥∥1/√ρ∥∥∞∥h∥L2. (20)

While due to ergodicity is indeed defined on , it becomes large in regions of small invariant measure , i.e., “dynamically irrelevant” regions. This would lead to a very small factor in (18), and thus large predicted distortion. For general , a more favorable estimate is difficult to obtain. In practice, however, these “dynamically irrelevant” regions are almost never visited by the system such that any practically relevant bound would not suffer from this problem.

Recall that our task is to parametrize the embedded transition manifold, in order to find a parametrization of the dominant transfer operator eigenfunctions . Therefore, differences of and in higher eigenfunctions , do not need to be resolved. Thus, we only need to preserve the distance

 ∥Πspan{ψ0,…,ψd}(p−q)∥L21/ρ,

where is the -orthogonal projection onto the dominant eigenfunctions:

 Πspan{ψ0,…,ψd}h=d∑i=1⟨h,ψi⟩ψi.

This allows us to derive a more favorable bound:

###### Lemma 3.15.

Let denote the -th dominant eigenfunction of the Perron–Frobenius operator and the associated eigenvalues. Let denote the -orthogonal projection onto the dominant eigenfunctions. Then for any ,

 ∥∥Πspan{ψ0,…,ψd}(ptx1−ptx2)∥∥L21/ρ≤√2dsupi=0,…,d∥ψi/ρ∥∞|X|∥∥Πspan{ψ0,…,ψd}(ptx1−ptx2)∥∥L2,

where denotes the Lebesgue measure of .

###### Proof.

See Appendix A. ∎

###### Remark 3.16.

Note that unlike in (20), the factor is not problematic here, as the dominant eigenfunctions are small whenever is small.

Combining Lemma 3.14 and Lemma 3.15 then finally yields the desired contraction bound:

###### Corollary 3.17.

Assume there exists and a constant such that

 c(ptx1−ptx2,imax)≤cX(imax)

for all . Then

where

###### Remark 3.18.

In summary it can be said that, if we can assume that the differences of all , , in the higher Mercer eigenfunctions of are uniformly bounded by the differences in the lower Mercer eigenfunctions, then the first inequality in (18) is fulfilled with the above constant . Combined with the Lipschitz constant in (14), we have therefore derived an upper bound for the distortion factor of :

 D(μ∘Πspan{ψ1,…,ψd})≤ ⎷2dλ0cX(imax)λimaxsupi=1,…,d∥ψi/ρ∥∞|X|. (21)

Most importantly, this bound guarantees the well-posedness of the embedding and parametrization problem as the distortion of even the fuzzy transition manifold cannot become arbitrarily large. We will support this statement with numerical evidence (see Section 4.2) showing that the distortion is, in practice, indeed small. Note, however, that we do not expect (21) to perform well as a quantitative error estimate, as many of the estimates that led to it are rather rough.

## 4 Illustrative examples and applications

### 4.1 Reaction coordinate of the Müller–Brown potential

As a first illustrating example, we compute the reaction coordinate of the two-dimensional Müller–Brown potential [33] via the new kernel-based Algorithm 3.1. The potential energy surface (see Figure 3 (a)) possesses three local minima, where the two bottom minima are separated only by a rather shallow energy barrier. Correspondingly, the system’s characteristic long-term behavior is determined by the rare transitions between the minima. These transitions happen predominantly along the potential’s minimum energy pathway (MEP), which is shown as white dashed line and was computed using the zero temperature string method [15, 16].

For two sets and a starting point , the committor function is defined as the probability that the process hits set before hitting set , provided it started in at time zero. For a precise definition see [41]. For the Müller–Brown potential, the committor function associated with the top left and bottom right energy minima, shown in Figure 3 (b), can be considered an optimal reaction coordinate [18]. Therefore, we use the (qualitative) comparison with the committor function as a benchmark for our reaction coordinate. Note that the computation of the committor function requires global knowledge of the metastable sets and is often not a practical option for the identification of reaction coordinates.

The governing dynamics is given by an overdamped Langevin equation (1), which we solve numerically using the Euler–Maruyama scheme. At inverse temperature , the lag time falls between the slow and fast time scales and is thus defined to be the intermediate lag time. The test points required by Algorithm 3.1 are given by a regular grid discretization of the domain . For the embedding, the Gaussian kernel

 k(x,x′)=exp(−∥x−x′∥22σ) (22)

with bandwidth is used and for the manifold learning task in Algorithm 3.1, the diffusion maps algorithm with bandwidth parameter . The reaction coordinate for the test points is shown in Figure