Nonparametric clustering of functional data using pseudo-densities

# Nonparametric clustering of functional data using pseudo-densities

\fnmsMattia \snmCiollarolabel=e1]ciollaro@cmu.edu [    \fnmsChristopher R. \snmGenovese\thanksreft3label=e2]genovese@stat.cmu.edu [    \fnmsDaren \snmWanglabel=e3]darenw@andrew.cmu.edu [ Carnegie Mellon University Department of Statistics
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
###### Abstract

We study nonparametric clustering of smooth random curves on the basis of the gradient flow associated to a pseudo-density functional and we show that the clustering is well-defined both at the population and at the sample level. We provide an algorithm to mark significant local modes, which are associated to informative sample clusters, and we derive its consistency properties. Our theory is developed under weak assumptions, which essentially reduce to the integrability of the random curves, and does not require to project the random curves on a finite-dimensional subspace. However, if the underlying probability distribution is supported on a finite-dimensional subspace, we show that the pseudo-density and the expectation of a kernel density estimator induce the same gradient flow, and therefore the same clustering. Although our theory is developed for smooth curves that belong to an infinite-dimensional functional space, we also provide consistent procedures that can be used with real data (discretized and noisy observations).

\kwd
\startlocaldefs\endlocaldefs\runtitle

Nonparametric clustering of functional data

{aug}

, and

\thankstext

t3Research supported in part by National Science Foundation Grants NSF-DMS-1208354 and NFS-DMS-1513412.

Modal clustering \kwdPseudo-density \kwdGradient flow \kwdFunctional data analysis

## 1 Introduction

In Functional Data Analysis (Ramsay and Silverman, 2005, Ferraty and Vieu, 2006, Ferraty and Romain, 2011, Horváth and Kokoszka, 2012), henceforth FDA, we think of curves (and other functions) as the fundamental unit of measurement. Clustering is an important problem in FDA because it is often of critical interest to identify subpopulations based on the shapes of the measured curves. In this paper, we study the problem of functional clustering in a fully infinite-dimensional setting. We are motivated by recent work on modal clustering in finite dimensions (Chacón, 2012, Chacón, 2014, and references therein) that, in contrast to many commonly-used clustering methods, has a population formulation, and by recent advances in clustering of functional data (Bongiorno and Goia, 2015). Specifically, we prove the existence of population clusters in the infinite-dimensional functional case, under mild conditions. We show that an analogue of the mean-shift algorithm (see, for example, Fukunaga and Hostetler, 1975, Cheng, 1995, and the more recent works of Comaniciu, Ramesh and Meer, 2001 and Carreira-Perpiñán, 2006) can identify local modes of a “pseudo-density”. We devise an algorithm to classify local modes as representatives of significant clusters, and under some regularity assumptions on the pseudo-density, we further show that the algorithm is consistent. We develop our theory assuming that the data are observed as continuous curves defined on some interval. Because in practice one does not observe continuous curves, we also show how to apply the procedures that we propose to real data (e.g. noisy measurements of random curves on a grid).

Modal clustering is typically a finite-dimensional problem, but motivated by the flourishing literature on FDA and by the increasing interest in developing sound frameworks and algorithms for clustering of random curves, we extend the idea of modal clustering to the case where is a functional random variable valued in an infinite-dimensional space. In particular, we develop a theory of modal clustering for smooth random curves that are assumed to belong to the Hölder space of curves defined on the standard unit interval whose first weak derivative is square integrable. We focus on for concreteness, but our theory generalizes to any Hölder space. Furthermore, our theory is density-free and nonparametric, as no assumptions are made regarding the existence of a dominating measure for the law of the functional data, nor it is assumed that can be parametrized by a finite number of parameters.

In the finite-dimensional modal clustering problem, we have that is the probability density function associated to the law of a random variable valued in . If is a Morse function (i.e. is smooth and its Hessian is not singular at the critical points), then the local modes of , , induce an partition of the sample space where the sets satisfy

1. if

2. the gradient ascent path on that starts from eventually converges to .

Note that this framework characterizes as a high-density region surrounding the local mode of and each set is thought of as a cluster at the population level. Unlike other approaches to clustering which define clusters exclusively at the sample level (consider -means, for instance), modal clustering provides an inferential framework in which the essential partition is a population parameter that one wants to infer from the data. In fact, as soon as an i.i.d. sample and an estimator of are available, the goals of modal clustering are exactly

• estimating the local modes of by means of the local modes of

• estimating the population clustering by means of the empirical partition induced by

Thus, the typical output of a modal clustering procedure consists of the estimated clustering structure and a set of cluster representatives . At the sample level, each data point is then uniquely assigned to a cluster and represented by the corresponding local mode .

Because it is generally not possible to define a probability density function in infinite-dimensional Hilbert spaces, we instead focus on a surrogate notion of density which we call “pseudo-density”. Generally, by pseudo-density we mean any suitably smooth functional which maps the sample space into the positive reals . In particular, we focus on a family of pseudo-densities which is parametrized by a bandwidth parameter and, more specifically, is the expected value of a kernel density estimator,

 ph(x)=EPK(∥X−x∥2L2h), (1.1)

where is an appropriately chosen kernel function and is the bandwidth parameter. Clusters of curves are then defined in terms of the gradient flow associated to .

The gradient flow associated to is the collection of the gradient ascent paths corresponding to the solution of the initial value problem

 {ddtπx(t)=∇ph(πx(t))πx(0)=x, (1.2)

where is the functional gradient of at . In complete analogy with the finite-dimensional case, the gradient of induces a vector field and a gradient ascent path is a curve that solves the initial value problem and moves along the direction of the vector field, i.e. at any time , the derivative of corresponds to the gradient of at . If the trajectory converges to a local mode of as , then is said to belong to the -th cluster of , . Thus, the cluster is defined as the set

 Ci={x∈H1([0,1]):limt→∞∥πx(t)−μi∥L2([0,1])→0} (1.3)

where is a solution of the initial value problem of equation (1.2). According to the this definition, the -th cluster of corresponds to the basin of attraction of the -th local mode of , and the collection of the clusters provides a good summary of the subpopulations associated to .

The main contribution of our work is to identify conditions under which

1. there exist population clusters in functional data, i.e. the population clusters defined in equation (1.3) exist and are well-defined

2. these clusters are estimable

and to provide a practical procedure to estimate the clusters and assess their statistical significance.

As we further discuss later in the paper, the most remarkable challenge arising in the infinite-dimensional setting is the lack of compactness. As opposed to the finite-dimensional setting, in the functional case it is hard to show the existence, the uniqueness, and the convergence of the gradient ascent paths described by the initial value problem of equation (1.2), unless the sample space can be compactly embedded in another space. We show that we can overcome this challenge by exploiting the compact embedding of in , the space of square-integrable functions on the unit interval, and by studying equation (1.2) using these two non-equivalent topologies. For convenience, we focus on functional data belonging to and on the gradient flow under the norm, but the exact same theory carries over to other function spaces, different norms and different pseudo-density functionals, as long as it is possible to compactly embed the sample space in a larger space and the chosen pseudo-density functional is sufficiently smooth. In particular, we remark that the results of this paper can be straightforwardly generalized to arbitrary pairs of Sobolev spaces of integer order satisfying the compact embedding requirement.

The theory of clustering that we develop in this work is projection-free, since it does not involve projecting the random curves onto a finite-dimensional space. However, if the probability law of the functional data is supported on a finite-dimensional space and admits a proper density with respect to the Lebesgue measure, we show that the gradient flow on the pseudo-density and the gradient flow on the expectation of the kernel density estimator of the data coincide (and so coincide the corresponding population clusterings.

One of the most important practical tasks in modal clustering is to identify significant local modes, as these are associated to informative clusters. We provide an algorithm that

• identifies the local modes of the population pseudo-density by analyzing its sample version ; furthermore, all of the local modes of identified by the algorithm converge asymptotically to their population correspondents of

• is consistent (under additional regularity assumptions on ), in the sense that it establishes a one-to-one correspondence between the sample local modes that it identifies and their population equivalents.

While from a purely mathematical standpoint a sample of functional data is thought of as a collection of continuous curves defined on an interval, we never observe such objects in practice. Rather, we typically only observe noisy measurements of the ’s at a set of design points . As an intermediate step, we therefore estimate the ’s from these observations (which constitutes a typical regression problem), and then use the estimates as the input of our procedure.

The remainder of the paper is organized as follows. Section 2 provides a concise literature review on pseudo-densities, finite-dimensional nonparametric clustering based on Morse densities, and mode-finding algorithms. Section 3 is devoted to the development of our theory of population clustering for smooth random curves. In particular, Section 3 studies in detail the gradient flow on the pseudo-density and establishes that, in analogy to the finite-dimensional case, population clusters of smooth random curves can be defined in terms of the basins of attraction of the critical points of . Section 4 describes the behavior of the gradient flow of when the probability law of the data is supported on a finite-dimensional subspace. Section 5 provides an algorithm to identify the significant local modes of and shows that, under additional regularity assumptions on , the algorithm is consistent. Section 6 extends the results of Section 5 to real data. Section 7 contains a discussion on the choice of the pseudo-density functional and some general guidelines for the choice of the smoothing parameter in practical applications. Section 8 summarizes the main contributions of this paper and indicates promising directions for future work. The proofs of the main results can be found in Appendix A, while other auxiliary results (such as probability bounds for the estimation of the pseudo-density functional and its derivatives) are deferred to Appendix B.

## 2 Related literature

The difficulties associated to the lack of proper density functions in infinite-dimensional spaces are well-known among statisticians. This has stimulated the introduction of various surrogate notions of density for functional spaces. The literature on pseudo-densities includes the work of Gasser, Hall and Presnell (1998), Hall and Heckman (2002), Dabo-Niang, Ferraty and Vieu (2004), Delaigle and Hall (2010), and Ferraty, Kudraszow and Vieu (2012).

A population framework based on Morse theory for nonparametric modal clustering in the finite dimensional setting is presented in Chacón (2012) and Chacón (2014). Whenever a proper density exists and it is a Morse function, the problem of equation (1.2) induces an essential partition of the sample space in the sense that each set in the partition of such that corresponds to the basin of attraction of a local mode of , i.e. . Furthermore, if has saddle points, the basin of attraction of each saddle is a null probability set (similarly, the basin of attraction of a local minimum is a singleton and hence negligible as well).

A number of gradient ascent algorithms have been developed to perform modal clustering in the finite-dimensional case. One of the most popular mode-finding and modal clustering algorithms is the mean-shift algorithm (Fukunaga and Hostetler, 1975, Cheng, 1995). A version of the mean-shift algorithm for functional data is discussed in Ciollaro et al. (2014). A gradient ascent algorithm for functional data is proposed in Hall and Heckman (2002).

In their recent work, Bongiorno and Goia (2015) propose a clustering method for functional data based on the small ball probability function and on functional principal components. A recent overview of other clustering techniques for functional data can be found in Jacques and Preda (2013).

## 3 A population background for pseudo-density clustering of functional data

We denote by a functional random variable valued in , the space of square integrable functions on the unit interval with its canonical inner product and induced norm . As we previously mentioned, it is not possible to define a proper probability density function for . Instead, we study the gradient flow of equation (1.2) associated to the functional

 ph(x)=EPK(∥X−x∥2L2h)=∫RK(s)dP∥X−x∥2L2/h(s) (3.1)

mapping into , where is a bandwidth parameter, is a kernel function, and denotes the probability measure induced by through the map . Note that is closely related to the so-called small-ball probability function . It is easy to see that when , therefore can be thought of as a smoother version of .

Unless otherwise noted, we make the following assumptions throughout the paper:

• is twice continuously differentiable and the following bounds hold on the derivatives of :

• , for

where the constants may depend on .

• for all .

• is -almost surely absolutely continuous and its moments satisfy and for some constants and .

• All the non-trivial critical points of are isolated under the norm, i.e. there exists an open neighborhood around each critical point of with such that there are no other critical points of that also belong to that neighborhood.

Various kernels can be shown to satisfy assumptions (H1) and (H2). For instance, both the compactly supported kernel and the exponential kernel satisfy our assumptions. (H3) is an assumption on the smoothness of the random curves. Intuitively, (H3) corresponds to assuming that the probablility law does not favor curves that are too irregular or wiggly. (H4) is a regularity assumption on the functional : essentially, under the above assumptions on , (H4) corresponds to assuming that the functional does not have flat “ridges” in regions where it is positive.

###### Remark 1.

A sufficient condition for (H4) to hold is that is a Morse functional. The following Proposition provides a sufficient condition under which is a Morse functional.

###### Proposition 1.

Suppose that has density with respect to the Lebesgue measure and is supported on a finite-dimensional compact domain . Suppose furthermore that and , the boundary of , satisfy

• is smooth enough so that the normal vector exists for any

• is continuous on

• is twice differentiable in the interior of ,

• is not vanishing on .

Then, for sufficiently small, all the critical points of in are non-degenerate and there are no non-trivial critical points outside of .

In order to simplify the discussion, from now on we focus on the shifted random curves ; however, with a little abuse of notation, we will keep using the letter to mean . This choice is just made for convenience as it significantly simplifies the proofs of many of the results that we present. However, it is simple to extend any of the results from to . Following this notational convention, thus belongs -almost surely to the space . The Poincaré inequality ensures that the semi-norm is in fact a norm on and with (i.e. can be continuously embedded in ). In the following, we denote for . Moreover, to alleviate the notation, from now on we denote and . If the curves were not shifted so that , then they would belong -almost surely to , which can still be continuously embedded in .

The main goal of this section is to show that the gradient flow associated to is well-defined. In particular, we establish the following facts:

1. the gradient flow associated to is a flow in

2. for any initial value in , there exists exactly one trajectory of such flow which is a solution to the initial value problem of equation (1.2)

3. for any initial value in , the unique solution of the initial value problem of equation (1.2) converges to a critical point of as and the convergence is with respect to the norm

4. all the non-trivial critical points of are in , the support of .

These facts guarantee that the clusters described in equation (1.3) exist and are well-defined.

###### Remark 2.

In general, in an infinite dimensional Hilbert space, the trajectory of the solution of an ordinary differential equation such as the one of equation (1.2) may not converge as . In fact, such trajectory can be entirely contained in a closed and bounded set without converging to any particular point of that set. To guarantee the convergence of the gradient flow trajectories, one needs that (see Jost, 2011)

1. the trajectories satisfy some compactness property

2. the functional of interest (in our case ) is reasonably well-behaved: for instance it is smooth, with isolated critical points.

In , compactness is a delicate problem: no closed bounded ball in is compact. However, any closed and bounded ball is compact with respect to the norm (and so is any closed and bounded ball). In fact, can be compactly embedded in (see, for instance, Chapter 5.7 of Evans, 1998), which means that every bounded set in is totally bounded in and can be continuously embedded in . Since is a closed subspace of , can also be compactly embedded in . From a theoretical point of view is strictly larger than . However, is dense in .

The remainder of our discussion focuses on the main results of this section, which concern the computation of the derivatives of and their properties, the existence, the uniqueness, and the convergence of the solution of the initial value problem of equation (1.2).

Before we state our results, let us recall that for a functional random variable valued in the expected value of is defined as the element such that for all (Horváth and Kokoszka, 2012). Furthermore, the expectation commutes with bounded operators. Also, recall that for a functional mapping a Banach space into another Banach space , the Frechét derivative of at a point is defined, if it exists, as the bounded linear operator such that . The most common case in this paper sets , , and . Because is a bounded linear operator, if is also an Hilbert space then the Riesz representation theorem guarantees the existence of an element such that, for any , . The element corresponds to the gradient of at . In this way, the gradient and the first derivative operator at can be identified. In the following, with a slight abuse of notation, we will use both to mean the functional gradient of the operator (which is an element of ) and its Frechét derivative (which is a bounded linear operator from to ). It will be clear from the context whether we are referring to the derivative operator or to the functional gradient. Note that higher order Frechét derivatives can be similarly identified with multilinear operators on (see, for example, Ambrosetti and Prodi, 1995).

Recall that, by assumption, the function is bounded from above by a constant . Furthermore, it is three times differentiable and its first Frechét derivative at is

 DKh(∥X−x∥2L2)=2K′h(∥X−x∥2L2)(x−X). (3.2)

The second Frechét derivative at corresponds to the symmetric bilinear operator

 D2Kh(∥X−x∥2L2)(z1,z2)=2K′h(∥X−x∥2L2)⟨z1,z2⟩L2 (3.3) +4K′′h(∥X−x∥2L2)⟨x−X,z1⟩L2⟨x−X,z2⟩L2

for .

###### Remark 3.

Any bounded bilinear operator on can be represented as a bounded linear operator from to . In fact, let be any element of ; then, is a bounded linear operator from to . By the Riesz representation theorem, one can define by letting for any . The operator norm of is then defined by

 ∥B∥=sup{v :∥v∥L2=1}∥B(v)∥L2. (3.4)

It is straightforward to check that both derivatives correspond to bounded linear operators under assumption (H1). The following Lemma provides the first and the second Frechét derivatives of .

###### Lemma 1.

Under assumption (H1) the Frechét derivative of at corresponds to the element

 Dph(x)=2EPK′h(∥X−x∥2L2)(x−X). (3.5)

The second Frechét derivative of at corresponds to the symmetric bilinear operator

 D2ph(x)(z1,z2)=EP [4K′′h(∥X−x∥2L2)⟨x−X,z1⟩L2⟨x−X,z2⟩L2 (3.6) +2K′h(∥X−x∥2L2)⟨z1,z2⟩L2].

Furthermore, both derivatives have bounded operator norm for any .

We state without proof the following standard Lemma.

###### Lemma 2.

Let be a compactly supported infinitely differentiable function. Suppose is such that for any such , where is a bounded linear operator. Then the weak first derivative of exists, , and . Moreover, for any and therefore for any .

The following Proposition shows that the gradient of is an element of . Intuitively, this means that if the starting point of the initial value problem of equation (1.2) is in (and a solution exists for that starting point), then we should expect that the path only visits elements of , i.e. the gradient flow associated to is a flow.

###### Proposition 2.

For any , the gradient of at , , is an element of such that for any ,

 ⟨Dph(x)′,y⟩L2=EP[−2K′h(∥X−x∥2L2)⟨x′−X′,y⟩L2]. (3.7)

Proposition 2 also implies that the equation is meaningful when restricted to . The next Lemma, Lemma 3, establishes that is locally Lipschitz under the norm. The subsequent Lemma, Lemma 4, guarantees that a solution of the problem (if it exists) is necessarily bounded. These two Lemmas allow us to claim that if the starting point is an element of , then the initial value problem of equation (1.2) has a unique solution in . This claim is summarized in Proposition 3.

###### Lemma 3.

Under (H1),the gradient of corresponds to a locally Lipschitz map in .

###### Lemma 4.

The following two results hold under (H1) and (H2)

• Suppose that . If , then
.

• Let . If a.s., then as soon as .

The intutive interpretation of Lemma 4 is that a trajectory that is a solution to the initial value problem of equation cannot wander too far from the origin in . In fact, if the norm of increases too much, then the path is eventually pushed back into the closed and bounded ball of radius (or radius if one makes the stronger assumption that the probability law of the random curves is completely concentrated on the ball of radius ). This “push-back” effect is captured by the condition . By combining Lemma 3 and Lemma 4, we obtain the following

###### Proposition 3.

Under assumptions (H1), (H2), and (H3), the initial value problem with has a unique solution in with respect to the topology. Moreover, if , then for all , where .

###### Remark 4.

Proposition 3 establishes the existence and the uniqueness of a solution to the initial value problem of equation (1.2) in the topology. The initial value problem can be solved uniquely in the topology as well. In fact, it is easily verified that, because is bounded, then the first derivative of , , is uniformly Lipschitz with respect to the norm. Thus, one only has to show that the flow of Proposition 3 solved in the topology corresponds to the gradient flow associated to . To verify this, one needs to check that the solution also satisfies the initial value problem of equation (1.2) under the norm. Specifically, consider the solution of Proposition 3 with any . The path is continuously differentiable as a map from to . It suffices to check that is continuously differentiable as a map from to as well. This is easily established using Poincaré inequality since

 ∥πx(t+δ)−πx(t)+Dph(πx(t))∥L2 (3.8) ≤Cp∥πx(t+δ)−πx(t)+Dph(πx(t))∥H10=o(δ),

where the Poincaré constant is for the pair . It is clear from equation (3.8) and the definition of Frechét derivative that the solution also satisfies the initial value problem of equation (1.2) under the norm. Thus, is the unique solution of the intial value problem of equation .

The following Theorem, based on Proposition 3, guarantees the convergence of to a critical point of as . The statement about the convergence strongly relies on the compact embedding of in , the boundedness of the first two derivatives of , and assumption (H4).

###### Theorem 1.

Assume (H1), (H2), (H3), and (H4) hold. Let be the solution of the initial value problem of equation (1.2) with . Let be such that for all . Then there exists a unique such that , , and .

The results above show that the gradient flow on is well-defined and its trajectories converge to critical points of that are in whenever the starting point is an element of . We conclude this section with the following Lemma which states that all the non-trivial critical points of belong to : thus, even though the functional “spreads” the probability law of the random curves outside of its support (in fact, it is easily seen that there exists points that are not in with ), yet all of its non-trivial critical points still lie in the support of .

###### Lemma 5.

Assume (H1), (H2), and (H3) hold. Let be a critical point of such that (i.e. is a nontrivial critical point of ). Then . Furthermore, if -almost surely, then all the nontrivial critical points of are contained in .

Note that the stronger assumption that is a functional analogue of the boundedness assumption which is frequently made with finite-dimensional data.

In this section, we assume that the distribution of the random function is supported on some compact subset of a finite dimensional vector space. In other words,

 P(X∈Sc)=1, (4.1)

where is a compact subset of a finite-dimensional subspace . Two insightful outcomes are discussed in detail.

• Under some mild extra assumptions on the finite dimensional distribution of , it is shown in Lemma 7 that , as a functional from to , is a Morse functional. This provides an important sufficient condition under which (H4) holds.

• If the functional random variable admits a finite-dimensional distribution on , it is natural to ask whether the gradient flow on corresponds to the finite-dimensional gradient flow associated to the expectation of a kernel density estimator of the density of on . This section provides a positive answers to this question. Furthermore, we show that such finite-dimensional gradient flow is entirely contained in .

Suppose that the probability law of the functional random variable is supported on a compact subset of a finite-dimensional space . If this is the case, there exists such that if , then is a Morse function on the interior of (see Remark 1 and Proposition 1). Moreover, as implied by Lemma 6 and Lemma 7 of this section, the trajectories of the gradient flow associated to are all contained in and they end at critical points of that belong to . It is natural to ask whether the gradient flow on corresponds to the finite-dimensional gradient flow associated to some pseudo-density on . This section answers this question and shows that, if admits a density function (when is viewed as a finite-dimensional random vector in ), then the gradient flow associated to corresponds to the gradient flow associated to the expectation of a kernel density estimator of with bandwidth .

Let be a linear subspace of . Without loss of generality, assume that the ’s form an orthonormal basis of equipped with the norm and that almost surely. Then, admits the decomposition for some random coefficients . Let and suppose that the distribution of has density with respect to the Lebesgue measure. We have

###### Lemma 6.

Assume (H1) and (H2) hold and . If , then for any . Furthermore, all the non-trivial critical points of belong to .

For the rest of this section, let us replace assumption (H4) with

• is an element of with probability 1, admits density on , and satisfies the assumptions of Proposition 1.

Consider . Let . Define to be , where denotes the standard Euclidean norm. Note that is the expectation of a standard finite dimensional kernel density estimator at . Since , it is clear that . To see the connection between the functional gradient and , the gradient of at , note that the random variable

 ⟨Dph(x),fi⟩L2 =⟨EP2K′h(∥X−x∥2L2)(x−X),fi⟩L2 (4.2) =EP2K′h(∥X−x∥2L2)⟨x−X,fi⟩L2 =2EPK′(∥~X−~x∥22)(xi−ai)

agrees with the -th component of the gradient of at . This equivalence implies that the gradient flow (with starting points in the subspace ) on and coincide (note that scaling the by does affect the associated gradient flow). Furthermore, there exists a depending on such that is a Morse function for (see Remark 1). Therefore, all the non-trivial critical points of are separated in . In light of Lemma 6, all the non-trivial critical points of are thus separated in (and in ).

Next, we have the following Lemma which guarantees that if is a Morse density on , then the non-trivial critical points of are non-degenerate for sufficiently small and they all belong to (a critical point of is non-degenerate if is an isomorphism from to ).

###### Lemma 7.

Under assumption (H1) (H2) and (H4’), all the non-trivial critical points of lie in and are non-degenerate for sufficiently small. Thus, for sufficiently small , (H4) holds.

In the finite-dimensional case considered in this section, we can say more about the behavior of the gradient flow on . In particular, we can characterize the solutions to the initial value problem of equation (1.2) also for the case in which the starting point does not belong to the support of (which is, in this case, ). In fact, let be an element of which does not belong to . The Gram-Schmidt orthogonalization process guarantees that there exists such that and , where is orthogonal to and . The following Lemma guarantees that the gradient ascent path originating from is entirely contained in . Its proof is identical to that of Lemma 6.

###### Lemma 8.

Assume (H1) and (H2) hold and that . Suppose . Then for all .

###### Remark 5.

In the finite-dimensional setting of this section (in particular under assumption (H4’)), and for sufficiently small, the basin of attraction of a saddle point of is negligible: in fact, from the above disussion, it is clear that if the random function is valued in a compact subset of a finite-dimensional linear subspace of and has a proper Morse density on , then the basin of attraction of any saddle point of is neglibible for sufficiently small (since is Morse on for small enough). Stated more precisely, for sufficiently small, if is a saddle point of then .

## 5 Statistical relevance of the estimated local modes

The empirical counterpart of is the functional , where are i.i.d sampling from the probability law . The critical points of can be found, for example, by using a functional version of the mean-shift algorithm (see Ciollaro et al., 2014). In this section, we provide a statistical algorithm to detect whether a critical point of corresponds to a local maximum of . This algorithm provides two insights for functional mode clustering.

• For finite-dimensional clustering problems, if the underlying density is a Morse function, then the basin of attraction of a saddle point of has null probability content as it corresponds to a manifold of lower dimension. In functional data clustering, however, the structure of the functional space is more complicated in the sense that there is no guarantee that the probability content of the basin of attraction of a saddle point of is negligible, even if is a Morse function. However, in analogy with the finite-dimensional case, clusters associated to non-degenerate local modes should generally be considered more informative as opposed to clusters associated with saddle points.

• Several results in the previous section are derived under assumption (H4), which essentially states that the local critical points of are well-behaved. Without assuming (H4), the algorithm provides a simple way to classify well-behaved local critical modes of by analyzing . In this way, informative clusters can still be revealed in a less restrictive setting.

Since the local modes of that correspond to non-degenerate local modes of provide the greatest insight about the population clustering, we refer to these local modes as “significant” local modes. In the following, we derive a procedure that allows us to discriminate the significant local modes from the non-significant ones.

Before giving the definition non-degeneracy for a critical point of a functional defined on an Hilbert space ( in our case), it is convenient to adopt the convention that a linear operator from an Hilbert space to itself can be associated to a bilinear form on the Hilbert space and vice versa. For example if is a linear operator, then it can be associated to a bilinear form by letting .

###### Definition 1.

Let be a bounded linear operator. is said to be self-adjoint if . is said to be positive (respecively negative) definite if (respectively ) for all . Furthermore, is said to be an isomorphism if both and are bounded.

###### Definition 2.

Let be twice continuously differentiable with bounded third derivative. Suppose is a critical point of , i.e. . Then, is said to be a non-degenerate local maximum (respectively minimum) if is a negative (respectively positive) definite isomorphism on .

It is a known fact that for any , the second derivative of , , is a self-adjoint linear operator. Furthermore, the following Lemma follows as a simple consequence of the fact that the second derivative of at a non-degenerate local maximum is a self-adjoint negative-definite isomorphism.

###### Lemma 9.

Suppose that is a non-degenerate local maximum of . Then there exist such that

 sup∥v∥L2=1D2f(x∗)(v,v)≤−δ. (5.1)

Let now be twice continuously differentiable with bounded third derivative. Consider the following abstract setting for and .

• The non-trivial critical points of and are all in .

• For , if then . Moreover,

 {π′i(t)=Dfi(πi(t))πi(0)∈H10, (5.2)

have solutions whose trajectories admit a convergent subsequence in .

• For , let denote

 ηℓ=supx∈BH10(0,M)∥Dℓf1(x)−Dℓf2(x)∥, (5.3)

where stands for the appropriate norms. Also, for and , let

 βk=supx∈L2∥Dkfi(x)∥<∞. (5.4)
###### Remark 6.

Of course, the results that we obtain here are most useful for the particular case where

 f1(x)=ph(x)=EPKh(∥X−x∥2L2) (5.5)