Sampled forms of functional PCA in reproducing kernel Hilbert spaces

Sampled forms of functional PCA in reproducing kernel Hilbert spaces

[ [    [ [ University of California, Berkeley Department of Electrical Engineering
 and Computer Science
University of California, Berkeley
Berkeley, California 94720
USA
\printeade1
Department of Statistics
University of California, Berkeley
Berkeley, California 94720
USA
\printeade2
\smonth9 \syear2011\smonth4 \syear2012
\smonth9 \syear2011\smonth4 \syear2012
\smonth9 \syear2011\smonth4 \syear2012
Abstract

We consider the sampling problem for functional PCA (fPCA), where the simplest example is the case of taking time samples of the underlying functional components. More generally, we model the sampling operation as a continuous linear map from to , where the functional components to lie in some Hilbert subspace of , such as a reproducing kernel Hilbert space of smooth functions. This model includes time and frequency sampling as special cases. In contrast to classical approach in fPCA in which access to entire functions is assumed, having a limited number of functional samples places limitations on the performance of statistical procedures. We study these effects by analyzing the rate of convergence of an -estimator for the subspace spanned by the leading components in a multi-spiked covariance model. The estimator takes the form of regularized PCA, and hence is computationally attractive. We analyze the behavior of this estimator within a nonasymptotic framework, and provide bounds that hold with high probability as a function of the number of statistical samples and the number of functional samples . We also derive lower bounds showing that the rates obtained are minimax optimal.

[
\kwd
\doi

10.1214/12-AOS1033 \volume40 \issue5 2012 \firstpage2483 \lastpage2510 \newproclaimexampleExample \newproclaimRemarksRemarks

\runtitle

Sampled form of functional PCA in RKHS

{aug}

A]\fnmsArash A. \snmAmini\correflabel=e1]amini@eecs.berkeley.edu and B]\fnmsMartin J. \snmWainwright\thanksrefT1label=e2]wainwrig@stat.berkeley.edu

\thankstext

T1Supported in part by NSF Grants CCF-0545862 and DMS-09-07632.

class=AMS] \kwd[Primary ]62G05 \kwd[; secondary ]62H25 \kwd62H12 \kwd41A35 \kwd41A25 Functional principal component analysis \kwdtime sampling \kwdFourier truncation \kwdlinear sampling operator \kwdreproducing kernel Hilbert space

1 Introduction

The statistical analysis of functional data, commonly referred to as functional data analysis (FDA), is an established area of statistics with a great number of practical applications; see the books Ramsay2005 (), Ramsay2002 () and references therein for various examples. When the data is available as finely sampled curves, say in time, it is common to treat it as a collection of continuous-time curves or functions, each being observed in totality. These datasets are then termed “functional,” and various statistical procedures applicable in finite dimensions can be extended to this functional setting. Among such procedures is principal component analysis (PCA), which is the focus of present work.

If one thinks of continuity as a mathematical abstraction of reality, then treating functional data as continuous curves is arguably a valid modeling device. However, in practice, one is faced with finite computational resources and is forced to implement a (finite-dimensional) approximation of true functional procedures by some sort of truncation procedure, for instance, in the frequency domain. It is then important to understand the effects of this truncation on the statistical performance of the procedure. In other situations, such as in longitudinal data analysis Diggle2002 (), a continuous curve model is justified as a hidden underlying generating process to which one has access only through sparsely sampled measurements in time, possibly corrupted by noise. Studying how the time-sampling affects the estimation of the underlying functions in the presence of noise shares various common elements with the frequency-domain problem described above.

The aim of this paper is to study effects of “sampling”—in a fairly general sense—on functional principal component analysis in smooth function spaces. In order to do so, we adopt a functional-theoretic approach by treating the sampling procedure as a (continuous) linear operator. This set-up provides us with a notion of sampling general enough to treat both the frequency-truncation and time-sampling within a unified framework. We take as our smooth function space a Hilbert subspace of and denote the sampling operator by . We assume that there are functions , , in for , generated i.i.d. from a probabilistic model (to be discussed). We then observe the collection in noise. We refer to the index as the number of statistical samples, and to the index as the number of functional samples.

We analyze a natural -estimator which takes the form of a regularized PCA in , and provide nonasymptotic bounds on the estimation error in terms of and . The eigen-decay of two operators govern the rates, the product of the sampling operator and its adjoint, and the product of the map embedding in and its adjoint. Our focus will be on the setting where is a reproducing kernel Hilbert space (RKHS), in which case the two eigen-decays are intimately related through the kernel function . In such cases, the two components of the rate interact and give rise to optimal values for the number of functional samples () in terms of the number of statistical samples () or vice versa. This has practical appeal in cases where obtaining either type of samples is costly.

Our model for the functions is an extension to function spaces of the spiked covariance model introduced by Johnstone and his collaborators John01 (), Johnstone2004 (), and studied by various authors (e.g., Johnstone2004 (), Paul2008 (), AmiWai2009 ()). We consider such models with components, each lying within the Hilbert ball of radius , with the goal of recovering the -dimensional subspace spanned by the spiked components in this functional model. We analyze our -estimators within a high-dimensional framework that allows both the number of statistical samples and the number of functional samples to diverge together. Our main theoretical contributions are to derive nonasymptotic bounds on the estimation error as a function of the pair , which are shown to be sharp (minimax-optimal). Although our rates also explicitly track the number of components and the smoothness parameter , we do not make any effort to obtain optimal dependence on these parameters.

The general asymptotic properties of PCA in function spaces have been investigated by various authors (e.g., Dauxois1982 (), Bosq2000 (), Hall2006a ()). Accounting for smoothness of functions by introducing various roughness/smoothness penalties is a standard approach, used in the papers Rice1991 (), Pezzulli1993 (), Silverman1996 (), Boente2000 (), among others. The problem of principal component analysis for sampled functions, with a similar functional-theoretic perspective, is discussed by Besse and Ramsey Besse1986 () for the noiseless case. A more recent line of work is devoted to the case of functional PCA with noisy sampled functions Cardot2000 (), Yao2005 (), Hall2006 (). Cardot Cardot2000 () considers estimation via spline-based approximation, and derives MISE rates in terms of various parameters of the model. Hall et al. Hall2006 () study estimation via local linear smoothing, and establish minimax-optimality in certain settings that involve a fixed number of functional samples. Both papers Cardot2000 (), Hall2006 () demonstrate trade-offs between the numbers of statistical and functional samples; we refer the reader to Hall et al. Hall2006 () for an illuminating discussion of connections between FDA and LDA approaches (i.e., having full versus sampled functions), which inspired much of the present work. We note that the regularization present in our -estimator is closely related to classical roughness penalties Rice1991 (), Silverman1996 () in the special case of spline kernels, although the discussion there applies to fully-observed functions, as opposed to the sampled models considered here.

After initial posting of this work, we became aware of more recent work on sampled functional PCA. Working within the framework of Hall et al. Hall2006 (), the analysis of Li and Hsing LiHsi10 () allows for more flexible sample sizes per curve; they derive optimal uniform (i.e., ) rates of convergence for local linear smoothing estimators of covariance function and the resulting eigenfunctions. Another line of work HuaETA08 (), QiZha11 () has analyzed sampled forms of Silverman’s criterion Silverman1996 (), with some variations. Huang et al. HuaETA08 () derive a criterion based on rank-one approximation coupled with scale invariance considerations, combined with an extra weighting of the covariance matrix. Xi and Zhao QiZha11 () also show the consistency of their estimator for both regular and irregular sampling. The regular (time) sampling setup in both papers have an overlap with our work; the eigenfunctions are assumed to lie in a second order Sobolev space, corresponding to a special case of a RKHS. However, even in this particular case, our estimator is different, and it is an interesting question whether a version of the results presented here can be used to show the minimax optimality of these Silverman-type criteria. There has also been recent work with emphasis on sampled functional covariance estimation, including the work of Cai and Yuan CaiYua10 (), who analyze an estimator which can be described as regularized least-squares with penalty being the norm of tensor product of RKHS with itself. They provide rates of convergence for the covariance function, from which certain rates (argued to be optimal within logarithmic factors) for eigenfunctions follow.

As mentioned above, our sampled model resembles very much that of spiked covariance model for high-dimensional principal component analysis. A line of work on this model has treated various types of sparsity conditions on the eigenfunctions Johnstone2004 (), Paul2008 (), AmiWai2009 (); in contrast, here the smoothness condition on functional components translates into an ellipsoid condition on the vector principal components. Perhaps an even more significant difference is that in this paper, the effective scaling of noise in is substantially smaller in some cases (e.g., the case of time sampling). This difference could explain why the difficulty of “high-dimensional” setting is not observed in such cases as one lets . On the other hand, a difficulty particular to our sampled model is the lack of orthonormality between components after sampling. It not only leads to identifiability issues, but also makes recovering individual components difficult.

In order to derive nonasymptotic bounds on our -estimator, we exploit various techniques from empirical process theory (e.g., Geer2009 ()), as well as the concentration of measure (e.g., Ledoux2001 ()). We also exploit recent work Mendelson2002 () on the localized Rademacher complexities of unit balls in a reproducing kernel Hilbert space, as well as techniques from nonasymptotic random matrix theory, as discussed in Davidson and Szarek Davidson2001 (), in order to control various norms of random matrices. These techniques allow us to obtain finite-sample bounds that hold with high probability, and are specified explicitly in terms of the pair , and the underlying smoothness of the Hilbert space.

The remainder of this paper is organized as follows. Section 2 is devoted to background material on reproducing kernel Hilbert spaces, adjoints of operators, as well as the class of sampled functional models that we study in this paper. In Section 3, we describe -estimators for sampled functional PCA, and discuss various implementation details. Section 4 is devoted to the statements of our main results, and discussion of their consequences for particular sampling models. In subsequent sections, we provide the proofs of our results, with some more technical aspects deferred to the supplementary material supp (). Section 5 is devoted to bounds on the subspace-based error. We conclude with a discussion in Section 6. In the supplementary material supp (), Section 7 is devoted to proofs of bounds on error in the function space, whereas Section 8 provides proofs of matching lower bounds on the minimax error, showing that our analysis is sharp.

Notation. We will use to denote the Hilbert–Schmidt norm of an operator or a matrix. The corresponding inner product is denoted as . If is an operator on a Hilbert space with an orthonormal basis , then . For a matrix , we have . For a linear operator , the adjoint is denoted as , the range as and the kernel as .

2 Background and problem set-up

In this section, we begin by introducing background on reproducing kernel Hilbert spaces, as well as linear operators and their adjoints. We then introduce the functional and observation model that we study in this paper, and conclude with discussion of some approximation-theoretic issues that play an important role in parts of our analysis.

2.1 Reproducing kernel Hilbert spaces

We begin with a quick overview of some standard properties of reproducing kernel Hilbert spaces; we refer the reader to the books Wahba (), Gu02 () and references therein for more details. A reproducing kernel Hilbert space (or RKHS for short) is a Hilbert space of functions that is equipped with a symmetric positive semidefinite function , known as the kernel function. We assume the kernel to be continuous, and the set to be compact. For concreteness, we think of throughout this paper, but any compact set of suffices. For each , the function belongs to the Hilbert space and it acts as the representer of evaluation, meaning that for all .

The kernel defines an integral operator on , mapping the function to the function . By the spectral theorem in Hilbert spaces, this operator can be associated with a sequence of eigenfunctions , in , orthogonal in and orthonormal in , and a sequence of nonnegative eigenvalues . Most useful for this paper is the fact that any function has an expansion in terms of these eigenfunctions and eigenvalues, namely

(1)

for some . In terms of this expansion, we have the representations and . Many of our results involve the decay rate of these eigenvalues: in particular, for some parameter , we say that the kernel operator has eigenvalues with polynomial- decay if there is a constant such that

(2)

Let us consider an example to illustrate. {example}[(Sobolev class with smoothness )] In the case and , we can consider the kernel function . As discussed in Appendix A of the supplementary material supp (), this kernel generates the class of functions

The class is an RKHS with inner product , and the ball corresponds to a Sobolev space with smoothness . The eigen-decomposition of the kernel integral operator is

(3)

Consequently, this class has polynomial decay with parameter . We note that there are natural generalizations of this example to , corresponding to the Sobolev classes of -times differentiable functions; for example, see the books Wahba (), Gu02 (), BerTho04 ().

In this paper, the operation of generalized sampling is defined in terms of a bounded linear operator on the Hilbert space. Its adjoint is a mapping , defined by the relation for all and . In order to compute a representation of the adjoint, we note that by the Riesz representation theorem, the th coordinate of this mapping—namely, —can be represented as an inner product , for some element , and we can write

(4)

Consequently, we have , so that for any ,

(5)

This adjoint operator plays an important role in our analysis.

2.2 Functional model and observations

Let be a fixed sequence of positive numbers, and let be a fixed sequence of functions orthonormal in . Consider a collection of i.i.d. random functions , generated according to the model

(6)

where are i.i.d. across all pairs . This model corresponds to a finite-rank instantiation of functional PCA, in which the goal is to estimate the span of the unknown eigenfunctions . Typically, these eigenfunctions are assumed to satisfy certain smoothness conditions; in this paper, we model such conditions by assuming that the eigenfunctions belong to a reproducing kernel Hilbert space embedded within ; more specifically, they lie in some ball in ,

(7)

For statistical problems involving estimation of functions, the random functions might only be observed at certain times , such as in longitudinal data analysis, or we might collect only projections of each in certain directions, such as in tomographic reconstruction. More concretely, in a time-sampling model, we observe -dimensional vectors of the form

(8)

where is a fixed collection of design points, and is a noise vector. Another observation model is the basis truncation model in which we observe the projections of onto the first basis functions of the kernel operator—namely,

(10)

where represents the inner product in .

In order to model these and other scenarios in a unified manner, we introduce a linear operator that maps any function in the Hilbert space to a vector of samples, and then consider the linear observation model

(11)

This model (11) can be viewed as a functional analog of the spiked covariance models introduced by Johnstone John01 (), Johnstone2004 () as an analytically-convenient model for studying high-dimensional effects in classical PCA.

Both the time-sampling (8) and frequency truncation (10) models can be represented in this way, for appropriate choices of the operator . Recall representation (4) of in terms of the functions .

  • For the time sampling model (8), we set , so that by the reproducing property of the kernel, we have for all , and . With these choices, the operator maps each to the -vector of rescaled samples

    Defining the rescaled noise yields an instantiation of model (11) which is equivalent to time-sampling (8).

  • For the basis truncation model (10), we set so that the operator maps each function to the vector of basis coefficients . Setting then yields another instantiation of model (11), this one equivalent to basis truncation (10).

A remark on notation before proceeding: in the remainder of the paper, we use as shorthand notation for , since the index should be implicitly understood throughout our analysis.

In this paper, we provide and analyze estimators for the -dimensional eigen-subspace spanned by , in both the sampled domain and in the functional domain. To be more specific, for , define the vectors , and the subspaces

(12)

and let and denote the corresponding estimators. In order to measure the performance of the estimators, we will use projection-based distances between subspaces. In particular, let and be orthogonal projection operators into and , respectively, considered as subspaces of . Similarly, let and be orthogonal projection operators into and , respectively, considered as subspaces of . We are interested in bounding the deviations

(13)

where is the Hilbert–Schmidt norm of an operator (or matrix).

2.3 Approximation-theoretic quantities

One object that plays an important role in our analysis is the matrix . From the form of the adjoint, it can be seen that . For future reference, let us compute this matrix for the two special cases of linear operators considered thus far:

  • For the time sampling model (8), we have for all , and hence , using the reproducing property of the kernel.

  • For the basis truncation model (10), we have , and hence . Thus, in this special case, we have .

In general, the matrix is a type of Gram matrix, and so is symmetric and positive semidefinite. We assume throughout this paper that the functions are linearly independent in , which implies that is strictly positive definite. Consequently, it has a set of eigenvalues which can be ordered as

(14)

Under this condition, we may use to define a norm on via . Moreover, we have the following interpolation lemma, which is proved in Appendix B.1 of the supplementary material supp ():

Lemma 1

For any , we have , with equality if and only if . Moreover, for any , the function has smallest Hilbert norm of all functions satisfying , and is the unique function with this property.

This lemma is useful in constructing a function-based estimator, as will be clarified in Section 3.

In our analysis of the functional error , a number of approximation-theoretic quantities play an important role. As a mapping from an infinite-dimensional space to , the operator has a nontrivial nullspace. Given the observation model (11), we receive no information about any component of a function that lies within this nullspace. For this reason, we define the width of the nullspace in the -norm, namely the quantity

(15)

In addition, the observation operator induces a semi-norm on the space , defined by

(16)

It is of interest to assess how well this semi-norm approximates the -norm. Accordingly, we define the quantity

(17)

which measures the worst-case gap between these two (semi)-norms, uniformly over the Hilbert ball of radius one, restricted to the subspace of interest . Given knowledge of the linear operator , the quantity can be computed in a relatively straightforward manner. In particular, recall the definition of the matrix , and let us define a second matrix with entries .

Lemma 2

We have the equivalence

(18)

where denotes the -operator norm.

See Appendix B.2 of the supplementary material supp () for the proof of this claim.

3 -estimator and implementation

With this background in place, we now turn to the description of our -estimator, as well as practical details associated with its implementation.

3.1 -estimator

We begin with some preliminaries on notation, and our representation of subspaces. Recall definition (12) of as the -dimensional subspace of spanned by , where . Our initial goal is to construct an estimate , itself an -dimensional subspace, of the unknown subspace .

We represent subspaces by elements of the Stiefel manifold , which consists of matrices with orthonormal columns

A given matrix acts as a representative of the subspace spanned by its columns, denoted by . For any , the matrix also belongs to the Stiefel manifold, and since , we may call a version of . We let be the orthogonal projection onto . For two matrices , we measure the distance between the associated subspaces via , where is the Hilbert–Schmidt (or Frobenius) matrix norm.

3.1.1 Subspace-based estimator

With this notation, we now specify an -estimator for the subspace . Let us begin with some intuition. Given the samples , let us define the sample covariance matrix . Given the observation model (11), a straightforward computation shows that

(19)

Thus, as becomes large, we expect that the top eigenvectors of might give a good approximation to . By the Courant–Fischer variational representation, these eigenvectors can be obtained by maximizing the objective function

over all matrices .

However, this approach fails to take into account the smoothness constraints that the vectors inherit from the smoothness of the eigenfunctions . Since by assumption, Lemma 1 implies that

Consequently, if we define the matrix , then it must satisfy the trace smoothness condition

(20)

This calculation motivates the constraint in our estimation procedure.

Based on the preceding intuition, we are led to consider the optimization problem

(21)

where we recall that . Given any optimal solution , we return the subspace as our estimate of . As discussed at more length in Section 3.2, it is straightforward to compute in polynomial time. The reader might wonder why we have included an additional factor of two in this trace smoothness condition. This slack is actually needed due to the potential infeasibility of the matrix for to problem (21), which arises since the columns of are not guaranteed to be orthonormal. As shown by our analysis, the additional slack allows us to find a matrix that spans the same subspace as , and is also feasible for to problem (21). More formally, we have:

Lemma 3

Under condition (30), there exists a matrix such that

(22)

See Appendix B.3 of the supplementary material supp () for the proof of this claim.

3.1.2 The functional estimate

Having thus obtained an estimate222Here, is any collection of vectors that span . As we are ultimately only interested in the resulting functional “subspace,” it does not matter which particular collection we choose. of , we now need to construct a -dimensional subspace of the Hilbert space to be used as an estimate of . We do so using the interpolation suggested by Lemma 1. For each , let us define the function

(23)

Since by definition, this construction ensures that . Moreover, Lemma 1 guarantees that has the minimal Hilbert norm (and hence is smoothest in a certain sense) over all functions that have this property. Finally, since is assumed to be surjective (equivalently, assumed invertible), maps linearly independent vectors to linearly independent functions, and hence preserves dimension. Consequently, the space is an -dimensional subspace of that we take as our estimate of .

3.2 Implementation details

In this section, we consider some practical aspects of implementing the -estimator, and present some simulations to illustrate its qualitative properties. We begin by observing that once the subspace vectors have been computed, then it is straightforward to compute the function estimates , as weighted combinations of the functions . Accordingly, we focus our attention on solving problem (21).

On the surface, problem (21) might appear nonconvex, due to the Stiefel manifold constraint. However, it can be reformulated as a semidefinite program (SDP), a well-known class of convex programs, as clarified in the following:

Lemma 4

Problem (21) is equivalent to solving the SDP

(25)

for which there always exists an optimal rank solution. Moreover, by Lagrangian duality, for some , the problem is equivalent to

(26)

which can be solved by an eigen decomposition of .

As a consequence, for a given Lagrange multiplier , the regularized form of the estimator can be solved with the cost of solving an eigenvalue problem. For a given constraint , the appropriate value of can be found by a path-tracing algorithm, or a simple dyadic splitting approach.

In practice where the radius is not known, one could use cross-validation to set a proper value for the Lagrange multiplier . A possibly simpler approach is to evaluate for the optimal on a grid of and choose a value around which is least variable. As for the choice of the number of components , a standard approach for choosing it would be to compute the estimator for different choices, and plot the residual sum of eigenvalues of the sample covariance matrix. As in ordinary PCA, an elbow in such a plot indicates a proper trade-off between the number of components to keep and the amount of variation explained.

Figure 1: Regularized PCA for time sampling in first-order Sobolev RKHS. Top row shows, from left to right, plots of the “true” principal components with signal-to-noise ratios and , respectively. The number of statistical and functional samples are and . Subsequent rows show the corresponding estimators obtained by applying the regularized form (26).

In order to illustrate the estimator, we consider the time sampling model (8), with uniformly spaced samples, in the context of a first-order Sobolev RKHS [with kernel function ]. The parameters of the model are taken to be , , , and . The regularized form (26) of the estimator is applied, and the results are shown in Figure 1. The top row corresponds to the four “true” signals , the leftmost being (i.e., having the highest signal-to-noise ratio) and the rightmost . The subsequent rows show the corresponding estimates , obtained using different values of . The second, third and fourth rows correspond to , and .

One observes that without regularization (), the estimates for the two weakest signals ( and ) are poor. The case is roughly the one which achieves the minimum for the dual problem. One observes that the quality of the estimates of the signals, and in particular the weakest ones, are considerably improved. The optimal (oracle) value of , that is, the one which achieves the minimum error between and , is in this problem. The corresponding estimates are qualitatively similar to those of and are not shown.

The case shows the effect of over-regularization. It produces very smooth signals, and although it fails to reveal and , it reveals highly accurate versions of and . It is also interesting to note that the smoothest signal, , now occupies the position of the second (estimated) principal component. That is, the regularized PCA sees an effective signal-to-noise ratio which is influenced by smoothness. This suggests a rather practical appeal of the method in revealing smooth signals embedded in noise. One can vary from zero upward, and if some patterns seem to be present for a wide range of (and getting smoother as is increased), one might suspect that they are indeed present in data but masked by noise.

4 Main results

We now turn to the statistical analysis of our estimators, in particular deriving high-probability upper bounds on the error of the subspace-based estimate , and the functional estimate . In both cases, we begin by stating general theorems that apply to arbitrary linear operators —Theorems 1 and 2, respectively—and then derive a number of corollaries for particular instantiations of the observation operator.

4.1 Subspace-based estimation rates (for )

We begin by stating high-probability upper bounds on the error of the subspace-based estimates. Our rates are stated in terms of a function that involves the eigenvalues of the matrix , ordered as . Consider the function given by

(27)

As will be clarified in our proofs, this function provides a measure of the statistical complexity of the function class

We require a few regularity assumptions. Define the quantity

(28)

which measures the departure from orthonormality of the vectors in . A straightforward argument using a polarization identity shows that is upper bounded (up to a constant factor) by the uniform quantity , as defined in equation (17). Recall that the random functions are generated according to the model , where the signal strengths are ordered as , and that denotes the noise standard deviation in the observation model (11).

In terms of these quantities, we require the following assumptions: {subequation}

(29)
(30)
(31)
(32)
{Remarks*}

The first part of condition (A1) is to prevent the ratio from going to zero as the pair increases, where the constant is chosen for convenience. Such a lower bound is necessary for consistent estimation of the eigen-subspace corresponding to . The second part of condition (A1), involving the constant , provides a lower bound on the signal-to-noise ratio . Condition (A2) is required to prevent degeneracy among the vectors obtained by mapping the unknown eigenfunctions to the observation space . [In the ideal setting, we would have , but our analysis shows that the upper bound in (A2) is sufficient.] Condition (A3) is required so that the critical tolerance specified below is well-defined; as will be clarified, it is always satisfied for the time-sampling model, and holds for the basis truncation model whenever . Condition (A4) is easily satisfied, since the right-hand side of (32) goes to infinity while we usually take to be fixed. Our results, however, are still valid if grows slowly with and subject to (32).

Theorem 1

Under conditions (A1)–(A3) for a sufficiently small constant , let be the smallest positive number satisfying the inequality

(33)

Then there are universal positive constants such that

(34)

where .

We note that Theorem 1 is a general result, applying to an arbitrary bounded linear operator . However, we can obtain a number of concrete results by making specific choices of this sampling operator, as we explore in the following sections.

4.1.1 Consequences for time-sampling

Let us begin with the time-sampling model (8), in which we observe the sampled functions

As noted earlier, this set-up can be modeled in our general setting (11) with and .

In this case, by the reproducing property of the RKHS, the matrix has entries of the form . Letting denote its ordered eigenvalues, we say that the kernel matrix has polynomial-decay with parameter if there is a constant such that for all . Since the kernel matrix represents a discretized approximation of the kernel integral operator defined by , this type of polynomial decay is to be expected whenever the kernel operator has polynomial- decaying eigenvalues. For example, the usual spline kernels that define Sobolev spaces have this type of polynomial decay Gu02 (). In Appendix A of the supplementary material supp (), we verify this property explicitly for the kernel that defines the Sobolev class with smoothness .

For any such kernel, we have the following consequence of Theorem 1:

Corollary 1 ((Achievable rates for time-sampling))

Consider the case of a time-sampling operator . In addition to conditions (A1) and (A2), suppose that the kernel matrix has polynomial-decay with parameter . Then we have

(35)

where , and .

{Remarks*}

(a) Disregarding constant pre-factors not depending on the pair , Corollary 1 guarantees that solving problem (21) returns a subspace estimate such that

with high probability as increase. Depending on the scaling of the number of time samples relative to the number of functional samples , either term in this upper bound can be the smallest (and hence active) one. For instance, it can be verified that whenever , then the first term is smallest, so that we achieve the rate . The appearance of the term is quite natural, as it corresponds to the minimax rate of a nonparametric regression problem with smoothness , based on samples each of variance . Later, in Section 4.3, we provide results guaranteeing that this scaling is minimax optimal under reasonable conditions on the choice of sample points; in particular, see Theorem 3(a).

(b) To be clear, although bound (35) allows for the possibility that the error is of order lower than , we note that the probability with which the guarantee holds includes a term of the order . Consequently, in terms of expected error, we cannot guarantee a rate faster than . {pf*}Proof of Corollary 1 We need to bound the critical value defined in the theorem statement (33). Define the function , and note that by construction. Under the assumption of polynomial- eigendecay, we have

and some algebra then shows that . Disregarding constant factors, an upper bound on the critical can be obtained by solving the equation

Doing so yields the upper bound . Otherwise, we also have the trivial upper bound , which yields the alternative upper bound . Recalling that and combining the pieces yields the claim. Notice that this last (trivial) bound on implies that condition (A3) is always satisfied for the time-sampling model.

4.1.2 Consequences for basis truncation

We now turn to some consequences for the basis truncation model (10).

Corollary 2 ((Achievable rates for basis truncation))

Consider a basis truncation operator in a Hilbert space with polynomial- decay. Under conditions (A1), (A2) and , we have

(36)

where , and .

{pf}

We note that as long as , condition (A3) is satisfied, since . The rest of the proof follows that of Corollary 1, noting that in the last step we have for the basis truncation model.

4.2 Function-based estimation rates (for )

As mentioned earlier, given the consistency of , the consistency of is closely related to approximation properties of the semi-norm induced by , and in particular how closely it approximates the -norm. These approximation-theoretic properties are captured in part by the nullspace width and defect defined earlier in equations (15) and (17), respectively. In addition to these previously defined quantities, we require bounds on the following global quantity:

(37)

A general upper bound on this quantity is of the form

(38)

In fact, it is not hard to show that such a bound exists with and using the decomposition . However, this bound is not sharp. Instead, one can show that in most cases of interest, the term is of the order of .

There are a variety of conditions that ensure that has this scaling; we refer the reader to the paper AmiWaiApprox () for a general approach. Here we pro