L_{1}-Penalization in Functional Linear Regression

# L1-Penalization in Functional Linear Regression with Subgaussian Design

Vladimir Koltchinskii School of Mathematics, Georgia Institute of Technology  and  Stanislav Minsker Department of Mathematics, Duke University
July 23, 2019
###### Abstract.

We study functional regression with random subgaussian design and real-valued response. The focus is on the problems in which the regression function can be well approximated by a functional linear model with the slope function being “sparse” in the sense that it can be represented as a sum of a small number of well separated “spikes”. This can be viewed as an extension of now classical sparse estimation problems to the case of infinite dictionaries. We study an estimator of the regression function based on penalized empirical risk minimization with quadratic loss and the complexity penalty defined in terms of -norm (a continuous version of LASSO). The main goal is to introduce several important parameters characterizing sparsity in this class of problems and to prove sharp oracle inequalities showing how the -error of the continuous LASSO estimator depends on the underlying sparsity of the problem.

Keywords: Functional regression and Sparse recovery and LASSO and Oracle inequality and Infinite dictionaries.

Mathematics Subject Classification (2000): 62J02, 62G05, 62J07.

V. Koltchinskii was partially supported by NSF grants DMS-1207808, DMS-0906880, CCF-0808863 and CCF-1415498.
S. Minsker was partially supported by grant R01-ES-017436 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH), NSF grants DMS-0650413, CCF-0808847 and DOE contract 113054 G002745.

## 1. Introduction

Let be a random couple defined on a probability space where is a stochastic process with parameter set and is a real valued response variable. In what follows, it will be assumed that the process is subgaussian. Denote

 (1.1) dX(s,t):=√Var(X(s)−X(t)),s,t∈T.

It will be also assumed that the space is totally bounded with respect to pseudometric and, moreover, it satisfies Talagrand’s generic chaining conditions ensuring that there exists a version of the process that is a.s. uniformly bounded and -uniformly continuous. In what follows, we assume that is such a version. Let be a finite measure on the Borel -algebra of the pseudometric space

Consider the following regression model

 Y=f∗(X)+ξ,

where is the regression function and is a random noise with and variance independent of the design variable We will be interested in estimating the regression function under an underlying assumption that can be well approximated by a functional linear model (“oracle model”)

 fλ,a(X)=a+∫TX(t)λ(t)μ(dt),

where is the “slope” function and is the intercept of the model. More precisely, we will focus on the problems in which the oracle models are “sparse” in the sense that the slope function is supported in a relatively small subset of parameter space such that the set of random variables can be well approximated by a linear space of a small dimension. Often, will be a sum of several “spikes” with disjoint and well separated supports. Such models might be useful in a variety of applications, in particular, in image processing where, in many cases, only sparsely located regions of the image are correlated with the response variable. In what follows, denotes the marginal distribution of in the space of all uniformly bounded and uniformly continuous functions on , and denotes the joint distribution of in Let be a sample consisting of i.i.d. copies of defined on The regression function is to be estimated based on the data Our estimation method can be seen as a direct extension of (a version of) LASSO to the infinite-dimensional case. Namely, let be a convex subset of the space such that Consider the following penalized empirical risk minimization problem:

 (1.2) (^λε,^aε):=argmin λ∈D,a∈R[1nn∑j=1(Yj−fλ,a(Xj))2+ε∥λ∥1],

where and is the regularization parameter. The function will be used as an estimator of the regression function

When the parameter set is finite, (1.2) defines a standard LASSO-estimator of the vector of parameters of linear regression model (see [39]). This estimator is among the most popular in high-dimensional statistics and it has been intensively studied in the recent years (e.g., see [9], [40], [23], [6], [24], [4], [25]; see also the book by Bühlmann and van de Geer [8] for further references).

We will be more interested in the case of uncountable infinite parameter sets (functional linear models). In such problems, standard characteristics of finite dictionaries used in the theory of sparse recovery (restricted isometry constants, restricted eigenvalues, etc) are not directly applicable. Our goal will be to develop proper parameters characterizing sparsity in the case of functional models and to prove oracle inequalities for the -error of continuous LASSO-estimator in terms of these sparsity parameters. We concentrate on the case of subgaussian random design (that, of course, includes an important example of Gaussian design processes) since, in this case, we can rely on a number of probabilistic tools from the theory of subgaussian and empirical processes. In particular, we extensively use in the proofs recent generic chaining bounds for empirical processes due to Mendelson [32], [31].

It should be emphasized that there is vast literature on functional regression (see, e.g., [34], [35] and references therein). A commonly used general idea in this literature is to estimate the eigenfunctions of the covariance operator and to project the unknown slope function onto the linear span of the “principal components” corresponding to the largest eigenvalues (see [33], [10] and references therein). Under smoothness assumptions on the slope function, a natural approach to its estimation is to use a regularization penalty (see [14] for construction of estimators based on smoothing splines and [44] for a more general reproducing kernel Hilbert space approach).

The problem studied in our paper is much closer to the theory of sparse estimation in high-dimensional statistics and can be viewed as an extension of this theory to the case of functional models and uncountable dictionaries. Our approach is similar in spirit to [22], [24] where such characteristics as “alignment coefficient” (used below for functional models) were introduced and studied in the case of finite dictionaries, and [26] which extended some of these results to the case of infinite dictionaries. For a review of some modern methods in functional data processing and their connections to various notions of sparsity, we refer the reader to [19]. A recent paper by James, Wang and Zhu [20] is similar to the present work in terms of motivation and approach, however, the theoretical analysis in [20] is performed under the assumptions on the design distribution that might not hold if has smooth trajectories.

It is important to note that in practice we never observe the whole trajectory of but rather its densely sampled version. In this case, the natural choice for is a uniform measure on the sampling grid, whence (1.2) becomes the usual LASSO once again. However, there is often no reasons to assume that Gram matrix of the design satisfies RIP [13] or restricted eigenvalue type conditions [6, 21] in this case. Although LASSO might not perform well as a variable selection procedure in such a framework, we will provide examples showing that prediction power of an estimator can still benefit from the fact that the underlying model is (approximately) sparse. In particular, oracle inequalities with error rates depending on sparsity can be derived from the general results of our paper. Other interesting approaches to theoretical analysis of LASSO with highly correlated design were proposed in [41], [17]. For instance, in [41] (see, in particular, Corollary 4.2) the authors show that in the case of highly correlated design, it is often possible to choose the regularization parameter to be small and achieve reasonable error rates.

It should be also mentioned that in a number of very important applications one has to deal with sparse recovery in infinite dictionaries with random designs that are not subgaussian, or with deterministic designs. For instance, in [12], the authors develop a theory of super-resolution. In this case, the dictionary consists of complex exponentials the design is deterministic and the estimation method is based on minimizing the total variation norm of a signed measure on subject to data dependent constraints. Although the results of our paper do not apply immediately to such problems, it is possible to extend our approach in this direction.

We will introduce several assumptions and definitions used throughout the paper.

###### Definition 1.1.

A closed linear subspace will be called a subgaussian space if there exists a constant such that for all

 Eesη≤eΓs2σ2η,s∈R,

where

It is well known that and that -and -norms are equivalent on (more precisely, they are within a constant ). Also, if is a closed linear subspace of such that are jointly normal centered random variables, then is a subgaussian space with Another example is the closed linear span of independent centered subgaussian random variables such that

 Eesηj≤eΓσ2ηjs2,s∈R,j≥1

for some

 L:={∑j≥1cjηj:∑j≥1σ2ηjc2j<+∞}.

For instance, one can consider a sequence of i.i.d. Rademacher random variables (that is, takes valued and with probability ). In the case of a single random variable its linear span is a subgaussian space if and only if is subgaussian.

In what follows, a subgaussian space and constant will be fixed. All the constants depending only on will be called absolute.

###### Assumption 1.1.

Suppose that

 (1.3) X(t)−EX(t)∈L for all t∈T.

Denote by the closed (in and, as a consequence, also in the -norm) linear span of

This assumption easily implies that the stochastic process is subgaussian, meaning the for all is a subgaussian random variable with parameter

Next, we recall the notion of Talagrand’s generic chaining complexity (see [38] for a comprehensive introduction). Given a pseudo-metric space , let be a nested sequence of partitions such that and . For , let be the unique subset of containing . The generic chaining complexity is defined as

 γ2(T;dX):=inf{Δn}sups∈T∑n≥02n2D(Δn(s))

where stands for the diameter of a set . Let

 γ2(δ):=γ2(T;dX;δ)=inf{Δn}supt∈T∑n≥02n/2(D(Δn(t))∧δ).

If is another metric on such that for all , and , then clearly

 (1.4) γ2(T;dY)≤γ2(δ).

This bound will be often used below. Our main complexity assumptions on the design distribution are the following:

###### Assumption 1.2.

Pseudometric space is such that and, moreover,

 γ2(T;dX;δ)→0 as δ→0.

Under these assumptions, the process has a version that is uniformly bounded and -uniformly continuous a.s. Moreover, (in particular, all the moments of are finite). It what follows, we will denote

 S(T):=S(T,dX)=inft∈T√Var(X(t))+Lγ2(T;dX).

Note that Theorem A.2 implies that there exists a numerical constant such that

 (1.5) Esupt∈T|X(t)−EX(t)|≤S(T).

We will also need the following assumptions on the regression function and the noise

###### Assumption 1.3.

Suppose that and

Since this assumption also implies that Note that if is a family of centered Gaussian random variables and is its closed linear span, then is a subgaussian space and is the orthogonal projection of onto the subspace Thus,

## 2. Approximation error bounds, alignment coefficient and Sobolev norms

Recall that is the joint distribution of and let be the empirical distribution based on the sample The integrals with respect to and are denoted by

 Pg:=Eg(X,Y),Png:=1nn∑i=1g(Xi,Yi).

In what follows, it will be convenient to denote and

 (ℓ∙f)(x,y):=ℓ(y,f(x))=(y−f(x))2.

We also use the notation for the derivative of quadratic loss with respect to Throughout the paper, denotes the bilinear form

 ⟨f,g⟩:=∫Tf(t)g(t)μ(dt).

Let

 Fn(λ,a):=Pn(ℓ∙fλ,a)+ε∥λ∥1,  F(λ,a):=P(ℓ∙fλ,a)+ε∥λ∥1.

Denote also

 ¯Yn:=n−1n∑j=1Yj,  ¯Xn(t):=n−1n∑j=1Xj(t),t∈T.

Note that

 (2.1) ^a(λ):=argmin a∈RFn(λ,a)=¯Yn−⟨λ,¯Xn⟩, a(λ):=argmin a∈RF(λ,a)=EY−⟨λ,EX⟩.

The following penalized empirical risk minimization problem

 (2.2) (^λε,^aε):=argmin λ∈D,a∈RFn(λ;a),

is exactly problem (1.2) written in a more concise form. Note that (2.2) is the empirical version of

 (2.3) (λε,aε):=argmin λ∈D,a∈RF(λ,a).

Due to convexity of the loss, both (2.3) and (2.2) are convex optimization problems. It will be shown (Theorem A.1 in the appendix) that, under certain assumptions, they admit (not necessarily unique) solutions

###### Assumption 2.1.

It is assumed throughout the paper that solutions of (2.3) and of (2.2) exist.

It might be also possible to study the problem under an assumption that and are approximate solutions of the corresponding optimization problems, but we are not pursuing this to avoid further technicalities.

The goal of this section is to determine the parameters responsible for the size of the risk of , where is the (distribution-dependent) solution of the problem (2.3), and to find upper bounds on these parameters in terms of classical Sobolev type norms. Later on, it will be shown that the same parameters affect the error rate of empirical solution .

Recall that is a convex subset that contains zero. It immediately follows from (2.3) that we can take and also that

 (2.4) ∥fλε,aε−f∗∥2L2(Π)≤q(ε):=infλ∈D,a∈R[∥fλ,a−f∗∥2L2(Π)+ε∥λ∥1].

Clearly, is a non-decreasing concave function (concavity follows from the fact that it is an infimum of linear functions). Therefore, is a non-increasing function. Note also that (take in the expression under the infimum) and

 q(ε)≤ε∥λ∗∥1

provided that where (take ). The infimum in the definition of is attained at (a solution of problem (2.3) that is assumed to exist). Then, in addition to the bound (2.3) also implies

 ∥λε∥1≤q(ε)ε.

We will be interested, however, in other bounds on in which the “regularization error” is proportional to rather than to (as it is the case in the bounds for ). To this end, we have to introduce some new characteristics of the oracles

Let be the covariance function of the stochastic process Clearly, under Assumption 1.2, and the covariance operator defined by

 (Kv)(s):=∫Tk(s,t)v(t)μ(dt).

is Hilbert–Schmidt. For , define

 (2.5) ∥u∥K:=sup⟨Kv,v⟩≤1⟨u,v⟩.
###### Remark 2.1.

In the case when is finite, operator is represented by the Gram matrix of a finite dictionary and standard “restricted isometry” and “restricted eigenvalue” type constants are defined in terms of and are involved in oracle inequalities for LASSO and other related estimators.

Note that , where . The set

 H(K):={u∈L2(T): ∥u∥K<∞}

is a reproducing kernel Hilbert space of the covariance kernel

We will need the following description of the subdifferential of the convex function :

 (2.6) ∂∥λ∥1={w:T↦[−1,1]: μ−a.s. w(t)=sign(λ(t)) whenever λ(t)≠0}.

It follows from the general description of the subdifferential of a norm in a Banach space :

 ∂∥x∥={{x∗∈X∗: ∥x∗∥=1, x∗(x)=∥x∥},x≠0,{x∗∈X∗: ∥x∗∥≤1},x=0,

where is the dual space. For details on our specific example, see [18], paragraph 4.5.1.

Note that, in standard examples (such as ), the “canonical” version of subgradient of lacks smoothness and RKHS-norms are often large or infinite for such a choice of It will be seen below that existence of smoother versions of subgradient is important in such cases. Given a measurable , let . For smooth will play a role of support of Given define the cone by

 (2.7) C(b)w:={u∈L1(μ): ∫T∖Tw|u|dμ≤b⟨w,u⟩}.

Note that, for we have Therefore, implies that

 ∫T∖Tw|u|dμ≤b∫Tw|u|dμ.

Roughly, this means that, for functions is a “dominant set”. Let

 (2.8) a(b)(w):=sup{⟨w,u⟩: u∈C(b)w, ∥fu∥L2(Π)=1}.

Such quantities were introduced in the framework of sparse recovery in [22], [24] and its size is closely related to the RIP and restricted eigenvalue - type conditions. In some sense, characterizes the way in which vector (function) is “aligned” with eigenspaces of the covariance operator of the process and, following [22], it will be called the alignment coefficient. Clearly, we always have the bound , however, it can be improved in several important cases, see Section 4.4. Note that is a nondecreasing function of For we have In this case, so, the alignment coefficient coincides with the RKHS-norm associated to the covariance function For we have so, the cone coincides with the subspace of functions supported in In this case, is the RKHS-norm associated with restriction of the kernel to

In what follows, it will be convenient to take and denote (although in the statement of Theorem 2.1 below a smaller value could be used).

We will be interested in those oracles for which there exists a subgradient such that is not too large and is a “small” subset of . Such functions provide a natural analogue of sparse vectors in finite-dimensional problems.

###### Theorem 2.1.

The following inequality holds:

 (2.9) ∥fλε,aε−f∗∥2L2(Π)≤infλ∈D,w∈∂∥λ∥1,a∈R[∥fλ,a−f∗∥2L2(Π)+14ε2a2(w)].
###### Remark 2.2.

It easily follows from the proof of this theorem that for all

 ∫T∖Tw|λε|dμ≤4ε[∥fλ,a−f∗∥2L2(Π)+14ε2a2(w)].

The intuition behind these results is the following: if there exists an oracle with a small approximation error (say, of the order ) and not very large alignment coefficient then the risk is also small and is “almost” concentrated on the set

As we show below, in some cases and can be bounded in terms of Sobolev-type norms.

Since self-adjoint integral operator with kernel is Hilbert–Schmidt, it is compact, and the orthogonal complement to its kernel possesses an orthonormal system of eigenfunctions corresponding to positive eigenvalues . It is well-known that

 (2.10) H(K)={w(⋅)=∞∑j=1wjfj(⋅): ∥w∥2K=∞∑j=1w2jνj<∞}.

However, one might want to find a more direct characterization of . One way to proceed is to use the so-called Factorization theorem:

###### Theorem 2.2 ([29], Theorem 4 in Section 9).

Assume that there exists a Hilbert space and an injective linear operator such that , where is the adjoint of . Then , and .

The most obvious choice is and , whence which again gives (2.10). Other choices often lead to more insightful description. For example, if is the standard Brownian motion on , then one can check [29] that with the standard Lebesgue measure and satisfy the requirements. It immediately implies

###### Corollary 2.1.

The reproducing kernel Hilbert space associated with the Brownian motion is defined by

 (2.11) H(K)={h∈L2[0,1], h(0)=0, ∥h∥2K:=∫10(h′(s))2ds<∞}⊂W2,1[0,1],

where

 W2,1[0,1]={h ∈L2[0,1], h is abs. continuous, ∥h∥2W2,1:=∫10[h2(s)+(h′(s))2]ds<∞}

is the Sobolev space.

In particular, it means that Suppose now that is a bounded open subset and, for some and

 (2.12) a2(w)≤C∥w∥2W2,β.

Let be a “sparse” oracle such that where are disjoint sets. Moreover, assume that the distance between and is positive for all In other words, has components with well separated supports and it is zero in between. In this case, one can find such that and are smooth functions (from the space to be specific) with disjoint supports. For any such function we have

 a2(w)≤C∥w∥2W2,β≤C1d∑j=1∥wj∥2W2,β≤C1dmax1≤j≤d∥wj∥2W2,β

and the bound of Theorem 2.1 implies that

 (2.13) ∥fλε,aε−f∗∥2L2(Π)≤∥∥fλ,a(λ)−f∗∥∥2L2(Π)+C4dmax1≤j≤d∥wj∥2W2,βε2.

Thus, the size of the error explicitly depends on the number of components of “sparse” oracles approximating the target.

In Section 4, we will show that bound (2.12) holds for a number of stochastic processes and, moreover, there are other ways to take advantage of sparsity in the cases when the domain of can be partitioned in a number of regions such that the processes are “weakly correlated”.

## 3. Basic oracle inequalities

In this section, we present general oracle inequalities for the -risk of estimator The main goal is to show that if there exists an oracle such that the approximation error is small, the alignment coefficient is not large and is “sparse” in the sense that the set of random variables can be well approximated by a linear space of small dimension, then the -error of the estimator can be controlled in terms of the dimension of and the alignment coefficient To state the result precisely, we have to introduce one more parameter, an “approximate dimension”, providing an optimal choice of approximating space Thus, the degree of “sparsity” of the oracle will be characterized by the alignment coefficient that already appeared in approximation error bounds of Section 2 and also by “approximate dimension” introduced below.

We start, however, with a “slow-rate” oracle inequality that does not depend on “sparsity”. The inequalities of this type are well known in the literature on sparse recovery, in particular, for LASSO estimator in the case of finite dictionaries, see [4], [30].

Recall that is a convex set and Recall also the definition of (see 2.4) and its properties. Note that

 (3.1) σ2Y=Var(f∗(X))+σ2ξ=∥f∗−Πf∗∥2L2(Π)+σ2ξ.
###### Theorem 3.1.

There exist absolute constants and such that the following holds. For any with and for all satisfying

 (3.2) ε≥DσYS(T)√n,

with probability at least

 (3.3) ∥f^λε,^aε−f∗∥2L2(Π)+34ε∥^λε∥1≤infλ∈D,a∈R[∥fλ,a−f∗∥2L2(Π)+32ε∥λ∥1]+Cσ2Y¯sn.

As was mentioned earlier, our main goal is to obtain sharper bounds which would demonstrate connections between the risk of and the degree of sparsity of an underlying model. Our next result is a step in this direction. We will need the notion of Kolmogorov’s -width of the set of random variables defined as follows:

 ρd(C):=infL⊂LX,dim(L)≤dsupη∈C∥PL⊥η∥L2(P).

It characterizes the optimal accuracy of approximation of the set by -dimensional linear subspaces of Given let

 XT′:={X(t)−EX(t):t∈T′}.

Recall that Given an oracle and let

 ρd(w):=ρd(XTw).

The following number will play a role of approximate dimension of the set of random variables

 (3.4) d(w,λ):=min{d≥0:dσ2Yn≥∥λ∥1γ2(ρd(w))√n}.
###### Theorem 3.2.

There exist absolute constants and such that the following holds. For any with and for all satisfying

 (3.5) ε≥DσYS(T)√s√n,

with probability at least

 (3.6) ∥∥f^λε,^aε−f∗∥∥2L2(Π)≤infλ∈D,w∈∂∥λ∥1,a∈R [∥fλ,a−f∗∥2L2(Π)+2ε2a2(w)+Cσ2Yd(w,λ)n +C∥λ∥21S2(T)n]+Cσ2Y¯sn.

Under an additional assumption that is not too large, it is possible to prove the following modified version of Theorem 3.2 without the term in the oracle inequality.

###### Theorem 3.3.

Assume that conditions of Theorem 3.2 hold. If is such that

 D⊂{λ∈L1(μ): ∥λ∥1≤cσY√nS(T)}

for some absolute constant , then with probability

 ∥∥f^λε,^aε−f∗∥∥2L2(Π)≤ (3.7) infλ∈D,w∈∂∥λ∥1,a∈R[∥fλ,a−f∗∥2L2(Π)+2ε2a2(w)+Cσ2Yd(w,λ)n]+Cσ2Y¯sn.

The proof of this result follows from the proof of Theorem 3.2, see remark 7.1 for more details.

###### Remark 3.1.

Note that the oracle inequality of Theorem 3.2 is sharp, meaning that the constant in front of (the leading constant) is It is possible to derive an oracle inequality with the leading constant larger than which might yield faster rates when the variance of the noise is small. Define the following version of the “approximate dimension” (compare to (3.4)):

 dσξ(w,λ):=min{d≥0:dσ2ξn≥∥λ∥1γ2(ρd(w))√n}.

Then, under the assumptions of Theorem 3.2, the following inequality holds with probability :

 ∥∥f^λε,^aε−f∗∥∥2L2(Π)≤ infλ∈D,w∈∂∥λ∥1,a∈R[2∥fλ,a−f∗∥2L2(Π)+2ε2a2(w) +Cσ2ξdσξ(w,λ)n+C∥λ∥21S2(T)n]+Cσ2Y¯sn.

The proof of this result uses arguments similar to the proof of Theorem 3.2, so we omit the details.

Inequality (3.6) above depends on rather abstract parameters (such as the alignment coefficient and the approximate dimension) that have to be further bounded before one can get a meaningful bound in concrete examples. This will be dicussed in some detail in the following sections.

## 4. Bounding the alignment coefficient

First, we discuss the bounds on alignment coefficient in terms of Sobolev-type norms in some detail. After this, we turn to the problem of bounding the alignment coefficient in the cases when there exists a weakly correlated partition for the design process

### 4.1. Sacks-Ylvisacker conditions

In the univariate case it is possible to determine whether (a certain subspace of) the Sobolev space can be continuously embedded into based on the smoothness of the covariance function Existence of such an embedding is given by the so-called Sacks-Ylvisaker conditions [37]. This provides a way to bound the RKHS norm generated by the covariance function of (and, thus, also the alignment coefficient) in terms of a Sobolev norm. Definitions and statements below are taken from [36], Section 3.

Set and . Let be a continuous function on such that the restrictions are continuously extendable to the closures . will stand for the extension of to which is continuous on and on . Set . Then, the covariance kernel defined on satisfies the Sacks-Ylvisaker conditions of order if the following holds true:

1. , the partial derivatives of up to order 2 are continuous on and are continuously extendable to and to .