Empirical Wavelet-based Estimation for Non-linear Additive Regression Models.

# Empirical Wavelet-based Estimation for Non-linear Additive Regression Models.

German A. Schnaidt Grez111Georgia Institute of Technology. Email:gschnaidt@gatech.edu 755 Ferst Dr. NW, Atlanta, Georgia 30332 Brani Vidakovic222Georgia Institute of Technology. Email:brani@gatech.edu 755 Ferst Dr. NW, Atlanta, Georgia 30332
###### Abstract

Additive regression models are actively researched in the statistical field because of their usefulness in the analysis of responses determined by non-linear relationships with multivariate predictors. In this kind of statistical models, the response depends linearly on unknown functions of predictor variables and typically, the goal of the analysis is to make inference about these functions.

In this paper, we consider the problem of Additive Regression with random designs from a novel viewpoint: we propose an estimator based on an orthogonal projection onto a multiresolution space using empirical wavelet coefficients that are fully data driven. In this setting, we derive a mean-square consistent estimator based on periodic wavelets on the interval . For construction of the estimator, we assume that the joint distribution of predictors is non-zero and bounded on its support; We also assume that the functions belong to a Sobolev space and integrate to zero over the [0,1] interval, which guarantees model identifiability and convergence of the proposed method. Moreover, we provide the risk analysis of the estimator and derive its convergence rate.

Theoretically, we show that this approach achieves good convergence rates when the dimensionality of the problem is relatively low and the set of unknown functions is sufficiently smooth. In this approach, the results are obtained without the assumption of an equispaced design, a condition that is typically assumed in most wavelet-based procedures.

Finally, we show practical results obtained from simulated data, demonstrating the potential applicability of our method in the problem of additive regression models with random designs.

###### keywords:
Wavelets, non-parametric regression, functional data analysis, robust statistical modeling
journal: Journal of

## 1 Introduction

Additive regression models are popular in the statistical field because of their usefulness in the analysis of responses determined by non-linear relationships involving multivariate predictors. In this kind of statistical models, the response depends linearly on unknown functions of the predictors and typically, the goal of the analysis is to make inferences about these functions. This model has been extensively studied through the application of piecewise polynomial approximations, splines, marginal integration, as well as back-fitting or functional principal components. Chapter 15 of Ramsay2005 (), Chapter 22 of Gyorfi2002 () and Mammen2003 (), Buja1989 () and Hastie1990 () feature thorough discussions of the issues related to fitting such models and provide a comprehensive overview and analysis of various estimation techniques for this problem.

In general, the additive regression model relates a univariate response to predictor variables , via a set of unknown non-linear functions . The functions may be assumed to have a specified parametric form (e.g. polynomial) or may be specified non-parametrically, simply as "smooth functions" that satisfy a set of constraints (e.g. belong to a certain functional space such as a Besov or Sobolev, Lipschitz continuity, spaces of functions with bounded derivatives, etc.). Though the parametric estimates may seem more attractive from the modeling perspective, they can have a major drawback: a parametric model automatically restricts the space of functions that is used to approximate the unknown regression function, regardless of the available data. As a result, when the elicited parametric family is not "close" to the assumed functional form the results obtained through the parametric approach can be misleading. For this reason, the non-parametric approach has gained more popularity in statistical research, providing a more general, flexible and robust approach in tasks of functional inference.

In this paper we propose a linear functional estimator based on an orthogonal projection onto a specified multiresolution space using empirical wavelet coefficients that are fully data driven. Here, stands for the space spanned by the set of scaling functions of the form , generated by a specified wavelet filter. Since we assume predictors are random with an unknown distribution, we introduce a kernel density estimator in the model to estimate its density. In this setting, we propose a mean-square consistent estimator for the constant term and the wavelet coefficients in the orthogonal series representation of the model. Our results are based on wavelets periodic on the interval and are derived under a set of assumptions that guarantee identifiability and convergence of the proposed estimator. Moreover, we derive convergence rates for the risk and propose a practical choice for the multiresolution index to be used in the wavelet expansion. In this approach, we obtain stated results without the assumption of an equispaced design, a condition that is typically assumed in most wavelet-based procedures.

Our choice of wavelets as an orthonormal basis is motivated by the fact that wavelets are well localized in both time and scale (frequency), and possess superb approximation properties for signals with rapid local changes such as discontinuities, cusps, sharp spikes, etc.. Moreover, the representation of these signals in the form of wavelet decompositions can be accurately done using only a few wavelet coefficients, enabling sparsity and dimensionality reduction. This adaptivity does not, in general, hold for other standard orthonormal bases (e.g. Fourier basis) which may require many compensating coefficients to describe signal discontinuities or local bursts.

We also illustrate practical results for the proposed estimator using different exemplary functions and random designs, under different sample sizes, demonstrating the suitability of the proposed methodology.

As it was mentioned, additive regression models have been studied by many authors using a wide variety of approaches. The approaches include marginal integration, back-fitting, least squares (including penalized least squares), orthogonal series approximations, and local polynomials. Short descriptions of the most commonly used techniques are provided next:

1. Marginal Integration. This method was proposed by Tjostheim and Auestad (1994)Tjostheim1994 () and Linton and Nielsen (1995)Nielsen1995 () and later generalized by Chen et al. (1996)Chen1996 (). The marginal integration idea is based on the estimation of the effects of each function in the model using sample averages of kernel functions by keeping a variable of interest fixed at each observed sample point, while changing the remaining ones. This method has been shown to produce good results in simulation studies (Sperlich et al., 1999)Sperlich1999 (). However, the marginal integration performance over finite samples tends to be inadequate when the dimension of the predictors is large. In particular, the bias-variance trade-off of the estimator in this case is challenging: for a given bandwidth there may be too few data points for any given x, which inflates the estimator variance and reduces its numerical stability. On the other hand, choosing larger bandwidth may reduce the variability but also enlarge the bias.

2. Back-fitting. This approach was first introduced by Buja et al. (1989)Buja1989a () and further developed by Hastie and Tibshirani (1990)Hastie1990a (). This technique uses nonparametric regression to estimate each additive component, and then updates the preliminary estimates. This process continues in an iterative fashion until convergence. One of the drawbacks of this method is that it has been proven to be theoretically challenging to analize. In this context, Opsomer and Ruppert (1997)Opsomer1997 () investigated the properties of a version of back-fitting, and found that the estimator was not oracle efficient333An oracle efficient estimator is such that each component of the model can be estimated with the same convergence rate as if the rest of the model components were known.. Later on, Mammen et al. (1999)Mammen1999 () and Mammen and Park (2006)Mammen2006 () proposed ways to modify the backfitting approach to produce estimators with better statistical properties such as oracle efficiency and asymptotic normality, and also free of the curse of dimensionality. Even though this is a popular method, it has been shown that its efficiency decreases when the unknown functions are observed at nonequispaced locations.

3. Series based methods using wavelets. One important benefit of wavelets is that they are able to adapt to unknown smoothness of functions (Donoho et al. (1995)Donoho1995 ()). Most of the work using wavelets is based on the requirement of equally spaced measurements (e.g. at equal time intervals or a certain response observed on a regularly spaced grid). Antoniadis et al. (1997)Antoniadis1997a () propose a method using interpolations and averaging; based on the observed sample, the function is approximated at equally spaced dyadic points. In this context, most of the methods that use this kind of approach lead to wavelet coefficients that can be computed via a matrix transformation of the original data and are formulated in terms of a continuous wavelet transformation applied to a constant piecewise interpolation of the observed samples. Pensky and Vidakovic (2001)Pensky2001 () propose a method that uses a probabilistic model on the design of the independent variables and can be applied to non-equally spaced designs (NESD). Their approach is based on a linear wavelet-based estimator that is similar to the wavelet modification of the Nadaraja-Watson estimator (Antoniadis et al. (1994)). In the same context, Amato and Antoniadis (2001)Amato2001 () propose a wavelet series estimator based on tensor wavelet series and a regularization rule that guarantees an adaptive solution to the estimation problem in the presence of NESD.

4. Other methods based on wavelets. Different approaches from the previously described that are wavelet-based have been also investigated. Donoho et al. (1992)Donoho1992 () proposed an estimator that is the solution of a penalized Least squares optimization problem preventing the problem of ill-conditioned design matrices. Zhang and Wong (2003) proposed a two-stage wavelet thresholding procedure using local polynomial fitting and marginal integration for the estimation of the additive components. Their method is adaptive to different degrees of smoothness of the components and has good asymptotic properties. Later on Sardy and Tseng (2004)Sardy2004 () proposed a non-linear smoother and non-linear back-fitting algorithm that is based on WaveShrink, modeling each function in the model as a parsimonious expansion on a wavelet basis that is further subjected to variable selection (i.e. which wavelets to use in the expansion) via non-linear shrinkage.

As was discussed before in the context of the application of wavelets to the problem of additive models in NESD, another possibility is just simply ignore the nonequispaced condition on the predictors and apply the wavelet methods directly to the observed sample. Even though this might seem a somewhat crude approach, we will show that it is possible to implement this procedure via a relatively simple algorithm, obtaining good statistical properties and estimation results.

For the implementation of the functional estimator, we choose periodic wavelets as an orthonormal basis. Even though this kind of wavelets exhibit poor behaviour near the boundaries (when the analyzed function is not periodic, high amplitude wavelet coefficients are generated in the neighborhood of the boundaries) they are typically used due to the relatively simple numerical implementation and compact support. Also, as was suggested by Johnstone (1994), this simplification affects only a small number of wavelet coefficients at each resolution level.

Periodic wavelets in are defined by a modification of the standard scaling and wavelet functions:

 ϕperj,k(x)=∑l∈Zϕj,k(x−l), (1) ψperj,k(x)=∑l∈Zψj,k(x−l). (2)

It is possible to show, as in Restrepo1996 (), that constitutes an orthonormal basis for . Consequently, , where is the space spanned by . This allows to represent a function with support in as:

 f(x)=⟨f(x),ϕper0,0(x)⟩ϕper0,0(x)+∑j≥02j−1∑k=0⟨f(x),ψperj,k(x)⟩ψperj,k(x). (3)

Also, for a fixed , we can obtain an orthogonal projection of onto denoted as given by:

 PJ(f(x))=2J−1∑k=0⟨f(x),ϕperJ,k(x)⟩ϕperJ,k(x) (4)

Since periodized wavelets provide a basis for , we have that as . Also, it can be shown that as . Therefore, we can see that uniformly converges to as . Similarly, as discussed in Daubechies1992 () it is possible to assess the approximation error for a certain density of interest using a truncated projection (i.e. for a certain chosen detail space ). For example, using the -th Sobolev norm of a function defined as:

 ∥f(x)∥Hs=√∫(1+|x|2)s|f(x)|2dx, (5)

one defines the sobolev space, as the space that consists of all functions whose s-Sobolev norm exists and is finite. As it is shown in Daubechies1992 ():

 ∥f(x)−PJ(f(x))∥2≤2−J⋅s⋅∥f∥Hs[0,1]. (6)

From (6), for a pre-specified one can choose such that . In fact, a possible choice of J could be:

 J≥−⌈1slog2(ϵ∥f∥Hs[0,1])⌉. (7)

Therefore, it is possible to approximate a desired function to arbitrary precision using the MRA generated by a wavelet basis.

## 2 Wavelet-based Estimation in Additive Regression Models

Suppose that instead of the typical linear regression model which assumes linearity in the predictors , we have the following:

 f(x) = β0+fA(x)+σ⋅ϵ (8) = β0+p∑j=1fj(xj)+σ⋅ϵ

where , independent of x, , , , . Similarly, , an unknown design density of observations and are unknown functions to be estimated.

### 2.1 Problem statement and derivation of the Estimator

Suppose that we are able to observe a sample where . We are interested in estimating and . For simplicity (without loss of generality) and identifiability, we assume:

1. (A1) The density is of the continuous type and has support in . Also, we assume s.t. .

2. (A2) For , .

3. (A3) For , and . This implies that for , .

4. (A4) The design density belongs to a generalized Holder class of functions of the form:

 H(β,L)={h:|∂αh(x)−∂αh(y)|≤L∥x−y∥β−|α|1,∀α∈Np,s.t.|α|=⌊β⌋,∀x,y∈[0,1]p} (9)

where , and . Also, suppose that , for all and .

5. (A5) The density is uniformly bounded in , that is, , .

Furthermore, since spans , each of the functions in 8 can be represented as:

 fl(x)=∑j≥02j−1∑k=0c(l)jk⋅ϕperjk(x),l=1,...,p, (10)

where denotes the th wavelet coefficient of the th function in the model. Similarly, for some fixed that is the orthogonal projection of , onto the multiresolution space. Therefore, can be expressed as:

 fl,J(x)=2J−1∑k=0c(l)Jk⋅ϕperJk(x),l=1,...,p, (11)

where:

 c(l)Jk=⟨fl(x),ϕperJk(x)⟩=∫10fl(x)ϕperJk(x)dx,l=1,...,p. (12)

Based on the model (8) and (11), it is possible to approximate by an orthogonal projection onto the multiresolution space spanned by the set of scaling functions , by approximating each of the functions as described above. Therefore, can be expressed as:

 fJ(x)=β0+p∑l=12J−1∑k=0c(l)JkϕperJk(x) (13)

Now, the goal is for a pre-specified multiresolution index , to use the observed samples to estimate the unknown constant and the orthogonal projections of the functions .

#### Remarks

1. Note that the scaling function for the wavelet basis is absolutely integrable in . Therefore, .

2. Also, from the above conditions, the variance of the response is bounded for every .

3. The assumption that the support of the random vector X is can be always satisfied by carrying out appropriate monotone increasing transformations of each dimensional component, even in the case when the support before transformation is unbounded. In practice, it would be sufficient to transform the empirical support to .

#### 2.1.1 Derivation of the estimator for β0

From the model definition presented in (8), and assumption (A2) we have that:

 ∫[0,1]p(β0+p∑l=1fl(xl))dx = β0+p∑l=1∫10fl(xl)dxl (14) = β0

Therefore, under assumptions (A1) and the last result, it is possible to obtain as:

 β0=EX,ϵ[f(X)h(%X)]. (15)

Indeed,

 EX,ϵ[f(X)h(X)] = EX,ϵ[β0+∑pj=1fj(Xj)+σ⋅ϵh(X)] = β0+EX[∑pj=1fj(Xj)h(X)] = β0.

As a result of (15), a natural data-driven estimator of is

 ^β0=1nn∑i=1yi^hn(xi), (16)

where is a suitable non-parametric density estimator of , e.g. a kernel density estimator.

#### 2.1.2 Derivation of the estimator for the wavelet coefficients c(l)Jk

Based on the multiresolution space spanned by the orthonormal functions , (12) and assumption (A2), the wavelet coefficients for each functional can be represented as:

 c(l)Jk=∫10fl(xl)ϕperJk(xl)dxl. (17)

Expanding the right-hand-side (rhs) of the last equation, we get:

 ∫10fl(xl)ϕperJk(xl)dxl = ∫10fl(xl)(ϕperJk(xl)−2−J2)dxl = ∫10(β0+p∑j=1fj(xj))(ϕperJk(xl)−2−J2)dxl = ∫[0,1]p−1∫10(β0+p∑j=1fj(xj))(ϕperJk(xl)−2−J2)dxldx(−l),

where corresponds to the random vector x without the th entry. It is easy to see that (17) holds because of assumption (A2) and the fact that . The proof for this last claim can be found in A.

Now, if we consider (A1), we can see that an alternative way to express (17) could be:

 c(l)Jk=EX,ϵ⎡⎢⎣f(X)(ϕperJk(xl)−2−J2)h(X)⎤⎥⎦. (18)

Indeed,

 EX,ϵ⎡⎢⎣f(X)(ϕperJk(xl)−2−J2)h(X)⎤⎥⎦ = EX,ϵ⎡⎢⎣(β0+∑pj=1fj(Xj)+σ⋅ϵ)(ϕperJk(xl)−2−J2)h(X)⎤⎥⎦ = ∫[0,1]p(β0+p∑j=1fj(xj))(ϕperJk(xl)−2−J2)dx = c(l)Jk.

From (18), similarly as for , we obtain a natural data-driven estimator of as:

 .^c(l)Jk=1nn∑i=1yi(ϕperJk(xil)−2−J2)^hn(xi) (19)

### 2.2 Asymptotic Properties of the Estimator

In this section, we study the asymptotic properties of the estimates proposed in (16) and (19) and propose necessary and sufficient conditions for the pointwise mean squared consistency of the estimator, under assumptions (A1)-(A5).

#### 2.2.1 Unbiasedness and Consistency of ^β0

Next, we analyze the asymptotic behavior of the estimator assuming assumptions (Ak1)-(Ak4) stated in C hold.

#### Asymptotic Behavior of E(^β0)

From (63) and the hierarchy of convergence for random variables, it follows that for a fixed x, . Let’s consider now a function , for , , defined as . Since satisfies (A5)-(A6), is bounded and continuous, which implies:

 E[1^hn(x)]→n→∞1h(x). (20)

In fact, since is continuous in and admits infinitely many derivatives , by using a Taylor series expansion around and results (66) and (69), it is possible to obtain:

 ∣∣ ∣∣EX1,...,Xn⎡⎣(1^hn(x))k−(1h(x))k⎤⎦∣∣ ∣∣ ≤ 1ϵk+2h{|Bias(^hn(x))|+Var(^hn(x))+Bias(^hn(x))2} (21) ≤ C{δβ+1nδp+δ2β},

for and a sufficiently large (independent of , ).

Therefore, under the choice , converges to at a rate for . Here the expectation is taken with respect to the joint density of the iid sample.

Similarly, the last result leads to:

 EX1,...,Xn⎡⎣(1^hn(x)−1h(x))2⎤⎦⟶0, (22)

as at a rate .

Now, letting x to be random, using conditional expectation it is possible to obtain:

 E[1^hn(X)]=EX% [EX1,...,Xn|X(1^hn(x)|X)]. (23)

From (20) and the last result, the dominated convergence theorem implies:

 E[1^hn(x)]⟶n→∞1 (24)

Using the definition of and the model (8), we obtain:

 E[^β0] = β0+E[∑pl=1fl(Xl)^hn(X)] (25) = β0+EX[EX1,...,Xn|X(∑pl=1fl(Xl)^hn(X)|X)] = β0+EX[p∑l=1fl(Xl)⋅EX1,...,Xn|X(1^hn(X)|X)].

Therefore, from (20)-(24) and under (A2),(A3), the dominated convergence leads to:

 E[^β0]⟶n→∞β0, (26)

which shows that is asymptotically unbiased for .

#### Asymptotic Behavior of Var(^β0)

From the definition of and (8), we can see that:

 Var(^β0) = 1nVar(Y^hn(X)) (27) ≤ 1nE[Y2^hn(X)2] ≤ 1nEX[EX1,...,Xn|X=x(Y2^hn(X)2|X=x)].

Now, if , from conditions (A2) and (A3), and the dominated convergence theorem, it follows:

 EX[EX1,...,Xn|X=x(Y2^hn(X)2|% X=x)]→n⟶∞E[Y2h(X)2]. (28)

Thus,

 Var(^β0)⟶n→∞0, (29)

provided .

Finally, putting together (26) and (29) we obtain that is consistent for .

#### 2.2.2 Unbiasedness and Consistency of the ^c(l)Jk

In this section, we study the asymptotic behavior of the wavelet coefficient estimators for a fixed , assuming that conditions (A1)-(A5) and (Ak1)-(Ak4) hold.

#### Asymptotic Behavior of E(^c(l)Jk)

For a fixed , , and ,we have that . Therefore,

 E[^c(l)Jk]=E[YϕperJk(Xl)^hn(X)]−2−J2E[^β0]. (30)

Following the same argument as in the case of the asymptotic behavior of , we find that the first term of (30) can be represented as:

 E[YϕperJk(Xl)^hn(X)] = EX[EX1,...,Xn|X(YϕperJk(Xl)^hn(X)|X)] = EX[YϕperJk(Xl)⋅EX1,...,Xn|X(1^hn(X)|X)].

Since is assumed fixed and (A3) holds, by the dominated convergence theorem, it follows that:

 EX[YϕperJk(Xl)⋅EX% 1,...,Xn|X(1^hn(X)|X)]⟶n→∞E[YϕperJk(Xl)h(X)]. (31)

Furthermore, by (A3) and (54):

 E[YϕperJk(Xl)h(X)] = ∫[0,1]p(β0+p∑j=1fj(xj))ϕperJk(xl)dx (32) = ∫10fl(xl)ϕperJk(xl)dxl+2−J2β0 = c(l)Jk+2−J2β0.

Finally, putting together the last result and (26), it follows:

 E[^c(l)Jk]⟶n→∞c(l)Jk, (33)

which shows that the wavelet coefficient estimators are asymptotically unbiased, for fixed, , and .

#### Asymptotic Behavior of Var(^c(l)Jk)

For a fixed , and , , the variance of is given by:

 Var(^c(l)Jk) = Var(1nn∑i=1YiϕperJk(Xil)^hn(Xi)−2−J2^β0) = 1nVar(YϕperJk(Xl)^hn(X))+2−JVar(^β0)−2Cov(1nn∑i=1YiϕperJk(Xil)^hn(Xi),2−J2^β0) = 1nVc1+2−JVc2+2Vc3.

By using the model defined in (8) we find that for :

 \resizebox429.2838pt$Vc1=VarX(EX1,...,Xn|X[YϕperJk(Xl)^hn(X)|X])+EX[VarX1,...,Xn|X(YϕperJk(Xl)^hn(X)|X)]$,
 \resizebox390.258pt$=VarX(YϕperJk(Xl)⋅EX1,...,Xn|X[1^hn(X)|X])+EX[(YϕperJk(Xl))2⋅VarX1,...,Xn|X(1^hn(X)|X)]$. (35)

By the dominated convergence theorem, it follows:

 (36)

where the last result holds since:

 EX1,...,Xn|X=% x[1^hn(X)|X=x] ⟶n→∞ 1h(x),and VarX1,...,Xn|X=x(1^hn(X)|X=x) ⟶n→∞ 0.

This implies,

 1nVar(YϕperJk(Xl)^hn(X))⟶n→∞0. (37)

#### Proposition 2

Let us suppose that conditions (A1)-(A5) and (Ak1)-(Ak4) hold hold. Then:

 E⎡⎣(YϕperJk(Xl)h(X))2⎤⎦≤C(β0,p,σ2,Mf)⋅⎧⎨⎩1ϵh(⌈log2(1ϵh)⌉−1)+1⌈log2(1ϵh)⌉⎫⎬⎭, (38)

where . This result shows that is bounded from above, provided , and conditions (A1)-(A5) and (Ak1)-(Ak4) hold. Therefore,

The proof can be found in D.

Similarly, as for , let’s consider the behavior of . Using the covariance definition and the iid assumption for the sample , it follows that:

 Vc3=2−J2n2⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩n∑i=1Cov(YiϕperJk(Xil)^hn(Xi),Yi^hn(Xi))+n∑i=1n∑j=1i≠jCov⎛⎝YiϕperJk(Xil)^hn(Xi),Yj^hn(Xj)⎞⎠⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭. (39)

#### Proposition 3

Let us suppose assumptions (A1)-(A5) and (Ak1)-(Ak4) are satisfied. The following results hold: