Asymptotics for penalized additive B-spline regression

# Asymptotics for penalized additive B-spline regression

Takuma Yoshida  and  Kanta Naito
Graduate School of Science and Engineering
and  Department of Mathematics
Shimane University, Matsue 690-8504, Japan
E-mail: naito@riko.shimane-u.ac.jp
27th April 2011
###### Abstract

This paper is concerned with asymptotic theory for penalized spline estimator in bivariate additive model. The focus of this paper is put upon the penalized spline estimator obtained by the backfitting algorithm. The convergence of the algorithm as well as the uniqueness of its solution are shown. The asymptotic bias and variance of penalized spline estimator are derived by an efficient use of the asymptotic results for the penalized spline estimator in marginal univariate model. Asymptotic normality of estimator is also developed, by which an approximate confidence interval can be obtained. Some numerical experiments confirming theoretical results are provided.

key words: Additive model; Backfitting algorithm; -spline; Penalized spline.

## 1 Introduction

The additive model is a typical regression model with multidimensional covariates and is usually expressed as

 yi=f1(xi1)+⋯+fD(xiD)+εi,

for given data , where each is a univariate function with a certain degree of smoothness. This paper focuses on the bivariate additive model, in which .

The additive model has become a popular smoothing technique and its fundamental properties have been summarized in literature such as Buja et al. (1989) and Hastie and Tibshirani (1990). Buja et al. (1989) proposed the so-called backfitting algorithm, which is efficient for nonparametric estimation of . The backfitting algorithm is a repetition update algorithm and its convergence and the uniqueness of its solution are not always assured. Buja et al. (1989) showed the sufficient condition for convergence of the backfitting algorithm and the uniqueness of its solution for the bivariate additive model.

In this paper, we discuss the asymptotic properties of the penalized spline estimator for the additive model with . Unlike spline smoothing, the asymptotic results of kernel smoothing for the additive model have been obtained. Ruppert and Opsomer (1997) showed that a certain kernel smoothing for the additive model satisfies the sufficient condition for convergence of the backfitting algorithm and the uniqueness of its solution. Furthermore, they derived the asymptotic bias and variance of the kernel estimator for the bivariate additive model.

Opsomer (2000) presented the sufficient condition for convergence of the backfitting algorithm and the uniqueness of its solution for the -variate additive model in Lemma 2.1. The asymptotic bias and variance of the kernel estimator for the -variate additive model were also derived under the assumption that the sufficient condition for convergence of the backfitting algorithm holds. Wand (1999) investigated asymptotic normality of the kernel estimator for the -variate additive model by elegant use of the results in Opsomer (2000). We observe from Wand’s results of asymptotic normality that kernel estimators of ’s are asymptotically independent.

Many researchers have explored the effectiveness of spline smoothing, such as Wahba (1975) and Green and Silverman (1994). Penalized spline estimators have been discussed in O’Sullivan (1986), Eilers and Marx (1996), Marx and Eilers (1998) and Ruppert et al. (2003). Despite its richness of application, asymptotics for spline smoothing seems have not yet been sufficiently developed.

For the univariate model (), Agarwal and Studen (1980) and Zhou et al. (1998) obtained important asymptotic results for the regression spline. Hall and Opsomer (2005) gave the mean squared error and consistency of the penalized spline estimator. The asymptotic bias and variance of the penalized spline estimator were obtained in Claeskens et al. (2009). Kauermann et al. (2009) worked with the generalized linear model. Wang et al. (2011) showed that the penalized spline estimator is asymptotically equivalent to a Nadaraya-Watson estimator. Thus, it seems that developments of asymptotic theories of the penalized spline are relatively recent events and we note that those works are mainly regarding the univariate model (). In the case of multidimensional covariates, Stone (1985) showed the consistency of the regression spline in the -variate additive model, but it is not penalized spline.

The aim of this paper is to derive asymptotic bias, asymptotic variance, and asymptotic distribution of the penalized spline estimator in the bivariate additive model. The penalized spline estimator for the bivariate additive model is obtained using the penalized least squares method and the backfitting algorithm. The uniqueness of the solution of the backfitting algorithm cannot be proved in general, but its convergence property can be shown. However, it is demonstrated that the solution of the backfitting algorithm is asymptotically unique and the objective function for the penalized least squares method is shown to be asymptotically convex. As will be seen in the subsequent section, the penalized spline estimator in a bivariate setting has a closed form, which we can use for asymptotic manipulations. The properties of band matrices play an important role as a mathematical tool in asymptotic considerations. The effect of the initial value required for implementing the backfitting algorithm is also investigated.

This paper is organized as follows. In Section 2, our model setting and estimating equation in the penalized least squares method are discussed and the backfitting algorithm to obtain the solution is composed. Section 3 provides the asymptotic bias and variance of the penalized spline estimator and then its asymptotic normality is developed. Furthermore, the uniqueness of the solution of the backfitting algorithm is discussed. Section 4 includes numerical studies to validate the theory and an application to real data is reported. In Section 5, some suggestions that are necessary to develop the asymptotics for the general -variate spline additive model are noted by comparing similar results already developed for the kernel estimator. Proofs for theoretical results are all given in the Appendix.

## 2 Model setting

### 2.1 Bivariate additive spline model

Consider a bivariate additive regression model

 yi=f1(xi1)+f2(xi2)+εi, (1)

for data , where is an unknown regression function and ’s are independent random errors with and . We assume to ensure identifiability of . Let be the density of and be the joint density of . We assume without loss of generality that for all .

Now we consider the -spline model

 sj(xj)=Kn∑k=−p+1B[p]k(xj)bj,k

as an approximation to at any for . Here, are th degree -spline basis functions defined recursively as

 B[0]k(x) = {1,κk−1

where are knots, with , and are unknown parameters. We denote as in what follows since only the th degree is treated. The details and many properties of the -spline function are clarified in de Boor (2001). We aim to obtain an estimator of via the -spline additive regression model

 yi=s1(xi1)+s2(xi2)+εi, (2)

instead of the model (1). The model (2) can be expressed as

 \boldmathy=X1\boldmathb1+X2\boldmathb2+\boldmathε

by using the notations , , , , and . We adopt the estimators of defined as the minimizer of

 L(\boldmathb1,\boldmathb2)=(% \boldmathy−X1\boldmathb1−X2\boldmathb2)′(\boldmathy−X1\boldmathb1−X2\boldmathb2)+2∑j=1λjn\boldmathb′jQm% \boldmathbj, (3)

where are smoothing parameters and is the th order difference matrix. This criterion is called the penalized least squares method and it has been frequently utilized in spline regression (Eilers and Marx (1996)). For a fixed point , the estimator of is

 ^fj(xj)=Kn∑k=−p+1Bk(xj)^bj,k.

and is called the penalized spline estimator of . The predictor of at a fixed point is defined as

 ^y=^f1(x1)+^f2(x2).

Since is assumed for , the estimator of each component is usually centered. Hence is rewritten as

 ^fj,c(xj)=^fj(xj)−1nn∑i=1^fj(xij),

as discussed in Wang and Yang (2007). In this paper, however, we do not examine because our interests are in asymptotics for and , and asymptotic distributions of and become equivalent.

### 2.2 Backfitting algorithm

Let . In general, is a solution of

 ∂L(\boldmathb1,\boldmathb2)∂\boldmathb=\boldmath0. (4)

In fact, the solution of (4) can be written as and where . However, this method has one defect: the is not in general convex as the function of . Hence, the solution of (4) does not necessarily become the minimizer of (3). Marx and Eilers (1998) also noted this point as a typical problem of additive spline regression.

Let be a minimizer of (3). Then it is important to investigate the difference between and asymptotically. If the difference is vanishingly small, it shows that asymptotically minimizes (3). The details of this assertion are given in Section 3.2.

In this paper, our estimator of is composed by using the backfitting algorithm obtained from the solution of (4). The merit and usage of the backfitting algorithm are clarified in Hastie and Tibshirani (1990). The -stage backfitting estimators and are defined as

 \boldmathb(ℓ)1=Λ−11X′1(% \boldmathy−X2\boldmathb(ℓ−1)2)   and   \boldmathb(ℓ)2=Λ−12X′2(% \boldmathy−X1\boldmathb(ℓ)1),

respectively, where is an initial value. Then, the -stage backfitting estimator of at is obtained as

 f(ℓ)j(xj)=Kn∑k=−p+1Bk(xj)b(ℓ)j,k=% \boldmathB(xj)′\boldmathb(ℓ)j,  j=1,2,

where . A mathematical property of the backfitting algorithm is that satisfies

 ∂L(\boldmathb)∂\boldmathb% ∣∣\boldmathb=\boldmathb(∞)=\boldmath% 0. (5)

The backfitting algorithm itself is applicable in not only bivariate but also the general -variate additive model. However, can be explicitly expressed only for the case . By referring to (5.24) on page 119 of Hastie and Tibshirani (1990), can be calculated as

 \boldmathb(ℓ)1=(X′1X1)−1X′1\boldmathy−(X′1X1)−1X′1ℓ−1∑j=0{S1S2}j(In−S1)\boldmathy−(X′1X1)−1X′1(S1S2)ℓ−1S1X2\boldmathb(0)2,\boldmathb(ℓ)2=Λ−12X′2ℓ−1∑j=0{S1S2}j(In−S1)\boldmathy+Λ−12X′2(S1S2)ℓ−1S1X2\boldmathb(0)2, (6)

where . It is shown by Theorem 10 of Buja et al. (1989) that converge depending on . Thus, the backfitting estimators and converge, but the vectors to which they converge are not unique, depending on the initial value. We will study the asymptotic behavior of , as well as the relationship of and from now on.

## 3 Asymptotic theory

We prepare some symbols and notations to be used hereafter. Let be the identity matrix of size . Define a matrix , where the -component is

 Gk,ij=∫10Bi(x)Bj(x)qk(x)dx

for . Define a matrix , where the -component is

 Σk,ij=∫10∫10σ2(x1,x2)Bi(xk)Bj(xk)q(x1,x2)dx1dx2

for .

Let a vector be such that satisfies the best approximation to the true function . For further information on this point, see Zhou et al. (1998).

For a matrix , if , then it is written as . This notation will be used for matrices with fixed sizes and sizes depending on .

In spline smoothing, the smoothing parameter is usually selected as with because a spline curve often yields overfitting for large . In the following, we assume that . Hence, we choose as and .

### 3.1 Asymptotic distribution of the penalized spline estimator

Let be with . Then, and with arbitrary initial value can be expressed as

 f(ℓ)1(x1)=f(ℓ)01(x1)−\boldmathB(x1)′(X′1X1)−1X′1(S1S2)ℓ−1S1X2\boldmathb(0)2

and

 f(ℓ)2(x2)=f(ℓ)02(x2)+\boldmathB(x2)′Λ−12X2(S1S2)ℓ−1S1X2% \boldmathb(0)2,

respectively. First, we investigate the influence of on , which is summarized as follows.

###### Proposition 1

Suppose that . Then for

 |f(ℓ)1(x1)−f(ℓ)01(x1)|=OP(K−2ℓ+1n)   and   |f(ℓ)2(x2)−f(ℓ)02(x2)|=oP(K−2ℓ+1n),  as  n→∞.

In particular, as ,

Proposition 1 claims that the influence of on can be ignored for large . In other words, for any initial value , we can uniquely obtain as . Hence, it suffices to consider instead of to develop asymptotics under manipulations and . Here, and can be written as

 f(ℓ)01(x1)=f(1)01(x1)−\boldmathB(x1)′(X′1X1)−1X′1ℓ−1∑k=1{S1S2}k(In−S1)\boldmathy

and

 f(ℓ)02(x2)=f(1)02(x2)+\boldmathB(x2)′Λ−12X′2ℓ−1∑k=1{S1S2}k(In−S1)\boldmathy,

respectively. This allows the following.

###### Proposition 2

Suppose that . Then for any fixed point ,

 f(∞)01(x1)=f(1)01(x1)+OP(K−1n)   and   f(∞)02(x2)=f(1)02(x2)+oP(K−1n), as n→∞.

We see from (6) that

 f(1)01(x1)=\boldmathB(x1)′Λ−11X′1\boldmathy   and   f(1)02(x2)=\boldmathB(x2)′Λ−12X′2(In−S1)\boldmathy.

It should be noted that has the same form as the penalized spline estimator based on the dataset in the univariate regression model (). This form is very important because the asymptotic bias and variance of the penalized spline estimator for the univariate regression model have been already derived by Claeskens et al. (2009). Similarly, includes , which is the same as the penalized spline estimator for univariate regression based on .

We denote as , which does not depend on the initial value as . The usefulness of Propositions 1 and 2 is that we can realize the asymptotic equivalence between the backfitting estimator and the (marginal) univariate penalized spline estimator. By using the results of Claeskens et al. (2009), we obtain Theorem 1.

###### Theorem 1

Suppose that and . Then for any fixed point , as ,

 E[^fj(xj)] = fj(xj)+bj,λ(xj)+OP(K−1n)+oP(λjnKnn−1), V[^f1(x1)^f2(x2)] = [V1(x1)(1+op(1))OP(n−1)OP(n−1)V2(x2)(1+oP(1))],

where,

 bj,λ(xj)=−λjnn\boldmathB(x)′G−1jQm\boldmathb∗j,  Vj(xj)=1n\boldmathB(xj)′G−1jΣjG−1j% \boldmathB(xj).

By using Theorem 1, we have the asymptotic joint distribution of .

###### Theorem 2

Suppose that there exists such that and . Furthermore, and satisfy and . Then for any fixed point , as ,

 V[^f1(x1)^f2(x2)]−1/2[^f1(x1)−f1(x1)^f2(x2)−f2(x2)]D−→N2([00],I2).

From Theorem 2, and are asymptotically independent. Asymptotic normality and the independence of and in kernel smoothing also hold, as shown in Wand (1999). Thus, the penalized spline estimator and the kernel estimator for the additive model have the same asymptotic property. Asymptotic normality of can be shown as a direct consequence of Theorem 2. We briefly note the pointwise confidence interval for by exploiting the distribution of obtained from Theorem 2. Here, we treat as known for all , but it should be estimated in data analysis.

###### Corollary 1

A asymptotic confidence interval of at any fixed point is

 ^fj(xj)±zα/2√V[^fj(xj)],

where is the th normal percentile.

The confidence interval in Corollary 1 will be applied to a set of real data in Section 4, in which we need to prepare an estimate of .

### 3.2 Minimizer of L(\boldmathb)

Here, we discuss the difference between and , the minimizer of (3). Although is the solution of (4), the problem of whether it minimizes or not is not trivial. That is, many solutions of (4) might exist because is not convex. Let the solutions of (4) be . Then, for any , there exists such that . However, is asymptotically not dependent on as implied in Proposition 1. Therefore, the uniqueness of the penalized spline estimator obtained by the backfitting algorithm is asymptotically satisfied. Furthermore, Theorem 3 says that minimizes .

###### Theorem 3

Let be the Hessian matrix of . Then is asymptotically positive definite.

We see that asymptotic properties of the penalized spline estimator for the additive model can be obtained not only by Theorems 1 and 2, but also by Theorem 3.

## 4 Numerical studies

In this section, we see the behavior of the estimator and validate Theorem 2 numerically by simulation. In addition, we aim to obtain an asymptotic confidence interval using a real dataset. We utilize the cubic spline () and the second order difference penalty () in all of the following numerical studies.

### 4.1 Simulation

We choose the true functions , and the error is . Here, is a uniform distribution on an interval . The explanatory variables are derived from . Then, and satisfy and , respectively. We demonstrate three simulations.

In Simulation-1, we compare with the true .

In Simulation-2, we compare with

 ^fpen,j(xj)=\boldmathB(xj)′Λ−1jX′j\boldmathy,

which is the penalized spline estimator for univariate regression based on .

In Simulation-3, we compare the density of with the kernel density estimate of simulated

 V[f(ℓ)1(x1)f(ℓ)2(x2)]−1/2[f(ℓ)1(x1)−f1(x1)f(ℓ)2(x2)−f2(x2)] (7)

to validate Theorem 2, where we note that the covariance matrix of can be exactly calculated and it in fact was used in this simulation. The bandwidth of the kernel density estimate is selected by the method of Sheather and Jones (1991). The algorithm of Simulation-3 is given as follows:

1. Generate for .

2. Generate the data from (1) and .

3. Calculate at fixed point .

4. Calculate the values of (7).

5. Iterate from Step 2 to Step 4, 1000 times.

6. Draw the kernel density estimate of (7) and compare with the density of .

The results of Simulation-1, Simulation-2 and Simulation-3 are displayed in Figure 1, Figure 2, and Figure 3, respectively. In all simulation settings, , and were adopted. We set the sample size for Simulation-1 and Simulation-2, and and for Simulation-3.

We see from Figure 1 that the backfitting estimator approximates well. We also observe in Figure 2 that the differences between and are small, which means that dominates the backfitting estimator as claimed in Proposition 2.

The contour plots of the density estimate of (7) and of the density of are drawn in Figure 3. We observe that there is still a gap between the density estimate and the density of in . However, we see from the case that the density estimate is clearly approaching the density of , as claimed in Theorem 2.

### 4.2 Application to real data

We construct the asymptotic pointwise confidence interval of by using real data. We utilize ozone data with (Hastie et al. (2001)). We use model (1), where is ozone concentration (ppb), is daily maximum temperature () and is wind speed (mph). Each is centered and ’s are modified as . We composed the backfitting estimator and asymptotic pointwise confidence interval of under the assumption that , which can be estimated by

 ^σ2=1nn∑i=1{yi−f(ℓ)1(xi1)−f(ℓ)2(xi2)}2.

Again, we used , and .

Hastie and Tibshirani (1990) estimated by using a pseudo additive method based on a smoothing spline. In addition, they constructed a pointwise error bar defined as , which is drawn in Figure 9.9 of Hastie and Tibshirani (1990). The asymptotic pointwise confidence interval exhibited in Figure 4 looks quite similar to the error bar. However, we see that, the asymptotic intervals given in Figure 4 are both smoother than the error bars. Although this is only an application to one dataset, we thus confirm that the confidence intervals based on asymptotic normality can be applied to real data.

## 5 Discussion

In this paper, asymptotic behavior of the penalized spline estimator in the bivariate additive model is investigated. The research in this paper can be seen as a spline version of the work by Ruppert and Opsomer (1997) and Wand (2000). To consider a generalization of the work in this paper to the -variate additive model, it might be worthwhile to review the work by Opsomer (2000), including local polynomial fitting in the -variate additive model, as introduced in Section 1. Let . Then a formal estimating equation yields the estimator of as

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^\boldmathf1^\boldmathf2⋮^\boldmathfD⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣InS1⋯S1S2In⋯S2⋮⋮⋱⋮SDSD⋯In⎤⎥ ⎥ ⎥ ⎥ ⎥⎦−1⎡⎢ ⎢ ⎢ ⎢⎣S1S2⋮SD