1 Introduction

In this paper, we propose a new semiparametric regression estimator by using a hybrid technique of a parametric approach and a nonparametric penalized spline method. The overall shape of the true regression function is captured by the parametric part, while its residual is consistently estimated by the nonparametric part. Asymptotic theory for the proposed semiparametric estimator is developed, showing that its behavior is dependent on the asymptotics for the nonparametric penalized spline estimator as well as on the discrepancy between the true regression function and the parametric part. As a naturally associated application of asymptotics, some criteria for the selection of parametric models are addressed. Numerical experiments show that the proposed estimator performs better than the existing kernel-based semiparametric estimator and the fully nonparametric estimator, and that the proposed criteria work well for choosing a reasonable parametric model.

Semiparametric Penalized Spline Regression

TAKUMA YOSHIDA AND KANTA NAITO

Graduate School of Science and Engineering, Shimane University, Matsue, Japan
Department of Mathematics, Shimane University, Matsue, Japan

Keywords Asymptotic theory; Bias reduction; -spline; Parametric model; Penalized spline;
Semiparametric regression

Mathematics Subject Classification Primary 62G08; Secondary 41A15, 62G20

## 1 Introduction

There have been several nonparametric smoothing techniques used in regression problems, such as lowess, kernel smoothing, spline smoothing, wavelet, the series method, and so on. The nonparametric estimators generally have consistency, which is an advantage of this approach. Hence, if the nonparametric estimator is used, we can expect that the true regression can be captured as the sample size increases. However, because the form of a nonparametric estimator is sometimes complicated, the interpretation of the estimated structure might not be clear.

On the other hand, in a parametric regression problem with the true regression function controlled by a finite-dimensional parameter vector, the estimated structure is easy to understand, however, the estimator does not have consistency. Therefore, there are advantages and disadvantages associated with each of these approaches. This motivates us to consider a hybrid of parametric and nonparametric methods for the regression problem and we, in fact, introduce a semiparametric regression method so that the estimator has the advantages of both approaches.

The semiparametric method in this paper consists of two steps. In the first step, we utilize an appropriate parametric estimator. In the second step, we apply a certain nonparametric smoother to the residual data associated with the parametric estimator in the first step. The parametric estimator in the first step and the nonparametric smoother in the second step are combined into the proposed semiparametric estimator.

Similar semiparametric approaches for smoothing have been developed by many authors. Hjort and Glad (1995) and Naito (2004) discussed similar methods in density estimation literature. Glad (1998) and Naito (2002) addressed the semiparametric regression method. Martins et al. (2008) introduced general decomposition, including additive and multiplicative corrections in regression. Recently, Fan et al. (2009) discussed the semiparametric approach in the framework of a generalized linear model. Note that the aforementioned works all used kernel smoothing in the second step estimation.

Our proposal is to utilize the penalized spline method for residual smoothing in the second step. This is a typical technique used in nonparametric regression problems with sufficient fitness and appropriate smoothness, which was developed by O’Sullivan (1986) and Eilers and Marx (1996). Many of its applications are summarized in Ruppert, et al (2003). Throughout this paper, the fully nonparametric penalized spline estimator is designated by NPSE, while the semiparametric penalized spline estimator, including the two-step manipulations mentioned above, is denoted by SPSE. In this paper, the advantages of using the penalized spline method instead of the kernel method are described both theoretically and numerically. In particular, we found that the SPSE has better behavior than the semiparametric local linear estimator (SLLE) in simulation.

This paper is organized as follows. We elaborate on the proposed SPSE in Section 2. Section 3 discusses the asymptotic properties of the SPSE, which can be obtained using a combination of the asymptotic results for the parametric estimator and for the NPSE developed by Claeskens et al. (2009). The asymptotic bias of the SPSE depends on the initial parametric model utilized in the first step. The form of the asymptotic bias suggests a method of choosing the parametric model for the first step. A theoretical comparison of SPSE with SLLE is also given in the context of asymptotic bias, which reveals that the use of the penalized spline rather than a kernel smoother in the second step is valid. In Section 4, some criteria for parametric model selection will be clarified. If a parametric model chosen by the criteria discussed in Section 4 is used as the parametric part of the SPSE, its asymptotic bias will become smaller than that of the NPSE. The results of a simulation are reported in Section 5. The simulation studies include checking the accuracy of the SPSE and comparing it with the NPSE and the SLLE as regression estimators. The performance of the parametric model selection discussed in Section 4 is also investigated. Related discussion and issues for future research are provided in Section 6. Proofs for the theoretical results are given in the Appendix.

## 2 Semiparametric penalized spline estimator

Consider the relationship of the dataset as the regression model

 yi=f(xi)+εi,  i=1,⋯,n,

where the explanatory is generated from density with its support on , is an unknown regression function, and the errors are assumed to be uncorrelated with and . Let be a parametric model. We now construct the semiparametric estimator of . First we obtain an appropriate estimator of via a suitable method of estimation. Then can be written as

 f(x)=f(x|^\boldmathβ)+f(x|^\boldmath% β)γrγ(x,^\boldmathβ), (1)

where for some . When , this decomposition becomes , which is called an additive correction. When , on the other hand, we have a multiplicative correction . By using the parameter , we can treat additive and multiplicative corrections systematically (see, Fan et al. (2009)). In the second step, is estimated by applying a nonparametric technique to . The SPSE is obtained as

 ^f(x,γ)=f(x|^\boldmathβ)+f(x|^\boldmathβ)γ^rγ(x,^\boldmathβ), (2)

where is a nonparametric estimator of .

We adopt the penalized spline to estimate . Let be a marginal -spline basis of degree with equally spaced knots . Then we consider the -spline model

 Kn∑k=−p+1B[p]k(x)bk

as an approximation to , where ’s are unknown parameters. The definition and fundamental properties of the -spline basis are detailed in de Boor (2001). Let be the -vector with th element and let and . The penalized spline estimator of is defined as the minimizer of

 (\boldmathRγ−Z\boldmathb)′(\boldmathRγ−Z\boldmathb)+λn\boldmathb′Qm\boldmathb,

where is the smoothing parameter and is the th difference matrix. The estimator of is defined as

 ^rγ(x,^\boldmathβ)=Kn∑k=−p+1B[p]k(x)^bk=\boldmathB(x)′(Z′Z+λnQm)−1Z′\boldmathRγ, (3)

where .

In Figure 1, an example of the SPSE is drawn. In the left panel, the true function and the least square estimator of are shown. In the middle panel, the residuals of and the penalized spline estimator of are drawn. In the right panel, the true function and the SPSE as given in (2) are drawn. As the interpretation of for this example, the parametric part captures the overall shape of and the nonparametric part explains details which could not be captured by the . Similarly, we can construct an SPSE with multiplicative correction.

## 3 Asymptotic Result

Asymptotics for the NPSE were developed by Claeskens et al. (2009). By using their results, we show the asymptotic bias and variance, and asymptotic distribution of the SPSE. We now give some assumptions regarding the asymptotics of the SPSE.

Assumptions

1. There exists such that for all , .

2. .

3. , ,

4. , ,

5. , .

6. and .

Define the matrix , where

 gij=∫10B[p]−p+i(u)B[p]−p+j(u)q(u)du

and the matrix , where

 gσ,ij=∫10B[p]−p+i(u)B[p]−p+j(u)σ2(u)q(u)f(u|\boldmathβ)2γdu.

Let be a best approximation to . This means that satisfies

 supx∈(0,1)∣∣∣f(x)−f(x|\boldmathβ)f(x|% \boldmathβ)γ+ba1(x|\boldmathβ,γ)−% \boldmathB(x)′\boldmathb∗(\boldmathβ,γ)∣∣∣=o(K−(p+1)n),

where

 ba1(x|\boldmathβ,γ)=−(f(x)−f(x|\boldmath% β)f(x|\boldmathβ)γ)(p+1)1Kp+1n(p+1)!Kn∑j=1I(κj−1≤x<κj)Bp+1(x−κj−1K−1n),

is the indicator function of the interval and is the th Bernoulli polynomial.

We now discuss a condition of the parametric estimator. Let be the true distribution of and let be the corresponding empirical distribution. The estimator of is defined as the functional form , where is a real valued function defined on the set of all distributions. We can then see that , where is defined as the optimizer of some distance measure . We assume that is the best approximation of . By the definition of , can be expressed as

 ^\boldmathβ−\boldmathβ0=1nn∑i=1I(Xi,Yi)+dn+δn, (4)

where is the influence function defined as

 I(X,Y)=limε→0{T((1−ε)F+εδ(X,Y))−T(F)ε}

with and finite covariance matrix, the delta function has probability 1 at a point , and is the bias of . The remaining term has mean for each component.

We investigate the asymptotic property of by a two-step procedure for clarity. First we derive the asymptotic expectation and variance of . Here, is the penalized spline smoother of . Second, we show that the difference between and vanishes asymptotically. Since is no longer stochastic, the asymptotic property of is dependent only on the nonparametric penalized spline estimator of . Hence we obtain

 E[^f0(x,γ)|\boldmathXn] = f(x|\boldmathβ0)+f(x|\boldmathβ0)γE[^rγ(x,\boldmathβ0)], V[^f0(x,γ)|\boldmathXn] = f(x|\boldmathβ0)2γV[^rγ(x,\boldmathβ0)].

Here for a random variable , and are the conditional expectation and variance of given . The asymptotic property of can be directly obtained by using Theorem 2 (a) of Claeskens et al. (2009).

###### Proposition 1.

Let . Then, under the Assumptions, for a fixed ,

 E[^f0(x,γ)|\boldmathXn] = f(x)+ba(x|\boldmathβ0,γ)+bλ(x|\boldmathβ0,γ)+oP(K−(p+1)n)+oP(λnKnn−1), V[^f0(x,γ)|\boldmathXn] = f(x|\boldmathβ0)2γn% \boldmathB(x)′G(q)−1G(σ,β0,γ,q)G(q)−1% \boldmathB(x)+oP(Knn−1),

where

 ba(x|\boldmathβ0,γ) = −f(x|\boldmathβ0)r(p+1)γ(x|%\boldmath$β$0)Kp+1n(p+1)!Kn∑j=1I(κj−1≤x<κj)Bp+1(x−κj−1K−1n), bλ(x|\boldmathβ0,γ) = −λnnf(x|\boldmathβ0)γ\boldmathB(x)′G(q)−1Qm\boldmathb∗(% \boldmathβ0,γ).

We now give the asymptotic result for . By using (4), and are expanded about and , respectively. From the details of the proof in the Appendix, we find that the asymptotic expectation and variance of are dominated by those of and we obtain the following theorem.

###### Theorem 1.

Let . Then under the Assumptions, for a fixed ,

 E[^f(x,γ)|\boldmathXn] = f(x)+ba(x|\boldmathβ0,γ)+bλ(x|\boldmathβ0,γ) +OP(n−1)+oP(K−(p+1)n)+oP(λnKnn−1), V[^f(x,γ)|\boldmathXn] = f(x|\boldmathβ0)2γn% \boldmathB(x)′G(q)−1G(σ,β0,γ,q)G(q)−1% \boldmathB(x)+oP(Knn−1),

where and are those given in Proposition 1.

Theorem 1 and Lyapunov’s theorem yield the asymptotic distribution of the SPSE.

###### Theorem 2.

Suppose that for some and the Assumptions are satisfied. Then, using and ,

 ^f(x,γ)−f(x)−ba(x|\boldmathβ0,γ)−bλ(x|\boldmathβ0,γ)√V[^f(x,γ)|\boldmathXn]D−→N(0,1),

where and are those given in Proposition 1.

If , we obtain the semiparametric regression spline estimator from (2). Thus, it is clear that the asymptotic result of the semiparametric regression spline is contained in Theorems 1 and 2. These are obtained from one parametric model. If we choose a polynomial model as , we obtain the following Corollary.

###### Corollary 1.

Let be the th polynomial model. Then, under and , or and , using , and equidistant knots, the SPSE is the same as the NPSE.

Remark 1  From Theorem 2, as the advanced analysis, we can construct the asymptotic pointwise confidence interval of by estimating the variance of the error.

Remark 2  Theorems 1 and 2 can be applied for . When , the results become those for additive correction. When , and the variance agrees with that of the estimator for multiplicative correction. In , it is understood that is a best approximation of . Therefore, can be written as , where is a best approximation of and is a vector with all components equal to 1. In conclusion, can be written as

 bλ(x|\boldmathβ0,1)=−λnnf(x|% \boldmathβ0)γ\boldmathB(x)′G(q)−1Qm\boldmathb∗

because all components of have vanished.

Remark 3  When is assumed, we obtain and by choosing as a best approximation of 0. For , in particular, and both hold even in cases where with any constant .

Remark 4  If we use the local th polynomial technique in the second step estimation, we obtain the asymptotic bias as

where is bandwidth and is the th order kernel function. If and are equal and is odd, the difference between and is only

 Kn∑j=1I(κj−1≤x<κj)Bp+1(x−κj−1K−1n)  and  ∫Rzp+1Hp(z)dz. (5)

If we can calculate (5), we would be able to compare the bias of the SPSE with that of the semiparametric local polynomial kernel estimator. As an example, when , it is easy to show that for , while we have for the Gaussian kernel and for the Epanechnikov kernel . Therefore is smaller than in this situation, which reveals that the SPSE is superior than the SLLE.

## 4 Parametric model selection

In this section, we describe how to choose a parametric model. From Remark 3, if the true regression function satisfies , the bias of the SPSE is reduced. Hence we determine the initial parametric model in a bias reduction context. Specifically, our purpose is to choose a parametric model such that the asymptotic bias of the SPSE becomes smaller than that of the NPSE:

 |ba(x|\boldmathβ0,γ)|<|ba(x)|   and   |bλ(x|\boldmathβ0,γ)|<|bλ(x)|, for all x∈(0,1), (6)

where and are the asymptotic biases of the NPSE. If is constant, and are equivalent to and , respectively. When the same and are used in both the SPSE and the NPSE, (6) can be rewritten as and for all , where

 La(x,γ)=|f(p+1)(x)|−∣∣ ∣∣f(x|\boldmathβ0)γ(f(x)−f(x|\boldmathβ0)f(x|% \boldmathβ0)γ)(p+1)∣∣ ∣∣

and

 Lλ(x,γ)=|\boldmathB(x)′G(q)−1Qm\boldmathb∗f|−|f(x|\boldmathβ0)γ\boldmathB(x)′G(q)−1Qm\boldmathb∗(% \boldmathβ0,γ)|,

where is a best approximation to . As a pilot estimator of and its th derivative, we can use the local polynomial estimator with degree . Then the estimator of and can be obtained as

 ^La(x,γ)=|^f(p+1)(x)|−∣∣ ∣∣f(x|^% \boldmathβ)γ⎛⎝^f(x)−f(x|^\boldmathβ)f(x|^\boldmathβ)γ⎞⎠(p+1)∣∣ ∣∣

and by using empirical form,

 ^Lλ(x,γ)=|\boldmathB(x)′Λ−1Qm(Z′Z)−1Z′^\boldmathf|−|f(x|^\boldmathβ)γ\boldmathB(x)′Λ−1Qm(Z′Z)−1Z′^\boldmathrγ|,

where and is an -vector with th component . Here, we use the fact that

 λnf(x|^\boldmathβ)γ% \boldmathB(x)′Λ−1Qm(Z′Z)−1Z′^\boldmathrγ=bλ(x|\boldmathβ0,γ)+oP(λnKnn−1),

which is detailed in the proof of Theorem 2 (a) of Claeskens et al. (2009). We choose one parametric model by relative evaluation. Let

 Ca∩λ(f(⋅|\boldmathβ))=#{zj∈(0,1)∣∣^La(zj,γ)>0,^Lλ(zj,γ)>0,j=1,⋯,J},

for a given parametric model and some finite grid points on . Here, for a set , is the cardinality of . After preparing a class of candidate parametric models , we choose a parametric model satisfying

 f(x|\boldmathβ)=argmaxfk{Ca∩λ(f(⋅|\boldmathβk))}. (7)

In summary, for each parametric model , we calculate , and . By using the parametric model which satisfies (7), we construct the SPSE. If we can choose a good parametric model and a good , the SPSE will have better behavior than the NPSE.

Remark 5  When we construct the semiparametric regression spline estimator (SPSE with ), we obtain . Therefore, depends only on .

Remark 6  We see that the bias term appears due to the use of the -spline model. On the other hand, arises from the penalty component. If we use the regression spline, vanishes and the bias of the estimator becomes less than that of the penalized spline estimator. However, the regression spline often provides overfitting. Thus, we use the penalized method for obtaining a smooth curve. If , a certain amount of smoothness in the estimator is assured. However, may grow too large because of the influence of the parametric model. Therefore under , we suggest choosing such that becomes less than . Hence, together with , the parametric model chosen by appears to bring fitness and smoothness to the SPSE.

## 5 Simulation

In this section, we examine the results of a numerical study to confirm the effects of the SPSE on a finite sample. We choose a parametric model by the criteria discussed in Section 4. We also compare the performance of the SPSE to those of the NPSE, the SLLE and the fully nonparametric local linear estimator (NLLE). In all situations, we utilize the linear and cubic splines and the second difference penalty for the second step nonparametric estimation. The SPSEs with linear and cubic splines are designated as SPSE1 and SPSE3, respectively. NPSE1 and NPSE3 are labeled similarly. The number of knots and the smoothing parameter are determined by GCV. The design points are drawn from a uniform density on and the errors are generated from the normal with mean 0 and variance . Let

 Ca = Ca(f(⋅|\boldmathβ))=#{zj∈(0,1)∣∣^La(zj,γ)>0,j=1⋯,J}, Cλ = Cλ(f(⋅|\boldmathβ))=#{zj∈(0,1)∣∣^Lλ(zj,γ)>0,j=1,⋯,J}, Ca∩λ = Ca∩λ(f(⋅|\boldmathβ))=#{zj∈(0,1)∣∣^La(zj,γ)>0,^Lλ(zj,γ)>0,j=1,⋯,J},

where . We prepare a class of candidate parametric models . For each , we calculate , and . We use a number of repetitions . For each iteration, we pick up from candidate models which maximize . The same manipulation is implemented for and . Finally we count the number of times that is picked up during the iterations. For comparison, we also show the model selection by using the AIC and the Takeuchi information criterion (TIC) detailed in Konishi and Kitagawa (2008).

Let

 Bj=1RR∑r=1^fr(zj)−f(zj),   Vj=1RR∑r=1{^fr(zj)−1RR∑r=1^fr(zj)}2,

where is the estimator for the th repetition. Let , and be the estimates of integrated squared bias, integrated variance and mean integrated squared error of , respectively. For comparison, the ISB, V and MISE of the SLLE and the NLLE were also calculated. In the SLLE and the NLLE, we used the Gaussian kernel and its bandwidth was obtained by the direct plug-in approach (Ruppert et al. (1995)).

Example 1  The true function is . We use three different specified parametric models:

The true curve can be approximated by sin. The curve poly1 is a rough model and poly3 is close to the true . The variance of the error is and the sample size is . The coefficients of the covariate are estimated by the maximum likelihood method for each model. This set-up is similar to that used by Glad (1998).

Table 1 includes the number of times that each parametric model was chosen based on each criterion. In , and , sin was selected in almost all iterations. This result is desirable because sin coincides with the true function . We also observe that the AIC and the TIC often choose sin. When the number of times sin is chosen is taken into consideration, it seems that is a better selector than the AIC and the TIC.

Results for ISB, V and MISE of the SPSE and the NPSE are given in Table 2. The SPSE with sin succeeds in regards to bias reduction even with a small sample size, and variance and MISE of the SPSE are also smaller than those of the NPSE. In additive correction, the result of SPSE1 with poly1 is exactly the same as that of the NPSE (see Corollary 1). If we use poly3, MISE of the SPSE is smaller than that of the NPSE, although the squared bias is somewhat larger in multiplicative correction. In both , V and MISE, the values of the SPSE are smaller than those of the SLLE. We implemented the same method of analysis for the case . The , V and MISE of the SPSE and those of the NPSE were almost the same, although these are not shown in this paper.

Example 2  The same true function used in Example 1 is adopted and the sample size is . A class of initial parametric models is chosen, consisting of th degree polynomials ranging from to and designated as poly1, …, poly6, respectively, and . This parametric model clearly does not contain the true and the estimator becomes unstable because the variance of error is relatively large.