Smoothing splines with varying smoothing parameter
This paper considers the development of spatially adaptive smoothing splines for the estimation of a regression function with non-homogeneous smoothness across the domain. Two challenging issues that arise in this context are the evaluation of the equivalent kernel and the determination of a local penalty. The roughness penalty is a function of the design points in order to accommodate local behavior of the regression function. It is shown that the spatially adaptive smoothing spline estimator is approximately a kernel estimator. The resulting equivalent kernel is spatially dependent. The equivalent kernels for traditional smoothing splines are a special case of this general solution. With the aid of the Green’s function for a two-point boundary value problem, the explicit forms of the asymptotic mean and variance are obtained for any interior point. Thus, the optimal roughness penalty function is obtained by approximately minimizing the asymptotic integrated mean square error. Simulation results and an application illustrate the performance of the proposed estimator.
Some key words: Equivalent kernel; Green’s function; Nonparametric regression; Smoothing splines; Spatially adaptive smoothing.
Smoothing splines play a central role in nonparametric curve-fitting. Recent synopses include Wahba (1990), Eubank (1999), Gu (2002), and Eggermont & LaRiccia (2009). Specifically, consider the problem of estimating the mean function from a regression model
where the are the design points on , the are independent and identically-distributed random variables with zero mean and unit variance, is the variance function, and is the underlying true regression function. The traditional smoothing spline is formulated as the solution to the minimization of
where is the penalty parameter controlling the trade-off between the goodness-of-fit and smoothness of the fitted function. Smoothing splines have a solid theoretical foundation and are among the most widely used methods for nonparametric regression (Speckman, 1981; Cox, 1983).
The traditional smoothing spline model has a major deficiency: it uses a global smoothing parameter , so the degree of smoothness of remains about the same across the design points. This makes it difficult to efficiently estimate functions with non-homogeneous smoothness. Wahba (1995) suggested using a more general penalty term, which replaces the constant by a roughness penalty function . Since is then a function of , the model becomes adaptive in the sense that it accommodates the local behavior of and imposes a heavier penalty in the regions of lower curvature of . Pintore et al. (2006) used a piecewise constant approximation for but this requires specification of the number of knots, the knot locations, and the values of between these locations. Storlie et al. (2010) discussed some computational issues on spatially adaptive smoothing splines. Liu & Guo (2010) refined the piecewise constant idea and designed a data-driven algorithm to determine the optimal jump locations and sizes for . Besides adaptive smoothing splines, other adaptive methods have been developed, including variable-bandwidth kernel smoothing (Mller & Stadtmller, 1987), adaptive wavelet shrinkage (Donoho & Johnstone, 1994, 1995, 1998), local polynomials with variable bandwidth (Fan & Gijbels, 1996), local penalized splines (Ruppert & Carroll, 2000), regression splines (Friedman & Silverman, 1989; Stone et al., 1997; Luo & Wahba, 1997; Hansen & Kooperberg, 2002), and free-knot splines (Mao & Zhao, 2003). Further, Bayesian adaptive regression has also been reported by Smith & Kohn (1996), DiMatteo et al. (2001), and Wood et al. (2002). Nevertheless, adaptive smoothing splines have the advantages of computational efficiency and easy extension to multidimensional covariates using the smoothing spline analysis of variance technique (Wahba, 1990; Gu, 2002). Further, the results in the present paper can be extended to the more general L-spline smoothing (Kimeldorf & Wahba, 1971; Kohn & Ansley, 1983; Wahba, 1985). Also, the usual Reinsch scheme can be easily modified to the present case.
Let , where is the space of Lebesgue square integrable functions, endowed with its usual norm and inner product . The method of adaptive smoothing splines finds to minimize the functional
where is the penalty parameter, and denotes the adaptive penalty function; more properties of will be stated later. Here, we incorporate a function into the roughness penalty, which generalizes the traditional smoothing splines, where . A two-point boundary value problem technique has been developed to find the asymptotic mean squared error of the adaptive smoothing spline estimator with the aid of the Green’s function. Thus the optimal roughness penalty function is obtained explicitly by approximately minimizing the asymptotic integrated mean squared error. Asymptotic analysis of traditional smoothing splines using Green’s functions was performed by Rice & Rosenblatt (1983), Silverman (1984), Messer (1991), Nychka (1995), and Eggermont & LaRiccia (2009); an extension to certain adaptive splines was made in Abramovich & Grinshtein (1999). In contrast to these results, the current paper develops a general framework for asymptotic analysis of adaptive smoothing splines, yielding a systematic, yet relatively simpler, approach to obtaining closed-form expressions of equivalent kernels for interior points and to asymptotic analysis. Our estimate possesses the interpretation of spatial adaptivity (Donoho & Johnstone, 1998), and the equivalent kernel may vary in shape and bandwidth from point to point, depending on the data.
In this section, we derive the optimality conditions for the solution that minimizes the functional ( ‣ Smoothing splines with varying smoothing parameter). Let where is the indicator function, and let be a distribution function with a continuous and strictly positive density function on . For a function , define and subsequent norms likewise. Let . If the design points are equally spaced, with for . If are independent and identically distributed regressors from a distribution with bounded positive density , then by the law of the iterated logarithm for empirical distribution functions.
Let be a piecewise constant function such that (). For any and , define
Necessary and sufficient conditions for to minimize in ( ‣ Smoothing splines with varying smoothing parameter) are that
almost everywhere, and
Both and are piecewise constant in . Therefore is a piecewise th order polynomial. Thus, Theorem 1 shows that is a piecewise th order polynomial. The exact form of will depend on additional assumptions about . For example, Pintore et al. (2006) assumed to be piecewise-constant with possible jumps at a subset of the design points. Then, the optimal solution is a polynomial spline of order . It is well-known that the traditional smoothing spline is a natural spline of order , which corresponds to the case here when .
We establish an equivalent kernel and asymptotic distribution of the spatially adaptive smoothing splines at interior points using a two-point boundary value problem technique. The key idea is to represent the solution to ( ‣ 1) by a Green’s function. It will be shown that the adaptive smoothing spline estimator can be approximated by a kernel estimator, using this Green’s function.
Denote (). Specifically, when , it follows from Theorem 1 that
Write . Thus, solves the two-point boundary value problem
subject to the boundary conditions from ( ‣ 1):
The solution to ( ‣ Smoothing splines with varying smoothing parameter) can be obtained explicitly with the aid of the Green’s function. For readers unfamiliar with Green’s functions, operationally speaking, if is the Green’s function for
then will solve ( ‣ Smoothing splines with varying smoothing parameter). This, together with the boundary conditions ( ‣ Smoothing splines with varying smoothing parameter), yields the solution to the two-point boundary value problem in ( ‣ Smoothing splines with varying smoothing parameter) and ( ‣ Smoothing splines with varying smoothing parameter). The derivations of the Green’s function and discussions of the boundary conditions are given in the online Supplementary Material. Specifically, let be linearly independent solutions for the homogeneous differential equation
Then, in ( ‣ Smoothing splines with varying smoothing parameter) can be represented as
where the last term is due to the boundary conditions and the coefficients are shown to be unique and stochastically bounded for all sufficiently small in the Supplementary Material. Equation ( ‣ Smoothing splines with varying smoothing parameter) can be decomposed into three parts: the asymptotic mean ; the random component ; and the remainder term , where . It will be shown that has a smaller order and the remainder term is negligible in the asymptotic analysis. Taking the -th derivative point-wise on both sides of ( ‣ Smoothing splines with varying smoothing parameter) gives the crucial representation of the adaptive smoothing spline estimator. This gives
We now introduce the main assumptions of this paper:
The functions , , and are -times continuously differentiable and strictly positive.
The function is -times continuously differentiable.
The smoothing parameter as . Denote
Assume as .
The random errors have a finite fourth moment.
Assumption 3 ensures that the smoothing parameter tends to zero not too quickly. In particular, it encompasses the cases of equally spaced design variables and of independent and identically-distributed regressors from a distribution with bounded positive density. In the former case, and in the second case, . The optimal choice of discussed subsequently is of order and it is easy to check that it satisfies Assumption 3.
Eggermont & LaRiccia (2006) were the first to show in full generality that the standard spline smoothing corresponds approximately to smoothing by a kernel method. A simple explicit formula of the equivalent kernel for all , denoted by , is given by Berlinet & Thomas-Agnan (2004). For interior points, the kernel is of the form for some function , and is a -th order kernel on . In particular, the shape of is defined by and is the same for different . For example, the closed form expressions for the first two equivalent kernels are:
Theorem 2 indicates that the spatially adaptive smoothing spline estimator is also approximately a kernel regression estimator. The equivalent kernel is the corresponding Green’s function. As shown in the Supplementary Material,
is an increasing function of , and . This shows that the shape of varies with . Our estimator possesses the interpretation of spatial adaptivity (Donoho & Johnstone, 1998); it is asymptotically equivalent to a kernel estimator with a kernel that varies in shape and bandwidth from point to point.
The number in ( ‣ 1) plays a role similar to the bandwidth in kernel smoothing. Theorem 2 shows that the asymptotic mean has bias , which can be negligible if is reasonably small. On the other hand, cannot be arbitrarily small since that will inflate the random component. The admissible range for is a compromise between these two.
The proof of Corollary 1 is given in the Supplementary Material. The asymptotic mean squared error of the spatially adaptive smoothing spline estimator is of order , which is the optimal rate of convergence given in Stone (1982).
The optimal and are chosen to minimize the integrated asymptotic mean squared error
which is in fact a function of . We choose the optimal to be . The optimal roughness penalty function minimizes the functional
Without any further assumptions, the above minimization problem does not have an optimal solution, since any arbitrarily large and positive function with on any sub-interval of will make arbitrarily small. To deal with this problem, we first impose a technical assumption on .
The set has zero measure.
Let , , and be the -fold integral operator. Then and
for and some . Moreover, we can define to be any positive constant for all where . This definition is assumed in the subsequent development. Hence, the functional in ( ‣ Smoothing splines with varying smoothing parameter) becomes
where is defined by . We then introduce another technical assumption on , or essentially on .
There exist positive constants and such that and for all . And is Lebesgue integrable on .
Consider the following set in ,
where is given in ( ‣ Smoothing splines with varying smoothing parameter) dependent on . Further development in the Supplemental Material establishes the following theorem that the objective functional attains a unique minimum in . In fact, under the additional Assumptions 5 and 6, the theorem first shows the existence of an optimal solution. Moreover, since the objective functional is strictly convex and the constraint set is convex, the uniqueness of an optimal solution also follows.
Given the optimal solution , is bounded on due to its absolute continuity. The lower bound in Assumption 6 ensures that the optimal is bounded below from zero. However, there is no guarantee that the optimal is bounded above due to the possibility for small values of . To avoid this problem, one may impose an additional upper bound constraint in Assumption 6. The proof of existence and uniqueness remains the same.
Obtaining an explicit solution of ( ‣ Smoothing splines with varying smoothing parameter) is difficult. Motivated from Pintore et al. (2006), we consider approximating by a piecewise constant function such that for , . Here , and are interior adaptive smoothing knots whose selection will be described below. When the integral in ( ‣ Smoothing splines with varying smoothing parameter) is taken ignoring the non-differentiability at the jump points , we obtain
Therefore, the optimal is
Unfortunately, the optimal values for the depend on and the -th derivative of the underlying regression function . We replace them by estimates in practice.
Rigorously speaking, such a step-function approximation to is not a valid solution to ( ‣ Smoothing splines with varying smoothing parameter) due to non-differentiability. However, simulations seem to suggest that such a simple approximation can yield good results. Furthermore, one can modify such , for example, to make it satisfy Assumption 2. In a sufficiently small neighborhood of each jump point, one can replace the steps by a smooth curve connecting the two steps such that the resulting function satisfies Assumption 2. Hence the piecewise constant can be viewed as a simple approximation to this smooth version of .
We now describe the detailed steps for approximate implementation. The first step is to select the interior smoothing knots . An abrupt change in the smoothness of the function is often associated with a similar change in the conditional probability density of given . For example, a steeper part of the function often comes with sparser data, or smaller conditional probability densities of given . Hence, we first use the sscden function in the R package gss to estimate the conditional probability densities of given on a dense grid, say . Then with a given , we select the top where the conditional probability density changes the most from to . A more accurate but considerably more time-consuming way of selecting the smoothing knots is a binary tree search algorithm proposed in Liu & Guo (2010).
Estimation of was first studied by Müller & Stadtmüller (1987). In this paper, we use the local polynomial approach in Fan & Yao (1998); see Hall & Carroll (1989), Ruppert et al. (1997), and Cai & Wang (2008) for other methods. This provides the weights for obtaining a weighted smoothing spline estimate of , whose derivative yields an estimate of . The function can be replaced by an estimate of the density function of . All these computations can be conveniently carried out using the R packages locpol and gss.
Ideally, the optimal computed as above work well. However, similar to the finding in Storlie et al. (2010), we have found that a powered-up version for some can often help in practice. Intuitively, this power-up makes up a bit for the under-estimated differences in across the predictor domain.
For the tuning parameters and , we consider and . Theoretically a larger might be preferred due to the better approximation of such step functions to the real function. However, as shown in Pintore et al. (2006) and Liu & Guo (2010), an greater than 8 tends to overfit the data. The options for were suggested in Storlie et al. (2010). In traditional smoothing splines, smoothing parameters are selected by the generalized cross-validation (Craven & Wahba, 1979) or the generalized maximum likelihood estimate (Wahba, 1985). As pointed out in Pintore et al. (2006), a proper criterion for selecting the piecewise constant should penalize on the number of segments of . The generalized Akaike information criterion proposed in Liu & Guo (2010) serves this purpose, which is a penalized version of the generalized maximum likelihood estimate where is penalized similar to the degrees of freedom in the conventional Akaike information criterion. In this paper, we will use the generalized Akaike information criterion to select and .
Once the piecewise constant penalty function is determined, we compute the corresponding adaptive smoothing spline estimate as follows. By the representer theorem (Wahba, 1990), the minimizer of ( ‣ Smoothing splines with varying smoothing parameter) lies in a finite-dimensional space of functions
where and are unknown coefficients, for , and is the reproducing kernel function whose closed form expressions at with a piecewise-constant are given in Section 2.2 of Pintore et al. (2006). Plugging ( ‣ Smoothing splines with varying smoothing parameter) into ( ‣ Smoothing splines with varying smoothing parameter), we solve for and by the Newton–Raphson procedure with a fixed . Here can be selected by the generalized cross-validation or the generalized maximum likelihood estimate with the adaptive reproducing kernel function.
This section compares the estimation performance of different smoothing spline methods. For traditional smoothing splines, we used the cubic smoothing splines from the function ssanova in the R package gss and the smoothing parameter was selected by the generalized cross validation score. For the spatially adaptive smoothing splines in Pintore et al. (2006), we used an equally-spaced five-step penalty function following their implementation and the optimal penalty function was selected to minimize the generalized cross validation function (19) in their paper. For the Loco-Spline in Storlie et al. (2010), we downloaded the authors’ original program from the site of the Journal of Computational and Graphical Statistics: http://amstat.tandfonline.com/doi/suppl/10.1198/jcgs.2010.09020/ suppl_file/r-code.zip. For the proposed adaptive smoothing splines, we used and cubic smooth splines to compute the optimal ’s.
Two well-known functions with varying smoothness on the domain were considered under the model with . We used and in all the simulations and repeated each simulation on 100 randomly generated data replicates. The integrated square error and point-wise absolute errors at were used to evaluate the performance of an estimate . To visualize the comparison, we also selected for each example and each method a data replicate with the median performance as follows. The function estimates from each method yielded 100 integrated square errors. After ranking them from the lowest to the highest, we chose the 50th integrated square error and its corresponding data replicate to represent the median performance. We then plotted the function estimates from these selected data replicates in Fig. 1-2 to compare the median estimation performances for different methods. To assess variability in estimation, we also superimposed in these plots the point-wise empirical 0.025 and 0.975 quantiles of the 100 estimates.
We first consider data generated from the Heaviside function with . Based on the error summary statistics in Table 1, all the adaptive methods outperform the traditional smoothing splines, with our method and that in Pintore et al. (2006) displaying clear advantages in all the error measures. Furthermore, our method had the smallest mean integrated square error. This advantage is better illustrated by the plots in Fig. 1. While the median estimates from all the three adaptive methods tracked the true function reasonably well, the Loco-Spline estimates show greater variability than the other two adaptive methods in estimating the flat parts of the Heaviside function. Further, our method does the best job in tracking down the jump. The estimate of Pintore et al. (2006) can oscillate around the jump of the Heaviside function, probably because the equally-spaced jump points for suggested in their paper sometimes have difficulty in characterizing the jump in the true function. This echoes the finding in Liu & Guo (2010) that the jump locations of also need to be adaptive, a concept adopted in our method.
The second example is the Mexican hat function with , where is the density function of . From Table 1 and Fig. 2, the estimates from our method and the Loco-Spline have competitive performance and both outperform the traditional smoothing spline and those of Pintore et al. (2006). The estimates of Pintore et al. (2006) again suffer close to the hat.
For the estimates plotted in Fig. 1–2, we also plot the estimated log penalties for all the methods in Figure 3. In general, the penalty functions from the three adaptive methods track the smoothness changes in the underlying functions reasonably well.
In this section, we apply the proposed adaptive smoothing splines to an example on electroencephalograms of epilepsy patients (Liu & Guo, 2010). Previous research (Qin et al., 2009) has shown that the low voltage frequency band 26-50Hz is important in characterizing electroencephalograms and may help determine the spatial-temporal initiation of seizure. The left panel of Figure 4 shows the raw time-varying log-spectral band power of 26-50Hz calculated every half second for a 15-minute long intracranial electroencephalogram series. The sampling rate was 200Hz and the seizure onset was at the 8th minute (Litt et al, 2001). The raw band powers are always very noisy and need to be smoothed before further analysis. The middle panel shows the reconstructions from traditional smoothing splines and the proposed adaptive smoothing splines. We also tried the Loco-Spline but the program exited due to a singular matrix error.
Traditional smoothing splines clearly under-smooth the pre- and post-seizure regions and over-smooth the seizure period, because a single smoothing parameter is insufficient to capture the abrupt change before the onset of the seizure. Our estimate smoothes out the noise on both ends but keeps the details before the onset of seizure. In particular, we see a fluctuation in power starting from a minute or so before the onset of the seizure, which may be a meaningful predictor of seizure initiation. The band power then increases sharply at the beginning of the seizure. Around the 10th minute at the end of the seizure, the band power drops sharply to a level even lower than the pre-seizure level, an indication of the suppression of neuronal activities after seizure. Afterwards, the band power starts to regain. But it still fails to reach the pre-seizure level even at the end of the 15th minute. These findings concur with those in Liu & Guo (2010).
The proposed method took less than 10 minutes for the whole analysis, compared with 40-50 minutes for the method in Liu & Guo (2010). This is not surprising, since the latter not only needs a dense grid search to locate the jump points but also lacks good initial step sizes.
We are grateful to two referees and Associate Editor for constructive and insightful comments. We are also thankful to Wensheng Guo and Ziyue Liu for providing the electroencephalogram data, and Howard Bondell for help with the Loco-spline program. Xiao Wang’s research is supported by US NSF grants CMMI-1030246 and DMS-1042967 and Jinglai Shen’s research is supported by US NSF grants CMMI-1030804 and DMS-1042916.
Supplementary Material available at Biometrika online includes the proofs of Theorems 1-3 and Corollary 1, and the detailed derivation of the Green’s function.
Outline Proof of Theorem 1. For any and ,
The function minimizes in ( ‣ Smoothing splines with varying smoothing parameter) if and only if for all .
We first have
Similarly, for .
If satisfies , , then for all ,
Let and . Define and , where is the indicator function. Since for all , we have and , unless and are of measure zero. This shows that almost everywhere.
Outline Proof of Theorem 2. It follows from ( ‣ Smoothing splines with varying smoothing parameter) that , where
Let minimize the functional
Similar to Theorem 1, we have
Hence, . Taking the th derivative of both sides of (A0), we get
Recall that is times continuously differentiable and . Combining this with (A0), it is easy to show that as for . Therefore,
Assume that a function satisfies