Fully Bayesian Estimation and Variable Selection in Partially Linear Wavelet Models
Abstract
In this paper we propose a waveletbased methodology for estimation and variable selection in partially linear models. The inference is conducted in the wavelet domain, which provides a sparse and localized decomposition appropriate for nonparametric components with various degrees of smoothness. A hierarchical Bayes model is formulated on the parameters of this representation, where the estimation and variable selection is performed by a Gibbs sampling procedure. For both the parametric and nonparametric part of the model we are using pointmassatzero contamination priors with a double exponential spread distribution. Only a few papers in the area of partially linear wavelet models exist, and we show that the proposed methodology is often superior to the existing methods with respect to the task of estimating model parameters. Moreover, the method is able to perform Bayesian variable selection by a stochastic search for the parametric part of the model.
Keywords: Bayesian variable selection, Gibbs sampling, hierarchical model, mixture prior, sparsity, wavelet shrinkage.
1 Introduction
In the present paper we consider a novel Bayesian approach for solving the following regression problem
(1) 
where , , are equispaced sampling points, , , are known dimensional design points, is an unknown dimensional parameter vector, is an unknown and potentially nonsmooth function, and the random errors are i.i.d. normal, with zero mean and variance . The model can be written in matrixvector form as
(2) 
Our interest is to simultaneously estimate the unknown parameter vector and nonparametric function using the observations . Another task is to identify important (nonzero) components of , that is to perform dimension reduction via variable selection on .
The model in (1) is called a partially linear model (PLM) in the literature. Engle et al. (1986) were among the first to use PLM to analyze electricity sales data. The model is semiparametric in nature because it combines parametric (linear) and nonparametric parts. In this paper we consider a model with one nonparametric part in it. The monograph by Härdle et al. (2000) discusses the general PLM model extensively.
Several approaches are proposed in the literature to represent the nonparametric component of the model in (2). These all build on existing nonparametric regression techniques, such as the kernel method, the local linear method (local polynomial or trigonometric polynomial techniques), or splines. In the most recent papers, wavelets are used (Chang and Qu, 2004; Fadili and Bullmore, 2005; Qu, 2006; Gannaz, 2007; Ding et al., 2011), which allows the nonparametric component to be parsimoniously represented by a limited number of coefficients. The wavelet representation can include a wide variety of nonparametric parts, including nonsmooth signals, and reduces the bias in estimating the parametric component.
In this paper we consider the latter approach and use the wavelet decomposition to represent . We use the Bayesian approach to formulate a hierarchical model in the wavelet domain and estimate its parameters. Only a few papers used wavelets in the partially linear model context, and besides Qu (2006), all used a penalized least squares estimation procedure. Therefore, using a fully Bayesian approach can be of interest.
After applying a linear and orthogonal wavelet transform, the model in (1) becomes
(3) 
where , and are the wavelet coefficients (at resolution and location ) corresponding to , and , and , where is an orthogonal matrix implementing the wavelet transform. In a matrixvector form,
which becomes
(4) 
Note that because of the orthogonality of , . Due to the whitening property of wavelet transforms (Flandrin, 1992), we can assume independence of the coefficients . To estimate and in model (3) in a Bayesian fashion, we build on results from the Bayesian linear models and wavelet regression literature.
To estimate in a simple nonparametric regression model , Bayesian shrinkage rules have been proposed in the literature by many authors. By a shrinkage rule, we mean that the observed wavelet coefficients are replaced with their shrunken version . The traditional Bayesian models consider a prior distribution on a generic wavelet coefficient as
(5) 
where is a point mass at zero, is a symmetric about 0 and unimodal distribution, and is a fixed parameter in [0,1], usually level dependent, that controls the amount of shrinkage for values of close to 0. This type of model was considered by Abramovich et al. (1998), Vidakovic (1998), Vidakovic and Ruggeri (2001), and Johnstone and Silverman (2005), among others.
The mixture prior approach was also utilized in Bayesian estimation and variable selection of linear models, , where a mixture prior is specified on parameters . This type of model was considered for example by George and McCulloch (1993, 1997), and Yuan and Lin (2004, 2005). It is natural to combine these approaches; therefore, we build on these modeling ideas to formulate a fully Bayesian model in the partially linear model context.
The paper is organized as follows. Section 2 formalizes the Bayesian model and presents some results related to it. In Section 3 we explain the estimation through a Gibbs sampling procedure developed for the hierarchical model. Section 4 discusses the selection of hyperparameters, contains simulations and comparisons to existing methods, and discusses how variable selection can be performed. Conclusions and discussion are provided in Section 5.
2 Hierarchical Model
In this section we propose a hierarchical model in which we use a mixture prior approach for both the parametric and the nonparametric components of the partially linear model. Let us consider the following hierarchical Bayesian model for a partially linear model in the wavelet domain (4):
(6) 
where pertains to the resolution level of and , , , , and stand for the normal, inverse gamma, double exponential, Bernoulli, and uniform distributions, respectively. Index refers to the regression coefficients in . Note that is an indicator vector of binary elements; therefore, subscript indicates that only those columns or elements of and with the corresponding element of 1 are included.
Note that the model in (6) uses the wellestablished mixture prior on with a point mass at zero, which accounts for the sparsity of the nonparametric part in the wavelet domain. Wavelet coefficients with large magnitudes are captured by the spread part of the mixture prior, for which we propose the double exponential or Laplace distribution with variance . The double exponential distribution is a popular choice for the spread part. It models wavelet coefficients with large energies and was used by several authors, for example Vidakovic and Ruggeri (2001), and Johnstone and Silverman (2005). The mixture prior on is specified levelwise, for each dyadic level ; however, the scale parameter is global. This serves the purpose of parsimony and contributes to the ease of estimation. Here is a latent variable indicating whether our parameter is coming from a point mass at zero () or from a double exponential part (), with prior probability of or , respectively. For the prior probability we assume a “noninformative” uniform prior. The uniform (0,1) prior is equivalent to a beta distribution, which is a conjugate prior for the Bernoulli distribution.
In our model we naturally propose the same mixture prior to model the regression parameters . Yuan and Lin (2004, 2005) used this prior in the Bayesian variable selection context for linear models. In case the model forces and if then is modeled with a double exponential prior accommodating large regression coefficients. For the elements of binary vector we use the Bernoulli prior with common parameter . This prior assumes that each predictor enters the model independently with prior probability . Although it does not take into account the possible correlation between the predictors, this type of prior works well in practice, and it was used by George and McCulloch (1993) and George and Foster (2000), to name a few. Unlike George and McCulloch (1993), who prespecified , we introduce another level of hierarchy by assuming a uniform “noninformative” prior on . Since it is not clear, in general, how to specify , it makes sense to put a prior distribution on the parameter, instead of using , which is a common suggestion in practice. As opposed to the fully Bayesian approach, George and Foster (2000) used the empirical Bayes approach to estimate .
Parameter represents the common noise variance for each resolution level on which we specified a conjugate inverse gamma prior. Spread parameters and will be given priors after a reformulated version of the model (6) is discussed.
The hierarchical model in (6) is not conjugate; however, with additional transformations, derivations and computational techniques, it is possible to develop a fast Gibbs sampling algorithm for updating of its parameters. Note that a standard approach for handling the double exponential prior in Markov chain Monte Carlo (MCMC) computations of hierarchical models is to represent the double exponential distribution as a scale mixture of normal distributions (Andrews and Mallows, 1974). This approach is used for example in Bayesian LASSO variable selection, where the double exponential prior (without point mass) is used on the regression parameters (Park and Casella, 2008). Here we will only use the scale mixture approach for the double exponential prior on . This introduces an additional parameter corresponding to each , which needs to be updated. Using the scale mixture representation, the model in (6) becomes
(7) 
In the model above . If we integrate out s from (7), we get back the model in (6), which follows from the scale mixture representation of the double exponential distribution. For the spread parameters and , inverse gamma and gamma priors are specified in the model, which turn out to be conjugate.
For parameters it is possible to derive the full conditional distributions without resorting to the scale mixture representation. This improves the speed of the Gibbs sampling algorithm. In order to do this, we first discuss some results related to model (7), which are instrumental in developing the Gibbs sampler.
First let from which it follows that . In the following notation refers to an arbitrary and the mean stands for the corresponding . If we consider a likelihood and elicit a double exponential prior on the , the marginal distribution becomes
(8) 
and the posterior distribution of becomes
where and respectively denote the pdf and cdf of the standard normal distribution. For derivations of these results, see Appendix. From the representation in (2) we can see that the posterior distribution is a mixture of truncated normals, which will be utilized in the Gibbs sampling algorithm. If we consider the mixture prior on in (6), we obtain the posterior distribution as
(10)  
where is the normal distribution with mean and variance , and
(11)  
is the mixing weight. Thus, the posterior distribution of is a mixture of point mass at zero and a mixture of truncated normal distributions with mixing weight .
3 Gibbs sampling scheme
To conduct posterior inference on the parameters and , we adopt a standard Gibbs sampling procedure. Gibbs sampling is an iterative algorithm that simulates from a joint posterior distribution through iterative simulation of the full conditional distributions. For more details on Gibbs sampling see Casella and George (1992) or Robert and Casella (1999). For the model in (7), full conditionals for all parameters can be determined exactly. We build on results given as (2), (10) and results derived by Yuan and Lin (2004). Derivations of the results in this section are deferred to Appendix.
Next we will find full conditional distributions and updating schemes for parameters , , , , , , , , , and , which are necessary to run the Gibbs sampler. Specification of the hyperparameters , , , , and will be done in Section 4.1.
3.1 Updating , and
In each Gibbs sampling iteration we first update the block by updating and for , and then we generate for .
3.1.1 Updating and as a block
Here we follow the results of Yuan and Lin (2004) and we get
where
and
Note that
and
Here the notation and refers to vectors and without the element and indicates the column of matrix . Therefore, in the iteration of the Gibbs sampling, update as a Bernoulli random variable with probabilities given
Then it is straightforward to update as
Note that in the above equation in which we substitute , and . Also, is a point mass distribution at zero, which is equivalent to .
3.1.2 Updating
For the scale mixture of normals representation of the double exponential distribution, we placed an exponential prior on in model (7). We update depending on the value of the latent variable , whether comes from a point mass or a normal prior. The updating scheme for is
(14) 
where denotes the generalized inverse Gaussian distribution (Johnson et al., 1994, p.284) with probability density function
Here denotes the modified Bessel function of the third kind. Simulation of random variates is available through a MATLAB implementation “randraw” based on Dagpunar (1989).
3.2 Updating , , and
Using a conjugate prior on results in an inverse gamma full conditional distribution. Therefore, update as
(15) 
Parameter has a conjugate prior. This results in a full conditional distributed as beta,
(16) 
Similarly, parameter is given a conjugate prior, and the update is
(17) 
Note that other choices from the family are possible for the prior of and , similarly. However, we used the noninformative choice and to facilitate datadriven estimation of and .
Using a conjugate prior on also results in an inverse gamma full conditional distribution. This leads to an update for as
(18) 
where and denotes the sample size. and refer to the finest and coarsest levels in the wavelet decomposition, respectively.
3.3 Updating
3.4 Updating
We approach updating in a novel way. As we mentioned before, the common approach for handling the double exponential prior in hierarchical models is the scale mixture representation. This approach, however, introduces an additional parameter corresponding to each , which needs to be updated. This adds new parameters. A faster and more direct method to update is possible by using results in (2) and (10). From the definition of latent variable we can easily see that if , because for such , is distributed as point mass at zero. In case , follows a mixture of truncated normal distributions a posteriori. Therefore, the update for is as follows:
(20) 
where , is a point mass distribution at zero, and is a mixture of truncated normal distributions with the density provided in (2). Simulating random variables from is nonstandard, and regular builtin methods fail, because we need to simulate random variables from tails of normal distributions having extremely low probability. The implementation of the updating algorithm is based on vectorizing a fast algorithm proposed by Robert (1995).
3.5 Updating
The Gibbs updating scheme is completed with the discussion of how to update . In the hierarchical model (7), we impose a gamma prior on the scale parameter of the double exponential distribution. This turns out to be a conjugate problem; therefore, we update by
(21) 
Note that the gamma distribution above is parameterized by its scale parameter.
Now the derivation of the updating algorithm is complete. Implementation of the described Gibbs sampler requires simulation routines for standard distributions such as the gamma, inverse gamma, Bernoulli, beta, exponential, normal, and also specialized routines to simulate from truncated normal, and generalized inverse Gaussian. The procedure was implemented in MATLAB and available from the author.
The Gibbs sampling procedure can be summarized as

Choose initial values for parameters

Repeat steps (iii)  (xi) for

Update the block for

Update for

Update

Update

Update

Update for

Update for

Update for

Update .
Note that the updating steps of vectors , and are vectorized in the implementation, which considerably speeds up the computation.
4 Simulations
In this section, we apply the proposed Gibbs sampling algorithm and simulate posterior realizations for the model in (7). We will name our method GSWaPaLiM, which is an acronym for Gibbs Sampling Waveletbased Partially Linear Model (GSWaPaLiM) method. Within each simulation step 20,000 Gibbs sampling iterations were performed, of which the first 5,000 were used for burnin. We used the sample averages and as the usual estimator for the posterior mean. In our setup, .
In what follows, we first discuss the selection of the hyperparameters, then compare the estimation performance with other methods on two simulated examples. Finally, variable selection will be demonstrated on an example.
4.1 Selection of Hyperparameters
In any Bayesian modeling task, the selection of hyperparameters is critical for good performance of the model. It is also desirable to have a default choice of the hyperparameters which makes the procedure automatic.
In order to apply the GSWaPaLiM method, we only need to specify hyperparameters , , , , , and in the hyperprior distributions. The advantage of the fully Bayesian approach is that once the hyperpriors are set, the estimation of parameters , , , , , , , , , and is automatic via the Gibbs sampling algorithm. The selection is governed by the data and hyperprior distributions on the parameters. Another advantage is that the method is relatively robust to the choice of hyperparameters since they influence the model at a higher level of hierarchy.
Critical parameters with respect to the performance of the shrinkage are and , which control the strength of shrinkage of and to zero. In model (7), we placed a uniform prior on these parameters; therefore, the estimation will be governed mostly by the data, which provides a degree of adaptiveness. Parameter represents the probability that a predictor enters the model a priori. When a priori information is available, it can be incorporated into the model, however, this is rarely the case. In the wavelet regression context, Abramovich et al. (1998) estimated parameter by a theoretically justified but somewhat involved method, and in Vidakovic and Ruggeri (2001), the estimation of this parameter depends on another hyperparameter , which is elicited based on empirical evidence. The proposed method provides a better alternative because of its automatic adaptiveness to the underlying nonparametric part of the model.
Another efficient way to elicit the hyperparameters of the model is through the empirical Bayes method performing maximization of the marginal likelihood. This approach was followed by Qu (2006) in the context of estimating partially linear wavelet models. However, the likelihood function is nonconcave; therefore, clever optimization algorithm and carefully set starting values are crucial for the performance of this method. The same method of estimating hyperparameters was used for example by Clyde and George (1999) and Johnstone and Silverman (2005) in the wavelet regression context, and by George and Foster (2000) in the linear regression context. Note that for the mixture priors specified on the parametric and nonparametric parts in model (7) the empirical Bayes approach might not be computationally tractable; therefore, the fully Bayesian approach provides a good alternative.
Default specification of hyperparameters , , , , , and in model (7) is given by the following:

We set , and .

Then we compute naive estimators from the data
where is an estimator of the nonparametric part of model (2), and is the ordinary least squares estimator for , although computed from the raw partially linear data.

Then we set , so that the mean of the inverse gamma prior becomes . We use , which is the usual robust estimator of the noise variation in the wavelet shrinkage literature (Donoho and Johnstone, 1994). Here MAD stands for the median absolute deviation of the wavelet coefficients at the finest level of detail and the constant 0.6745 calibrates the estimator to be comparable with the sample standard deviation. Note that coefficients correspond to , therefore, .

After this we set , which sets the mean of the gamma prior on equal to an estimator of . This estimator is adopted from Vidakovic and Ruggeri (2001), where .

Finally we set , so that the mean of the inverse gamma prior is a prespecified value, . Results in the estimation of s turned out to be somewhat sensitive to for small sample size and small number of linear predictors. We used , which specified a prior on with large enough variance to work well in practice.
4.2 Simulations and Comparisons with Various Methods
In this section, we discuss the estimation performance of the proposed GSWaPaLiM method and compare it to three methods from the partially linear wavelet model literature. The first one is the wavelet Backfitting algorithm (BF) proposed by Chang and Qu (2004), the second one is the LEGEND algorithm proposed by Gannaz (2007) and the last one is the double penalized PLM wavelet estimator (DPPLM) by Ding et al. (2011). A Bayesian waveletbased algorithm for the same problem was proposed by Qu (2006). However, we found that the implementation of that algorithm is not robust to different simulated examples and initial values of the empirical Bayes procedure, therefore, we omitted it from our discussion.
The coarsest wavelet decomposition level was , as suggested from Antoniadis et al. (2001). Reconstruction of the theoretical signal was measured by the average mean squared error (AMSE), calculated as
where is the number of simulation runs, and are known values of the simulated functions considered. We denote by the estimator from the th simulation run. Note again, that in each of these simulation runs we perform 20,000 Gibbs sampling iterations in order to get the estimators and . Also note that , where . We also assess the performance in estimating the parametric part of the model by AMSE, calculated as
In the following simulation study we also used a modification of the wavelet Backfitting algorithm proposed by Chang and Qu (2004). The original algorithm, denoted as BF, uses as a soft threshold value in each iteration. In the modified algorithm we run the iterative algorithm a second time using the generalized crossvalidation threshold as in Jansen et al. (1997). This simple modification significantly improves the performance of the original algorithm. The method will be denoted as BFM in the sequel.
The procedure based on Gannaz (2007), denoted as LEGEND, is a wavelet thresholding based estimation procedure solved by the proposed LEGEND algorithm. The formulation of the problem is similar to the one in Chang and Qu (2004) and Fadili and Bullmore (2005), penalizing only the wavelet coefficients of the nonparametric part, but the solution is faster by recognizing the connection with Huber’s Mestimation of a standard linear model with outliers.
The algorithm by Ding et al. (2011) will be denoted as DPPLM in the simulations. The authors discuss several simulation results based on how the Lasso penalty parameter was chosen and whether the adaptive Lasso algorithm was used or not in the estimation procedure. It was reported that the GCV criteria with adaptive Lasso provided the smallest AMSE results, therefore, that version of the algorithm is used in the present simulations. We will refer to the method as DPPLMGCV in the future.
For comparison purposes we use two simulation examples, one from Qu (2006), and another one from Ding et al. (2011). We set the number of replications .
Example 1
The first example is based on an example in Qu (2006). The simulated data are generated from
where and with . The nonparametric test functions are , where Blocks, Bumps, Doppler and Heavisine. These are four standard test functions considered by Donoho and Johnstone (1994). We chose , , and to have reasonable signaltonoise ratios (SNR). The test functions were simulated at and points, and the nonparametric components were equally spaced in the unit interval. The standard wavelet bases were used: Symmlet 8 for Heavisine and Doppler, Daubechies 6 for Bumps and Haar for Blocks. The two columns of the design matrix were generated as independent random variables.
Results of the simulation are presented in Table 1. It can be seen that the proposed GSWaPaLiM method gives better AMSE and AMSE results in most test scenarios. It is apparent that the modified version of the Backfitting algorithm (BFM) provides better results than the original backfitting algorithm (BF). Note that an additional uncertainty results from estimating the noise variance , which was assumed to be known in the simulations by Chang and Qu (2004). provides comparable results to the algorithm, since both are using the same least squares formulation penalizing only the wavelet coefficients of the nonparametric part of the model. The solution algorithm and estimation of the noise is different in these methods. Note that boldface numbers indicate the smallest AMSE result for each test scenario.
Signal  N  Method  AMSE  AMSE  Signal  N  Method  AMSE  AMSE 

Blocks  64  GSWaPaLiM  0.6012  0.1179  Doppler  64  GSWaPaLiM  1.0332  0.2009 
BF  8.3137  0.5647  BF  5.0350  0.3366  
BFM  1.0670  0.1606  BFM  1.1988  0.1607  
LEGEND  7.0360  0.4970  LEGEND  4.8372  0.2965  
DPPLMGCV  0.8781  0.1415  DPPLMGCV  1.0435  0.1535  
128  GSWaPaLiM  0.3933  0.0284  128  GSWaPaLiM  0.4865  0.0363  
BF  3.9360  0.1268  BF  2.5299  0.0758  
BFM  0.6040  0.0368  BFM  0.6676  0.0390  
LEGEND  3.6166  0.1166  LEGEND  2.8607  0.0811  
DPPLMGCV  0.5955  0.0372  DPPLMGCV  0.6540  0.0393  
256  GSWaPaLiM  0.2547  0.0107  256  GSWaPaLiM  0.3727  0.0126  
BF  2.1635  0.0320  BF  1.7494  0.0264  
BFM  0.4488  0.0138  BFM  0.4880  0.0145  
LEGEND  1.9638  0.0290  LEGEND  1.8516  0.0261  
DPPLMGCV  0.4465  0.0140  DPPLMGCV  0.4854  0.0146  
512  GSWaPaLiM  0.1776  0.0048  512  GSWaPaLiM  0.2293  0.0050  
BF  1.2914  0.0098  BF  0.9649  0.0085  
BFM  0.3247  0.0056  BFM  0.3129  0.0057  
LEGEND  1.2862  0.0096  LEGEND  1.0617  0.0089  
DPPLMGCV  0.3252  0.0057  DPPLMGCV  0.3122  0.0057  
Bumps  64  GSWaPaLiM  0.7932  0.2136  Heavisine  64  GSWaPaLiM  0.4265  0.0714 
BF  8.8940  0.6099  BF  1.4997  0.0959  
BFM  1.7003  0.2479  BFM  0.5890  0.0628  
LEGEND  9.2580  0.5556  LEGEND  1.5195  0.1013  
DPPLMGCV  1.4129  0.2222  DPPLMGCV  0.5833  0.0642  
128  GSWaPaLiM  0.7265  0.0783  128  GSWaPaLiM  0.2834  0.0227  
BF  6.3618  0.2014  BF  0.4817  0.0234  
BFM  1.0466  0.0728  BFM  0.3526  0.0218  
LEGEND  7.5358  0.2046  LEGEND  1.0038  0.0336  
DPPLMGCV  0.9931  0.0716  DPPLMGCV  0.3544  0.0226  
256  GSWaPaLiM  0.5522  0.0177  256  GSWaPaLiM  0.1972  0.0099  
BF  4.0588  0.0571  BF  0.3603  0.0112  
BFM  0.7845  0.0247  BFM  0.2623  0.0105  
LEGEND  3.9100  0.0533  LEGEND  0.6559  0.0140  
DPPLMGCV  0.7688  0.0243  DPPLMGCV  0.2608  0.0107  
512  GSWaPaLiM  0.4317  0.0065  512  GSWaPaLiM  0.1310  0.0045  
BF  2.9271  0.0196  BF  0.2576  0.0049  
BFM  0.6022  0.0091  BFM  0.1758  0.0047  
LEGEND  2.9142  0.0189  LEGEND  0.4219  0.0058  
DPPLMGCV  0.5999  0.0090  DPPLMGCV  0.1756  0.0047 
Example 2
The second example is based on a simulation example from Ding
et al. (2011). The simulated data are generated from
where and with . The parametric part of the model is sparse, where only the first 4 regression variables are significant. The nonparametric test functions are , where PiecePoly given in Nason (1996) and Bumps. We chose and to have reasonable signaltonoise ratios (SNR). The test functions were simulated at and points, and Daubechies 8 wavelet base were used in both cases of the test functions. Rows of the design matrix were independently generated from 20dimensional multivariate normal distribution with zero mean vector, variance 1 and pairwise correlation coefficient between consecutive elements of the rows .
Results of the simulation are presented in Table 2. Note that boldface numbers indicate the smallest AMSE results for each test scenario. It can be seen that the proposed GSWaPaLiM method gives better AMSE and AMSE results in all test scenarios. In this example the parametric part of the model is sparse, therefore, the double penalized wavelet estimator is superior to the wavelet backfitting and algorithms, especially in estimating s. Since the true is a sparse vector, penalized estimation of the coefficients provides superior results as opposed to the , and