Fully Bayesian Estimation and Variable Selection in Partially Linear Wavelet Models

Fully Bayesian Estimation and Variable Selection in Partially Linear Wavelet Models

Norbert Reményi
School of Industrial and Systems Engineering
Georgia Institute of Technology
Abstract

In this paper we propose a wavelet-based methodology for estimation and variable selection in partially linear models. The inference is conducted in the wavelet domain, which provides a sparse and localized decomposition appropriate for nonparametric components with various degrees of smoothness. A hierarchical Bayes model is formulated on the parameters of this representation, where the estimation and variable selection is performed by a Gibbs sampling procedure. For both the parametric and nonparametric part of the model we are using point-mass-at-zero contamination priors with a double exponential spread distribution. Only a few papers in the area of partially linear wavelet models exist, and we show that the proposed methodology is often superior to the existing methods with respect to the task of estimating model parameters. Moreover, the method is able to perform Bayesian variable selection by a stochastic search for the parametric part of the model.


Keywords: Bayesian variable selection, Gibbs sampling, hierarchical model, mixture prior, sparsity, wavelet shrinkage.

1 Introduction

In the present paper we consider a novel Bayesian approach for solving the following regression problem

(1)

where , , are equispaced sampling points, , , are known -dimensional design points, is an unknown -dimensional parameter vector, is an unknown and potentially non-smooth function, and the random errors are i.i.d. normal, with zero mean and variance . The model can be written in matrix-vector form as

(2)

Our interest is to simultaneously estimate the unknown parameter vector and nonparametric function using the observations . Another task is to identify important (non-zero) components of , that is to perform dimension reduction via variable selection on .

The model in (1) is called a partially linear model (PLM) in the literature. Engle et al. (1986) were among the first to use PLM to analyze electricity sales data. The model is semiparametric in nature because it combines parametric (linear) and nonparametric parts. In this paper we consider a model with one nonparametric part in it. The monograph by Härdle et al. (2000) discusses the general PLM model extensively.

Several approaches are proposed in the literature to represent the nonparametric component of the model in (2). These all build on existing nonparametric regression techniques, such as the kernel method, the local linear method (local polynomial or trigonometric polynomial techniques), or splines. In the most recent papers, wavelets are used (Chang and Qu, 2004; Fadili and Bullmore, 2005; Qu, 2006; Gannaz, 2007; Ding et al., 2011), which allows the nonparametric component to be parsimoniously represented by a limited number of coefficients. The wavelet representation can include a wide variety of nonparametric parts, including non-smooth signals, and reduces the bias in estimating the parametric component.

In this paper we consider the latter approach and use the wavelet decomposition to represent . We use the Bayesian approach to formulate a hierarchical model in the wavelet domain and estimate its parameters. Only a few papers used wavelets in the partially linear model context, and besides Qu (2006), all used a penalized least squares estimation procedure. Therefore, using a fully Bayesian approach can be of interest.

After applying a linear and orthogonal wavelet transform, the model in (1) becomes

(3)

where , and are the wavelet coefficients (at resolution and location ) corresponding to , and , and , where is an orthogonal matrix implementing the wavelet transform. In a matrix-vector form,

which becomes

(4)

Note that because of the orthogonality of , . Due to the whitening property of wavelet transforms (Flandrin, 1992), we can assume independence of the coefficients . To estimate and in model (3) in a Bayesian fashion, we build on results from the Bayesian linear models and wavelet regression literature.

To estimate in a simple nonparametric regression model , Bayesian shrinkage rules have been proposed in the literature by many authors. By a shrinkage rule, we mean that the observed wavelet coefficients are replaced with their shrunken version . The traditional Bayesian models consider a prior distribution on a generic wavelet coefficient as

(5)

where is a point mass at zero, is a symmetric about 0 and unimodal distribution, and is a fixed parameter in [0,1], usually level dependent, that controls the amount of shrinkage for values of close to 0. This type of model was considered by Abramovich et al. (1998), Vidakovic (1998), Vidakovic and Ruggeri (2001), and Johnstone and Silverman (2005), among others.

The mixture prior approach was also utilized in Bayesian estimation and variable selection of linear models, , where a mixture prior is specified on parameters . This type of model was considered for example by George and McCulloch (1993, 1997), and Yuan and Lin (2004, 2005). It is natural to combine these approaches; therefore, we build on these modeling ideas to formulate a fully Bayesian model in the partially linear model context.

The paper is organized as follows. Section 2 formalizes the Bayesian model and presents some results related to it. In Section 3 we explain the estimation through a Gibbs sampling procedure developed for the hierarchical model. Section 4 discusses the selection of hyperparameters, contains simulations and comparisons to existing methods, and discusses how variable selection can be performed. Conclusions and discussion are provided in Section 5.

2 Hierarchical Model

In this section we propose a hierarchical model in which we use a mixture prior approach for both the parametric and the nonparametric components of the partially linear model. Let us consider the following hierarchical Bayesian model for a partially linear model in the wavelet domain (4):

(6)

where pertains to the resolution level of and , , , , and stand for the normal, inverse gamma, double exponential, Bernoulli, and uniform distributions, respectively. Index refers to the regression coefficients in . Note that is an indicator vector of binary elements; therefore, subscript indicates that only those columns or elements of and with the corresponding element of 1 are included.

Note that the model in (6) uses the well-established mixture prior on with a point mass at zero, which accounts for the sparsity of the nonparametric part in the wavelet domain. Wavelet coefficients with large magnitudes are captured by the spread part of the mixture prior, for which we propose the double exponential or Laplace distribution with variance . The double exponential distribution is a popular choice for the spread part. It models wavelet coefficients with large energies and was used by several authors, for example Vidakovic and Ruggeri (2001), and Johnstone and Silverman (2005). The mixture prior on is specified levelwise, for each dyadic level ; however, the scale parameter is global. This serves the purpose of parsimony and contributes to the ease of estimation. Here is a latent variable indicating whether our parameter is coming from a point mass at zero () or from a double exponential part (), with prior probability of or , respectively. For the prior probability we assume a “noninformative” uniform prior. The uniform (0,1) prior is equivalent to a beta distribution, which is a conjugate prior for the Bernoulli distribution.

In our model we naturally propose the same mixture prior to model the regression parameters . Yuan and Lin (2004, 2005) used this prior in the Bayesian variable selection context for linear models. In case the model forces and if then is modeled with a double exponential prior accommodating large regression coefficients. For the elements of binary vector we use the Bernoulli prior with common parameter . This prior assumes that each predictor enters the model independently with prior probability . Although it does not take into account the possible correlation between the predictors, this type of prior works well in practice, and it was used by George and McCulloch (1993) and George and Foster (2000), to name a few. Unlike George and McCulloch (1993), who prespecified , we introduce another level of hierarchy by assuming a uniform “noninformative” prior on . Since it is not clear, in general, how to specify , it makes sense to put a prior distribution on the parameter, instead of using , which is a common suggestion in practice. As opposed to the fully Bayesian approach, George and Foster (2000) used the empirical Bayes approach to estimate .

Parameter represents the common noise variance for each resolution level on which we specified a conjugate inverse gamma prior. Spread parameters and will be given priors after a reformulated version of the model (6) is discussed.

The hierarchical model in (6) is not conjugate; however, with additional transformations, derivations and computational techniques, it is possible to develop a fast Gibbs sampling algorithm for updating of its parameters. Note that a standard approach for handling the double exponential prior in Markov chain Monte Carlo (MCMC) computations of hierarchical models is to represent the double exponential distribution as a scale mixture of normal distributions (Andrews and Mallows, 1974). This approach is used for example in Bayesian LASSO variable selection, where the double exponential prior (without point mass) is used on the regression parameters (Park and Casella, 2008). Here we will only use the scale mixture approach for the double exponential prior on . This introduces an additional parameter corresponding to each , which needs to be updated. Using the scale mixture representation, the model in (6) becomes

(7)

In the model above . If we integrate out s from (7), we get back the model in (6), which follows from the scale mixture representation of the double exponential distribution. For the spread parameters and , inverse gamma and gamma priors are specified in the model, which turn out to be conjugate.

For parameters it is possible to derive the full conditional distributions without resorting to the scale mixture representation. This improves the speed of the Gibbs sampling algorithm. In order to do this, we first discuss some results related to model (7), which are instrumental in developing the Gibbs sampler.

First let from which it follows that . In the following notation refers to an arbitrary and the mean stands for the corresponding . If we consider a likelihood and elicit a double exponential prior on the , the marginal distribution becomes

(8)

and the posterior distribution of becomes

where and respectively denote the pdf and cdf of the standard normal distribution. For derivations of these results, see Appendix. From the representation in (2) we can see that the posterior distribution is a mixture of truncated normals, which will be utilized in the Gibbs sampling algorithm. If we consider the mixture prior on in (6), we obtain the posterior distribution as

(10)

where is the normal distribution with mean and variance , and

(11)

is the mixing weight. Thus, the posterior distribution of is a mixture of point mass at zero and a mixture of truncated normal distributions with mixing weight .

3 Gibbs sampling scheme

To conduct posterior inference on the parameters and , we adopt a standard Gibbs sampling procedure. Gibbs sampling is an iterative algorithm that simulates from a joint posterior distribution through iterative simulation of the full conditional distributions. For more details on Gibbs sampling see Casella and George (1992) or Robert and Casella (1999). For the model in (7), full conditionals for all parameters can be determined exactly. We build on results given as (2), (10) and results derived by Yuan and Lin (2004). Derivations of the results in this section are deferred to Appendix.

Next we will find full conditional distributions and updating schemes for parameters , , , , , , , , , and , which are necessary to run the Gibbs sampler. Specification of the hyperparameters , , , , and will be done in Section 4.1.

3.1 Updating , and

In each Gibbs sampling iteration we first update the block by updating and for , and then we generate for .

3.1.1 Updating and as a block

Here we follow the results of Yuan and Lin (2004) and we get

where

and

Note that

and

Here the notation and refers to vectors and without the element and indicates the column of matrix . Therefore, in the iteration of the Gibbs sampling, update as a Bernoulli random variable with probabilities given

Then it is straightforward to update as

Note that in the above equation in which we substitute , and . Also, is a point mass distribution at zero, which is equivalent to .

3.1.2 Updating

For the scale mixture of normals representation of the double exponential distribution, we placed an exponential prior on in model (7). We update depending on the value of the latent variable , whether comes from a point mass or a normal prior. The updating scheme for is

(14)

where denotes the generalized inverse Gaussian distribution (Johnson et al., 1994, p.284) with probability density function

Here denotes the modified Bessel function of the third kind. Simulation of random variates is available through a MATLAB implementation “randraw” based on Dagpunar (1989).

3.2 Updating , , and

Using a conjugate prior on results in an inverse gamma full conditional distribution. Therefore, update as

(15)

Parameter has a conjugate prior. This results in a full conditional distributed as beta,

(16)

Similarly, parameter is given a conjugate prior, and the update is

(17)

Note that other choices from the family are possible for the prior of and , similarly. However, we used the noninformative choice and to facilitate data-driven estimation of and .

Using a conjugate prior on also results in an inverse gamma full conditional distribution. This leads to an update for as

(18)

where and denotes the sample size. and refer to the finest and coarsest levels in the wavelet decomposition, respectively.

3.3 Updating

We saw in model (7) that latent variable has a Bernoulli prior with parameter . Its full conditional distribution remains Bernoulli with parameter as in (11). Thus, the latent variable is updated as follows:

where .

3.4 Updating

We approach updating in a novel way. As we mentioned before, the common approach for handling the double exponential prior in hierarchical models is the scale mixture representation. This approach, however, introduces an additional parameter corresponding to each , which needs to be updated. This adds new parameters. A faster and more direct method to update is possible by using results in (2) and (10). From the definition of latent variable we can easily see that if , because for such , is distributed as point mass at zero. In case , follows a mixture of truncated normal distributions a posteriori. Therefore, the update for is as follows:

(20)

where , is a point mass distribution at zero, and is a mixture of truncated normal distributions with the density provided in (2). Simulating random variables from is nonstandard, and regular built-in methods fail, because we need to simulate random variables from tails of normal distributions having extremely low probability. The implementation of the updating algorithm is based on vectorizing a fast algorithm proposed by Robert (1995).

3.5 Updating

The Gibbs updating scheme is completed with the discussion of how to update . In the hierarchical model (7), we impose a gamma prior on the scale parameter of the double exponential distribution. This turns out to be a conjugate problem; therefore, we update by

(21)

Note that the gamma distribution above is parameterized by its scale parameter.

Now the derivation of the updating algorithm is complete. Implementation of the described Gibbs sampler requires simulation routines for standard distributions such as the gamma, inverse gamma, Bernoulli, beta, exponential, normal, and also specialized routines to simulate from truncated normal, and generalized inverse Gaussian. The procedure was implemented in MATLAB and available from the author.

The Gibbs sampling procedure can be summarized as

  1. Choose initial values for parameters

  2. Repeat steps (iii) - (xi) for

  3. Update the block for

  4. Update for

  5. Update

  6. Update

  7. Update

  8. Update for

  9. Update for

  10. Update for

  11. Update .

Note that the updating steps of vectors , and are vectorized in the implementation, which considerably speeds up the computation.

4 Simulations

In this section, we apply the proposed Gibbs sampling algorithm and simulate posterior realizations for the model in (7). We will name our method GS-WaPaLiM, which is an acronym for Gibbs Sampling Wavelet-based Partially Linear Model (GS-WaPaLiM) method. Within each simulation step 20,000 Gibbs sampling iterations were performed, of which the first 5,000 were used for burn-in. We used the sample averages and as the usual estimator for the posterior mean. In our set-up, .

In what follows, we first discuss the selection of the hyperparameters, then compare the estimation performance with other methods on two simulated examples. Finally, variable selection will be demonstrated on an example.

4.1 Selection of Hyperparameters

In any Bayesian modeling task, the selection of hyperparameters is critical for good performance of the model. It is also desirable to have a default choice of the hyperparameters which makes the procedure automatic.

In order to apply the GS-WaPaLiM method, we only need to specify hyperparameters , , , , , and in the hyperprior distributions. The advantage of the fully Bayesian approach is that once the hyperpriors are set, the estimation of parameters , , , , , , , , , and is automatic via the Gibbs sampling algorithm. The selection is governed by the data and hyperprior distributions on the parameters. Another advantage is that the method is relatively robust to the choice of hyperparameters since they influence the model at a higher level of hierarchy.

Critical parameters with respect to the performance of the shrinkage are and , which control the strength of shrinkage of and to zero. In model (7), we placed a uniform prior on these parameters; therefore, the estimation will be governed mostly by the data, which provides a degree of adaptiveness. Parameter represents the probability that a predictor enters the model a priori. When a priori information is available, it can be incorporated into the model, however, this is rarely the case. In the wavelet regression context, Abramovich et al. (1998) estimated parameter by a theoretically justified but somewhat involved method, and in Vidakovic and Ruggeri (2001), the estimation of this parameter depends on another hyperparameter , which is elicited based on empirical evidence. The proposed method provides a better alternative because of its automatic adaptiveness to the underlying nonparametric part of the model.

Another efficient way to elicit the hyperparameters of the model is through the empirical Bayes method performing maximization of the marginal likelihood. This approach was followed by Qu (2006) in the context of estimating partially linear wavelet models. However, the likelihood function is nonconcave; therefore, clever optimization algorithm and carefully set starting values are crucial for the performance of this method. The same method of estimating hyperparameters was used for example by Clyde and George (1999) and Johnstone and Silverman (2005) in the wavelet regression context, and by George and Foster (2000) in the linear regression context. Note that for the mixture priors specified on the parametric and nonparametric parts in model (7) the empirical Bayes approach might not be computationally tractable; therefore, the fully Bayesian approach provides a good alternative.

Default specification of hyperparameters , , , , , and in model (7) is given by the following:

  • We set , and .

  • Then we compute naive estimators from the data

    where is an estimator of the nonparametric part of model (2), and is the ordinary least squares estimator for , although computed from the raw partially linear data.

  • Then we set , so that the mean of the inverse gamma prior becomes . We use , which is the usual robust estimator of the noise variation in the wavelet shrinkage literature (Donoho and Johnstone, 1994). Here MAD stands for the median absolute deviation of the wavelet coefficients at the finest level of detail and the constant 0.6745 calibrates the estimator to be comparable with the sample standard deviation. Note that coefficients correspond to , therefore, .

  • After this we set , which sets the mean of the gamma prior on equal to an estimator of . This estimator is adopted from Vidakovic and Ruggeri (2001), where .

  • Finally we set , so that the mean of the inverse gamma prior is a prespecified value, . Results in the estimation of s turned out to be somewhat sensitive to for small sample size and small number of linear predictors. We used , which specified a prior on with large enough variance to work well in practice.

4.2 Simulations and Comparisons with Various Methods

In this section, we discuss the estimation performance of the proposed GS-WaPaLiM method and compare it to three methods from the partially linear wavelet model literature. The first one is the wavelet Backfitting algorithm (BF) proposed by Chang and Qu (2004), the second one is the LEGEND algorithm proposed by Gannaz (2007) and the last one is the double penalized PLM wavelet estimator (DPPLM) by Ding et al. (2011). A Bayesian wavelet-based algorithm for the same problem was proposed by Qu (2006). However, we found that the implementation of that algorithm is not robust to different simulated examples and initial values of the empirical Bayes procedure, therefore, we omitted it from our discussion.

The coarsest wavelet decomposition level was , as suggested from Antoniadis et al. (2001). Reconstruction of the theoretical signal was measured by the average mean squared error (AMSE), calculated as

where is the number of simulation runs, and are known values of the simulated functions considered. We denote by the estimator from the th simulation run. Note again, that in each of these simulation runs we perform 20,000 Gibbs sampling iterations in order to get the estimators and . Also note that , where . We also assess the performance in estimating the parametric part of the model by AMSE, calculated as

In the following simulation study we also used a modification of the wavelet Backfitting algorithm proposed by Chang and Qu (2004). The original algorithm, denoted as BF, uses as a soft threshold value in each iteration. In the modified algorithm we run the iterative algorithm a second time using the generalized cross-validation threshold as in Jansen et al. (1997). This simple modification significantly improves the performance of the original algorithm. The method will be denoted as BFM in the sequel.

The procedure based on Gannaz (2007), denoted as LEGEND, is a wavelet thresholding based estimation procedure solved by the proposed LEGEND algorithm. The formulation of the problem is similar to the one in Chang and Qu (2004) and Fadili and Bullmore (2005), penalizing only the wavelet coefficients of the nonparametric part, but the solution is faster by recognizing the connection with Huber’s M-estimation of a standard linear model with outliers.

The algorithm by Ding et al. (2011) will be denoted as DPPLM in the simulations. The authors discuss several simulation results based on how the Lasso penalty parameter was chosen and whether the adaptive Lasso algorithm was used or not in the estimation procedure. It was reported that the GCV criteria with adaptive Lasso provided the smallest AMSE results, therefore, that version of the algorithm is used in the present simulations. We will refer to the method as DPPLM-GCV in the future.

For comparison purposes we use two simulation examples, one from Qu (2006), and another one from Ding et al. (2011). We set the number of replications .

Example 1
The first example is based on an example in Qu (2006). The simulated data are generated from

where and with . The nonparametric test functions are , where Blocks, Bumps, Doppler and Heavisine. These are four standard test functions considered by Donoho and Johnstone (1994). We chose , , and to have reasonable signal-to-noise ratios (SNR). The test functions were simulated at and points, and the nonparametric components were equally spaced in the unit interval. The standard wavelet bases were used: Symmlet 8 for Heavisine and Doppler, Daubechies 6 for Bumps and Haar for Blocks. The two columns of the design matrix were generated as independent random variables.

Results of the simulation are presented in Table 1. It can be seen that the proposed GS-WaPaLiM method gives better AMSE and AMSE results in most test scenarios. It is apparent that the modified version of the Backfitting algorithm (BFM) provides better results than the original backfitting algorithm (BF). Note that an additional uncertainty results from estimating the noise variance , which was assumed to be known in the simulations by Chang and Qu (2004). provides comparable results to the algorithm, since both are using the same least squares formulation penalizing only the wavelet coefficients of the nonparametric part of the model. The solution algorithm and estimation of the noise is different in these methods. Note that boldface numbers indicate the smallest AMSE result for each test scenario.

Signal N Method AMSE AMSE Signal N Method AMSE AMSE
Blocks 64 GS-WaPaLiM 0.6012 0.1179 Doppler 64 GS-WaPaLiM 1.0332 0.2009
BF 8.3137 0.5647 BF 5.0350 0.3366
BFM 1.0670 0.1606 BFM 1.1988 0.1607
LEGEND 7.0360 0.4970 LEGEND 4.8372 0.2965
DPPLM-GCV 0.8781 0.1415 DPPLM-GCV 1.0435 0.1535
128 GS-WaPaLiM 0.3933 0.0284 128 GS-WaPaLiM 0.4865 0.0363
BF 3.9360 0.1268 BF 2.5299 0.0758
BFM 0.6040 0.0368 BFM 0.6676 0.0390
LEGEND 3.6166 0.1166 LEGEND 2.8607 0.0811
DPPLM-GCV 0.5955 0.0372 DPPLM-GCV 0.6540 0.0393
256 GS-WaPaLiM 0.2547 0.0107 256 GS-WaPaLiM 0.3727 0.0126
BF 2.1635 0.0320 BF 1.7494 0.0264
BFM 0.4488 0.0138 BFM 0.4880 0.0145
LEGEND 1.9638 0.0290 LEGEND 1.8516 0.0261
DPPLM-GCV 0.4465 0.0140 DPPLM-GCV 0.4854 0.0146
512 GS-WaPaLiM 0.1776 0.0048 512 GS-WaPaLiM 0.2293 0.0050
BF 1.2914 0.0098 BF 0.9649 0.0085
BFM 0.3247 0.0056 BFM 0.3129 0.0057
LEGEND 1.2862 0.0096 LEGEND 1.0617 0.0089
DPPLM-GCV 0.3252 0.0057 DPPLM-GCV 0.3122 0.0057
Bumps 64 GS-WaPaLiM 0.7932 0.2136 Heavisine 64 GS-WaPaLiM 0.4265 0.0714
BF 8.8940 0.6099 BF 1.4997 0.0959
BFM 1.7003 0.2479 BFM 0.5890 0.0628
LEGEND 9.2580 0.5556 LEGEND 1.5195 0.1013
DPPLM-GCV 1.4129 0.2222 DPPLM-GCV 0.5833 0.0642
128 GS-WaPaLiM 0.7265 0.0783 128 GS-WaPaLiM 0.2834 0.0227
BF 6.3618 0.2014 BF 0.4817 0.0234
BFM 1.0466 0.0728 BFM 0.3526 0.0218
LEGEND 7.5358 0.2046 LEGEND 1.0038 0.0336
DPPLM-GCV 0.9931 0.0716 DPPLM-GCV 0.3544 0.0226
256 GS-WaPaLiM 0.5522 0.0177 256 GS-WaPaLiM 0.1972 0.0099
BF 4.0588 0.0571 BF 0.3603 0.0112
BFM 0.7845 0.0247 BFM 0.2623 0.0105
LEGEND 3.9100 0.0533 LEGEND 0.6559 0.0140
DPPLM-GCV 0.7688 0.0243 DPPLM-GCV 0.2608 0.0107
512 GS-WaPaLiM 0.4317 0.0065 512 GS-WaPaLiM 0.1310 0.0045
BF 2.9271 0.0196 BF 0.2576 0.0049
BFM 0.6022 0.0091 BFM 0.1758 0.0047
LEGEND 2.9142 0.0189 LEGEND 0.4219 0.0058
DPPLM-GCV 0.5999 0.0090 DPPLM-GCV 0.1756 0.0047
Table 1: AMSE comparison of the GS-WaPaLiM method to other methods for Example 1.

Example 2
The second example is based on a simulation example from Ding et al. (2011). The simulated data are generated from

where and with . The parametric part of the model is sparse, where only the first 4 regression variables are significant. The nonparametric test functions are , where PiecePoly given in Nason (1996) and Bumps. We chose and to have reasonable signal-to-noise ratios (SNR). The test functions were simulated at and points, and Daubechies 8 wavelet base were used in both cases of the test functions. Rows of the design matrix were independently generated from 20-dimensional multivariate normal distribution with zero mean vector, variance 1 and pairwise correlation coefficient between consecutive elements of the rows .

Results of the simulation are presented in Table 2. Note that boldface numbers indicate the smallest AMSE results for each test scenario. It can be seen that the proposed GS-WaPaLiM method gives better AMSE and AMSE results in all test scenarios. In this example the parametric part of the model is sparse, therefore, the double penalized wavelet estimator is superior to the wavelet backfitting and algorithms, especially in estimating s. Since the true is a sparse vector, penalized estimation of the coefficients provides superior results as opposed to the , and