A Proof of Theorem 1 and 2

Bayesian Penalized Regression

Abstract

We consider ordinary least squares, lasso, bridge, and ridge regression methods under a unified framework. The particular method is determined by the form of the penalty term, which is typically chosen by cross validation. We introduce a fully Bayesian approach which allows selection of the penalty through posterior inference if desired. We also show how to use a type of model averaging approach to eliminate the nuisance penalty parameters and perform inference through the marginal posterior distribution of the regression coefficients. We develop a component-wise Markov chain Monte Carlo algorithm for sampling and establish conditional and marginal posterior consistency for the Bayesian model. Numerical results show that the method tends to select the optimal penalty and performs well in both variable selection and prediction. Both simulated and real data examples are provided.

1 Introduction

Penalized regression methods such as the lasso (Tibshirani, 1996), ridge regression (Hoerl and Kennard, 1970), and bridge regression (Frank and Friedman, 1993; Fu, 1998) have become popular alternatives to ordinary least squares (OLS). All of these methods can be viewed in a common framework. If is a centered -vector of responses, is a standardized matrix, and is a -vector, then estimates are obtained by solving

where , and . When the OLS estimator is recovered, while if , then corresponds to the lasso, corresponds to bridge regression, and corresponds to ridge regression. The success of penalized regression methods is undisputed, but it is well recognized that each method performs best in different regimes defined by the nature of the unknown true parameter ; for some discussion see Fu (1998); Tibshirani (1996); and Zou and Hastie (2005). In practice, and may be chosen using cross validation.

We consider Bayesian approaches to penalized regression methods, which have received much recent attention. Tibshirani (1996) characterized the lasso estimates as a posterior quantity, however, the first explicit Bayesian approach to lasso regression is introduced by Park and Casella (2008) followed by Hans (2009) and Kyung et al. (2010). Fu (1998) and Polson et al. (2014) studied Bayesian bridge regression while Casella (1980), Frank and Friedman (1993), and Griffin and Brown (2013) considered Bayesian ridge regression.

A standard Bayesian formulation of penalized regression models assumes

with an identity matrix, along with priors and

(1)

Notice that if is observed and is fixed, this yields a marginal posterior density

(2)

from which one can easily observe that the estimator obtained in (1) amounts to the posterior mode. The frequentist estimator is thus suboptimal under squared error loss for which the Bayes estimator is the posterior mean; see Hans (2009) for a clear discussion on this point in the context of the Bayesian lasso.

Of course, just as in (1), the use of (2) requires a choice of and . Some Bayesian approaches have incorporated a prior for , some have used empirical Bayes approaches to estimate it, and some have conditioned on it as a fixed quantity (Casella, 1980; Hans, 2009; Khare and Hobert, 2013; Kyung et al., 2010; Park and Casella, 2008; Roy and Chakraborty, 2017). However, existing Bayesian methods condition on the choice of .

In Section 2 we propose a fully Bayesian hierarchical model incorporating a prior for and yielding a posterior density . This allows inference to proceed naturally using a type of model-averaging. If estimation of the true value of is of interest, then the marginal density can be used to produce an estimate along with posterior credible intervals. If prediction of a future value is desired we calculate the posterior mean of the posterior predictive density

Prediction intervals based on the posterior predictive density are conceptually straightforward. We will demonstrate that the proposed model results in estimation and prediction that is comparable to, and often better than, existing methods.

We can also use the hierarchical model to perform inference about based on the appropriate marginal density. Consider estimation of . In Section 5 we conduct a simulation study where four scenarios are identified such that in scenario I and IV the lasso should be preferred, while in scenario II ridge and lasso should be comparable, and in scenario III ridge should be preferred. The estimated marginal posterior density for for a single simulated data set from each scenario is displayed in Figure 1. We see that the posterior density tends to have a mode near the values of corresponding to the optimal penalization method.

Figure 1: Estimated marginal posterior density of in four scenarios.

The posterior for the proposed hierarchical model is analytically intractable in the sense that it is difficult to calculate the required posterior quantities. Thus we develop an efficient component-wise Markov chain Monte Carlo (MCMC) algorithm (Johnson et al., 2013) to sample from the posterior. We also consider Monte Carlo approaches to estimating posterior credible intervals and interval estimates based on the posterior predictive distribution.

Throughout we pay particular attention to the setting where dimension grows with sample size. In addition to using simulation to demonstrate the properties of the proposed hierarchical model, we establish both conditional and marginal strong posterior consistency. That is, we show that the posterior conditional distribution of and the marginal distribution of will concentrate in neighborhoods of the true regression coefficient as the dimension grows with sample size. This is dealt with carefully in Section 3.

The rest of the paper is organized as follows. In Section 2 we introduce the hierarchical model. We study strong posterior consistency in Section 3. Section 4 addresses estimation and prediction with a Markov chain Monte Carlo algorithm. Simulation experiments and a data example are presented in Sections 5 and 6, respectively. Some final remarks are given in Section 7. All proofs are deferred to the appendix.

2 Hierarchical Model

We will construct a hierarchical model which shrinks small effects while simultaneously introducing little bias for larger effects. Indeed the posterior consistency results of Section 3 give sufficient conditions for this and we demonstrate it via simulation in Section 5.

We continue to assume the response follows a normal distribution

(3)

We also assume a proper conjugate prior . The prior on differs from (1). Instead we assume

(4)

The only difference is that for each we assign a parameter , which allows for differing shrinkage in estimating each component. Figure 2 displays the density for some settings of and .

Figure 2: Prior density of for different values of and when .

Routine calculation shows that and

Hence the variance is a decreasing function of . If is small, larger values of are likely but if is small, smaller values of are likely. This suggests a way to incorporate a spike-and-slab prior through the prior for . Specifically, we assume

(5)

and . The hyperparameters are chosen so that one component of the mixture has a small mean and variance while the other can have a relatively large mean and variance.

Figure 3: A useful default prior density for .

Finally, we need to specify a prior for . Notice that, unlike which controls shrinkage for an individual , the parameter is common to all of the . We assume

(6)

where is a shifted to have support on and each such that . The idea here is that each component represents the analyst’s assessment of the relative importance of lasso, bridge, and ridge but, in our empirical work, we have found that different choices yield similar estimation and prediction. A useful default prior results from setting , , , and . This amounts to a “U” shaped curve with about of the mass close to 1, another close to 2, and the remaining in between, meaning lasso, ridge, and bridge regression are roughly equally likely. This prior density is displayed in Figure 3.

3 Posterior Consistency

In this section we establish sufficient conditions for the posterior to concentrate near the true regression coefficients as the dimension grows with sample size. We slightly modify our notation to make the dependence on the sample size explicit. Let and let denote all of the hyperparameters. Then the full posterior distribution is denoted since, in this section, we assume the precision is known. We will establish both consistency with respect to the marginal and consistency with respect to the conditional .

We make the following assumptions throughout this section (i) , as ; (ii) if and are the smallest and the largest singular values of , respectively, then ; (iii) if is the true regression parameter, then ; and if denotes the number of nonzero elements in , then , as , for . Finally, let denote the distribution at (3) under the true regression parameter and for set

We are now in position to state our result on conditional consistency.

Theorem 1.

If, for each , for finite , then for any , as ,

Proof.

See Appendix A. ∎

Next we address marginal posterior consistency.

Theorem 2.

If, for each , each element for and , then for any , as ,

Proof.

See Appendix A. ∎

4 Estimation and Prediction

The hierarchical model gives rise to a posterior density characterized by

(7)

and which yields marginal density . Under squared error loss, the Bayes estimator of the regression coefficients is . Interval estimates can be constructed from quantiles of the posterior marginal distribution of . Similarly, we can estimate and make inference about the other parameters through the appropriate marginal distributions.

Under squared error loss prediction of a future observation is based on the mean of the posterior predictive distribution

(8)

A routine calculation shows that if corresponds to a new observation, then . Interval estimates can be constructed from quantiles of the posterior predictive distribution.

Unfortunately, calculation of and quantiles of posterior marginals or the posterior predictive distribution and is analytically intractable so we will have to resort to Monte Carlo methods, which are considered in the sequel.

4.1 Markov Chain Monte Carlo

We develop a deterministic scan component-wise MCMC algorithm with invariant density which consists of a mixture of Gibbs updates and Metropolis-Hastings updates. To begin we require the posterior full conditionals. Let be all of the entries of except . Then

(9)
(10)

and

where and are Gamma densities evaluated at . We see that we can use Gibbs updates for , the and the . However, for the and we will need Metropolis-Hastings updates, which are now described.

Consider updating . If is the current value at the th iteration, then we will use a random walk Metropolis-Hastings update with proposal distribution , where is chosen by the user, and invariant density given by (9).

The MH update for is conducted via a proposal mixture of kernels. Specifically, if is chosen by the analyst, then the proposal distribution is

That is, we use an independence Metropolis-Hastings sampler with invariant density given by (10).

Cycling through these updates for steps in the usual fashion yields an MCMC sample

Estimation is straightforward since the sample mean is strongly consistent for , that is, as ,

and a sample quantile of the is strongly consistent for the corresponding quantile of the marginal distribution (Doss et al., 2014).

Prediction intervals for a new observation require a further Monte Carlo step. Consider the posterior predictive density

so that given the MCMC sample we can sample from by drawing for . The sample quantiles of are then strongly consistent for the corresponding quantiles of the posterior predictive distribution.

Remark 1.

If interest lies in extreme quantiles, then importance sampling is preferred (see e.g. Robert and Casella, 2013). However, for standard settings such as .05 or .95 quantiles, then the approach suggested here will be much faster. In fact, compared to the above approach, our implementation of importance sampling with a Cauchy instrumental distribution was more than 550 times slower in our examples from Section 5 and hence we do not pursue it further here.

5 Simulation Experiments

5.1 Simulation Scenarios

We consider four scenarios: (I) a small number of large effects; (II) a small to moderate number of moderate-sized effects; (III) a large number of small effects; and (IV) a sparse setting with . For each scenario we independently repeat the following procedure 1000 times. We generate 1000 observations from a model and split them into a training set of size and a test set of size . We then fit the hierarchical model from Section 2 on the training data using the MCMC algorithm and estimation procedure from Section 4. The hyperparameters were taken to be , , , and and we used the suggested default prior for . The MCMC algorithm is run for 1e5 iterations, a value which was chosen based on obtaining enough effective samples according to the procedure developed by Vats et al. (2017). The MCMC procedure is not computationally onerous since in our most challenging simulation experiment it took less than ten seconds to complete for a single data set.

In each scenario, we generate data from the following linear model:

In each setting we include an intercept so that the first column of the design matrix is a column of ones. The remaining columns are generated from a multivariate normal distribution , where the diagonal entries of equal 1 and the off-diagonals are for all .

Scenario I. We set and, in each replication, randomly choose 17 of the 19 predictors to be 0, while the remaining two are independently sampled from a . Here and .

Scenario II. We set and, in each replication, randomly choose 9 of the 19 predictors to be 0, while the remainder are independently sampled from a . Here and .

Scenario III. We set and, in each replication, 19 are independently sampled from a . Here and .

Scenario IV. We set and, in each replication, randomly choose 132 of the 149 predictors to be 0, while the remaining are independently sampled from a . Here and .

Remark 2.

While we assume Gaussian errors in our simulation experiments, we also investigated the situation where this assumption is violated. In particular, we considered the case where follows a Student’s -distribution with 5 degrees of freedom. In this setting our method continued to provide reasonable estimation and prediction. In fact, the results were similar enough that we do not present them here in the interest of a concise presentation.

5.2 Posterior of

Recall Figure 1 which displays the estimated posterior density of for a single data set in each of the four scenarios. The results coincide nicely with previous conclusions (Tibshirani, 1996; Friedman et al., 2001). In scenario I and IV where the lasso is preferred, more mass is close to 1. In scenario III ridge regression should dominate the lasso and is concentrated in the region between 1.8 and 2. In scenario II ridge and lasso are often comparable with a small advantage for lasso. In the data set displayed here the estimated density favors larger values of , but we will see that the performance Bayesian methods are comparable to the optimal frequentist method. Overall, the proposed approach has the ability to provide a posterior density curve of which puts most if its mass near the optimal values of .

5.3 Estimation

We compare the Bayesian penalized regression model with a Bayesian lasso (i.e. the hierarchical model of Section 2 with ) and a Bayesian ridge regression (i.e. the hierarchical model of Section 2 with ). We also compare to the frequentist lasso and ridge regression with the tuning parameters chosen by 10-fold cross validation.

Table 1 reports the average distance between estimated coefficients and the truth. These results suggest that the hierarchical model dominates the other two when large effects exist (scenarios I and IV), especially in scenario IV where . Our approach dominates ridge and is comparable to the others in scenario II. In the scenario III, our model is superior to the Bayesian lasso and Bayesian ridge models, but is slightly worse than the lasso while ridge regression dominates. Largely this appears to be due to the hierarchical model more aggressively shrinking tiny effects to zero.

Scenario Average
B.P. B.L. B.R. Lasso Ridge
I 0.429 0.005 0.410 0.005 0.475 0.005 1.107 0.012 10.025 0.025
II 2.649 0.028 2.381 0.026 2.517 0.026 2.159 0.019 4.958 0.015
III 2.959 0.014 3.149 0.015 3.033 0.014 2.539 0.012 0.639 0.003
IV 1.399 0.035 1.528 0.044 1.590 0.046 5.656 0.070 36.226 0.011
Table 1: Average with standard errors.

Table 2 gives an estimate of the signal-to-noise ratio; specifically the ratio of the sum of squares due to regression (SSReg) divided by the sum of squared residuals (SSR). One can observe that the proposed method is at least as good as the alternatives presented here.

Scenario Median SSReg/SSR
B.P. B.L. B.R. Lasso Ridge
I 140.4 1.1 139.8 1.2 141.6 1.1 140.7 1.0 51.9 0.2
II 179.1 1.4 180.7 1.3 180.7 1.2 175.6 1.4 131.4 0.8
III 42.6 0.4 42.9 0.4 42.8 0.4 45.0 0.5 40.2 0.4
IV 1076.5 10.9 1048.7 9.6 1269.4 14.7 732.8 4.3 2.5 0.1
Table 2: Estimated signal-to-noise ratio (SSReg/SSR) with standard errors.

Table 3 summarizes the estimated frequentist coverage probability of 95% posterior credible intervals for each element of . For instance, in scenario I, we have 19 corresponding ratios for the 19 predictors other than intercept. The results again suggest good performance except in scenario III which again is likely due to the aggressive shrinkage of tiny effects.

Scenario min 25% med 75% max
I 0.98 1.00 1.00 1.00 1.00
II 0.86 0.92 0.97 0.99 1.00
III 0.70 0.80 0.83 0.85 0.91
IV 0.96 1.00 1.00 1.00 1.00
Table 3: Estimated coverage probability for 95% posterior credible intervals

Estimation of Large Effects

We consider two additional scenarios to study what happens in the presence of large effects. All of the settings remain the same as above, except as noted below.

Scenario V. Set , and . Values of regressors other than the intercept are also drawn from a multivariate distribution , where the diagonal entries of equals 1 and all off-diagonals are . The simulation true vector of coefficients is follows,

Scenario VI. We have the same settings as in scenario V except that

Table 4 reports the average distance between estimated coefficients and the truth in scenarios V and VI. The Bayesian methods are all comparable and all dominate both the lasso and ridge regressions. It is well known that lasso and ridge regression produce highly biased estimates in the presence of large effects, however, this does not appear to happen in the Bayesian approaches. Notice also that the scale of the coefficients does not affect the results for the Bayesian methods.

Scenario Average
B.P. B.L. B.R. Lasso Ridge
V 1.369 0.005 1.364 0.005 1.368 0.005 1.575 0.015 3.247 0.039
VI 1.366 0.005 1.365 0.005 1.371 0.005 96.666 1.189 294.709 2.311
Table 4: Average with standard errors.

Table 5 gives an estimate of the of signal-to-noise ratio. Again, the proposed method is at least as good as the alternatives.

Scenario Median SSReg/SSR
B.P. B.L. B.R. Lasso Ridge
V 252.7 1.4 253.0 1.3 252.9 1.3 234.2 1.0 216.0 1.3
VI 1.43e6 7481.1 1.43e6 7322.7 1.43e6 7585.4 864.5 5.9 904.8 1.6
Table 5: Estimated signal-to-noise ratio (SSReg/SSR) with standard errors.

5.4 Prediction

We turn our attention to prediction of a future observation. The simulation results are based on the same simulated data as in Section 5.3. Table 6 reports our simulation results. The Bayesian approaches are comparable in all four scenarios, with the fully Bayesian approach being slightly better. In scenario I the Bayesian methods dominate both lasso and ridge. In scenario II the Bayesian methods are all comarable to lasso with ridge being substantially worse. Ridge dominates in scenario III, but the other methods are comparable. In scenario IV, where , both lasso and ridge regression are substantially worse.

Scenario Median MSE
B.P. B.L. B.R. Lasso Ridge
I 4.045 0.004 4.040 0.003 4.057 0.004 4.192 0.007 11.947 0.057
II 4.741 0.017 4.626 0.013 4.682 0.013 4.586 0.010 5.790 0.022
III 4.903 0.014 4.966 0.012 4.905 0.014 4.640 0.010 4.249 0.008
IV 5.325 0.024 5.324 0.019 5.542 0.021 10.494 0.123 613.517 3.070
Table 6: Median mean-squared error with standard errors.

Figure 4 displays the posterior predictive densities and prediction intervals for a future observation. In each graph, the dotted bell-shaped curve represents the true density function. The solid bell-shaped curve is the empirical posterior predictive distributions of . The dotted line stands for the true s and the two dashed lines represent 2.5 and 97.5 percentiles respectively so that intervals in-between are the 95% intervals. Clearly the prediction interval contains the true value in each scenario.

Figure 4: Posterior predictive intervals. Scenario I is upper left, scenario II is upper right, scenario III is lower left, and scenario IV is lower right.

Prediction with Large Effects

We consider scenarios V and VI described above. Table 7 reports prediction performance. As with estimation all of the Bayesian methods are comparable and the scale does not impact the quality of prediction. The lasso and ridge regressions are dominated by the Bayesian methods. We also provide box plots of prediction MSE across replications for the full Bayesian model, the lasso and ridge in Figure 5. Clearly, the Bayesian approach is much more stable than the others.

Scenario Median MSE
B.P. B.L. B.R. Lasso Ridge
V 4.988 0.010 4.973 0.010 4.985 0.009 8.016 0.262 8.207 0.262
VI 4.988 0.013 4.986 0.012 4.996 0.012 40655.773 1258.053 41790.063 1458.483
Table 7: Median mean-squared errors with standard errors.
Figure 5: Box-plots of median mean-squared errors for scenario V (left) and VI (right).

6 Diabetes Data

Figure 6: Marginal posterior distributions for three selected predictors. The dashed line is the marginal posterior mean of each coefficient, dotted line the lasso estimates, and dash-dot line the ridge estimates. An empirical 95% credible region is marked by solid line for each coefficient as well.

The Diabetes data set (Efron et al., 2004) contains 10 predictors, 1 response, and 442 observations. There is a positive correlation between predictor S1 and S2. When the correlation between two predictors is close to 1 and both of them tend to be unimportant in the model, some methods may fit a negative coefficient for one predictor and a positive one for the other. For example, a OLS linear model with the diabetes data will fit a coefficient for S1 and for S2.

We applied the fully Bayesian hierarchical model to this data. The hyperparameters were taken to be , , , and and we used the suggested default prior for . We used 1e7 MCMC samples. The left and middle panel of Figure 6 show the empirical marginal posterior distributions for S1 and S2 respectively. Marked by solid lines, both 95% credible intervals suggest that S1 and S2 should not be included in the model. The lasso and ridge solutions are also presented in the figures. An interesting observation is that the effect of gender is significant based on our model while insignificant on the other two.

7 Final Remarks

We proposed a fully Bayesian method for penalized linear regression which incorporates shrinkage-related parameters both at the individual and the group level namely, the ’s and . This allows the practitioner to address the uncertainty in the tuning parameters in a principled manner by averaging over the posterior distribution. Overall, the method has good performance in terms of prediction and estimation.

There are several potential directions for future research in this vein. For example, it would be conceptually straightforward to investigate priors which permit , whereas we have restricted consideration to . It would also be interesting to consider a broader class of penalized regression approaches which would allow the embedding of lasso, ridge and bridge regression in a framework which also incorporates other penalized regressions such as the elastic net (Zou and Hastie, 2005).

Appendix A Proof of Theorem 1 and 2

We begin with a preliminary result that will be used in both proofs.

Lemma 1.

Let denote the subset of nonzero entries in and and . Then

(11)
Proof.

Notice that

Then

and by conditional independence

(12)

For , we have

(13)

Since

combining (A) and (A) yields the claim. ∎

Proof of Theorem 1.

All we need to do is show that there exists such that for

The result would then follow directly from Theorem 1 in Armagan et al. (2013).

Consider (1). If for finite , then

Now take the negative logarithm of both sides of (1) to obtain

By assumption as for so that also . Thus the first, second, and the fifth term on the right hand side are . Consider the third and fourth terms. As ,

Thus, as , and . Therefore the third and fourth terms are also . Then for all , as ,

and the result follows. ∎

Proof of Theorem 2.

We need to show that there exists such that for

The result would then follow directly from Theorem 1 in Armagan et al. (2013).

Let . Then, since is a proper prior distribution,

We will show that there exists such that for

From (1) we have