Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation ^{1}^{1}1This paper is part of Mohammed EL Anbari’s PhD thesis. This work has been partly supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 20092012 project ANR09BLAN01 Emile for the last two authors, and by Institut Universitaire de France for the last author. JeanMichel Marin and Christian P. Robert are grateful to the participants to the BIRS 07w5079 meeting on “Bioinformatics, Genetics and Stochastic Computation: Bridging the Gap” for their helpful comments. Discussions in 2007 in Banff with Sylvia Richardson and in Roma with Jim Berger and Paul Speckman are also gratefully acknowledged. Given that Arnold Zellner sadly passed away last August, we would like to dedicate this paper to the memory of this leading Bayesian thinker who influenced so much the field and will continue to do so much longer.
Abstract
Using a collection of simulated an real benchmarks, we compare Bayesian and frequentist regularization approaches under a low informative constraint when the number of variables is almost equal to the number of observations on simulated and real datasets. This comparison includes new global noninformative approaches for Bayesian variable selection built on Zellner’s gpriors that are similar to Liang et al. (2008). The interest of those calibrationfree proposals is discussed. The numerical experiments we present highlight the appeal of Bayesian regularization methods, when compared with nonBayesian alternatives. They dominate frequentist methods in the sense that they provide smaller prediction errors while selecting the most relevant variables in a parsimonious way.
Keywords: Model choice, regularization methods, noninformative priors, Zellner’s –prior, calibration, Lasso, elastic net, Dantzig selector.
1 Introduction
Given a response variable, and a collection of associated potential predictor variables , the classical linear regression model imposes a linear dependence on the conditional expectation (Rao, 1973)
A fundamental inferential direction for those models relates to the variable selection problem, namely that only variables of relevance should be kept within the regression while the others should be removed. While we cannot discuss at length the potential applications of this perspective, variable selection is particularly relevant when the number of regressors is larger than the number of observations (as in microarray and other genetic data analyzes).
To deal with poorly or illposed regression problems, many regularization methods have been proposed, like ridge regression (Hoerl and Kennard, 1970) and Lasso (Tibshirani, 1996). Recently the interest for frequentist regularization methods has increased and this has produced a flury of methods (see, among others, Candes and Tao, 2007, Zou and Hastie, 2005, Zou, 2006, Yuan and Lin, 2007).
However, a natural approach for regularization is to follow the Bayesian paradigm as demonstrated recently by the Bayesian Lasso of Park and Casella (2008). The amount of literature on Bayesian variable selection is quite enormous (a small subset of which is, for instance, Mitchell and Beauchamp, 1988, George and McCulloch, 1993, Chipman, 1996, Smith and Kohn, 1996, George and McCulloch, 1997, Dupuis and Robert, 2003, Brown and Vannucci, 1998, Philips and Guttman, 1998, George, 2000, Kohn et al., 2001, Nott and Green, 2004, Schneider and Corcoran, 2004, Casella and Moreno, 2006, Cui and George, 2008, Liang et al., 2008, Bottolo and Richardson, 2010). The number of approaches and scenarii that have been advanced to undertake the selection of the most relevant variables given a set of observations is quite large, presumably due to the vague decisional setting induced by the question Which variables do matter? Such a variety of resolutions signals a lack of agreement between the actors in the field.
Most of the solutions, including Liang et al. (2008) and Bottolo and Richardson (2010), focus on the use of the prior, introduced by Zellner (1986). While this prior has a long history and while it reduces the prior input to a single integer, , the influence of this remaining prior factor is longlasting and large values of are no guarantee of negligible effects, in connection with the Bartlett or Lindley–Jeffreys paradoxes (Bartlett, 1957, Lindley, 1957, Robert, 1993), as illustrated for instance in Celeux et al. (2006) or Marin and Robert (2007). In order to alleviate this influence, some empirical Bayes [Cui and George (2008)] and hierarchical Bayes [Zellner and Siow (1980), Celeux et al. (2006), Marin and Robert (2007), Liang et al. (2008) and Bottolo and Richardson (2010)] solutions have been proposed. In this paper, we pay special attention to two calibrationfree hierarchical Zellner priors. The first one is the Jeffreys prior which is not location invariant. A second one avoids this problem by only considering models with at least one variable in the model.
The purpose of our paper is to compare the frequentist and the Bayesian points of views in regularization when remains (slightly) greater than , we limit our attention to full rank models. This comparison is considered from both the predictive and the explicative point of views. The outcome of this study is that Bayesian methods are quite similar while dominating their frequentist counterpart.
The plan of the paper is as follows: we recall the details of Zellner’s (1986) original prior in Section 2, and discuss therein the potential choices of . We present hierarchical noninformative alternatives in Section 3. Section 4 compares the results of Bayesian and frequentist methods on simulated and real datasets. Section 5 concludes the paper.
2 Zellner’s priors
Following standard notations, we introduce a variable that indicates which variables are active in the regression, excluding the constant vector corresponding to the intercept that is assumed to be always present in the linear regression model.
We observe , the model is defined as the conditional distribution
(1) 
where

,

is the matrix which columns are made of the vector and of the variables for which ,

and are unknown parameters.
The same symbol for the parameter is used across all models. For model , Zellner’s prior is given by
The experimenter chooses the prior expectation and . For such a prior, we obtain the classical average between prior and observed regressors,
This prior is traditionally called Zellner’s prior in the Bayesian folklore because of the use of the constant by Zellner (1986) in front of Fisher’s information matrix . Its appeal is that, by using the information matrix as a global scale,

it avoids the specification of a whole prior covariance matrix, which would be a tremendous task;

it allows for a specification of the constant in terms of observational units, or virtual prior pseudoobservations in the sense of de Finetti (1972).
However, fundamental feature of the prior is that this prior is improper, due to the use of an infinite mass on . From a theoretical point of view, this should jeopardize the use of posterior model probabilities since these probabilities are not uniquely scaled under improper priors, because there is no way of eliminating the residual constant factor in those priors (DeGroot, 1973, Kass and Raftery, 1995, Robert, 2001). However, under the assumption that is a parameter that has a meaning common to all models , Berger et al. (1998) develop a framework that allows to work with a single improper prior that is common to all models (see also Marin and Robert, 2007). A fundamental appeal of Zellner’s prior in model comparison and in particular in variable selection is its simplicity, since it reduces the prior input to the sole specification of a scale parameter .
At this stage, we need to point out that an alternative prior is often used (Berger et al., 1998, Fernandez et al., 2001, Liang et al., 2008, Bottolo and Richardson, 2010), by singling out the intercept parameter in the linear regression. By first assuming a centering of the covariates, i.e. for all ’s, the intercept is given a flat prior while the other parameters of are associated with a corresponding prior. Thus, this is an alternative to model , which we denote by model to stress the distinctions between both representations and which is such that
(2) 
where

the matrix which columns are made of the variables for which ,

, and are unknown parameters.
The parameters and are denoted the same way across all models and rely on the same prior. Namely, for model , the corresponding Zellner’s prior is given by
In that case, we obtain
and
For models and , in a noninformative setting, we can for instance choose or and large. However, as pointed out in Marin and Robert (2007, Chapter 3) among others, there is a lasting influence of over the resulting inference and it is impossible to “let go to infinity” to eliminate this influence, because of the Bartlett and LindleyJeffreys (Bartlett, 1957, Lindley, 1957, Robert, 1993) paradoxes that an infinite value of ends up selecting the null model, regardless of the information brought by the data. For this reason, datadependent versions of have been proposed with various degrees of justification:

Kass and Wasserman (1995) use so that the amount of information about the parameters contained in the prior equals the amount of information brought by one observation. As shown by Foster and George (1994), for large enough this perspective is very close to using the Schwarz (Kass and Wasserman, 1995) or BIC criterion in that the logposterior corresponding to is equal to the penalized loglikelihood of this criterion.

Fernandez et al. (2001) gather both perspectives in as a conservative bridge between BIC and RIC, a choice that they christened “benchmark prior”.
These solutions, while commendable since based on asymptotic properties (see in particular Fernandez et al., 2001 for consistency results), are nonetheless unsatisfactory in that they depend on the sample size and involve a degree of arbitrariness.
3 Mixtures of priors
The most natural Bayesian approach to solving the uncertainty on the parameter is to put a hyperprior on this parameter:

This was implicitely proposed by Zellner and Siow (1980) since those authors introduced Cauchy priors on the ’s since this corresponds to a prior augmented by a Gamma prior on .

For model , Liang et al. (2008), Cui and George (2008) and Bottolo and Richardson (2010) use
and an hyperprior of the form
with . This constraint on is due to the fact that the hyperprior must be proper, in connection with the separate processing of the intercept and the use of a Lebesgue measure as a prior on . We note that needs to be specified, and being the solutions favored by Liang et al. (2008).
For model a more convincing modelling is possible since the Jeffreys prior is available. Indeed, if
then
where is the orthogonal projector on the linear subspace spanned by the columns of . Since, the Fisher information matrix is
the corresponding Jeffreys prior on is
Note that, for model , Liang et al. (2008) discuss the choice of and then as leading to the reference prior and Jeffreys prior, presumably also under the marginal model after integrating out , although details are not given.
For such a prior modelling, there exists a closedform representation for posterior quantities in that
and
(3) 
where is the Gaussian hypergeometric function (Butler and Wood, 2002). We can thus proceed to undertake Bayesian variable selection without resorting at all to numerical methods (Marin and Robert, 2007). Moreover, the shrinkage factor due to the Bayesian modelling can also be expressed in closed form as
This obviously leads to straightforward representations for Bayes estimates. If is a matrix containing new values of the explanatory variables for which we would like to predict the corresponding response , the Bayesian predictor of is given by
Similarly, the Bayesian model averaging predictor of is given by
This numerical simplification in the derivation of Bayesian estimates and predictors is found in Liang et al. (2008) and exploited further in Bottolo and Richardson (2010). Note also that Guo and Speckman (2009) have furthermore established the consistency of the Bayes factors based on such priors.
In contrast with this proposal, the prior of Liang et al. (2008) depends on a tuning parameter . Despite that, there also exist arguments to support this prior modelling, including the important issue of invariance under locationscale transforms. As seen in the above formulae, the Jeffreys prior associated to model ensure scale invariance but not location invariance. In order to ensure location invariance for model , it would be necessary to center the observation variable as well as the dependent variables . Obviously, this centering of the data is completely unjustified from a Bayesian perspective and further it creates artificial correlations between observations. However it could be argued that the lack of location invariance only pertains to quite specific and somehow artificial situations and that it is negligible in most situations. We will return to this point in the comparison section.
A location scale alternative consists in using the prior of Liang et al. (2008) with and excluding the null model from the competitors. This prior leads to the model posterior probability
(5) 
4 Numerical comparisons
We present here the results of numerical experiments aiming at comparing the behavior of Bayesian variable selection and of some (nonBayesian) popular regularization methods in regression, when considered from a variable selection point of view: The regularization methods that we consider are the Lasso, the Dantizg selector, and elastic net, described in Section 4.1. The Bayesian variable selection procedures we consider oppose strategies for selecting the hyperparameter in Zellner’s priors: We include in this comparison the intrinsic prior (Casella and Moreno, 2006) which is another default objective prior for the non informative setting that does not require any tuning parameters and is also invariant under location and scale changes. All procedure under comparison are described in Table 1. We have also included in this comparison the highly standard AIC and BIC penalized likelihood criteria. Moreover, we will refer to the performances of an ORACLE procedure that assumes the true model is known and that estimate the regression coefficients with the least squares method.
4.1 Regularization methods
 1) The Lasso:

Introduced by Tibshirani (1996), the Lasso is a shrinkage method for linear regression. It is defined as the solution to the following penalized least squares optimization problem
where is a positive tuning parameter.
 2) The Dantzig Selector:

Candes and Tao (2007) introduced the Dantzig Selector as an alternative to the Lasso. The Dantzig Selector is the solution to the optimization problem
where is a positive tuning parameter. The constraint can be viewed as a relaxation of the normal equation in the classical linear regression.
 3) The Elastic Net (Enet):

The Lasso has at least two limitations: a) Lasso does not encourage grouped selection in the presence of high correlated covariates and b) for the case Lasso can select at most covariates. To overcome these limitations, Zou and Hastie (2005) proposed an elastic net that combines both ridge and Lasso penalties, i.e.
where and are two positive tuning parameters.
4.2 Numerical experiments on simulated datasets
We have designed six different simulated datasets as benchmarks chosen as follows:

Example 1 (sparse uncorrelated design) corresponds to an uncorrelated covariate setting (), with predictors and where the components of () are iid realizations. The response is simulated as

Example 2 (sparse correlated design) corresponds to a correlated case (), with predictors and , for , , for , and for , the components of () being iid realizations. The use of common terms in the ’s obviously induces a correlation among those ’s: the correlation between variables and is 0.9, as for the variables (, and ), and for the variables (, , , and ). There is no correlation between those three groups of variables. The response is simulated as

Example 3 (sparse noisy correlated design) involves predictors. Those variables are generated using a multivariate Gaussian distribution with correlations
The response is simulated as

Example (saturated correlated design) is the same as Example , except that the response is simulated as

Example 5 involves predictors. Those variables are generated using a multivariate Gaussian distribution with correlations
The response is simulated as

Example 6 (null model) involves predictors. Those variables are generated using a multivariate Gaussian distribution with correlations
The response is simulated as
Each dataset consists of a training set of size , on which the regression model has been fitted and a test set of size for assessing performances. Tuning parameters in the Lasso, the Dantzig selector (DZ), and the elastic net (ENET) have been selected by minimizing the crossvalidation prediction error through leaveoneout. For each example, independent datasets have been simulated. We use three measures of performances:

The root mean squared error (MSE)
being the prediction of in the test set;

HITS: the number of correctly identified influential variables;

FP (False Positives): the number of noninfluential variables declared as influential.
Using those six different datasets as benchmarks, we compare the variable selection methods listed in Table 1. The performances of the above selection methods are summarized in Tables 2–13. In the Bayesian approaches, the set of variables is naturally selected according to the maximum posterior probability and the predictive is obtained via the Bayesian model averaging predictors.
AIC  Akaike Information Criterion 
BIC  Bayesian Information Criterion 
BRIC  g prior with (Fernandez et al., 2001) 
EBL  Local EB estimate of in prior (Cui and George, 2008) 
EBG  Global EB estimate of in prior (Cui and George, 2008) 
ZSN  Base model in Bayes factor taken as the null model (Liang et al., 2008) 
ZSF  Base model in Bayes factor taken as the full model (Liang et al., 2008) 
OVS  Objective variable selection using the intrinsic prior (Casella and Moreno, 2006) 
HG  Hyperg prior with (Liang et al., 2008) 
HG  Hyperg prior with (Liang et al., 2008) 
HG  Hyperg prior with (Liang et al., 2008), null model excluded 
NIMS  Jeffreys prior on the noninvariant model 
LASSO  Lasso (Tibshirani, 1996) 
DZ  The Dantzig Selector (Candes and Tao, 2007) 
ENET  The elasticnet (Zou and Hastie, 2005) 
In this numerical experiment, the Bayesian procedures are clearly much more parsimonious than the regularization procedures in that they almost always avoid overfitting. In all examples, the false positive rate FP is smaller for the Bayesian solutions than for the regularization methods. Except for the ZSF and OVS scenarios which behave slightly worse than the others, all the Bayesian procedures tested here produce the same selection of predictors. It seems that ZSF has a slight tendency to select too many variables. The performances of OVS are somewhat disappointing and this procedure seems to have a tendency to be too parsimonious. From a predictive viewpoint, computing the MSE by model averaging, Bayesian approaches also perform better than regularization approaches except for the saturated correlated example (Example 4). We further note that the classical selection procedures based on AIC and BIC do not easily reject variables and are thus slightly worse than Bayesian and regularization procedures (a fact not surprising for AIC). In all examples, the NIMS and HG2 approaches lead to optimal performances in that they select the right covariates and only the right covariates, while achieving close to the minimal root mean squared error compared with all the other Bayesian solutions we considered. They also do almost systematically better than BIC and AIC.
A global remark about this coparison is that all Bayesian procedures have a very similar MSE and thus that they all correspond to the same regularization effect, except for OVS which does systematically worse. However it is important to notice that the MSE for OVS has not been computed by model averaging, but by using the best model. Otherwise, it would be hazardous to recommend one of the priors from those simulations since there is no sensitive difference between them from both selection and prediction points of view.
HITS  FP  
ORACLE  
AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
Variables  

AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
HITS  FP  
ORACLE  
AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
Variables  

AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
HITS  FP  
ORACLE  
AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
Variables  

AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
HITS  FP  
ORACLE  
AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
Variables  

AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
HITS  FP  
ORACLE  
AIC  
BIC  
BRIC  
EBL  
EBG  
ZSN  
ZSF  
OVS  
HG  
HG  
HG  
NIMS  
LASSO  
DZ  
ENET 
Variables  

AIC  
BIC 