Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation 1footnote 11footnote 1This paper is part of Mohammed EL Anbari’s PhD thesis. This work has been partly supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2009-2012 project ANR-09-BLAN-01 EMILE for the last two authors, and by Institut Universitaire de France for the last author. Jean-Michel Marin and Christian P. Robert are grateful to the participants to the BIRS 07w5079 meeting on “Bioinformatics, Genetics and Stochastic Computation: Bridging the Gap” for their helpful comments. Discussions in 2007 in Banff with Sylvia Richardson and in Roma with Jim Berger and Paul Speckman are also gratefully acknowledged. Given that Arnold Zellner sadly passed away last August, we would like to dedicate this paper to the memory of this leading Bayesian thinker who influenced so much the field and will continue to do so much longer.

Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation 111This paper is part of Mohammed EL Anbari’s PhD thesis. This work has been partly supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2009-2012 project ANR-09-BLAN-01 Emile for the last two authors, and by Institut Universitaire de France for the last author. Jean-Michel Marin and Christian P. Robert are grateful to the participants to the BIRS 07w5079 meeting on “Bioinformatics, Genetics and Stochastic Computation: Bridging the Gap” for their helpful comments. Discussions in 2007 in Banff with Sylvia Richardson and in Roma with Jim Berger and Paul Speckman are also gratefully acknowledged. Given that Arnold Zellner sadly passed away last August, we would like to dedicate this paper to the memory of this leading Bayesian thinker who influenced so much the field and will continue to do so much longer.

Gilles Celeux
Project select, INRIA Saclay, Université Paris Sud, Orsay, France
Mohammed EL Anbari
Université Caddi Ayyad, Marrakech, Maroc
Jean-Michel Marin
Institut de Mathématiques et Modélisation de Montpellier,
Université de Montpellier 2, France
Christian P. Robert
Université Paris-Dauphine, CEREMADE,
Institut Universitaire de France & CREST, France
Abstract

Using a collection of simulated an real benchmarks, we compare Bayesian and frequentist regularization approaches under a low informative constraint when the number of variables is almost equal to the number of observations on simulated and real datasets. This comparison includes new global noninformative approaches for Bayesian variable selection built on Zellner’s g-priors that are similar to Liang et al. (2008). The interest of those calibration-free proposals is discussed. The numerical experiments we present highlight the appeal of Bayesian regularization methods, when compared with non-Bayesian alternatives. They dominate frequentist methods in the sense that they provide smaller prediction errors while selecting the most relevant variables in a parsimonious way.

Keywords: Model choice, regularization methods, noninformative priors, Zellner’s –prior, calibration, Lasso, elastic net, Dantzig selector.

1 Introduction

Given a response variable, and a collection of associated potential predictor variables , the classical linear regression model imposes a linear dependence on the conditional expectation (Rao, 1973)

A fundamental inferential direction for those models relates to the variable selection problem, namely that only variables of relevance should be kept within the regression while the others should be removed. While we cannot discuss at length the potential applications of this perspective, variable selection is particularly relevant when the number of regressors is larger than the number of observations (as in microarray and other genetic data analyzes).

To deal with poorly or ill-posed regression problems, many regularization methods have been proposed, like ridge regression (Hoerl and Kennard, 1970) and Lasso (Tibshirani, 1996). Recently the interest for frequentist regularization methods has increased and this has produced a flury of methods (see, among others, Candes and Tao, 2007, Zou and Hastie, 2005, Zou, 2006, Yuan and Lin, 2007).

However, a natural approach for regularization is to follow the Bayesian paradigm as demonstrated recently by the Bayesian Lasso of Park and Casella (2008). The amount of literature on Bayesian variable selection is quite enormous (a small subset of which is, for instance, Mitchell and Beauchamp, 1988, George and McCulloch, 1993, Chipman, 1996, Smith and Kohn, 1996, George and McCulloch, 1997, Dupuis and Robert, 2003, Brown and Vannucci, 1998, Philips and Guttman, 1998, George, 2000, Kohn et al., 2001, Nott and Green, 2004, Schneider and Corcoran, 2004, Casella and Moreno, 2006, Cui and George, 2008, Liang et al., 2008, Bottolo and Richardson, 2010). The number of approaches and scenarii that have been advanced to undertake the selection of the most relevant variables given a set of observations is quite large, presumably due to the vague decisional setting induced by the question Which variables do matter? Such a variety of resolutions signals a lack of agreement between the actors in the field.

Most of the solutions, including Liang et al. (2008) and Bottolo and Richardson (2010), focus on the use of the -prior, introduced by Zellner (1986). While this prior has a long history and while it reduces the prior input to a single integer, , the influence of this remaining prior factor is long-lasting and large values of are no guarantee of negligible effects, in connection with the Bartlett or Lindley–Jeffreys paradoxes (Bartlett, 1957, Lindley, 1957, Robert, 1993), as illustrated for instance in Celeux et al. (2006) or Marin and Robert (2007). In order to alleviate this influence, some empirical Bayes [Cui and George (2008)] and hierarchical Bayes [Zellner and Siow (1980), Celeux et al. (2006), Marin and Robert (2007), Liang et al. (2008) and Bottolo and Richardson (2010)] solutions have been proposed. In this paper, we pay special attention to two calibration-free hierarchical Zellner -priors. The first one is the Jeffreys prior which is not location invariant. A second one avoids this problem by only considering models with at least one variable in the model.

The purpose of our paper is to compare the frequentist and the Bayesian points of views in regularization when remains (slightly) greater than , we limit our attention to full rank models. This comparison is considered from both the predictive and the explicative point of views. The outcome of this study is that Bayesian methods are quite similar while dominating their frequentist counterpart.

The plan of the paper is as follows: we recall the details of Zellner’s (1986) original -prior in Section 2, and discuss therein the potential choices of . We present hierarchical noninformative alternatives in Section 3. Section 4 compares the results of Bayesian and frequentist methods on simulated and real datasets. Section 5 concludes the paper.

2 Zellner’s -priors

Following standard notations, we introduce a variable that indicates which variables are active in the regression, excluding the constant vector corresponding to the intercept that is assumed to be always present in the linear regression model.

We observe , the model is defined as the conditional distribution

(1)

where

  • ,

  • is the matrix which columns are made of the vector and of the variables for which ,

  • and are unknown parameters.

The same symbol for the parameter is used across all models. For model , Zellner’s -prior is given by

The experimenter chooses the prior expectation and . For such a prior, we obtain the classical average between prior and observed regressors,

This prior is traditionally called Zellner’s -prior in the Bayesian folklore because of the use of the constant by Zellner (1986) in front of Fisher’s information matrix . Its appeal is that, by using the information matrix as a global scale,

  • it avoids the specification of a whole prior covariance matrix, which would be a tremendous task;

  • it allows for a specification of the constant in terms of observational units, or virtual prior pseudo-observations in the sense of de Finetti (1972).

However, fundamental feature of the -prior is that this prior is improper, due to the use of an infinite mass on . From a theoretical point of view, this should jeopardize the use of posterior model probabilities since these probabilities are not uniquely scaled under improper priors, because there is no way of eliminating the residual constant factor in those priors (DeGroot, 1973, Kass and Raftery, 1995, Robert, 2001). However, under the assumption that is a parameter that has a meaning common to all models , Berger et al. (1998) develop a framework that allows to work with a single improper prior that is common to all models (see also Marin and Robert, 2007). A fundamental appeal of Zellner’s -prior in model comparison and in particular in variable selection is its simplicity, since it reduces the prior input to the sole specification of a scale parameter .

At this stage, we need to point out that an alternative -prior is often used (Berger et al., 1998, Fernandez et al., 2001, Liang et al., 2008, Bottolo and Richardson, 2010), by singling out the intercept parameter in the linear regression. By first assuming a centering of the covariates, i.e.  for all ’s, the intercept is given a flat prior while the other parameters of are associated with a corresponding -prior. Thus, this is an alternative to model , which we denote by model to stress the distinctions between both representations and which is such that

(2)

where

  • the matrix which columns are made of the variables for which ,

  • , and are unknown parameters.

The parameters and are denoted the same way across all models and rely on the same prior. Namely, for model , the corresponding Zellner’s -prior is given by

In that case, we obtain

and

For models and , in a noninformative setting, we can for instance choose or and large. However, as pointed out in Marin and Robert (2007, Chapter 3) among others, there is a lasting influence of over the resulting inference and it is impossible to “let go to infinity” to eliminate this influence, because of the Bartlett and Lindley-Jeffreys (Bartlett, 1957, Lindley, 1957, Robert, 1993) paradoxes that an infinite value of ends up selecting the null model, regardless of the information brought by the data. For this reason, data-dependent versions of have been proposed with various degrees of justification:

  • Kass and Wasserman (1995) use so that the amount of information about the parameters contained in the prior equals the amount of information brought by one observation. As shown by Foster and George (1994), for large enough this perspective is very close to using the Schwarz (Kass and Wasserman, 1995) or BIC criterion in that the log-posterior corresponding to is equal to the penalized log-likelihood of this criterion.

  • Foster and George (1994) and George and Foster (2000) propose , in connection with the Risk Inflation Criterion (RIC) that penalizes the regression sum of squares.

  • Fernandez et al. (2001) gather both perspectives in as a conservative bridge between BIC and RIC, a choice that they christened “benchmark prior”.

  • George and Foster (2000) and Cui and George (2008) resort to empirical Bayes techniques.

These solutions, while commendable since based on asymptotic properties (see in particular Fernandez et al., 2001 for consistency results), are nonetheless unsatisfactory in that they depend on the sample size and involve a degree of arbitrariness.

3 Mixtures of -priors

The most natural Bayesian approach to solving the uncertainty on the parameter is to put a hyperprior on this parameter:

  • This was implicitely proposed by Zellner and Siow (1980) since those authors introduced Cauchy priors on the ’s since this corresponds to a -prior augmented by a Gamma prior on .

  • For model , Liang et al. (2008), Cui and George (2008) and Bottolo and Richardson (2010) use

    and an hyperprior of the form

    with . This constraint on is due to the fact that the hyperprior must be proper, in connection with the separate processing of the intercept and the use of a Lebesgue measure as a prior on . We note that needs to be specified, and being the solutions favored by Liang et al. (2008).

  • For model , Celeux et al. (2006) and Marin and Robert (2007) used

    and a hyperprior of the form

    The choice of the integer support is mostly computational, while the Jeffreys-like shape is not justified, but the authors claim that it is appropriate for a scale parameter.

For model a more convincing modelling is possible since the Jeffreys prior is available. Indeed, if

then

where is the orthogonal projector on the linear subspace spanned by the columns of . Since, the Fisher information matrix is

the corresponding Jeffreys prior on is

Note that, for model , Liang et al. (2008) discuss the choice of and then as leading to the reference prior and Jeffreys prior, presumably also under the marginal model after integrating out , although details are not given.

For such a prior modelling, there exists a closed-form representation for posterior quantities in that

and

(3)

where is the Gaussian hypergeometric function (Butler and Wood, 2002). We can thus proceed to undertake Bayesian variable selection without resorting at all to numerical methods (Marin and Robert, 2007). Moreover, the shrinkage factor due to the Bayesian modelling can also be expressed in closed form as

This obviously leads to straightforward representations for Bayes estimates. If is a matrix containing new values of the explanatory variables for which we would like to predict the corresponding response , the Bayesian predictor of is given by

Similarly, the Bayesian model averaging predictor of is given by

This numerical simplification in the derivation of Bayesian estimates and predictors is found in Liang et al. (2008) and exploited further in Bottolo and Richardson (2010). Note also that Guo and Speckman (2009) have furthermore established the consistency of the Bayes factors based on such priors.

In contrast with this proposal, the prior of Liang et al. (2008) depends on a tuning parameter . Despite that, there also exist arguments to support this prior modelling, including the important issue of invariance under location-scale transforms. As seen in the above formulae, the Jeffreys prior associated to model ensure scale invariance but not location invariance. In order to ensure location invariance for model , it would be necessary to center the observation variable as well as the dependent variables . Obviously, this centering of the data is completely unjustified from a Bayesian perspective and further it creates artificial correlations between observations. However it could be argued that the lack of location invariance only pertains to quite specific and somehow artificial situations and that it is negligible in most situations. We will return to this point in the comparison section.

A location scale alternative consists in using the prior of Liang et al. (2008) with and excluding the null model from the competitors. This prior leads to the model posterior probability

(5)

Equations (3) and (5) are similar. However, in the last part of (5), is centered, ensuring the location invariance of the selection procedure.

4 Numerical comparisons

We present here the results of numerical experiments aiming at comparing the behavior of Bayesian variable selection and of some (non-Bayesian) popular regularization methods in regression, when considered from a variable selection point of view: The regularization methods that we consider are the Lasso, the Dantizg selector, and elastic net, described in Section 4.1. The Bayesian variable selection procedures we consider oppose strategies for selecting the hyperparameter in Zellner’s -priors: We include in this comparison the intrinsic prior (Casella and Moreno, 2006) which is another default objective prior for the non informative setting that does not require any tuning parameters and is also invariant under location and scale changes. All procedure under comparison are described in Table 1. We have also included in this comparison the highly standard AIC and BIC penalized likelihood criteria. Moreover, we will refer to the performances of an ORACLE procedure that assumes the true model is known and that estimate the regression coefficients with the least squares method.

4.1 Regularization methods

1) The Lasso:

Introduced by Tibshirani (1996), the Lasso is a shrinkage method for linear regression. It is defined as the solution to the following penalized least squares optimization problem

where is a positive tuning parameter.

2) The Dantzig Selector:

Candes and Tao (2007) introduced the Dantzig Selector as an alternative to the Lasso. The Dantzig Selector is the solution to the optimization problem

where is a positive tuning parameter. The constraint can be viewed as a relaxation of the normal equation in the classical linear regression.

3) The Elastic Net (Enet):

The Lasso has at least two limitations: a) Lasso does not encourage grouped selection in the presence of high correlated covariates and b) for the case Lasso can select at most covariates. To overcome these limitations, Zou and Hastie (2005) proposed an elastic net that combines both ridge and Lasso penalties, i.e.

where and are two positive tuning parameters.

4.2 Numerical experiments on simulated datasets

We have designed six different simulated datasets as benchmarks chosen as follows:

  1. Example 1 (sparse uncorrelated design) corresponds to an uncorrelated covariate setting (), with predictors and where the components of () are iid realizations. The response is simulated as

  2. Example 2 (sparse correlated design) corresponds to a correlated case (), with predictors and , for , , for , and for , the components of () being iid realizations. The use of common terms in the ’s obviously induces a correlation among those ’s: the correlation between variables and is 0.9, as for the variables (, and ), and for the variables (, , , and ). There is no correlation between those three groups of variables. The response is simulated as

  3. Example 3 (sparse noisy correlated design) involves predictors. Those variables are generated using a multivariate Gaussian distribution with correlations

    The response is simulated as

  4. Example (saturated correlated design) is the same as Example , except that the response is simulated as

  5. Example 5 involves predictors. Those variables are generated using a multivariate Gaussian distribution with correlations

    The response is simulated as

  6. Example 6 (null model) involves predictors. Those variables are generated using a multivariate Gaussian distribution with correlations

    The response is simulated as

Each dataset consists of a training set of size , on which the regression model has been fitted and a test set of size for assessing performances. Tuning parameters in the Lasso, the Dantzig selector (DZ), and the elastic net (ENET) have been selected by minimizing the cross-validation prediction error through leave-one-out. For each example, independent datasets have been simulated. We use three measures of performances:

  1. The root mean squared error (MSE)

    being the prediction of in the test set;

  2. HITS: the number of correctly identified influential variables;

  3. FP (False Positives): the number of non-influential variables declared as influential.

Using those six different datasets as benchmarks, we compare the variable selection methods listed in Table 1. The performances of the above selection methods are summarized in Tables 213. In the Bayesian approaches, the set of variables is naturally selected according to the maximum posterior probability and the predictive is obtained via the Bayesian model averaging predictors.

AIC Akaike Information Criterion
BIC Bayesian Information Criterion
BRIC g prior with (Fernandez et al., 2001)
EB-L Local EB estimate of in -prior (Cui and George, 2008)
EB-G Global EB estimate of in -prior (Cui and George, 2008)
ZS-N Base model in Bayes factor taken as the null model (Liang et al., 2008)
ZS-F Base model in Bayes factor taken as the full model (Liang et al., 2008)
OVS Objective variable selection using the intrinsic prior (Casella and Moreno, 2006)
HG- Hyper-g prior with (Liang et al., 2008)
HG- Hyper-g prior with (Liang et al., 2008)
HG- Hyper-g prior with (Liang et al., 2008), null model excluded
NIMS Jeffreys prior on the non-invariant model
LASSO Lasso (Tibshirani, 1996)
DZ The Dantzig Selector (Candes and Tao, 2007)
ENET The elastic-net (Zou and Hastie, 2005)
Table 1: Accronyms and description for the variable selection methods compared in the numerical experiment. (The block separate the methods by their nature.

In this numerical experiment, the Bayesian procedures are clearly much more parsimonious than the regularization procedures in that they almost always avoid overfitting. In all examples, the false positive rate FP is smaller for the Bayesian solutions than for the regularization methods. Except for the ZS-F and OVS scenarios which behave slightly worse than the others, all the Bayesian procedures tested here produce the same selection of predictors. It seems that ZS-F has a slight tendency to select too many variables. The performances of OVS are somewhat disappointing and this procedure seems to have a tendency to be too parsimonious. From a predictive viewpoint, computing the MSE by model averaging, Bayesian approaches also perform better than regularization approaches except for the saturated correlated example (Example 4). We further note that the classical selection procedures based on AIC and BIC do not easily reject variables and are thus slightly worse than Bayesian and regularization procedures (a fact not surprising for AIC). In all examples, the NIMS and HG-2 approaches lead to optimal performances in that they select the right covariates and only the right covariates, while achieving close to the minimal root mean squared error compared with all the other Bayesian solutions we considered. They also do almost systematically better than BIC and AIC.

A global remark about this coparison is that all Bayesian procedures have a very similar MSE and thus that they all correspond to the same regularization effect, except for OVS which does systematically worse. However it is important to notice that the MSE for OVS has not been computed by model averaging, but by using the best model. Otherwise, it would be hazardous to recommend one of the priors from those simulations since there is no sensitive difference between them from both selection and prediction points of view.

HITS FP
ORACLE
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 2: Example 1: Mean of MSE, HITS and FP. The numbers between parentheses are the corresponding standard errors.
Variables
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 3: Example 1: Relative frequencies of the selected variables for methods under comparison.
HITS FP
ORACLE
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 4: Example 2: Mean of MSE, HITS and FP. The numbers between parentheses are the corresponding standard errors.
Variables
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 5: Example 2: Relative frequencies of the selected variables for methods under comparison.
HITS FP
ORACLE
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 6: Example 3: Mean of MSE, HITS and FP. The numbers between parentheses are the corresponding standard errors.
Variables
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 7: Example 3: Relative frequencies of the selected variables for methods under comparison.
HITS FP
ORACLE
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 8: Example 4: Mean of MSE, HITS and FP. The numbers between parentheses are the corresponding standard errors.
Variables
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 9: Example 4: Relative frequencies of the selected variables for methods under comparison.
HITS FP
ORACLE
AIC
BIC
BRIC
EB-L
EB-G
ZS-N
ZS-F
OVS
HG-
HG-
HG-
NIMS
LASSO
DZ
ENET
Table 10: Example 5: Mean of MSE, HITS and FP. The numbers between parentheses are the corresponding standard errors.
Variables
AIC
BIC