Lemma 1.

Bayes variable selection in semiparametric linear models

Suprateek Kundu and David B. Dunson

Abstract: There is a rich literature proposing methods and establishing asymptotic properties of Bayesian variable selection methods for parametric models, with a particular focus on the normal linear regression model and an increasing emphasis on settings in which the number of candidate predictors () diverges with sample size (). Our focus is on generalizing methods and asymptotic theory established for mixtures of -priors to semiparametric linear regression models having unknown residual densities. Using a Dirichlet process location mixture for the residual density, we propose a semiparametric -prior which incorporates an unknown matrix of cluster allocation indicators. For this class of priors, posterior computation can proceed via a straightforward stochastic search variable selection algorithm. In addition, Bayes factor and variable selection consistency is shown to result under various cases including proper and improper priors on and , with the models under comparison restricted to have model dimensions diverging at a rate less than .

Key words: Asymptotic theory; Bayes factor; -prior; Large p, small n; Model selection; Posterior consistency; Subset selection; Stochastic search variable selection.

Suprateek Kundu is a doctoral candidate in the Dept. of Biostatistics, UNC Chapel Hill, Chapel Hill, NC 27599, USA (skundu@email.unc.edu).
David B. Dunson is professor in Dept. Statistical Science, Duke University, Durham, NC 27708, USA (dunson@stat.duke.edu).

1. INTRODUCTION
Bayesian variable selection is very widely applied, with a rich literature on alternative priors and computational methods. For a recent review of Bayesian variable selection methods, refer to O’Hara and Sillanpää (2009). Most of the literature has focused on Gaussian linear regression models, with common methods including stochastic search variable selection (SSVS) (George and McCulloch, 1993; 1997), reversible jump MCMC (Green, 1995) and adaptive shrinkage (Tibshirani, 1996; Park and Casella, 2008; Yi and Xu, 2008). Such methods can be applied directly for kernel or basis function selection in nonlinear regression with Gaussian residuals (Smith and Kohn, 1996) and can be adapted to accommodate generalized linear models with outcomes in the exponential family (Raftery and Richardson 1993; Meyer and Laud 2002).

It is well known that Bayesian variable selection can be sensitive to the prior, and there is an increasingly rich literature showing asymptotic properties providing support for carefully-chosen priors, such as mixtures of g-priors (Zellner and Siow, 1980; Liang et. al., 2008), with such priors also having appealing computational properties. This literature is essentially entirely focused on Gaussian linear regression models, and the emphasis of this article is on developing methods that generalize this work to semiparametric regression models having unknown residual distributions.

To set the stage, first consider the well-studied problem of comparison of linear models of the following type:

(1)

where Y is n1 vector of responses, is the common intercept, X is a n design matrix (j=1,2) excluding the column of intercepts, and ’s are Gaussian residuals, j=1,2. The models may or may not be nested, and the number of candidate predictors is . Among numerous model selection criteria available for such comparisons, the Bayes factor (Kass and Raftery, 1995) has received substantial attention as the most widely accepted Bayesian measure of the weight of evidence in the data in favor of one model over another. The Bayes factor for comparing versus based on a sample Y is defined as BF, the ratio of marginal likelihoods under and . Assuming one of the models under comparison is true, Bayes factor consistency refers to the phenomenon where BF as under and BF as under . A stronger form of consistency is also possible when the convergence happens almost surely. When comparing the true model pairwise to each model in a list, Bayes factor consistency typically implies that the posterior probability on the true model goes to one.

Although priors most commonly used in practice assume a priori independence in the elements of the coefficient vectors ( and ), priors that have been shown to result in Bayes factor consistency typically incorporate dependence. Examples include the intrinsic prior (Berger and Pericchi, 1996; Moreno, Bertolino and Racugno, 1998), and Zellner’s -prior (Zellner, 1986) specified by , j=1,2. The intrinsic priors have proven to behave very well for multiple testing problems (Casella and Moreno, 2006). Zellner’s -prior allows for a convenient correlation structure and can control for the amount of prior information relative to the sample through only one hyperparameter . Among others, Fernndez et al. (2001) investigated Bayes factor consistency under various choices of fixed , which was allowed to depend on the sample size and/or the number of candidate predictors. In order to resolve difficulties associated with a fixed choice of g, such as Bartlett’s paradox (Bartlett, 1957; Jeffreys 1961) and information paradox (Zellner 1986; Berger and Pericchi 2001), Zellner and Siow (1980) placed an inverse-gamma prior on , while Liang et. al. (2008) extended the idea of Strawderman (1971) to the regression context by proposing hyper- and hyper- priors on , under which they established Bayes factor consistency. The above approaches entail specifying improper priors on common model parameters and proper priors on model parameters unique to any one model, which results in a prior specification for the more complex model depending upon the simpler model. To avoid such pitfalls, Guo and Speckman (2009) adopted the idea of Marin and Robert (2007) and placed mixtures of priors on all the elements of both and , which leads to tractable Bayes factors as well as Bayes factor consistency.

There has also been a growing interest in model selection procedures for normal linear models when the number of candidate predictors () increase with sample size (). Such increases occur in a wide variety of applications, such as in nonparametric regression when the number of candidate kernels or basis functions depends on . Shao (1997) analyzed the consistency of several frequentist and Bayesian approximation criteria for model selection in normal linear models with increasing model dimensions, assuming the true model to be the submodel minimizing the average squared prediction error. Moreno et. al. (2010) examined consistency of Bayes factors and the BIC under intrinsic priors for nested normal linear models, when the dimension of the parameter space increases with the sample size. Jiang (2007) considered Bayesian variable selection in generalized linear models in settings and provided conditions to obtain near optimal rates of convergence in estimating the conditional predictive distribution, but did not consider asymptotic properties in selecting the important predictors.

To our knowledge, this area has entirely focused on parametric models with a particular focus on normal linear regression. Such a parametric assumption on the residual error is rather stringent and may not hold in practice, thus invalidating the earlier assumption of the true model belonging to the class of models under comparison and potentially leading to inconsistent Bayes factors. In Section 5, simulations illustrate that when residuals are generated from a bimodal distribution, Bayesian variable selection under a Gaussian linear regression model tends to have poor performance. With this motivation, our focus is on developing Bayes variable selection methods that do not require Gaussian residuals and that can be shown to be consistent.

There is a limited literature on variable selection in Bayesian regression models having unknown residual distributions. Kuo and Mallick (1997) consider an accelerated failure time model for time-to-event data containing a linear regression component and a mixture of Dirichlet processes for the residual density. To perform variable selection, they add indicator variables to the regression function and implement an MCMC algorithm. Also, in the survival analysis setting, Dunson and Herring (2005) proposed a Bayesian approach for selecting predictors in a semiparametric hazards model that allows uncertainty in whether predictors enter in a multiplicative or additive manner. Kim, Tadesse and Vannucci (2006) instead define a Bayesian variable selection approach, which uses a Dirichlet process to define clusters in the data, while updating the variable inclusion indicators using a Metropolis scheme. Mostofi and Behboodian (2007) models a symmetric and unimodal residual density using a Dirichlet process scale mixtures of uniforms, while conducting Bayesian variable selection. Chung and Dunson (2009) modeled the conditional response density given predictors using a flexible probit stick-breaking mixture of Gaussian linear models, allowing variable selection via a Bayesian stochastic search method.

These articles focused on defining methodology and computational algorithms, but without study of theoretical properties, such as consistency. In fact, to our knowledge, there has been no previous work on consistent Bayesian variable selection in semi-parametric models, though there is recent work on consistent non-parametric Bayesian model selection (Ghosal, Lember and van der Vaart, 2008 among others). It is not straightforward to apply such theory directly to the problem of variable selection in semiparametric linear models.

With this motivation, we define a practical, useful and general methodology for Bayesian variable selection in semiparametric linear models, while providing basic theoretical support by showing Bayes factor and variable selection consistency. We accomplish this by generalizing the methods and asymptotic theory for mixtures of -priors to linear regression models with unknown residuals characterized via Dirichlet process (DP) location mixture of Gaussians. We propose a new class of mixtures of semi-parametric -priors, which results in consistent Bayesian variable selection even when there are many more candidate predictors () than samples () as long as the prior assigns probability zero to models having greater than or equal to predictors. Additionally, posterior computation for the proposed method is straightforward via an SSVS algorithm.

Section 2 develops the proposed framework. Section 3 considers asymptotic properties. Section 4 outlines algorithms for posterior computation. Section 5 contains a simulation study. Section 6 applies the approach to a type 2 diabetes data example, and Section 7 discusses the results.


2. MIXTURES OF SEMIPARAMETRIC -PRIORS
2.1  MODEL FORMULATION
In this section, we propose a new class of priors for Bayesian variable selection in linear regression models with an unknown residual density characterized via a Dirichlet process (DP) location mixture of Gaussians. In particular, let

(2)

where is a vector of variable inclusion indicators, with =I(th predictor is included in the model) and , }, and does not include an intercept, and is a density with respect to Lebesgue measure on . For simplicity, we model as having an unknown mean instead of including an intercept as in (1). The number of candidate predictors may or may not increase with the sample size . We can address the prior uncertainty in subset selection by placing a prior on , while the prior on characterizes prior knowledge of the size of the coefficients for the selected predictors.

The DP mixture prior on the density induces clustering of the subjects into groups, with each group having a distinct intercept in the linear regression model. Let A denote an allocation matrix, with A = 1 if the th subject is allocated to the th cluster and 0 otherwise. The th column of then sums to , the number of subjects allocated to cluster , with . Following Kyung, Gill and Casella (2009), conditionally on the allocation matrix A, (2) can be represented as the linear model

(3)

where .

In keeping with the mixtures of -priors literature, we would like the prior on the regression coefficients to retain the essential elements of Zellner’s -prior, but at the same time to be suitably adapted to reflect the semi-parametric nature of the model in question - more specifically, the clustering of responses by the DP kernel mixture prior. To this effect, we propose a mixture of semi-parametric -priors which is constructed to scale the covariance matrix in Zellner’s -prior to reflect the clustering phenomenon as follows:

(4)

Prior (4) inherits the advantages of the traditional mixtures of -priors including computational efficiency in computing marginal likelihoods (conditional on A) and robustness to mis-specification of . In addition, the prior can be interpreted as having arisen from the analysis of a conceptual sample generated using a scaled design matrix , reflecting the clustering phenomenon due to the DP kernel mixture prior. Moreover, the proposed prior leads to Bayes factor and variable selection consistency in semi-parametric linear models (2), as highlighted in the sequel.

Note that since for any allocation matrix A, the prior variance of Y conditional on (A,) is higher for the semi-parametric prior as compared to the traditional prior. To assess the influence of A on the prior for , we did simulations which revealed that for fixed (, ), var() increases but the cov() decreases as the number of underlying clusters in the data increase (). This suggests that as the number of clusters increase, the components of are likely to be more dispersed with decreasing association between each other.


2.2  Bayes Factor in Semiparametric Linear Models
Throughout the rest of the paper, we will assume that the data Y= are generated from the true model with i.i.d. from the true residual density , which is a density on with respect to Lesbesgue measure. For modeling purposes, we put a DP location mixture of Gaussians prior on the unknown . For pairwise comparison, we evaluate the evidence in favor of compared to using Bayes factor, where

(5)

where indexes models of dimension in the model space (j=1,2) and is defined in (4). Our prior specification philosophy is similar to the one adopted by Guo and Speckman (2009) for normal linear models, in that we assign proper priors on all elements of both conditional on , and an improper prior on (for a more objective assessment). However unlike Guo and Speckman (2009), our focus is on Bayesian variable selection in semi-parametric linear models.

Note that the likelihood of the response after marginalizing out in (3) turns out to be L (Kyung et. al., 2009). Thus conditional on A, , and we are in the normal linear model set-up:

(6)

where . Under a mixture of semi-parametric -priors, we can directly use expression (17) in Guo and Speckman (2009) to obtain for j=1,2

(7)

where , the equivalent of a hat matrix in standard linear regression. Also, marginalizing over all possible subcluster allocations for a given sample size n, the following form for the marginal likelihood can be obtained (Kyung et. al., 2009):

(8)

where is the collection of all possible nk matrices corresponding to different allocations of n subjects into k subclusters, is the collection of all possible allocation matrices for a sample size n with . In the limiting case as , we have as the class of ‘limiting allocation matrices’. Using (7), the Bayes factor in favor of conditional on the allocation matrix A is given by

(9)

where , (j=1,2). Finally using (8), the unconditional Bayes factor marginalizing out A in favor of is

(10)

3.  ASYMPTOTIC PROPERTIES
In this section we focus on asymptotic properties including Bayes factor and variable selection consistency. Before proceeding, note that the standard assumptions made for establishing Bayes factor consistency in linear models (1) are:
(A1) under ,
(A2) If , , under ,
where , the hat matrix in . A necessary condition for assumption (A1) to hold is that X has full rank, which is likely to be satisfied for fixed model dimensions but can not be guaranteed for increasing model dimensions without further assumptions. Conditional on the limiting allocation matrix A, we make similar assumptions. For fixed and conditional on A, we assume
(A1) under .
(A2): If , under .
For (j=1,2) with , conditional on A we assume (A1) and
(2): If , under .
Note that (A1)(A1) (as ), so that assumption (A1) is a stronger version of (A1). Further, for the two extreme cases when A= and A=I, (A1)(A1). To see this, note that , where is a vector containing the column means of X. This implies (plugging in X for X in (A1)), where X is the centered version of such that . On the other hand for A=I, we have . Assumptions (A2), (2) can be interpreted as a positive ‘limiting distance’ between two models corresponding to design matrices X and X in (3) conditional on A, after marginalizing out , i.e. . Such a ‘limiting distance’ () can be considered as a natural extension of the definition of distance between two normal linear models in Casella et. al. (2009) and Moreno et. al. (2010).

The following lemma gives the limits of quantities such as (A, j=1,2), which would be useful for establishing asymptotic properties. The proof follows directly from the fact that conditional on allocation matrix A, , and using Lemmas 1 and 2 of Guo and Speckman (2009). From here on, we shall make all probability statements under the model as defined in (5).

Lemma 1.

Suppose assumptions (A1), (A2) and (2) hold.
(i) If , then conditional on A, .
(ii) If , then conditional on A, .

Although the next result establishes Bayes factor consistency in semi-parametric linear models (5) under the class of proper priors for , the result can be extended to improper priors . For fixed , can be greater or less than , while for increasing we compare models with , and , which involves the special case of fixed but increasing . As elaborated in Guo and Speckman (2009), the class of priors considered here include hyper- () and hyper- () priors, with (Liang et. al. 2008), Zellner-Siow and beta-prime priors. Let the notation imply that almost surely. We assume the following conditions on :
(A3): There exists a constant k such that for any constant and any sequence .
(A4): There exists a constant k such that k-( k and .
Assumption (A3) ensures that the prior mass for the tail decreases exponentially fast, which is a weak condition and quite reasonable.

Theorem I.

Suppose assumptions (A1), (A2) and (2) hold .
(I) Suppose and are fixed. If , then under and assumptions (A3), (A4), BF as n and if , BF as n. Further, if , then under and assumption (A3), BF as n.
(II) Suppose and , with . Then under and assumption (A3), BF as n.

REMARK 1. The above result can be easily extended to improper priors .

In settings in which there are not two models under consideration but many, often it is of interest to see if the posterior model probability P() goes to 1 as n. The next theorem gives such a result making use of a sequence of prior model probabilites depending on and assuming that the growth rate of is known, for increasing model dimensions.

Theorem II.

Suppose the conditions of Theorem I hold. For fixed and under , P() 1 for any . For increasing () and under , P() 1 for , where denotes the lth model having predictors, , with .

REMARK 2. The mode of convergence of P() under is the same as that of the associated conditional Bayes factors.


4.  POSTERIOR COMPUTATION
We propose an MCMC algorithm for posterior computation, which combines a stochastic search variable selection algorithm (George and McCulloch, 1997) with recently proposed methods for efficient computation in Dirichlet process mixture models. In particular, we utilize the slice sampler of Walker (2007) incorporating the modification of Yau et al. (2011). Using Sethuraman’s (1994) stick-breaking representation, let

The slice sampler of Walker (2007) relies on augmentation with uniform latent variables as follows

For sampling the DP precision parameter, we specify the prior , as such a hierarchical specification is likely to ensure better performance by increasing the support of the prior. Further, we assume equal prior inclusion probability for all predictors, i.e. , k=. We outline the posterior computation steps briefly below:
Step 1.1: Update the s after marginalizing out the augmented uniform variable using .
Step 1.2: Update the augmented uniform variables from its full conditional as described in Walker (2007).
Step 2: Update the allocation of atoms to different subjects using , h=1,…,M
Step 3: Update the precision parameter of the DP using , where M is the number of clusters in the particular iteration.
Step 4: Letting be the dimension of the current model, update using
.
Step 5: Using the hyper- prior and the fact that , we can subsequently adopt the griddy Gibbs approach (Ritter and Tanner, 1992) to update .
Step 6: For variable selection, we update ’s one at a time by computing their posterior inclusion probabilities after marginalizing out and conditional on inclusion indicators for the remaining predictors as well as and A. Denoting as the vector of variable inclusion indicators with , and p as the vector sum of , we can sample from the Bernoulli conditional posterior distribution with probabilities

Step 7: Set and update using ,
where and .


5.  SIMULATION STUDY
We present the results of two simulation studies to demonstrate the utility of Bayesian variable selection in semi-parametric linear models. For the first case (Case I), the truth was generated from a linear model involving ten predictors with coefficients (3 2 -1 0 1.5 1 0 -4 -1.5 0) and a bimodal residual specified by 0.5*N(2.5,1)+0.5*N(-2.5,1). For the second case (Case II), the truth was generated from a normal linear model with the same set of regression coefficients and intercept=1. The covariates were generated independently from uniform(-1,1) distribution. For each case, we generated 20 different replicates for each of the sample sizes 100, 200, 300, 400 and 500, and summarized the results across the replicates.

After generating the data in such a manner, we compared the performance of our method using marginal inclusion probabilities for each predictor (given by P(), j=), with the normal linear model having . This prior on is a special case of the SLM with A=, and is an attempt to assign comparable prior information to both the methods. The replicate averaged marginal inclusion probabilities of each predictor are reported across different sample sizes. We used prior on the DP precision parameter. Further, we chose a prior for which corresponds to a=4 in the hyper- prior. For the griddy Gibbs approach, we chose 1,000 equally spaced quantiles from distribution. We made 50,000 runs with a burn in of 5,000.

As the sample size increases, it is interesting to see how the marginal probabilities of inclusion for important predictors and the marginal probabilities of exclusion for unimportant predictors change. The marginal inclusion probabilities under both the methods were 1.00 for and for all the sample sizes. For the remaining predictors, the plots of the marginal inclusion probabilities over different sample sizes are presented (Fig 1 and Fig 2), as a comparison between the two methods. These plots depict a faster rate of increase of the marginal inclusion probabilities of the important predictors for the semi-parametric Bayes method when the true residuals are non-Gaussian and a similar rate of increase under both methods when the true residuals are Gaussian. In contrast, for the unimportant predictors the exclusion probabilities converge to one slowly for both the methods, reflecting the well known tendency for slower accumulation of evidence in favor of the true null.

To get a closer look when the true residual is non-Gaussian (Case I), we present the results for the sample size 100. As a comparison, we also present regression estimates under the lasso (Tibshirani, 1996) and elastic net (Zou and Hastie, 2005), using the GLMNET package in R with default settings. The average mean square error for out of sample prediction for a test sample size of 25 under the semi-parametric linear model was 7.7 compared to 15.4 under the normal linear model, implying a 50% reduction. The average out of sample MSE were 7.68 for lasso (L1) and 7.65 for elastic net (EL). Out of the 20 different replicates generated with sample size 100, the normal linear model (NLM) chose the wrong subset of predictors 16 times under the median probability model, whereas the semi-parametric linear model (SLM) made incorrect variable selection decisions for 3 out of 20 replicates. The computation time for SLM per iteration was marginally slower than NLM, with the difference inreasing as the number of clusters increase. The mixing for the fixed effects was good under both the methods. The results for SLM do not appear to be sensitive to the hyper-parameters in , but are mildly sensitive to hyper-parameters in for n=100.

Table 1 summarizes results for the model averaged regression estimates (), including 95% pointwise credible intervals (C.I.) and marginal inclusion probabilities (MIP). The SLM and NLM correctly identify the important as well as unimportant predictors. In general, the L1 and EL results seem to be unstable with the coefficients shrunk to 0 varying over different replicates. As a result, the replicate averaged estimates for L1 and EL in Table 1 lead to inaccurate estimates of and . For the estimation of the fixed effects, the MSE around the true () was 0.015 for SLM, 0.084 for NLM, 0.047 for elastic net and 0.047 for lasso. Thus, the SLM results in more accurate estimates with narrower credible intervals. From the results, it is clear that when the true residual is non-Gaussian, the SLM not only has a more desirable performance in variable selection and estimation of regression coefficients, it also has a superior out of sample predictive performance as compared to NLM.

6.  APPLICATION TO DIABETES DATA
The prevalence of diabetes in the United States is expected to more than double to 48 million people by 2050 (Mokdad et. al., 2001). Previous medical studies have suggested that Diabetes Mellitus type II (DM II) or adult onset diabetes could be associated with high levels of total cholesterol (Brunham et. al., 2007) and obesity (often characterized by BMI and waist to hip ratio) (Schmidt et. al., 1992), as well as hypertension (indicated by a high systolic or diastolic blood pressure or both) which is twice as prevalent in diabetics compared to non-diabetic individuals (Epstein and Sowers, 1992). However, most of these results rely on informal treatment of data and lack rigorous statistical analysis to support their conclusions.

We develop a comprehensive variable selection strategy for indicators of DM II based on data obtained from Department of Biostatistics, Vanderbilt University website, involving a diabetes study for African-Americans. Our primary focus is to discover important indicators of DM II by modeling the continuous outcome, glycosylated hemoglobin ( indicates a positive diagnosis of diabetes) based on predictors such as total cholesterol (TC), stabilized glucose (SG), high density lipoprotein (HDL), age, gender, body mass index (BMI) indicator (overweight and obese with normal as baseline), systolic blood pressure (SBP), diastolic blood pressure (DBP), waist to hip ratio (WHR) and postprandial time indicator (PPT) (0/1 depending on whether the blood was drawn within 2 hours of a meal). In addition to the factors already noted above (total cholesterol, obesity and hypertension) for DM II, we note that lower levels of HDL have been known to be associated with insulin resistance syndrome (often considered a precursor of DM II with a conversion rate around 30%), and further we also expect PPT to be a significant indicator as blood sugar levels are high up to 2 hours after a meal.

After trimming the records containing missing values, the data consisted of 365 subjects which was split into multiple training and test samples of sizes 330 and 35 respectively. The replicate averaged fixed effects estimates (multiplied by 100) for the SLM, NLM, lasso (L1) and elastic net (EL) are presented in Table 2, along with the marginal inclusion probabilities (MIP) for the SLM and the NLM. We also evaluate the out of sample predictive performance for each training-test split using predictive MSE for SLM, NLM, L1 and EL in Table 3, and additionally provide coverage (COV) and width (CIW) of 95% pointwise credible intervals for SLM and NLM. The same values of hyper-parameters were used as in section 5 for SLM and NLM. For each replicate, we randomized the initial starting points and made 100,000 runs for SLM (burn in = 20,000) and 50,000 runs for NLM (burn in = 5,000).

It is interesting to note from Table 2 that SLM tells a quite different story compared to the NLM in terms of variable selection. In particular, while both the models successfully identify total cholesterol, stabilized glucose and postprandial time as important predictors, it is only the SLM which identifies systolic hypertension (MIP = 0.77) and waist to hip ratio (MIP = 0.98) as important positively associated indicators, whereas NLM fails to identify these factors (MIP = 0.18 for SBP and 0.17 for WHR) and instead throws in age (MIP = 0.77) as an important predictor. Further, SLM points to a more significant negative association with HDL (MIP=0.64) as compared to NLM (MIP=0.52). For both the methods, the marginal inclusion probabilities for BMI (overweight and obese) were low, which could potentially be attributed to adjusting for the other factors such as waist to hip ratio. The lasso and elastic net include all predictors except DBP, and hence produce an overly complex model.

Variable selection in this application is clearly influenced by the assumptions on the residual density, with the nonparametric residual density providing a more realistic characterization that should lead to a more accurate selection of the important predictors. Figure 3 show an estimate of the residual density obtained from the SLM analysis, suggesting a unimodal right skewed density with a heavy right tail. The SLM results suggest that a mixture of two Gaussians provides an adequate characterization of this density. The computation time for SLM is only marginally slower than NLM, and in addition SLM exhibits good mixing for most of the fixed effects (Table 4). These results are robust to SSVS starting points, and consistency in the results across training-test splits also indirectly suggests adequate computational efficiency of SSVS.

In terms of out of sample predictive MSE (Table 3), none of the models is a clear winner, with the relative performance varying across training-test splits. The MSE’s for lasso and elastic net are very similar to NLM, except for the second test sample where they have the lowest MSE. Overall, the NLM has narrower 95% pointwise credible intervals compared to SLM, often resulting in poorer coverage. Thus, in conclusion, although the competitors yield comparable out of sample predictive performance, it is only the SLM which succeeds in choosing the most reasonable model for DM II, consistent with previous medical evidence.


7.  DISCUSSION
We develop mixtures of semi-parametric -priors for linear models with non-parametric residuals characterized by DP mixtures of Gaussians. The proposed method addresses the often encountered issue of non-Gaussianity of residuals in variable selection settings, and has attractive asymptotic justifications such as Bayes factor and variable selection consistency involving fixed as well as (under some restrictions on the model space). Further, the method is essentially no more difficult to implement than SSVS for normal linear models and can lead to substantially different conclusions, as illustrated in the diabetes application. The general topic of semi- and nonparametric Bayesian model selection is understudied and we hope that this work stimulates additional research of this type in broader model classes, such as for generalized linear models and nonparametric regression.

7.  ACKNOWLEDGEMENTS
This work was support by Award Number R01ES017240 from the National Institute of Environmental Health Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences or the National Institutes of Health.

APPENDIX: PROOF OF RESULTS
Proof of Theorem I: Using similar methods as in the proof of theorem 2 in Guo and Speckman (2009), it can be shown that conditional on A and assumptions (A3) and (A4), the upper and lower bounds of are

and . Similarly,

Therefore, BF

(11)

Case (I): For fixed , directly from the proof of Theorem 3 in Guo and Speckman (2009)

(12)
(13)

In the limiting sense as , the maximum in the upper bound in (13) is computed over A. From (12), under for all A implies max. Dividing both sides of (13) by L(), this implies BF0 under . Further, the mode of convergence of BF is the same as BF, and the rest follows from the proof of Theorem 3 in Guo and Speckman (2009).

Case (II): For increasing model dimensions and with , for we will only assume (A3) so that k. We have using (11)

(14)

Let us consider the following cases under .
Case C1: . We have , j=1,2, and . Using Lemma 1 of Guo et. al. (2009),