Modular Bayes screening for high-dimensional predictors

Modular Bayes screening for high-dimensional predictors

Abstract

With the routine collection of massive-dimensional predictors in many application areas, screening methods that rapidly identify a small subset of promising predictors have become commonplace. We propose a new MOdular Bayes Screening (MOBS) approach, which involves several novel characteristics that can potentially lead to improved performance. MOBS first applies a Bayesian mixture model to the marginal distribution of the response, obtaining posterior samples of mixture weights, cluster-specific parameters, and cluster allocations for each subject. Hypothesis tests are then introduced, corresponding to whether or not to include a given predictor, with posterior probabilities for each hypothesis available analytically conditionally on unknowns sampled in the first stage and tuning parameters controlling borrowing of information across tests. By marginalizing over the first stage posterior samples, we avoid under-estimation of uncertainty typical of two-stage methods. We greatly simplify the model specification and reduce computational complexity by using modularization. We provide basic theoretical support for this approach, and illustrate excellent performance relative to competitors in simulation studies and the ability to capture complex shifts beyond simple differences in means. The method is illustrated with applications to genomics by using a very high-dimensional cis-eQTL dataset with roughly 38 million SNPs.

Keywords: Genomics; High-dimensional; Independent screening; Large p small n; Mixture model; Modularization; Nonparametric Bayes; Variable selection.

1Introduction

In modern scientific research, it has become routine to collect massive dimensional data for each study subject, leading to a massive small to moderate statistical problem. In most scientific studies collecting high-dimensional data, the over-arching interest is not in developing a black-box model for prediction; instead the focus is on variable selection. In particular, the scientists would like to select the subset of features out of a high-dimensional set of candidates , which are predictive of a response , with indexing the subject under study. There is a vast recent literature on high-dimensional feature/variable selection, with the overwhelming focus on one of two strategies. The first is to use a penalized optimization method, such as Lasso ([26]), SCAD ([5]), Dantzig selector ([2]) or other variants, to obtain a sparse estimate of the regression coefficient vector. In general, such an approach has adequate performance only when the feature matrix satisfies a number of stringent conditions, which seldom hold in scientific applications even approximately. When these assumptions are violated, performance can degrade badly.

This fact has lead to an emphasis on the second strategy, which is to first apply an independent screening algorithm using measures of association between and separately for , and then select features based on thresholding the corresponding p-values, test statistics or association estimates to maintain a superset containing most of the important features along with a number of false positives. This strategy was proposed by [6] using marginal correlation for sure independence screening (SIS). [7] extended the idea to generalized linear models and [8] considered nonparametric additive models. [28] introduced model-free screening via sure independent ranking and screening (SIRS). Other model-free procedures include the distance correlation screening (DCS) in [16] and the fused Kolmogorov filter in [20] and [21].

The focus of this article is on improving upon high-dimensional screening methodology through using Bayesian nonparametric and hierarchical models to avoid parametric assumptions on the data generating model, while borrowing information across the many tests being conducted. A potentially key disadvantage of most existing screening methods is the lack of any borrowing of information across the different tests. A novel idea we propose, which should have impact beyond the screening case, is to place a nonparametric Bayes model on the marginal distribution of the response, with inclusion tests for each predictor then defined conditionally on unknowns in the marginal distribution. This approach has the dual advantages of avoiding parametric assumptions on the distribution of the response, while also automatically borrowing information across the screening tests through their shared conditioning on a common set of unknowns. In addition to this novel type of borrowing, we additionally place a more convention Bayesian hierarchical structure on the probability of variable inclusion, related to [24]. A key advantage of the proposed formulation is that we can characterize uncertainty in posterior computation for the marginal response distribution via integrating conditional posterior probabilities of variable inclusion across first stage posterior samples. This is computationally scalable and avoids the under-estimation of uncertainty typical of two-stage procedures..

One caveat about our proposed approach is that it is not a coherent fully Bayes probability model in that we ignore certain dependencies for simplicity, robustness and computational tractability. In particular, to define a fully Bayes model for the marginal distribution of the response in settings involving predictors , we would require a model for the conditional distribution of the response given all the predictors along with a model for the joint distribution of the predictors . In very large settings, computational and complexity considerations force a focus on relatively simple settings involving strong constraints (e.g., Gaussian, linear, highly sparse, etc). In most settings, such constraints are inconsistent with available prior information, and results are therefore highly questionable. Instead, we take the more mild approach of ignoring information on in defining the marginal distribution of . This is a type of modularization ([17]). The idea of modularization is that Bayesian models can be defined in modules, with posterior computation in certain modules not taking into account the model structure and data in other modules. In our case, the model for the marginal density is one module, and we do not attempt to take into account information about the predictors in fitting this module. We suspect that closely related ideas may be a game changer in making (approximately) Bayesian approaches practical in modern massive dimensional data settings.

The existing literature on nonparametric Bayes variable selection is sparse. There are several methods for two group comparisons ([4]), but such methods are computationally intensive for each group, and would need to be applied separately times. There are also several approaches that are designed for variable selection in Bayesian nonparametric models for conditional response distributions, . For example, [3] proposed a stochastic search variable selection algorithm under a probit stick-breaking mixture model for . [13] and [27] used tensor factorizations to characterize the conditional distribution in order to capture complex interactions between the predictors. [12] introduced a nonparametric test using a nonparametric Bayesian slice inversed regression based on modeling the conditional distribution of a covariate given the discretized response. Their method is less computationally intensive than previous attempts, but is restricted to univariate . Finally, [9] tested for pairwise dependence between two variables using Dirichlet process mixtures, improving efficiency by running MCMC for each marginal in parallel.

In Section 2 we propose the general MOdular Bayes Screening (MOBS) framework, provide basic theoretical support, and sketch a general approach to computation. In Section 3 we focus on the case in which the response is continuous and univariate, motivating a location-scale mixture of Gaussians for the marginal distribution. Section 4 contains a simulation study assessing operating characteristics relative to a variety of high-dimensional screening methods. Section 5 applies the screening method to a massive-dimensional cis-eQTL dataset with approximately 38 million SNPs. Section 6 contains a discussion. Technical details are included in an Appendix.

2Modular Bayes screening framework

2.1General framework

MOdular Bayes Screening (MOBS) uses a two stage set-up and starts with a mixture model for the marginal density of the response. For subjects , let denote the response, which can be either univariate or multivariate. Let denote a high-dimensional vector of categorical predictors for subject , with having levels. Suppose that marginally we have independently for ; we refer to as the baseline density. Treating this baseline density as unknown, we apply a finite over-fitted mixture model ([23]):

where is the kernel density, are the weights, the kernel parameters, is the prior distribution for each and is a conservative upper bound on the number of mixture components needed to produce an accurate approximation of the unknown density. Let denote a latent variable vector with being the cluster membership of and . As increases, (Equation 1) converges weakly to a Dirichlet process mixture model. This represents a common approximation to the Dirichlet Process mixture model ([11]) facilitating practical implementation.

To generalize the baseline density to allow dependence on the th predictor , we let

where and , , represent the weights and component parameters for subjects with . Expression (Equation 2) modifies the baseline density (Equation 1) to allow the weights and kernel parameters to vary with . Let

correspond to discarding and including the th predictor, respectively. The alternative hypothesis is composed of three cases: , where varies but remains the same, , where varies but is the same and , where both vary. By testing both the weights and the cluster parameters, MOBS is able to capture deeper and more complex shifts in distribution than traditional methods that focus primarily on the mean. Clearly, . Let be the prior probabilities of the four subhypotheses, where is the prior probability of and are the prior probabilities of the alternative hypotheses respectively. We suppose that the weight and component parameter vectors specific to each level of each predictor are distributed randomly about their baseline values according to the hierarchical model:

where is a distribution centered on with scale parameter and is the Dirichlet precision controlling variability about the mean . Using a Dirichlet prior for maintains conjugacy with the Multinomial distribution of the component memberships and facilitates marginalizing out when calculating the likelihood. For the same reason, should ideally be conjugate to the kernel density . The precisions control how much changes with predictor for each such that holds. The hierarchical structure favors borrowing of information in learning across different predictors and levels of .

2.2Algorithmic details

Using (Equation 3) allows direct calculation of an analytic form for , where denotes , the baseline weights and component parameters, and denotes , the hyperparameters controlling borrowing of information and multiplicity adjustment in hypothesis testing. The key idea in our proposed MOBS approach is to run MCMC for posterior computation only for the baseline nonparametric model for the marginal response density . We do not know the exact kernel memberships, weights or parameters, and there does not exist a tractable, analytic form for , so MOBS instead takes the Monte Carlo integration of over the samples of generated from the posterior distribution under the baseline model.

Let and be the likelihood conditional on the baseline MCMC output and hyperparameters controlling borrowing of information and multiplicity adjustment in hypothesis testing for and , respectively. Then, we have

where is the conditional likelihood for for . The posterior probabilities of and conditional on the baseline weights and parameters are then, respectively,

where for ,

is the Bayes factor in favor of over conditional on the hyperparameters and the baseline weights and parameters.

We calculate the conditional likelihoods given the baseline unknowns and hyperparameters as follows. First, under , we have the simple form:

where , with the number of subjects having and belonging to cluster , and the total number of subjects allocated to each component. For the three alternative subhypotheses

These results imply after some calculations that

Here, provides a weight of evidence of changes in mixture weights with , while provides the same for different kernels. provides evidence for both weights and kernels being different and interestingly ends up being the product of and .

We initialize , . Estimating involves iteratively taking the average of the posterior probabilities for each subhypothesis, which is effectively an empirical Bayes procedure. Our proposed MOBS approach leverages on the simplicity of posterior computation for and the simple analytic forms shown above via Algorithm 1.

3Univariate continuous response

3.1Computation

While the proposed method allows for complex, arbitrary responses, this section provides an example under a simple univariate, continuous setting. The baseline density is represented as a mixture of univariate location-scale Gaussian kernels:

where are the mean and variance parameters of the component with prior . Details on the Gibbs sampler for the base distribution can be found in the Appendix. The conditional density can similarly be represented as

where with prior with , and the scale parameters .

To derive , note that for each and ,

where is the multivariate beta function. Hence

Similarly to find for each , first note that

and

where

It follows that

Running MOBS requires first generating samples of and from the posterior under the baseline mixture model. One then iteratively alternates between estimation of using (Equation 8) and (Equation 9) over the samples of the baseline and and estimation of as the average of the estimated posterior probabilities. The strength of MOBS lies in the ease of computing and in a trivially parallelizable manner. The overall algorithm has computational complexity . The approach can be trivially modified to accommodate multivariate and discrete settings; in such cases, expression (Equation 9) will take a different form.

3.2Hyperparameters selection

An important question when running MOBS is the selection of hyperparameters, especially that of the precision parameters . The various can be interpreted as the precisions controlling the distance between the density of under the null hypothesis and that of under the alternative hypothesis. Naturally, higher values of would suggest a smaller distance, as converges to when .

We aim to tune such that the prior signal-to-noise ratio is within a reasonable range somewhere between and . For any fixed and , let . Given , the densities of and are

Let

be the average square of -norm distances between the densities of and and between those of and respectively. Since , we know

While and are both intractable, we can estimate their values by averaging and over samples of . Empirically, we find that for a given , setting provides a reasonable default specification that is relatively stable and has a signal-to-noise ratio between 0.05 and 0.1.

3.3Theoretical properties

We initially study asymptotic properties of MOBS treating parameters as known, and then consider sure screening consistency under general settings. We first investigate properties of . The posterior odds conditioned on and are

Set . For , and , let

Therefore, and where . Let and be the true values of and , respectively, for . Then under , , and .

The above theorem shows rates of convergence of hypothesis probabilities and Bayes factors for the different hypotheses in the event that the true weights, parameters and cluster allocations are known. However, in practice MOBS uses posterior samples for the baseline parameters to account for uncertainty. Now consider the situation where the parameters are unknown and the model may be misspecified. In particular, we extend our above results using misspecification techniques given in [14] and used in [18]. Let be the set of all convex combinations of Gaussian distributions where and let define a prior on . Let be the true distribution and the closest convex combination in to under Kullback-Leibler divergence. We define to be a neighborhood of the density under the measure induced by the density :

and define to be the weighted Hellinger distance

This result suggests that our posterior probability is consistent under and holds under weak conditions under for each . However, consistency can fail under when and have the same closest point under K-L divergence in . In practice, and would need to be extremely close to have the same closest point, so this is a very mild condition.

Next, we use the previous theorem to establish sure screening consistency for MOBS. Following standard screening conventions ([16]), let denote the conditional distribution function of given the predictors . Define

to be the true set of relevant predictors. A good screening method can identify a small subset such that . Existing literature have focused predominantly on frequentist methods that select based on rankings or thresholding using some test statistic such as marginal correlation in SIS or distance correlation in DCS. MOBS instead relies on thresholding of the estimated posterior null probabilities for . In particular, let

where are the order values and is an integer. Let if the th predictor is marginally related to and otherwise. We impose the following two conditions:

(C1) All jointly important predictors are also marginally important (i.e. ) and .

(C2) All jointly important predictors satisfy the constraint that and are not the closest in K-L divergence to the same .

Condition 1 is similar to conditions found in the screening literature ([6]), restricting metrics of marginal importance for important predictors to be non-zero. Condition 2 requires Theorem 3.3 to be satisfied, and is mild as noted above.

4Simulation

4.1Screening accuracy

In this section, we assess the performance of MOBS against five existing screening methods using simulated datasets. The first competitor is sure independence screening or SIS ([6]), and we use the SIS R package. We also compare against three frequentist model-free methods, sure independent ranking and screening or SIRS ([28]), using code found at http://users.stat.umn.edu/ wangx346/research/example1b.txt, the distance correlation screening or DCS ([16]), using the energy R package and the fused Kolmogorov filter ([21]) using code provided by the authors. Finally, we compare to the JYL Bayesian nonparametric test ([12]) with code found at http://www.people.fas.harvard.edu/ junliu/BF/bfslice.html.

We simulate 100 replicates for all methods under six different models. Under these settings, we test observations and total predictors and generate and . First, we consider the case where is independent and generate the features with , , for all 2000 predictors. However, in biological applications (e.g. involving SNPs), can often exhibit moderate correlations in blocks of predictors. To mimic such settings, we randomly select 600 predictors and draw them in the following fashion with correlation . We first draw for each for and . Next, construct by replacing the first rows of such that for for some constant ,

To convert this continuous data to multinomial, for each predictor, we can simply assign them by quantiles. The remaining 1900 predictors are generated from independent Gaussians without correlation.

, where is independent of and is uncorrelated.

is generated the same as with Model 1 but with a block of predictors with correlation described above.

, where is independent of and is uncorrelated.

is generated the same as with Model 3 but with a block of predictors with correlation described above.

is generated using 6 predictors at positions chosen at random from . Let be the true vector of jointly important indices. The resulting marginal density is then sampled from mixtures of Gaussian distributions with mean and standard deviation each corresponding to different permutations of . If , we subsequently draw . is generated to be uncorrelated.

is generated the same as with Model 5 but with a block of predictors with correlation described above.

In order to implement MOBS, we use the computational strategy described in Section 2 and Section 3 and set the deafult hyperparameters , with for the linear case since we expect the response to be mostly unimodal and for the more multimodal single index and mixture models. We run the Markov chain of the baseline model 6000 times and perform Monte Carlo integration using the last 500 samples. If potential issues involving label-switching for the samples of arise, we use methods found in [25] and [22].

ROC curves for all six methods under linear, single index and mixture settings, both correlated and uncorrelated.
ROC curves for all six methods under linear, single index and mixture settings, both correlated and uncorrelated.

Figure 1 summarizes the ROC curves for the six different methods by comparing the true positive rate against the false positive rate. In each of the simulations, MOBS exhibits the best or one of the best performance in comparison to the SIS, SIRS, DCS, the fused Kolmogorov filter and the JYL approaches. In the mixture setting, the nonparametric methods dominate as expected. For the linear regression case and single index case, the other two nonparametric methods, the fused Kolmogorov filter and JYL both demonstrate weakness in these settings, but MOBS still performs strongly and is comparable to SIS. In a multimodal mixture setting, both and play large roles in detecting change, whereas in a unimodal setting, is less relevant and MOBS reduces to essentially a -test.

4.2Computational efficiency

Understanding the computational efficiency of variable screening is important as the various methods should be scalable to huge data sets. We measure the total time taken to run each of the six methods under two different settings using data generated from an uncorrelated linear regression model setting in Model 1 but with varying and . Our first example fixes the number of subjects at 250 and then alter the number of predictors. Next, we fix the number of predictors at 2500 and instead change the number of subjects.

Computational speed against number of samples n when p=2500 and computational speed against number of predictors p when n=250.
Computational speed against number of samples n when p=2500 and computational speed against number of predictors p when n=250.
Computational speed against number of samples when and computational speed against number of predictors when .

Figure 2 summarize the results under both methods. As expected, the linear predictor SIS exhibits significantly faster performance as a simple parametric model though MOBS remains competitive against the other nonparametric and model-free screening methods. All six screening methods are simple to parallelize and scale linearly with the number of predictors . However, in regards to the number of samples , MOBS scales linearly unlike slower competitors such as JYL and DCS.

5Applications

5.1Data

Histogram of E2F2 (Ensemble ID: ENSG00000007968) with the estimated density from sampling the baseline model with k=5 clusters.
Histogram of (Ensemble ID: ENSG00000007968) with the estimated density from sampling the baseline model with clusters.

In this section, we illustrate our approach by analyzing the GEUVADIS cis-eQTL dataset ([15]), publicly available at http://www.ebi.ac.uk/Tools/geuvadis-das/. The data set consists of messenger RNA and microR