Scalable Bayesian regression in high dimensions
with multiple data sources
Current applications of high-dimensional regression in biomedicine often involve multiple sources or types of covariates. We propose methodology for this setting, focusing on the “wide data” regime with large total dimensionality and sample size . As a starting point, we formulate a flexible ridge-type prior with shrinkage levels that are specific to each data type or source. These multiple shrinkage levels are set automatically using empirical Bayes. Importantly, all estimation, including setting of shrinkage levels, can be formulated mainly in terms of inner product matrices of size , rendering computation fast and scalable in the wide data setting, even for millions of predictors, and the resulting procedures are free of user-set tuning parameters. We consider sparse extensions via constrained minimization of a certain Kullback-Leibler divergence. This includes a relaxed variant that scales to large , allows adaptive and source-specific shrinkage and has a closed-form solution. We compare these approaches to standard high-dimensional methods in a simulation study based on biological data.
Keywords: high-dimensional regression, multiple data types, prediction, shrinkage priors, ridge regularization
Advances in data acquisition and storage have meant that current and emerging studies are routinely including multiple sources of covariates, such as different data types, with one or more of the sources being high-dimensional. To fix ideas, consider a biomedical setting in which samples indexed by each have response and covariates of several types (representing say genetic data, imaging, clinical covariates and so on) with respective dimensionalities . We refer to the different types of covariate as sources. The ’s are the source-specific dimensionalities and is the total dimensionality. We consider a specific example of this kind below, in the context of Alzheimer’s disease.
Constructing regression models using such data is challenging, because the relevance of the sources may be quite unequal (and unknown in advance) and the total dimension may be large. This motivates a need for methodology that can cope with multiple sources and that scales to high dimensions.
Methods for high-dimensional regression are now well-established and include penalized least-squares approaches such as the lasso and extensions (tibshirani; tibshirani2005; Yuan_Lin; zou_hastie_2005), SCAD (fan_li_2001) and Bayesian analogues (see e.g. Kyung, for a review). A range of Bayesian approaches have been proposed, notably those based on shrinkage priors, often coupled with variable selection, see for instance Yuan_Lin_2005; park_casella; hans_2010; griffin_brown_2010; carvalho_etal_2010 and armagan_etal_2012, among others. However, in the very large case many available methods become computationally cumbersome or intractable and effective source-specific shrinkage remains hard to achieve.
In this paper we put forward an approach to regression in the multiple-source, high-dimensional setting. Specifically:
We consider a generalized ridge-type prior with shrinkage that adjusts to individual data sources, with the corresponding shrinkage levels estimated from the data.
We show that estimation can be formulated in a way that renders computation efficient for “wide” data, even for large and over sources, and including automatic setting of all tuning parameters.
We introduce sparsifications that achieve competitive prediction performance and that provide a fast yet multivariate technique for discarding non-influential predictors.
Thus, we consider the case of data from multiple sources with source-specific dimensionalities that could differ by many orders of magnitude (e.g. between clinical and genetic data), with total large and a priori unknown source-specific importance.
There has been much interesting work on group selection approaches in regression (reviewed in huang2012). The group lasso (Yuan_Lin) allows specification of covariate subsets that can then be selected as groups; however, applying the group lasso in the current setting (by identifying groups with sources) would not be useful because sources would then simply be either included or excluded (without within-source regularization). The sparse group lasso (simon2013) permits additional regularization via within-group sparsity but its use here would require a nontrivial extension to source-specific penalties whose tuning would be difficult if not intractable in the very high-dimensional, multiple source setting. Dondelinger_Mukherjee consider the case of penalized regression over multiple subgroups of samples; this is quite different from the present setting of sources of covariates (i.e. we focus on the columns, not the rows), also the authors do not tackle the very high-dimensional case.
Ridge-type estimators are among the oldest and best studied regularized regression tools, whether from a penalized likelihood or Bayesian viewpoint. Our results build on these classical tools to deal with multiple source, high-dimensional problems. Our main focus is on prediction and not variable selection per se. The sparse solutions proposed should be regarded mainly as fast and scalable multivariate filtering procedures which can potentially reduce dimensionality enough to open the way for established variable selection methods.
The remainder of this paper is organized as follows. In Section 2 we introduce the scalable Bayesian regression (SBR) approach, describing model formulation, prior specification and tuning of shrinkage levels. Section 3 deals with the sparse extension of the methodology, sparse SBR (SSBR), including a general solution and a relaxed variant. The latter is tractable for large problems and permits an adaptive approach to regulate induced sparsity. Results and comparisons with standard penalized likelihood approaches from a simulation study are presented in Section 4. The paper concludes with a discussion in Section 5.
2 Scalable Bayesian regression
Let be a vector of responses and a set of covariate or feature matrices from data-sources. Each is of size so that the total number of potential predictors is . We consider the normal linear model
where each is a -vector of regression coefficients, denotes a -dimensional multivariate normal density with mean and covariance and is the identity matrix. Without loss of generality we assume throughout that all data are standardized. Let and denote the respective global predictor matrix and -vector of regression coefficients (here, “global” means with all sources taken together). Then, with prior the full model under consideration is
The improper prior for (Jeffreys’ prior) is a common option for linear regression models. The crucial aspect of prior formulation for the multiple-source, high-dimensional setting under consideration is the construction of . This we discuss in detail next.
2.2 The prior on
The SBR approach is based on a natural generalization of the standard ridge prior. Specifically, the prior on is
where , and with , for . Here each is a source-specific shrinkage level on the corresponding . The special case recovers the standard ridge prior with just one shrinkage level (and indeed the solutions presented below could be used to give a scalable implementation of classical ridge with a single ). However, when dealing with multiple data sources one would expect different contributions from the various data-sources. This motivates a need for source-specific penalties that can adjust to account for these differences and additionally provide potentially useful information about the usefulness of specific data sources.
At this point it is useful to define the quantity
All formulas presented in the remainder of this Section are cast in terms of . Importantly, this means that the key computations under SBR can be formulated so as to require only a one-off computation of these individual inner product (Gram) matrices of size (these calculations can be easily implemented in parallel) followed mainly by operations on those matrices. As we show below, for wide data with large , this gives a practical way to implement SBR.
Under the conjugate prior in (1) the posterior distribution of is given by
where and . Calculating the posterior mode directly involves a matrix inversion. For we instead use
where is an -vector whose calculation involves only an matrix inversion. The derivation of (4) is provided in Appendix A. For very large problems the computation of the posterior mode can be done in parallel; additionally, we draw attention to the useful expression
for . Having obtained the posterior mode, prediction from an available of dimensionality is straightforward via . In addition, when interest lies solely in prediction the corresponding calculation can be simplified to
Calculating the posterior covariance matrix can also be simplified through the formula
For details see Appendix A. In practice we are not interested in evaluating the entire covariance matrix (for very large this might in fact be difficult due to memory limitations). However, the methodology considered in Section 3.2 requires the diagonal elements of and in this case the formula in (7) facilitates computation as it allows for fast and parallel block matrix computations. Again, note that , so here the magnitude of each , for , can guide us in determining whether to use block computations or not. Moreover, the posterior distribution of is given by
i.e. an inverse-gamma distribution with shape and scale , where
is the residual sum of squares.
Below we will make use of the marginal likelihood . This is the likelihood obtained by integrating over the parameter space with respect to the joint prior distribution (here, over and ). Using the prior specification and results above we have
2.4 Automatic setting of shrinkage levels
Specification of penalty parameters is often handled through cross-validation (CV) or generalized CV in a frequentist framework (tibshirani), while Bayesian methods typically rely on empirical Bayes (EB) point estimates or data-dependent hyper-priors; see e.g. Yuan_Lin_2005; park_casella; bala_madigan_2010; hans_2010 and griffin_brown_2010. An alternative approach is considered by lykou_ntzoufras_2013 who tune the Bayesian lasso penalty based on Pearson correlations at the limit of significance determined by Bayes factors. Furthermore, fully Bayesian shrinkage methods include the horse-shoe prior (carvalho_etal_2010) and the double generalized Pareto (armagan_etal_2012).
Here, the tuning parameter is vector valued and for the applications we consider we would like fast and efficient approaches by which to set it. We propose three EB point estimators obtained by (i) minimizing the leave-one-out CV error, (ii) maximizing the marginal likelihood and (iii) locating the posterior mode under a data-dependent prior. All three require no user input and are computationally fast. We discuss each in turn.
Leave-one-out cross-validation (CV) estimator: Similar to the case of ordinary least squares under ridge regression the leave-one-out CV error in our case can be computed as
A proof is provided in Appendix B.
Marginal likelihood (ML) estimator: From Eq. (10) the quantity that maximizes the ML is given by
Posterior mode (PM) estimator: We consider a product-exponential data-dependent prior for with prior mode at zero, prior mean equal to as given in (11) and prior variance , i.e. . The rationale is that a smaller individual estimated penalty corresponds to a stronger belief that the corresponding matrix contains useful signal and therefore to a smaller prior variance (especially when ). On the other hand as increases we let the quadratic prior variance account for the chance that there is actually some useful signal in which passes undetected by the leave-one-out CV approach. The resulting posterior mode estimate is
Note that the number of available data sources will typically not be large. Therefore, the aforementioned solutions are easy to find through standard optimization routines.
3 Sparse SBR
The SBR posterior mode in (4) is non-sparse (“dense”) in the sense that the regression coefficients will not be set to exactly zero. In this Section we propose a methodology for “sparsifying” SBR. The idea is to find a sparse approximation to the full (dense) Bayesian solution that is closest to it in a Kullback-Leibler (KL) sense. To do so, we minimize the KL divergence with respect to the posterior distribution of the regression vector, but subject to a lasso-type constraint to ensure sparsity. We show first a general solution that is suitable for small to moderate and then go on to consider a relaxed solution that is applicable to the large case. The solutions presented below bear a resemblance to methods proposed by Antoniadis_Fan2001, in the context of wavelet regression, and by Aseervatham2011 for the special case of ridge logistic regression for text categorization. However, these are rooted in different arguments and not equivalent to the KL-based approach below.
3.1 Sparsification using the KL divergence
Let denote the true posterior over , conditional on , with mode and covariance as in Eqs. (4) and (7), respectively, and let denote an approximate conditional posterior where is the approximate mode (this will provide a sparsification of ). The idea is to minimize the KL divergence from to under an penalty on vector to induce sparsity. It is easy to show that the KL divergence from to is
Note that in (14) is a true distance metric (satisfying non-negativity, symmetry and the triangle inequality). Note also that the presence of the nuisance parameter cannot be ignored when the minimization also involves a penalty on . In principle, one could work with the marginal posterior distribution of (a multivariate distribution) in order to avoid consideration of . However, in this case working with the KL divergence is not straightforward. Another option would be to use a plug-in posterior point estimate in (14) such as the mode or mean of . Instead, here we pursue a tuning-free approach in which is integrated out; specifically, we work with
where the posterior of is given in (8). The constant depends on sample size and the residual sum of squares, i.e. , with RSS as given in Eq. (9). Note that using the posterior mean or mode of results in and , respectively. Using (15), the general solution is
where controls the sparsity of . Clearly, the SSBR solution implies a lasso-type model with the particularity of a saturated design where the analogue to sample size equals . This means that SSBR can include at most predictors (unlike classical lasso which for can include at most predictors).
The solution in (16) involves computations with the inverse covariance matrix and is, therefore, best suited to settings where is non-large. Large cases can be treated through the approach presented next which leads to a closed-form expression for .
3.2 A relaxed solution applicable to the large- case
Instead of the KL divergence to the posterior used above, consider the KL divergence between the quantities and , with . This amounts to setting as target distribution the product of the marginal posterior densities. The use of independent posterior factorizations is common in various settings; for instance, in marginal likelihood estimation (e.g. Botev_2013; perrakis_marglik), in expectation-propagation algorithms (e.g. Minka_2001) and in variational Bayes (Bishop). Working with the diagonal matrix leads to the following minimization
where is the -th element, for , of the main diagonal of . Note that the main diagonal elements are feasible to calculate even for very large ; this can be achieved using Eq. (7) with parallel block computations. Moreover, the minimization in (17) has a closed-form solution which is as follows
Note that for fixed and we obtain which makes sense from an asymptotic perspective. However, when is a constant not depending on either directly or indirectly (e.g. through ), a “non-sparsifying” effect may be triggered even for moderate sample size, which is in contrast to our initial intent. Setting of is discussed next.
3.3 Tuning of
We will mainly focus on the specification of in the relaxed approach leading to the solution in (18), which is of main interest as it is directly applicable to the large case111For the general case we note that one could potentially borrow ideas from Antoniadis_Fan2001 using a universal-type threshold (this would equal in our setting). Additionally, one could incorporate information from the SBR part using source-specific penalties of the form for , with the corresponding estimate used in the initial SBR phase.. For the relaxed solution we propose parameter-specific and source-specific adaptive penalties for each , where and . Specifically, we consider penalties of the form
The rationale in (19) is that the larger the magnitude of , the smaller the corresponding penalty. In addition, we restrict to which guarantees reasonable shrinkage when and avoids extreme shrinkage when (which is the common case). Here the ’s act as further tuning parameters with values close to one or zero encouraging sparse or dense solutions, respectively. At this point we borrow information from the available source-specific ’s and treat the ’s as power-weights, namely setting them equal to
With this approach we take advantage of all available information from the previous SBR step, i.e. parameter-specific shrinkage through and sourse-specific shrinkage through . Arguably, this strategy may result in undesirable non-sparse solutions, but that will be in the rare, and rather unrealistic, case where is large and all sources are equally important in the sense that the ’s will be more or less the same; a setting where in fact a single- SBR approach is more suitable.
As a final comment, we remark that despite the fact that this penalization approach depends indirectly on sample size through the regression coefficients in (19) and the shrinkage parameter in (20), it may still be sensitive to the “non-sparsifying” effect discussed at the end of Section 3.2. Controlling this effect requires scaling the penalty in (19) by a factor ; however, automatic tuning of is not straightforward. Empirical results (see Section 4) suggest that can lead to a reasonable balance between sparsity and predictive performance. We will call this extension “controlled” SSBR or cSSBR (as it “controls” for sample size). Table 1 summarizes the different variants of SBR that we consider.
|Solution||Sparsity control||Shrinkage estimator|
|Cross Validation||Marginal Likelihood||Posterior Mode|
4 Simulation study
In this Section we present a simulation study aimed at mimicking a typical modern biomedical application involving multiple data types. Reflecting the relative ease with which multiple data modalities can now be acquired such designs are becoming common, with examples including the Cancer Genome Atlas222https://cancergenome.nih.gov, the Alzheimer’s Disease Neuroimaging Initiative333http://adni.loni.usc.edu and the Rhineland Study444http://www.rheinland-studie.de among many others.
The problem. We consider a regression problem with covariates from three sources, namely clinical (CL), gene-expression (RNA) and genetic (single nucleotide polymorphism or SNP) data with respective (simulated) feature matrices , and . The number of covariates in each data-source is set equal to , and . Although the methods we propose can cope with larger , we restrict total in this Section to facilitate empirical comparison with standard methods.
Covariates. The covariate matrices for the clinical and gene-expression variables are generated as and , respectively. Here and are covariance matrices estimated from (real) phenotype and gene-expression data from the Drosophila Genetic Reference Panel (DGRP) (drosophila) (data available online at http://dgrp2.gnets.ncsu.edu/data.html). To simulate the genetic data we use a block-diagonal covariance structure (HapMap). We specify , where each is of size (with /B) and generated from a inverse-Wishart with degrees of freedom and identity scale matrix, i.e. for . As dominates in terms of dimensionality the specification of essentially controls the overall correlation level. We consider two simulation scenarios: (i) corresponding to 1000 blocks of size 100 (“low-correlation scenario”) and (ii) corresponding to 100 blocks of size 1000 (“high-correlation scenario”). We first generate and then discretize in correspondence to the common SNP encoding 0/1/2 (homozygous major allele/heterozygous/homozygous minor allele). The discretization is tuned in order to give a reasonable empirical distribution of SNPs; specifically, for we discretize as
Regression coefficients and sparsity. For the (true) regression vectors , and we consider the following levels of sparsity (fraction of non-zero ’s); , and . Varying sparsity of gives three scenarios for overall sparsity : (i) (sparse scenario), (ii) (medium scenario) and (iii) (dense scenario). Let , and denote the respective number of elements in the sub-vectors , and containing the non-zero beta coefficients. The non-zero betas are generated from the generalized normal distribution (GND). Following the parameterization in Mineo_2003 the probability distribution function of a GND() with location , scale and shape is given by
The GND includes as special cases the normal () and the double exponential () distributions. To avoid these particular cases (which could potentially bias the simulation towards ridge or lasso respectively) we set and generate the non-zero effects as , for , , for , and , for . The signal strength is controlled via the scale parameter (this is downscaled by a factor of 1.5 for the SNP coefficients to control the total amount of signal in the SNPs). To complete the specification of the simulation we set this scale parameter by considering the finite-sample risk in a simplified CL-only oracle-like set-up. Specifically we consider the correlation induced between predictions (under the OLS estimate using the low-dimensional CL data only) and out-of-sample test data (with the data-generating mechanism being a linear model with conditional mean and error variance equal to unity). Specifically, we set which results in an average out-of-sample correlation of 0.6 when =100 and =5000.
Given the above configurations (low/high correlation and sparse/medium/dense scenarios) we generate data from the model
where and with . The test sample size always equals 5000. Each simulation scenario is repeated 50 times.
Methods under comparison.
We consider the SBR method under the three EB estimates proposed in Section 2.4 and the corresponding SSBR solutions with the penalty terms in (19) the power-weights in (20) and no control for the effect of sample size. Furthermore, we consider cSSBR approaches with , and . We present results obtained from as this option led to a good balance between sparsity and predictive performance.
We compare to standard ridge and lasso (with set to minimize the mean squared error from 5-fold CV using package
glmnet in R).
Boxplots of out-of-sample correlations between predictions and test data under the low-correlation and high-correlation simulations are presented in Figures 1 and 2, respectively. In general, all proposed SBR methods (CV/ML/PM) demonstrate good and stable predictive performance, without any apparent differences among them. Specifically, we see:
In the sparse scenario SBR methods match or outperform lasso.
In the medium scenario SBR methods generally outperform ridge and lasso.
In the dense scenario SBR methods match or outperform ridge.
Concerning the sparse extensions of the SBR approaches we see the following:
In the low correlation simulations SSBR/cSSBR match more or less the performance of non-sparse methods.
In the high correlation simulations SSBR performs well overall in the medium and dense scenarios for larger sample size. The cSSBR approach leads to more stable predictions in comparison to cSSBR.
Figure 3 shows the resulting values of for (higher values correspond to lower penalty, i.e. higher estimated importance) in the low-correlation simulations, and provides useful insights concerning the behaviour of SBR methods. Evidently, all estimates (CV/ML/PM) adjust well by appropriate source-specific penalization. This key attribute allows SBR to generally outperform ridge and lasso in prediction when dealing with multiple data sources. Note also that the PM estimator tends to penalize less. The corresponding plots from the high-correlation scenarios (not shown) are very similar.
As noted above, the sparse SSBR (without or with control) solutions seem to allow equally good predictive performance, in certain cases, as the dense SBR solutions. In addition, they employ fewer parameters; Table 2 shows the average sparsity (over the 50 repetitions) induced by SSBR methods under the various simulations. As seen, the solutions seem to display an adjustment to the true underlying sparsity. In addition controlling for the effect of sample size yields much sparser models.
|Method||Sample size||Simulation scenario|
4.3 Computational performance
We conclude by examining computational burden as a function of total dimension and in comparison with the lasso. To do so we add a fourth “data source” which is simply Gaussian noise. The number of these additional covariates is set so that the total number of predictors is . Sample size is set equal to 100.
Computations were carried out on a compute server with 128 cores (2.28GHz) and 1TB of RAM.
For lasso we treat the binding of individual matrices into one data matrix (an operation not needed for SBR) as a pre-processing step and do not include this in the reported runtimes for lasso. We use the parallel option in
glmnet for estimation of the penalty parameter via 10-fold CV (the default option).
For our methods we consider the SBR approach, which requires evaluation of both in (11) and in (13) (and is thus in principle the slowest), and also its corresponding sparse extension. We include the formation of transpose matrices and calculation of gram matrices in reported runtime (although these could be regarded as pre-processing steps). We do not consider ridge because it can be seen as a special case of SBR and hence equally fast when implemented as described here.
Figure 4 shows the average runtimes (in minutes) over five fits on a single simulated dataset. SBR and sparse SBR are considerably faster than lasso with the gap increasing with .
We note that by adding random noise variables as described above we in a way “favour” the lasso implementation in
glmnet, as the screening rules that are used by default can relatively easily exclude these covariates.
SBR is the fastest method, with the average runtime for being approximately 5 minutes, net of all steps. Also, running sparse SBR with suitable parallel block-matrix calculations makes this method also very fast; average runtime was approximately 10 minutes for . We note that our current implementations of the methods are certainly not optimal in terms of computational efficiency.
We note also that our methods can fully utilize available cores for parallel computation, hence they should continue to gain in runtime if more cores are available even with no increase in clock speed.
The aim of this paper was to introduce a framework for high-dimensional regression using multiple data-sources that allows efficient and fast computations in the wide data, very large setting. We showed that the SBR method is effective in this setting, while the sparse extension SSBR gives parsimonious solutions that can be attractive in many cases.
Concerning the predictive performance of the various SBR variants, our empirical results suggest that the choice of shrinkage estimator (CV/ML/PM) is not very crucial; all three estimators seem able to adjust to individual data sources. As a consequence, all SBR approaches showed effective and stable prediction performance in both sparse and dense conditions. The corresponding SSBR solutions have the potential to achieve similar predictive performance but with explicit sparsity. In addition, using some control for the effect of sample size on sparsity is desirable as it can lead to enhanced sparsity with no loss in terms of prediction. In this case, cSSBR using the PM shrinkage estimator and as sample size control factor seems to be a promising option.
As shown the proposed methodology is very fast even when the number of predictors is in the millions. We note that SBR/SSBR can take advantage of large numbers of cores (regardless of clock speed) because the parallelization is used for matrix multiplication (this is not the case if using parallelization e.g. for cross-validation as in standard implementations of lasso among others).
R code for SBR/SSBR implementation, together with an illustrative example, and complete documentation are available at https://github.com/mukherjeegroup/SBR.
A The posterior mode
B The leave-one-out CV estimator
The leave-one-out CV estimates are obtained via
Here is the posterior mode from the regression of ( without the -th element) on ( without the -th row). For simplicity henceforth.
First, set and observe that
For the transition from the second to the first line we used the Sherman-Morrison formula (harville_97, p.424). Note that , i.e. the -th element of the main diagonal of the hat matrix . Since from the result in (B.2) we have
Switching to matrix notation we have
As a result
C The relaxed SSBR solution
Let , with , and let for . It is straightforward to see that
Which concludes the proof. For we obtain the solution in (18).