Fast Markov chain Monte Carlo for high dimensional Bayesian regression models with shrinkage priors

Fast Markov chain Monte Carlo for high dimensional Bayesian regression models with shrinkage priors

Rui Jin Department of Statistics and Actuarial Science, University of Iowa Aixin Tan Department of Statistics and Actuarial Science, University of Iowa
Janurary 1, 2019
Janurary 1, 2019
Abstract

In the past decade, many Bayesian shrinkage models have been developed for linear regression problems where the number of covariates, , is large. Computing the intractable posterior are often done with three-block Gibbs samplers (3BG), based on representing the shrinkage priors as scale mixtures of Normal distributions. An alternative computing tool is a state of the art Hamiltonian Monte Carlo (HMC) method, which can be easily implemented in the Stan software. However, we found both existing methods to be inefficient and often impractical for large problems. Following the general idea of Rajaratnam et al. (2018), we propose two-block Gibbs samplers (2BG) for three commonly used shrinkage models, namely, the Bayesian group lasso, the Bayesian sparse group lasso and the Bayesian fused lasso models. We demonstrate with simulated and real data examples that the Markov chains underlying 2BG’s converge much faster than that of 3BG’s, and no worse than that of HMC. At the same time, the computing costs of 2BG’s per iteration are as low as that of 3BG’s, and can be several orders of magnitude lower than that of HMC. As a result, the newly proposed 2BG is the only practical computing solution to do Bayesian shrinkage analysis for datasets with large . Further, we provide theoretical justifications for the superior performance of 2BG’s. First, we establish geometric ergodicity (GE) of Markov chains associated with the 2BG for each of the three Bayesian shrinkage models, and derive quantitative upper bounds for their geometric convergence rates. Secondly, we show that the Markov operators corresponding to the 2BG of the Bayesian group lasso and the Bayesian sparse group lasso are trace class, respectively, whereas that of the corresponding 3BG are not even Hilbert-Schmidt.

Fast Markov chain Monte Carlo for high dimensional Bayesian regression models with shrinkage priors



Key words and phrases:  Bayesian fused lasso, Bayesian group lasso, Geometric ergodicity, Gibbs sampler, Hamilton Monte Carlo, trace class.

1 Introduction

A linear regression model describes how a response variable is related to potential explanatory variables through a vector of regression coefficients, . Sparse estimates of are often desired to achieve simple models and stable predictions, especially when the number of explanatory variables is larger than the sample size . One popular set of tools is the penalized regression method, such that, with proper choice of the penalty function, some ’s will be estimated to be zero. See, e.g., Fan and Li (2001). However, it’s generally challenging to quantify the uncertainty of the coefficient estimates and that of the resulting sparse models, though there are important recent works in this direction. See, for e.g., Chatterjee and Lahiri (2011); Lockhart et al. (2014); Tibshirani et al. (2016).

Alternatively, Bayesian models quantify uncertainties of parameters and latent variables through posterior distributions. Two major types of Bayesian regression models are used to induce sparsity. First, the spike-and-slab models assign mixtures of the degenerate distribution at zero and continuous distributions as priors for the ’s, to explore exactly sparse solutions (Mitchell and Beauchamp, 1988; George and McCulloch, 1993). Such models are natural for sparse problems and enjoy optimality in various senses (Johnstone and Silverman, 2004; Castillo et al., 2015). But both exact and approximate computing are notoriously difficult for even a moderate number of predictors, as an exploration of sub-models is needed essentially. Models of a second kind, which we will call the Bayesian shrinkage models, assign the ’s continuous prior distributions. Bayesian shrinkage models have gained popularity as their approximate computation are more easily achieved, and many such models lead to good inference results. Some notable examples include the horseshoe priors (Carvalho et al., 2010), the generalized double Pareto priors (Armagan et al., 2013), and the Dirichlet-Laplace priors (Bhattacharya et al., 2015).

Bayesian shrinkage methods have been tailored to different regression setups. More often than not, predictors have known structures, such as they form groups, or are ordered in some meaningful way. The Bayesian group lasso and the Bayesian sparse group lasso models are designed for the first situation, while the Bayesian fused lasso model is for the latter. See Xu and Ghosh (2015); Lee and Chen (2015); Kyung et al. (2010) for their introduction into the Bayesian framework, and see Yuan and Lin (2006); Huang et al. (2012); Simon et al. (2013); Tibshirani et al. (2005) for their origins in the penalized regression literature. These Bayesian shrinkage models have been applied to diverse areas such as agriculture, genetics, image processing and ecology. Examples can be found in Hefley et al. (2017); Li et al. (2015); Raman et al. (2009); Fan and Peng (2016); Hooten and Hobbs (2015); Greenlaw et al. (2017) and Zhang et al. (2014). Successful applications of the Bayesian group lasso, the Bayesian sparse group lasso and the Baysian fused lasso models depend on fast and reliable computing, which is the pursuit of this paper.

Since there are usually no closed form expressions for posterior quantities of Bayesian shrinkage models, one would resort to Markov chain Monte Carlo (MCMC) as a most reliable method for approximations. Many shrinkage priors, including the three that we will focus on, enjoy a hierarchical representation. This allows an MCMC scheme called the three-block Gibbs sampler (3BG) to explore the posterior. However, 3BG’s often suffer from slow convergence due to high correlation between components of the different blocks, especially that between the regression coefficients in one block and the variance parameters in another (Rajaratnam and Sparks, 2015). An alternative MCMC solution is to use Hamiltonian Monte Carlo (HMC), a general sampling method for exploring continuous distributions. We are not aware of previous usage of HMC to compute the aforementioned Bayesian shrinkage models, but this is easy to set up using the Stan (Carpenter et al., 2017) software, with no need of the hierarchical representation of the prior. Stan implements a certain leapfrog variant of HMC, and is widely-known for producing nearly independent samples in numerous empirical examples, for which standard MCMC algorithms like random walk Metropolis do not mix well. In contrast to the empirical success of HMC, there is limited understanding of its theoretical properties, such as the rate of convergence to the target distribution, and whether the resulting Monte Carlo estimators have finite asymptotic variances (Diaconis et al., 2014). Indeed, performance of the HMC heavily depends on the shape of the log posterior density, and the choice of tuning parameters including the number of leapfrog steps and the step sizes. A rare attempt to identify conditions under which certain variants of HMC converge at a geometric rate is Livingstone et al. (2016), but the conditions therein are non-trivial to check and do not concern exactly the type of HMC implemented in Stan. Despite the lack of theoretical guarantees, we run HMC for the Bayesian shrinkage models in Sections 4 and 5. It turns out that, in high-dimensional regression problems with correlated covariates, the computing cost of each iteration of HMC can be very high, while the resulting chains still exhibit moderate to high autocorrelations.

So far, we mentioned that for the three Bayesian shrinkage models of interest, both existing computing solutions, 3BG and HMC, are far from satisfactory. We develop and study two-block Gibbs samplers (2BG) that are reliable and fast in high dimensional cases, where these models are most needed. Empirically, we demonstrate that the proposed 2BG greatly outperforms both 3BG and HMC in computing efficiency, notably with multiple orders of magnitude improvements in challenging cases where exceeds . Here, computing efficiency is measured as the effective sample size per unit time, where the effective sample size is inversely proportional to the square root of the asymptotic variance of the Monte Carlo estimator of interest. Looking further into what leads to the improvement, we see that the proposed 2BG mixes much better than 3BG in terms of having lag-one autocorrelations closer to 0, with similar computing cost per iteration; and 2BG mixes similarly to or better than HMC, but costs only a small fraction of the latter per iteration.

Beyond showing the empirical success of 2BG’s for computing the three Bayesian shrinkage models of interest, we prove in theory that they are reliable, and have superior spectral properties than the corresponding 3BG’s. In more details, there are two commonly used criteria to evaluate Markov chains that explore a target distribution, disregarding computing cost: the rate of convergence (in distribution) to the target, and the asymptotic variance of Monte Carlo estimators of interest. Concerning convergence rates, it is desirable that a Markov chain has geometric ergodicity (GE), a most regularly used condition to establish the central limit theorem (CLT), which allows standard error evaluations of Monte Carlo estimates. For 2BG for each of the three variants of the Bayesian lasso model, we establish GE of the respective Markov chain, and obtain quantitative upper bounds for its rate of convergence. In addition, both convergence rates and the asymptotic variances depend on the spectrum of the corresponding Markov operator. See, e.g., Rosenthal et al. (2015) and Mira (2001). For the Bayesian group lasso and the Bayesian sparse group lasso models, we are able to prove that the Markov operators associated with 2BG are trace-class (the eigenvalues are absolutely summable and hence square-summable) whereas that of 3BG are not even Hilbert-Schmidt (the eigenvalues of the absolute value of the Markov operator are not square-summable).

The GE and the trace class results for 2BG also provide theoretical foundations for further acceleration of these algorithms. Indeed, 2BG is a kind of Data Augmentation (DA) algorithm. There is a well-studied “sandwich” trick (Liu and Wu, 1999; Meng and Van Dyk, 1999; Hobert and Marchev, 2008) that allows many ways to insert an extra step in each iteration of a DA, to achieve potentially much faster convergence rate. The sandwich trick will improve the computational efficiency of DA provided the extra step is inexpensive, and the trick has seen many successful implementations (Liu and Wu, 1999; Meng and Van Dyk, 1999; Van Dyk and Meng, 2001; Marchev and Hobert, 2004; Hobert et al., 2011; Ghosh and Tan, 2015). The design of good sandwich algorithms for the Bayesian shrinkage models is beyond the scope of this paper. Nevertheless, our results that certain 2BG samplers are GE and trace class imply that any sandwich improvements of them are also GE, and have better spectral properties, in the sense that eigenvalues of the improved algorithms are dominated by that of the basic 2BG (Khare and Hobert, 2011).

The 2BG algorithms proposed and analyzed in this paper stem from the work of Rajaratnam et al. (2018), which introduced the two-block strategy for a class of Bayesian models for regression. In the case of the Bayesian lasso model, they analyzed the convergence rates and the spectral properties for the corresponding 2BG and 3BG, and acknowledged that such a theoretical study is “a big and challenging undertaking”. A main contribution of our work is to develop these theoretical results for the aforementioned three variants of the Bayesian lasso models, which complement and support the empirical advantage we observed of using 2BG over other existing computing solutions.

The rest of the paper is organized as follows. In Section 2, we present the three Bayesian shrinkage models and review the existing 3BG algorithms to explore their posterior distributions. The 2BG algorithms for these models are proposed in Section 3. In Sections 4 and 5, we perform empirical comparisons among HMC, 3BG and 2BG in simulation studies and real data applications. In Section 6, we give theoretical justifications of the superior performance of 2BG over 3BG, by studying the convergence properties and the spectrums of the corresponding Markov chains.

2 Bayesian Shrinkage Models and Standard Three-block Gibbs Samplers

Let be the response vector, be the design matrix, be the vector of regression coefficients and be the residual variance in a regression problem. We consider a general framework of Bayesian models as follows:

(2.1)

Here, is a positive definite matrix indexed by , with a proper prior on . And are user specified hyperparameters. Many Bayesian shrinkage models can be specified using (2.1), e.g., those under the lasso prior (Park and Casella, 2008), the enet prior (Li and Lin, 2010a), the horseshoe prior (Carvalho et al., 2010), and the Dirichlet–Laplace prior (Bhattacharya et al., 2015). Bayesian inference of the model is based on the joint posterior distribution on , especially its marginal and various summaries of the distribution. These distributions are analytically intractable and difficult to draw independent samples from. We will focus on the MCMC solution for computing. An alternative solution through variational approximation is usually of much lower accuracy (see, e.g., Neville et al. (2014)), and will not be discussed in this paper.

Note that the model in (2.1) yields the following conditional distributions:

(2.2)

where . Hence, one can use a Gibbs sampler to update the three blocks of components , and in turn, as long as is easy to sample from. This is what we refer to as 3BG, and has been adopted in Kyung et al. (2010); Griffin and Brown (2010); Pal et al. (2017) and Li and Lin (2010b) to study special cases of model (2.1). The 3BG Markov transition density with respect to Lebesgue measure on is given by

(2.3)

We next review three special cases of model (2.1) that are useful for different regression setups.

2.1 The Bayesian Group Lasso

In regression problems, explanatory variables often occur in groups. Grouping structures can be implied by knowledge of the area of application. For instance, in modeling health of individuals, variables may belong to different categories such as social and economic environment, physical environment, and individual characteristics. Also, quantitative variables may enter a model as groups of basis functions, and qualitative variables with multiple levels could be represented by groups of dummy variables. See Huang et al. (2012) for an extensive review of problems that involve groups of variables. Suppose there are groups of explanatory variables. For , let denote the size of the th group, the -dim vector of regression coefficients, and the corresponding sub-design matrices. Yuan and Lin (2006) presented the group lasso method for estimating as follows,

(2.4)

The penalty parameter decides the intensity of the shrinkage and is often tuned by cross-validation. An obvious Bayesian analogue of the above method can be obtained by combining the first line of (2.1) and the prior

This construction leads to a conditional posterior distribution , the mode of which matches the group lasso estimator in (2.4). In the Bayesian literature, a variation of the above is more often used to achieve group level shrinkage, which specifies

(2.5)

Kyung et al. (2010) showed a hierarchical representation of (2.5), and the model is completed with a prior on :

(2.6)

Here, and are user specified hyperparameters. Note that (2.6) is indeed a special case of (2.1), with played by , and played by

The model in (2.6) implies the following full conditionals:

(2.7)

These conditionals enable a 3BG to explore the corresponding posterior, which updates the three blocks , and in that order. Then the components of the samples form a Markov chain, and its transition density with respect to Lebesgue measure on is given by

(2.8)

2.2 The Bayesian Sparse Group Lasso

In the same setup of the previous section where the explanatory variables form groups, Simon et al. (2013) introduced the sparse group lasso method that promotes sparsity in regression coefficients both among groups and within each group. For penalty parameters , the sparse group lasso estimator is defined as

(2.9)

where and are often tuned by cross validation. A Bayesian analogue of (2.9) can be obtained by specifying

(2.10)

The resulting mode of the conditional posterior distribution coincides with the estimator in (2.9). A more popular variation of the prior is given by

(2.11)

We will refer to the model that uses (2.11) and completed with an Inverse-Gamma prior on as the Bayesian sparse group lasso model (Xu and Ghosh, 2015). This model is a special case of the Bayesian shrinkage model (2.1), as it has the following hierarchical representation:

(2.12)

where , , , and are user specified hyperparameters. Also, for ,

and

The model in (2.12) implies the following full conditionals:

(2.13)

where is the direct sum of , and is the th coefficient of the th group. Accordingly, a 3BG can be established from (2.13), that updates variables in three blocks , , and , in that order. Then the components of the samples form a Markov chain, and its transition density with respect to Lebesgue measure on is given by

(2.14)

2.3 The Bayesian Fused Lasso

For regression problems in which explanatory variables can be ordered in some meaningful way, Tibshirani et al. (2005) developed the fused lasso method, that penalizes the norm of both the coefficients and their successive differences. For fixed tuning parameters , the fused lasso estimator is defined as

(2.15)

We now present the Bayesian counterpart of the fused lasso method. Let the conditional prior of be

(2.16)

Then given any , the fused lasso estimator (2.15) can be interpreted as the mode of the conditional posterior . A variation of (2.16) that is more often implemented in Bayesian models is

(2.17)

A full Bayesian fused lasso model based on a hierarchical representation of (2.17) and a prior on was presented in Kyung et al. (2010), as shown below (a typo was fixed for their prior of and , they should follow the Exponential distributions with mean and respectively):

(2.18)

where and play the role of of (2.1), and , and are hyperparameters specified by the user. The matrix is such that

The following full conditionals can be derived from the model in (2.18) :

(2.19)

These conditionals enable a 3BG to explore the corresponding posterior, which updates the three blocks , and , in that order. Then the components of the samples form a Markov chain, and its transition density with respect to Lebesgue measure on is given by

(2.20)

3 Fast MCMC for Bayesian Shrinkage Models

The 3BG algorithms in section 2 are the most commonly used computing solution for the aforementioned Bayesian shrinkage models. Despite straightforward implementations, 3BG converges slowly in high-dimensional settings. For a special case of the Bayesian shrinkage model in (2.1) with , which assigns a “non-informative” improper prior on , Rajaratnam et al. (2018) noted that the conditional distribution is also tractable, and inexpensive to sample from. Hence, can be updated in one block by drawing from and then . We now slightly generalize Lemma 1 of Rajaratnam et al. (2018) by developing a tractable form of the conditional posterior distribution for the general Bayesian shrinkage model in (2.1), where the prior can be either improper or proper. The proof can be found in the supplementary material.

Lemma 1

For the Bayesian shrinkage model in (2.1), suppose that the prior on is for . Then, follows an inverse-Gamma distribution with shape parameter and scale parameter .

Lemma 1 leads to a two-block Gibbs sampler (2BG) that cycles through the following steps

(3.1)

Convergence properties of the Markov chain that underlies the 2BG is characterized by its -marginal, which is a Markov chain with transition density

(3.2)

Details of the new 2BG algorithms for the Bayesian group lasso, the Bayesian sparse group lasso and the Bayesian fused lasso models are given in the subsections below.

3.1 The Two-block Gibbs Sampler for the Bayesian Group Lasso

For the Bayesian group lasso model, the 2BG updates described in (3.1) become

(3.3)

The one-step transition density of the -chain is given by

(3.4)

3.2 The Two-block Gibbs Sampler for the Bayesian Sparse Group Lasso

For the Bayesian sparse group lasso model, the 2BG updates the two blocks and in each iteration as follows:

(3.5)

The one-step transition density of the -chain is given by

(3.6)

3.3 The Two-block Gibbs Sampler for the Bayesian Fused Lasso

For the Bayesian fused lasso model, the 2BG updates the two blocks and in each iteration as follows:

(3.7)

The one-step transition density of the -chain is given by

(3.8)

4 Simulation Studies

We use simulation studies to demonstrate the advantage of the newly developed 2BG’s over existing MCMC computing solutions for the three Bayesian shrinkage models of interest. Existing methods include 3BG and HMC. Briefly, HMC is a state of the art generic MCMC technique that has shown remarkable empirical success in exploring many difficult Bayesian posteriors. See Neal et al. (2011) for an introduction. All HMC samplers in our study are run using the Stan software, which simply requires specification of the Bayesian models in Stan language, then the software will automatically derive the log posterior density, and tune and run HMC algorithms accordingly. Note that, similar to that of Gibbs samplers, different representations or parameterizations of the same model will lead to different HMC algorithms. We experimented two ways to implement HMC for the Bayesian group lasso model. The first one uses the conditional prior in (2.5) directly, while the second one uses its hierarchical representation in (2.6). For the sparse group lasso and the fused lasso models, we implemented them in Stan where the conditional priors of are directly specified as (2.11) and (2.17), respectively. Their hierarchical versions are not implemented because they are highly inconvenient to put down in the language of Stan, and are expected to be much more costly per iteration.

For the Markov chain underlying each algorithm, let denote the lag- autocorrelation of the marginal of the chain (under stationarity). Rajaratnam and Sparks (2015, sec 3) provided comprehensive reasons why is a good proxy for the mixing rate of the joint Markov chain, more so than that of the autocorrelations of the regression coefficients. Empirically, smaller estimates of signal better mixing of the chain. We also compare the per unit time effective sample size of the component of each chain. Specifically, iterations of the Markov chain produce an effective sample size of

(4.1)

We estimate of the 3BG and the 2BG samples using the effectiveSize function from the R package coda (Plummer et al., 2006), and that of the HMC using Stan. Estimates of are divided by , their respective running times in seconds, for eventual comparisons.

Data are simulated using the following model:

(4.2)

where is a vector of independent standard normal random variables, and is a vector consisting of the true regression coefficients. In the first set of scenarios, we generate grouped explanatory variables. For . Consider at and at . We generate variables independently from the multivariate normal distribution with mean vector and the identity variance-covariance matrix. We build order-5 polynomials based on each variable to get a design matrix , where and every 5 consecutive columns belong to a group. For each we experiment with, is such that only its first elements are nonzero, and are drawn independently from the distribution.

In the second set of scenarios, adjacent explanatory variables are assigned regression coefficients close to each other. At and , consider . To generate the design matrix , the rows are independently drawn, each from the -dim multivariate normal distribution, where the mean vector is , and the variance-covariance matrix has on the diagonal, and everywhere else. Then, the columns of are standardized to have mean zero and squared Euclidean norm . The first and the third elements of are drawn independently from , and all other elements are set to zero.

In analyzing the simulated data, both the Bayesian group lasso and the Bayesian sparse group lasso are applied to scenario 1, and the Bayesian fused lasso to scenario 2. Take the comparison of MCMC algorithms for the Bayesian group lasso model for example. At each combination, datasets are generated from scenario 1. For every data set, each kind of MCMC algorithm is run for iterations, with the first discarded as burn in. Each point in Figures S1S2 and S3 represents the average lag-one autocorrelation estimates from Markov chains of the same kind, over different datasets. Similarly, each point in Figures 24 and 6 represents the decadic logarithm of the average of over the Markov chains. The series of experiments are run on the High Performance Neon Cluster at University of Iowa, on a standard machine with 64 GB RAM, 16 cores and 2.6GHz processor.

Remark 1

Note that we choose the relatively short chain length of because the HMC samplers are very costly. While the 2BG and the 3BG sampler can be easily run for iterations or more in all scenarios, they yield estimates of and that are numerically similar to that of the shorter chains reported here. Also, multiple chains with random starts of the same algorithm were run for diagnostic purposes and passed various diagnostic tests including the Gelman-Rubin (Gelman et al., 1992; Brooks and Gelman, 1998), the Geweke (Geweke, 1992), and the Heidelberg-Welch test. The estimates of and were also very similar across chains with different starts. After all, only the first of each set of randomly started chains of length was retained for numerical and graphical summaries.

Remark 2

The main goal of the our simulation study is to study the behavior of the different algorithms, not to see which models or hyperparameter values fit each scenario the best. Hence we only present empirical results for the most reasonable models and their default hyperparameter values. We set , which lead to improper priors on . Also, we set for the Bayesian group lasso, and for the other two Bayesian models. The resulting posteriors are proper (Kyung et al., 2010; Xu and Ghosh, 2015). For work in the future, it’s of interest to investigate the impact of hyperparameter values through Bayesian cross-validation and Bayesian sensitivity analysis, or to conduct fully Bayesian analysis that employ hyperpriors. All such endeavors heavily depend on the fast algorithms developed in this article.

4.1 Simulation Results for the Bayesian Group Lasso

Figure S1 displays the average lag-one autocorrelation of the -chain. We can see that the proposed 2BG has dramatically smaller autocorrelations than the original 3BG across all setups of , indicating better mixing of 2BG. Both HMC samplers have comparable mixing rates to that of 2BG when is no bigger than . But with larger , the lag-one autocorrelation of the HMCs deteriorates, while that of 2BG stays below . Further, Figure 2 takes computing time into consideration by examining the average effective sample size per second, , in the log scale of each sampler. It is evident that 2BG is more efficient than all others. For example, at , 2BG produces about twice as many effective samples per second as that of 3BG, times that of “Stan” and times that of “Stan_hier’. At , the relative efficiency of 2BG over the other three methods increase to , and , respectively. Note that “Stan_hier” is much less efficient than “Stan”, as a result of similar mixing rate of the two underlying Markov chains, but that the hierarchical version involves a higher-dimensional posterior that greatly increases the computational cost per iteration.

Figure 1: Empirical lag-one autocorrelation of the component of the four MCMC algorithms for the Bayesian group lasso model at (left) and (right).
Figure 2: Efficiency, measured by the average effective sample size per second, , and displayed in base- log scale, of the four MCMC algorithms for the Bayesian group lasso model at (left) and (right).

4.2 Simulation Results for the Bayesian Sparse Group Lasso

Figure 3: Empirical lag-one autocorrelation of the component of the three MCMC algorithms for the Bayesian sparse group lasso model at (left) and (right).
Figure 4: Efficiency, measured by the average effective sample size per second, , and displayed in base- log scale, of the three MCMC algorithms for the Bayesian sparse group lasso model at (left) and (right).

Figure S2 displays the lag-one autocorrelation of the -chain of the 2BG, the 3BG and the HMC sampler for the Bayesian sparse group lasso model, which shows a similar pattern as that of the group lasso model. Again, 2BG always mixes much faster than 3BG. HMC mixes faster than 2BG when , but 2BG shows an increasingly large advantage over HMC as grows larger than . After taking computing time into account, in Figure 4, we see that 2BG produces the most effective samples per second in all setups. For example, at , 2BG produces around and times as many effective samples per second as that of 3BG and HMC, respectively. And at , the relative efficiency of 2BG to the other two samplers grow to about and , respectively.

4.3 Simulation Results for the Bayesian Fused Lasso

Figure S3 displays the autocorrelation of the -chain of the three samplers for the Bayesian fused lasso model. 2BG and HMC always mix much faster than 3BG, while 2BG has an advantage over HMC for large values when exceeds . Now taking computing time into account, Figure 6 shows that 2BG is the most efficient sampler in all situations. For example, at , the 2BG produces times as many effective samples as that of 3BG and times that of “Stan”. When increased to , 2BG is times as efficient as 3BG and twice that of HMC, though all three generate less than effective sample per second.

Figure 5: Empirical lag-one autocorrelation of the component of the three MCMC algorithms for the Bayesian fused lasso model at (left) and (right).
Figure 6: Efficiency, measured by the average effective sample size per second, , and displayed in base- log scale, of the three MCMC algorithms for the Bayesian fused lasso model at (left) and (right).

4.4 Summary of Simulation Results

For all three Bayesian shrinkage models in all setups, 2BG is always more efficient than 3BG and HMC. The advantage of 2BG is most prominent when the number of predictors is larger the data sample size . In these most demanding, high-dimensional regression problems, 2BG produces about times as many effective samples per second as that of its competitors. Contributing to this advantage are (1) the fast mixing rate of 2BG, which is similar to that of the state of the art method HMC for small , and even faster than that of the HMC for larger , and (2) the low computing cost per iteration of 2BG, which is no more than that of 3BG, and much lower than that of HMC.

5 Real Data Analysis

In this section, we observe superior performance of the proposed 2BG over 3BG and HMC in real data applications. Three data sets are analyzed, each using a different Bayesian shrinkage model that is appropriate for the structure of the covariates. For each model, all three computing methods are used and evaluated in terms of mixing rate and computational efficiency, as summarized in Table 1.

Each algorithm is run for iterations with the first one-tenth discarded as “burn-in”. The experiments are run on a 2.2 GHz machine with 16GB RAM and MacOS Sierra system. More details of the data and the comparisons in computation are given in the subsections below.

Autocorrelation
Data Set n p 2BG 3BG Stan 2BG 3BG Stan
Gene Expression Data 120 100 0.057 0.40 0.19 578 245 0.08
Economic Data 72 51 0.014 0.42 0.029 1213 494 8.63
CGH Data 200 200 0.288 0.637 0.232 49 19 1.38
Table 1: Lag-one Autocorrelation and effective sample size per second for the -component of 2BG, 3BG and HMC as applied to the three datasets in Section 5.

5.1 Gene Expression Data

We first evaluate the performance of 2BG, 3BG and HMC for a Bayesian group lasso model applied to a gene expression data of Scheetz et al. (2006), which is publicly available as the data set bardet in the R package gglasso (Yang and Zou, 2015). The response variable is the expression level of gene TRIM32, which causes Bardet-Biedl syndrome. The expression level of other genes are available and expanded using basis B-splines. The resulting design matrix contains columns where each consecutive correspond to the same gene. Each column is then standardized to have mean zero and squared Euclidean norm . The Bayesian group lasso model was implemented with regularization parameter , based on a cross-validation step of its frequentist counterpart.

2BG mixes very fast with an average lag-one autocorrelation of for the -chain, compared to that of of HMC, and of 3BG. Next, consider the number of effective samples of the -component generated per second, 2BG comes first at around , followed by 3BG at , and HMC at only .

5.2 Economic Data

We use the Bayesian sparse group lasso model to analyze an economic data from Rose and Spiegel (2010), which is publicly available as the data set Crisis2008BalancedData in the R package BSGS (Lee and Chen, 2015). The response variable is the growth rate of a crisis measure from 2008 to 2009. The design matrix contains explanatory variables for the 2008 financial crisis for countries. These variables can be classified into groups. All the columns of are standardized except for those corresponding to the dummy variables. The regularization parameters were taken to be and .

Again, using the lag-one autocorrelation of the chain , 2BG comes in first at , followed by HMC at and 3BG at a substantially larger correlation at . In terms of the number of effective samples per second, , 2BG is the best at , followed by of 3BG, and of HMC.

5.3 Comparative Genomic Hybridization (CGH) Data

We apply the Bayesian fused lasso model to a subset of the CGH data of Tibshirani and Wang (2007), which is publicly available as the data set CGH in the R package cghFLasso. Tibshirani and Wang (2007) analyzed this data with fused lasso model. The response contains samples of CGH array. The design matrix is the identity matrix. The regularization parameters were taken to be and .

Concerning the lag-one autocorrelation of the -chain, 2BG and the HMC are comparable, at and , respectively. While 3BG falls behind at . After taking time into account, 2BG is clearly the winner by producing effective samples per second, compared to from 3BG, and from HMC.

6 Theoretical Properties

In this section, we study the convergence rate and the spectral property of the proposed 2BG and the original 3BG for the three Bayesian shrinkage models respectively. In particular, we establish GE of the 2BG Markov chain for each model, and derive quantitative upper bounds on their geometric convergence rates. We also show that the 2BG Markov chains have better behaving spectrums than their 3BG counterparts.

6.1 Markov Chain Background

Let be the posterior distribution of obtained from the model in (2.1), defined on the space . Consider a Markov chain that is invariant with respect to , that has one-step transition function and -step transition function , for . Let be the total variation distance. If the Markov chain is aperiodic, irreducible, and Harris recurrent, then for any initial value , the Markov chain converges to the posterior distribution in the following sense:

One can quantify the rate of this convergence. For example, if there exists a function and , such that for all ,

(6.1)

then the Markov chain is said to be geometrically ergodic. Further, the infimum of all that satisfies (6.1) for some is called the geometric rate of the convergence, and denoted by . GE is a most commonly required condition for the central limit theorem (CLT) of Markov chains (see, e.g., Jones et al. (2004)), and establishing the CLT is fundamental to reliable computing for Bayesian inference.

One way to prove the GE of a Markov chain is by constructing drift conditions and the associated minorization conditions, which we will refer to as the d&m method. We say a drift condition holds for the Markov transition function if there exists a function , and constants and , such that for all ,

(6.2)

Next, for any , let . The minorization condition associated with the drift function holds if there exists and a probability measure on such that for all ,

(6.3)
Theorem 1 (Rosenthal (1995))

Suppose a Markov chain satisfies the drift condition (6.2) and the minorization condition (6.3) on . Let and . Then, for any ,

The above theorem concerning the d&m method tells us that, for any , is an upper bound for the geometric convergence rate . It can also be shown that when , and , there always exists such that . The above d&m method has been used to establish GE results for Bayesian shrinkage models under various priors, including Khare and Hobert (2013) for 3BG and Rajaratnam et al. (2018) for 2BG under the Lasso prior, Pal and Khare (2014) for 3BG and Zhang et al. (2019) for 2BG under the Dirichlet-Laplace prior and the Normal-Gamma prior, as well as Vats (2017) for 3BG under the same three priors that this paper is concerned with.

Despite the GE results in Vats (2017) and the fact that each 2BG has no bigger operator norm than that of its 3BG counterpart (Liu et al., 1994), there is no straightforward implication that each 2BG is also GE. This is mostly because 3BG’s lead to non-reversible Markov chains, which correspond to non-self-adjoint operators, and it is not well understood how their operator norms relate to the convergence rates. Hence, there is the need to prove GE of each 2BG directly, which we accomplish using the d&m method in the next subsection. In addition, quantitative upper bounds will be provided for the convergence rates .

6.2 Geometric Ergodicity of 2BG

In this section, we establish GE of the 2BG Markov chains for the three Bayesian shrinkage models of interest, by deriving their respective drift and minorization conditions.

We first present two Lemmas. The first one is a well known result in matrix algebra, and the second is specific to our regression setup with response and design matrix . Proof of the latter can be found in the supplementary material. Notationwise, for two positive definite matrices and , we write if is nonnegative definite.

Lemma 2

Let , be two positive definite matrix. If , then .

Lemma 3

Let be a diagonal matrix where . Let . Then

6.2.1 Geometric Ergodicity of the Bayesian Group Lasso

For the Markov transition density in (3.4), we define the drift function as

(6.4)

Denote . We obtain the following theorem by establishing (6.2) and (6.3) for the drift function .

Theorem 2

Let , , and . Then, for any ,

(6.5)

for every , where , and .

Proof.

Let denote the initial value for the Markov chain of 2BG in (3.3), and let denote the first iteration of the chain. To establish the drift condition for , observe that