[
Abstract
In a modern observational study based on healthcare databases, the number of observations typically ranges in the order of and that of the predictors in the order of . Despite the large sample size, the data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression provides a potential solution to this problem. There is a rich literature on desirable theoretical properties of Bayesian approaches based on shrinkage priors. On the other hand, the development of scalable methods for the required posterior computation has largely been limited to the case. Shrinkage priors make the posterior amenable to Gibbs sampling, but a major computational bottleneck still arises from the need to sample from a highdimensional Gaussian distribution at each iteration. Despite a closedform expression for the precision matrix , computing and factorizing such a large matrix is computationally expensive nonetheless. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector such that the solution to the linear system has the desired Gaussian distribution. We can then find an accurate solution to the linear system by the conjugate gradient (CG) algorithm through the matrixvector multiplications by , without ever explicitly inverting . Practical performance of CG, however, depends critically on appropriate preconditioning of the linear system; we turn CG into a highly effective algorithm for sparse Bayesian regression by developing a theory of priorpreconditioning. We apply our algorithm to a clinically relevant largescale observational study with and , designed to assess the relative risk of intracranial hemorrhage from two alternative blood anticoagulants. Our algorithm demonstrates an order of magnitude speedup in the posterior computation.
supplementReferences for Supplement CGaccelerated Gibbs]Priorpreconditioned conjugate gradient for accelerated Gibbs sampling in “large & large ” sparse Bayesian logistic regression models Nishimura et. al.]Akihiko Nishimura and Marc A. Suchard
1 Introduction
Given an outcome of interest and a large number of features , the goal of sparse regression (or variable selection) is to find a small subset of these features that captures the principal relationship between the outcome and features. Such a sparsity assumption is mathematical necessity when exceeds the sample size ; even when , however, the assumption often remains critical in improving the interpretability and stable estimation of regression coefficients . This is especially true when either

the design matrix is sparse i.e. only a small fraction of the design matrix contains nonzero entries because the features are observed infrequently.

the outcome is imbalanced i.e. (or ) for most of .
Either of these conditions reduces the amount of information the data provides on the regression coefficients. In a modern observational study based on largescale healthcare databases, for example, the sparsity of almost always holds true because a large number of potential preexisting conditions and available treatments exist, yet only a small subset of these applies to each patient (Schuemie et al., 2018b). Imbalanced outcome is also common as many serious diseases of interest are rare among the population.
The particular application considered in this manuscript is a comparative study of two anticoagulants dabigatran and warfarin, based on observational data from Truven Health MarketScan Medicare Supplemental and Coordination of Benefits Database. These drugs reduce the risk of blood clot formation, but can increase the risk of bleeding. The goal of the study is to quantify which of the two drugs has a lower risk of intracranial hemorrhage (bleeding inside the skull) among treated patients. The data set consists of patients and predictors of potential relevance.
An increasingly common approach to sparse regression is the Bayesian method based on continuous shrinkage priors on the regression coefficients . These priors often are represented as a scalemixture of Gaussians
(1.1) 
with unknown global scale parameter and local scale parameter (Polson and Scott, 2010; Carvalho et al., 2010; Armagan et al., 2013; Polson et al., 2014; Bhattacharya et al., 2015; Bhadra et al., 2017). Compared to more traditional “spikeandslab” discretemixture priors for sparse Bayesian regression, these continuous shrinkage priors are typically more computationally efficient while maintaining highly desirable statistical properties (Bhattacharya et al., 2015; Pal et al., 2014; Datta et al., 2013). Despite the relative computational advantage, however, posterior inference under these priors still encounters a major scalability issue. For instance, for our comparative study of two anticoagulants, it takes over 50 hours on 2015 iMac to run 1,000 iterations of the current stateoftheart Gibbs sampler in our reasonably optimized Python implementation (Section 5).
In this article, we focus on accelerating the stateoftheart Gibbs sampler for sparse Bayesian logistic regression, but our approach applies whenever the likelihood of the data or latent parameter can be expressed as a Gaussian mixture. The PolyaGamma data augmentation scheme of Polson et al. (2013) makes a posterior under the logistic model amenable to Gibbs sampling as follows. Through a PolyaGamma auxiliary parameter , the conditional likelihood of a binary outcome becomes
(1.2) 
Correspondingly, the full conditional distribution of is given by
(1.3) 
where , a diagonal matrix with entries , and . The main computational bottleneck of the Gibbs sampler is the need to sample from a highdimensional Gaussian of the form (1.3). The standard algorithm requires operations: for computing the term for and for the Cholesky decomposition of . These operations remain significant computational burdens even when a sparsity in significantly reduces the theoretically necessary number of arithmetic operations. This is because numerical computations in a modern computer architecture is memory bound, and the reduction in the number of arithmetic operations does not directly translate into reduced computing time (Golub and Van Loan, 2012; Dongarra et al., 2016).
Recently, significant progress has been made in computational techniques for cases. Bhattacharya et al. (2016) proposes an algorithm to sample from (1.3) with only operations, where the cost becomes negligible for . The algorithm requires computing the matrix for operations, and then solving an linear system for operations. Johndrow et al. (2018) reduce the cost by replacing the matrix with an approximation that can be computed with operations for . This technique offers no reduction in computational cost in cases, however. Hahn et al. (2018) propose a sampling approach for linear regression based on an extensive preprocessing of the matrix — a trick limited in scope strictly to Gaussian likelihood models. None of these advances addresses the “large & large ” logistic regression problem considered in this article.
Proposed in this article is a novel algorithm to sample from a highdimensional distribution of the form (1.3) through the conjugate gradient (CG) method, requiring only a small number of the matrixvector multiplication operations . The vector can be computed through operations and along with simple elementwise multiplications, without ever explicitly forming the matrix . This is an important feature when dealing with a large and sparse design matrix ; the matrix and hence may contain a much larger proportion of nonzero entries than does, making directly handling much more intensive in memory usage and the number of arithmetic operations. Note, for example, that it requires 74.5 GB of memory to allocate a dense matrix in doubleprecision numbers when . The ability to automatically exploits a sparsity structure in , therefore, is another major advantage of our algorithm.
Also developed in this article is a theory of an effective preconditioning technique in the context of sparse Bayesian regression. Preconditioning relates a given problem to a modified one to accelerate CG and is critical in achieving the full potential of the algorithm. While a variety of generalpurpose preconditioners exist, design of an effective preconditioner remains problem specific. Exploiting the fact that the shrinkage priors dominate likelihood for all but a small fraction of the regression coefficients, we develop what we term the priorpreconditioning approach and demonstrate its superiority over generalpurpose preconditioners in sparse Bayesian regression applications. It is worth noting that our contribution here is completely distinct from that of Cockayne et al. (2018), who propose an extension of CG that outputs a probability measure. We instead employ the classical CG method in a novel context, demonstrating its ability to significantly accelerate Monte Carlo simulations when applied in a right way.
The rest of the paper is organized as follows. Section 2 begins by describing how to recast the problem of sampling from the distribution (1.3) as that of solving a linear system , eliminating the need to factorize . The rest of the section explains how to apply CG to rapidly solve the linear system while developing theoretical foundations behind our priorpreconditioner. In Section 3, using simulated data, we demonstrate the effectiveness of the CGbased sampler for sparse regressions. Also demonstrated is how the behavior of CG depends on different preconditioning strategies. Section 4 describes practical details needed to successfully apply the CGbased sampler to sparse regression problems. Finally, in Section 5, we apply our algorithm to carry out the dabigatran vs. warfarin comparison, demonstrating an order of magnitude speedup in the posterior computation. Our CGaccelerated Gibbs sampler for sparse Bayesian logistic regression is implemented as the bayesbridge package available from Python Package Index (pypi.org). The source code is available at a GitHub repository https://github.com/akinishimura/bayesbridge.
2 Conjugate gradient sampler for sparse regression
2.1 Generating a Gaussian as the solution of a linear system
The standard algorithm for sampling a multivariateGaussian requires the Cholesky factorization of its precision (or covariance) matrix (Ripley, 1987). When the precision matrix has a specific structure as in (1.3), however, it turns out that the problem of sampling from the distribution (1.3) can be recast to that of solving a linear system. This in particular obviates the need to factorize the matrix . The key observation is that we can generate a Gaussian vector with for a computational cost negligible compared to an explicit formation of .
Proposition 2.1
The following procedure generates a sample from distribution (1.3):

Generate by sampling independent Gaussian vectors and and then setting
(2.4) 
Solve the following linear system for :
(2.5)
The result follows immediately from basic properties of multivariate Gaussians; in particular, the solution to (2.5) has the required covariance structure because .
The above algorithm complements the algorithm Bhattacharya et al. (2016) propose; their algorithm reduces the task of sampling a multivariate Gaussian to solving a linear system, while our algorithm reduces it to solving a system. While the structures of the two linear systems are different, the techniques developed in this manuscript may be applicable with appropriate modifications to solving the linear system arising in Bhattacharya et al. (2016).
2.2 Iterative method for solving a linear system
The utility of Proposition 2.1 stems from the fact that solving the linear system (2.5) can be significantly faster than the standard algorithm for sampling a Gaussian vector. We achieve this speedup by applying the CG method (Hestenes and Stiefel, 1952; Lanczos, 1952). CG belongs to a family of iterative methods for solving a linear system. Compared to traditional direct methods, iterative methods are more memory efficient and, if the matrix has certain structures (Section 2.3), can be significantly faster.
Iterative methods have found applications in Gaussian process models, where optimizing the hyper parameters of covariance functions requires solving linear systems involving large covariance matrices (Gibbs and MacKay, 1996). Significant research has gone into how best to apply iterative methods in this specific context; see Stein et al. (2012), Sun and Stein (2016), and Stroud et al. (2017) for example. Outside the Gaussian process literature, Zhou and Guan (2017) use an iterative method to address the bottleneck of having to solve large linear systems when computing Bayes factors in a model selection problem.
The use of CG as a computational tool for Monte Carlo simulation is a novel feature of our work that has not been considered by any the above works. While we were preparing our manuscript, we were also informed of the work by Zhang et al. (2018), which uses the same idea as in Proposition 2.1. They apply the CG method, apparently without any preconditioners, to generate a posterior sample from a Gaussian process model. Our work is distinguished by a careful development — supported by both theoretical analysis and systematic empirical evaluations — of preconditioning techniques (Section 2.3) tailored toward sparse Bayesian regression problems. The theoretical foundations laid out here also provide practical guidelines on how one may apply CG in other Bayesian computation problems.
The CG method solves a linear system involving a positive definite matrix as follows. Given an initial guess , which may be taken as for example, CG generates a sequence of increasingly accurate approximations to the solution. The convergence of the CG iterates ’s is intimately tied to the Krylov subspace
(2.6) 
generated from the initial residual . With denoting an affine space , the approximate solution satisfies the following optimality property in terms of a weighted norm:
(2.7) 
The norm is often referred to as the norm. The optimality property (2.7) in particular implies that CG yields the exact solution after iterations. The main computational cost of each update is the matrixvector operation . Therefore, the required number of arithmetic operations to run iterations of the CG update is comparable to that of a direct method based on the Cholesky factorization of . Through an effective preconditioning strategy described in the next section, however, we can induce rapid convergence of CG for a typical precision matrix arising from the conditional distribution (1.3). In our numerical results, we indeed find that the distribution of even for is indistinguishable from (1.3) for all practical purposes (Section 5.6).
It is worth emphasizing that our CG sampler does not require explicitly forming the precision matrix of the form (1.3) because the vector can be computed via operations and along with simple elementwise multiplications. This is critical for our “large and large ” applications for a couple of reasons. First, computing the term in often requires more computational efforts than a subsequent factorization of when . Secondly, as mentioned earlier for a large sparse design matrix , storing an explicitly computed can significantly increase memory burden compared to storing itself.
2.3 Convergence behavior of the CG method
When directly applied to a given linear system, the iterative solution of CG often displays slow convergence. Section 2.4 covers the topic of how to induce more rapid CG convergence for the system (2.5) arising from sparse Bayesian regression. In preparation, here we describe how the convergence behavior of CG is related to the structure of the positive definite matrix .
Convergence behavior of CG can be partially explained by the following wellknown error bound given in terms of the condition number , the ratio of the largest to the smallest eigenvalue of .
Theorem 2.2
Given a positive definite system and a starting vector , the th CG iterate satisfies the following bound in its norm distance to the solution :
(2.8) 
See Trefethen and Bau (1997) for a proof. Theorem 2.2 guarantees fast convergence of the CG iterates when the condition number is small. On the other hand, a large condition number does not always prevent rapid convergence of CG. This is because CG can also converge quickly when the eigenvalues of are “clustered.” The following theorem quantifies this phenomenon, albeit in an idealized situation in which has exactly distinct eigenvalues.
Theorem 2.3
If the positive definite matrix has only distinct eigenvalues, then the CG yields an exact solution within iterations. In particular, the result holds if is a rank perturbation of an identity i.e. for .
See Golub and Van Loan (2012) for a proof.
Theorem 2.2 and 2.3 are arguably the two most famous results on the convergence property of CG, perhaps because their conclusions are clearcut and easy to understand. However, each theorem captures only an aspect of CG’s convergence property and falls short of describing typical behavior in practical applications; the assumption behind Theorem 2.3 is unrealistic, while the bound of Theorem 2.2 is too pessimistic. To better capture behavior of CG in actual applications, we bring together the most useful of the known results that have been scattered around the numerical linear algebra literature and summarize them as the following rule of thumb. We make all the statements contained in the rule of thumb mathematically precise in Appendix B. As we will see, the rule of thumb points us to the design of an effective preconditioning strategy in the context of sparse Bayesian regression.
Rule of Thumb 2.4
Suppose that the eigenvalues of are clustered in the interval except for a small fraction of them. Then CG “knocks off” the outlying eigenvalues exponentially quickly and its convergence subsequently accelerates as if the effective condition number of is rather than in Eq 2.8. The largest eigenvalues are effectively removed within iterations, while the same number of smallest eigenvalues tends to delay the convergence longer.
2.4 Preconditioning the CG method for rapid solutions of (2.5)
A preconditioner is a positive definite matrix such that the preconditioned system
(2.9) 
leads to faster convergence of the CG iterations. In practice, we only need and not since the algorithm can be implemented so that only the operation is required to solve the preconditioned system (2.9) via CG (Golub and Van Loan, 2012). This preconditioned CG algorithm still returns a solution in terms of the original system. While a wide range of general techniques has been proposed, finding a good preconditioner to a given linear system remains “a combination of art and science.” (Saad, 2003)
In light of Rule of Thumb 2.4, an effective preconditioner should modify the eigenvalue structure of so that the preconditioned matrix has more tightly clustered eigenvalues except perhaps for a small number of outlying eigenvalues. Larger outlying eigenvalues are preferable over smaller ones, as smaller ones cause more significant, if not severe, delays in the convergence of CG. In addition to the convergence rate, a choice of a preconditioner must take into account the onetime cost of computing the preconditioner itself as well as the cost of operation during each CG iteration.
In the contexts of sparse Bayesian regression, the linear system (2.5) turns out to admit a deceptively simple yet highly effective preconditioner, obviating the need for more complex and computationally expensive preconditioning strategies. In fact, the choice
(2.10) 
yields a modified system (2.9) with an eigenvalue structure ideally suited to the CG application. With a slight abuse of terminology, we will call this matrix the prior preconditioner since it corresponds to the precision of before observing . Note that this choice is different from the widelyused Jacobi preconditioner based on the diagonal elements of ; despite being a reasonable choice, the Jacobi preconditioner is substantially inferior to the prior preconditioner in typical applications (Section 3.3 and 5.4).
Our prior preconditioner can be motivated as follows. The shrinkage prior is employed either when 1) we want to impose a sparsity through an informative prior on the global shrinkage parameter or 2) we believe that the data support sparser models, say, in terms of the marginal likelihood. In either case, we expect posterior draws of to satisfy except for a relatively small subset of . (More precisely, we mean by that the contribution of the term is small in predicting the outcome.) This observation leads to the following simple heuristic. Note that the priorpreconditioned matrix is given by
(2.11) 
and the matrix has the th entry
(2.12) 
which is small when either or . In other words, the entries of are small away from the block corresponding to the indices . In general, smaller entries of a matrix have less contributions to the eigenvalue structures of the entire matrix (Golub and Van Loan, 2012). This means that the priorpreconditioned matrix (2.11) can be thought of as a perturbation of the identity with a matrix with approximate lowrank structure. As such, can be expected to have eigenvalues clustered around 1, except for a relatively small number of larger ones.
The above heuristic on the approximate lowrank structure of is formalized and quantified in Theorem 2.5. It is also worth noting, however, that it is too naive to conclude from the above heuristic that a good approximation to can be obtained by simply zeroing out ’s below some threshold. Such thresholding approach is successfully employed in Johndrow et al. (2018) for a related but different problem. In our context, however, numerical results clearly show that such approximation can be of a poor quality — see the supplement Section S4.
Theorem 2.5
Let denote the th largest element of . The eigenvalues of the priorpreconditioned matrix (2.11) then satisfies
(2.13) 
for . In fact, the following more general bounds hold. Let denote the submatrix of a given matrix corresponding to the row and column indices . With this notation, we have
(2.14) 
for any such that .
Theorem 2.5 guarantees tight clustering of the eigenvalues of the prior/preconditioned matrix — and hence rapid convergence of the CG — when most of ’s are close to zero. Through its dependence on , the bound (2.14) further shows that rapid CG convergence is also expected when the eigenvalues of decays quickly, which tends to happen when there is highcollinearity among the predictors.
Theorem 2.5 can also be used to directly relate the CG approximation error under the prior preconditioner to the decay rate in ’s:
Theorem 2.6
The priorpreconditioned CG applied to (2.5) yields iterates satisfying the following bound for any :
(2.15)  
where 
To illustrate the implication of Theorem 2.6 in concrete terms, suppose that a posterior draw satisfies for some . In this case, we have . So the bound (2.15) implies
(2.16) 
For instance, after iterations, the CG approximation error in the norm is guaranteed to be reduced by a factor of relative to the initial error.
We have introduced the priorpreconditioning strategy ultimately for the purpose of efficient posterior computation under sparse regression models. All the results have so far been formulated in linear algebraic languages, however. To provide a useful guideline on the performance of the CGaccelerated Gibbs sampler in practical applications, we now summarize the above discussions in a more statistical language.
Rule of Thumb 2.7
The prior preconditioner (2.10) induces rapid convergence of the CG applied to the linear system (2.5) when the posterior of concentrates on sparse vectors. (With continuous shrinkage priors, by sparsity we mean that most of ’s are virtually zero in their magnitudes.) As the sparsity of increases, the average convergence rate of the CG sampler also increases.
The statements above are born out by an illustrative example of Section 3 using synthetic sparse regression posteriors. Also, as we have seen, the statements can be made more precise in terms of the decay rate in the ordered statistics of a posterior sample (Rule of Thumb 2.4, Theorem 2.5, and Theorem 2.6).
2.5 General principle behind priorpreconditioning approach
We close our discussion of preconditioning techniques by providing alternative heuristics behind the prior preconditioner. While not as quantitative as Theorem 2.5, the following general principle suggests that the priorpreconditioning is effective beyond sparse regression settings for any Bayesian computation involving conditionally Gaussian distributions. In the context of the CG sampler, the preconditioned matrix represents the precision matrix of the transformed parameter . In fact, preconditioning the linear system (2.5) with a preconditioner is equivalent to applying a parameter transformation before employing the CG sampler. That is, we can apply one of the two strategies — precondition the linear system or apply the parameter transformation — to achieve exactly the same effect on the speed of the CG sampler.
When we choose the prior precision as the preconditioner, the transformed parameter a priori has the identity precision matrix, before its distribution is modified via the likelihood. This perspective, combined with the fact that the eigenvalues of represents the posterior precisions of along its principal components, suggests the following principle:
Principle Behind Priorpreconditioning 2.8
Under a strongly informative prior, the posterior looks like the prior except in a small number of directions along which the data provide significant information. This translates into the eigenvalues of the priorpreconditioned matrix clustering around 1 except for a relatively small number of large eigenvalues.
3 Demonstration of the CG sampler performance on simulated data
In addition to the prior preconditioning strategy introduced in Section 2.4, there remain a few more important details to successfully apply the CG sampler to sparse regression problems in practice. Preconditioning is undoubtedly the most essential ingredient of the CG sampler, however, and we defer other practical details to Section 4. Instead, we now turn to demonstrating the performance of the CG sampler applied to distributions of the form (1.3) as they arise from actual posterior distributions of sparse Bayesian logistic regression models. We illustrate how the CG convergence rates are affected by different preconditioning strategies and by corresponding eigenvalue distributions of the preconditioned matrices. Also, we use simulated data with varying numbers of nonzero coefficients and confirm how sparsity in regression coefficients translates into faster CG convergence as predicted by Theorem 2.5 and Rule of Thumb 2.7.
3.1 Choice of shrinkage prior: Bayesian bridge
While a variety of globallocal shrinkage priors of the form (1.1) are available, we adopt the Bayesian bridge prior of Polson et al. (2014) in our implementation of the CGaccelerated Gibbs sampler. We make this choice for the following reasons. First, the Gibbs sampler under the Bayesian bridge tends to demonstrate better mixing in the global shrinkage parameter . This is because the Bayesian bridge formulation allows the update of from the distribution of with marginalized out (Polson et al., 2014). Secondly, many of the alternative shrinkage priors have extremely heavy tails that can be problematic in the logistic regression context. Under such heavytailed priors, the posterior ends up having heavytails when the parameters are only weakly identified from the data, causing issues both in terms of posterior computation and inference (Ghosh et al., 2018; Piironen and Vehtari, 2017).
Under the Bayesian bridge, the local shrinkage parameter ’s are given independent alphastable distributions with index of stability with . This choice induces to a prior on , when is marginalized out, such that
(3.17) 
The case coincides with the Bayesian lasso. The distribution of becomes “spikier” as , placing greater mass around while at the same time having heavier tails. While an alphastable distribution has no closedform expression, there are algorithms available to efficiently sample from the posterior distribution of (Polson et al., 2014). In typical applications, the marginal likelihood of data favors the values but only very weakly identifies it (Polson et al., 2014), so in our implementation we simply set .
3.2 Experimental setup
We generate synthetic data of sample size with the number of predictors . To generate a design matrix with correlation among the predictors, we emulate a model from factor analysis (Jolliffe, 2002). We first sample a set of orthonormal vectors with uniformly from a Stiefel manifold, comprised of sets of orthonormal vectors. We then set the predictor for the th observation to be
(3.18) 
This is equivalent to independently sampling , where is a diagonal matrix with and is an orthonormal matrix sampled uniformly from the space of orthonormal matrices. We then center and standardize the predictors as is commonly done in sparse regression models (Hastie et al., 2009).
The above process yields a design matrix with moderate correlations among the predictors — the distribution of pairwise correlations is approximately Gaussian centered around 0 with the standard deviation of 0.13. Based on this design matrix , we simulate three different binary outcome vectors by varying the number of nonzero regression coefficients. More specifically, we consider a sparse regression coefficient with for each , and . In all three scenarios, the binary outcome ’s are generated from the logistic model as for .
For each of the simulated data sets, we obtain a posterior sample of by running the current stateoftheart PolyaGamma augmented Gibbs sampler using the direct linear algebra to sample from its conditional distribution (1.3). We confirm the convergence of the Markov chain through the traceplot of the posterior log density of (with and marginalized out). We can consider the stateoftheart Gibbs sampler “exact” in a sense that the direct linear algebra has no potential convergence issues of the CG method. (It is, however, still affected by errors from finite precision arithmetic as always.)
Having obtained a posterior sample , we sample a vector as in (2.4) and apply CG to the linear system (2.5). We then compare the CG iterates to the exact solution obtained by solving the same system with the direct linear method based on the Cholesky factorization. We repeat this process for 8 random replications of the righthand vector for the linear system (2.5).
3.3 Results: convergence rates, eigenvalues, and sparsity of coefficients
Figure 3.1 shows the CG approximation error as a function of the number of CG iterations, whose cost as we discuss in Section 2.2 is dominated by matrixvector operations of the form . Here we characterize the error as the average of the relative error across all the coefficients. For each line on the plot, this error is averaged in the logscale over the 8 random replications of the righthand vector . Plots in the supplement Section S1 show, however, that the CG convergence behavior remains qualitatively similar regardless of choice of the error metric and varies little across the different righthand vectors. In particular, the superior convergence rate of the priorpreconditioning over the Jacobi one holds consistently.
First, we focus on the results corresponding to the priorpreconditioned CG, indicated by the lines with circles. After matrixvector operations, the distance between and is already orders of magnitudes smaller than typical Monte Carlo error, say, in estimating . In fact, with a relatively small number of additional CG iterations, the distance reaches the level of machine precision manifested as the “plateaus” of the approximation error seen in the blue dashdot and orange dashed lines with circles.
Along with the approximation errors based on the prior preconditioner, as a benchmark we also compute the errors based on the Jacobi preconditioner . The Jacobi preconditioner is the simplest generalpurpose preconditioner and usually performs well for linear systems like ours when the diagonals of is significantly larger than the offdiagonals (Golub and Van Loan, 2012). In the context of the CG sampler, the Jacobi preconditioning coincides with the use of the conditional precisions as a preconditioner. While more complex generalpurpose preconditioners exist, they require substantially more computational efforts to compute them before the CG iterations can even get started (Golub and Van Loan, 2012). We therefore do not consider those preconditioners here.
It is evident from Figure 3.1 that the prior preconditioner induces more rapid CG convergence than the Jacobi preconditioner for the purpose of our CGaccelerated Gibbs sampler. The difference in convergence speed is more pronounced with sparser true regression coefficients. Studying the eigenvalue distributions of the respective preconditioned matrices provides further insight into the observed convergence behaviors. Figure 3.2 (a) & (b) show the eigenvalue distributions of the preconditioned matrices based on a posterior sample from the synthetic data with 10 nonzero coefficients. The trimmed version of the histograms are also given to highlight the tails of the distributions. The prior preconditioner induces the distribution with a tight cluster around 1 (or 0 in the scale) with a relatively small number of large ones, confirming the theory we developed in Section 2.4. On the other hand, the Jacobi preconditioner induces a more spreadout distribution, problematically introducing quite a few small eigenvalues that delay the CG convergence (Rule of Thumb 2.4).
Finally, we turn our attention to the relationship as seen in Figure 3.1 between the CG convergence rate and the sparsity in the underlying true regression coefficients. The CG convergence is clearly quicker when the true regression coefficients are sparser. To understand this relationship, it is informative to look at the values of drawn from the respective posterior distributions. Figure 3.3 plots the values of for corresponding to the first 250 coefficients. We use two different scales for and , shown on the left and right respectively, to facilitate the qualitative comparison between the two cases. As expected, the posterior sample from the synthetic data with a larger number of signals has a larger number of away from zero. These relatively large ’s contribute to the delayed convergence of CG as quantified in Theorem 2.5 and Rule of Thumb 2.4.
A more significant cause of the delay, however, is the fact that the shrinkage prior yields weaker shrinkage on the zero coefficients when there are a larger number of signals. With a close look at Figure 3.3, one can see that corresponding to the true signals are not as well separated from the rest of ’s when . In fact, the histograms on the left of Figure 3.4 shows that the distribution of ’s for are shifted toward larger values compared to that for . This is mostly due to the posterior distribution of concentrating around a larger value — the value of the posterior sample is for the case while for the case. It is also worth taking a closer look at the tail of the distribution of ’s. The histograms on the right of Figure 3.4 show the distribution of the 250 largest ’s after scaling them by . The figure makes it clear that ’s corresponding to the true signals are much more well separated from the rest when . Overall, the slower decay in the largest values of ’s when results in the eigenvalues of the prior preconditioner having a less tight cluster around 1; compare the eigenvalue distributions of Figure 3.2.(a) & (b) to those of Figure 3.2.(c) & (d).
4 Practical CG sampler details for sparse regression
4.1 Initial vector for CG iterations
As suggested by Theorem 2.2 and by more thorough analysis of the CG convergence behavior in Appendix B, generally speaking the CG iterations decrease the distance between the iterates ’s and the exact solution relative to the initial error . Hence there is a benefit to choosing the initial vector with a small initial error . It should be noted, however, that choice of a preconditioner determines the exponential convergence rate of CG and, comparatively, choice of an initial vector often has a smaller effect on the approximation error. In other words, once the initial vector is chosen within a reasonable range, we should not expect a dramatic gain from further finetuning. In fact, when sampling from a sparse regression posterior, we find it difficult to improve much over a simple initialization , which is a reasonable choice as most coefficients are shrunken to near zero. Only small (%) though consistent improvements were observed by one of the other approaches with which we experimented — see the supplement Section S2.1.
4.2 Termination criteria for CG method
An iterative method must be supplied with a termination criteria to decide when the current iterate is close enough to the exact solution. A CG termination criteria used in most of the existing linear algebra libraries is based on the norm of the residual . This is mostly for convenience reasons as the residual norm is easily computed as a biproduct of the CG iterations. At least, the residual norm can be related to as
(4.19) 
For the purpose of sampling a Gaussian vector , however, it is not at all clear when or can be considered small enough. To address this problem, we develop an alternative metric tailored toward our CG sampler for sparse regression.
For our CG sampler, we propose to assess the CG convergence in terms of the norm of the priorpreconditioned residual , where denotes the elementwise multiplication of two vectors. More specifically, we use the termination criteria
(4.20) 
in terms of the rootmeansquared residual .
In the supplement Section S2.2, we explain the utility of the norm as an approximate upper bound to the following quantity:
(4.21) 
The standardization by second moment ensures that, when the computed error is small, all the coordinates of are close to those of either in terms of their means or variances of the target Gaussian distribution. In all our examples, we find this metric to work very well in quantifying the numerical error for the purpose of the CGaccelerated sampling; see Section 5.4 and 5.6 for illustrations.
4.3 Incorporating intercept and predictors with uninformative prior
When fitting a sparse regression model, standard practice is to include an intercept without any shrinkage, often with the improper flat prior (Park and Casella, 2008). Additionally, there may be predictors of particular interests, inference for whose regression coefficients is more appropriately carried out with uninformative or weaklyinformative priors without shrinkage; see Zucknick et al. (2015) as well as the application in Section 5 for examples of such predictors. Our CGaccelerated Gibbs sampler can accommodate such predictors with an appropriate modification to the prior preconditioner proposed above.
For notational convenience, suppose that the regression coefficients are indexed so that the first th coefficients are to be estimated without shrinkage. We further assume that the unshrunk coefficients are given independent Gaussian priors for where denotes an improper prior . The precision matrix of then is given by
(4.22) 
for where we employ the convention if . From the computational point of view, the unshrunk coefficients are distinguished from the shrunk ones by the fact that their prior scales (before conditioning on and ) typically have little to do with their posterior scales (after conditioning on and ). For this reason, a naively modified preconditioner may not be appropriate, especially for coefficients with corresponding to uninformative priors.
We propose a modified preconditioner of the form for appropriately chosen . Under the proposed form of a modified preconditioner, it can be shown that all but eigenvalues of the preconditioned matrix are “wellbehaved” for the purpose of rapid CG convergence. Our goal, therefore, is to choose ’s to control the behavior of the additional eigenvalues introduced by the unshrunk coefficients. The detailed analysis of how to achieve this goal is somewhat involved and is provided in the supplement Section S2.3, but in the end we show that
(4.23) 
with some is a good choice. The factor is included since, as explained in Section S2.3, it is better to err on the side of choosing ’s larger than smaller. In our numerical results, we use and estimate from earlier MCMC iterations.
While we find a poor choice of ’s to significantly delay CG convergence in practice, we also point out the following: once ’s are chosen within reasonable ranges, their precise values have rather small effect on the convergence rate when is small. This is because, under the proposed modified preconditioner, all but eigenvalues are wellbehaved regardless of the choice of . As CG has an ability to eventually “knock off” the extreme eigenvalues (Rule of Thumb 2.4), the additional eigenvalues will generally have limited effects in its convergence behavior.
5 Application: comparison of two drugs using healthcare databases
In this section, we demonstrate how CGacceleration delivers an order of magnitude speedup in the posterior computation for a largescale observational study. We apply sparse Bayesian logistic regression to conduct a comparative study of two anticoagulants dabigatran and warfarin. These drugs are administered to reduce the risk of blood clot formation, but as a side effect can increase the incidence rates of bleeding. The goal of the study is to quantify which of the two drugs have a lower risk of intracranial hemorrhage (bleeding inside the skull). This question has previously been investigated by Graham et al. (2015) and our analysis yields clinical findings consistent with theirs (see Section 5.7).
We are particularly interested in sparse Bayesian regression as a tool for the Observational Health Data Sciences and Informatics (OHDSI) collaborative (Hripcsak et al., 2015). Therefore, we follow the OHDSI protocol in preprocessing of the data as well as in the treatment effect estimation procedure. In particular, sparse regression plays a critical role in eliminating the need to handpick confounding factors; this enables the application of a reproducible and consistent statistical estimation procedure to tens of thousands of observational studies (Schuemie et al., 2018b, a; Tian et al., 2018).
5.1 Data set
We extract patientlevel data from Truven Health MarketScan Medicare Supplemental and Coordination of Benefits Database. In the database, we find patients who were new users of either dabigatran or warfarin after diagnosis of atrial fibrillation. Among them, 19,768 are treated with dabigatran and the rest with warfarin. There are predictors, consisting of clinical measurements, preexisting conditions, as well as prior treatments and administered drugs — all measured before exposure to dabigatran or warfarin. Following the OHDSI protocol, we screen out the predictors observed in less than % of the cohort. This reduces the number of predictors to . The precise definition of the cohort can be found at http://www.ohdsi.org/web/atlas/#/cohortdefinition/{2978,2979,2981}.
Each patient has or is exposed to only a small subsets of all the possible preexisting conditions, treatments, and drugs. The design matrix therefore is sparse, with only 5% of the entries being nonzero. (The density of would have been 1% without the screening of the infrequent predictors.) Another noteworthy feature of the data is the low incidence rates of intracranial hemorrhage, so that the outcome indicator has nonzero entries for only 192 out of 72,489 patients.
5.2 Statistical approach: treatment effect estimation via sparse regression
To estimate the effect of treatment by dabigatran over warfarin on the outcome of interest, we use a doublyrobust method for average treatment effect estimation with propensity score stratification. The actual procedure and essential ideas are described below, but we refer the readers to Stuart (2010) and the references therein for further details.
Within the framework of propensity score methods, the treatment effect estimation proceeds in two stages. First, the propensity score of the treatment assignment to dabigatran of the th individual is estimated by the logistic model
(5.24) 
The quantiles of the estimated propensity scores are then used to stratify the population into subgroups of equal sizes. Following a typical recommendation, we choose the number of subgroups as . Under suitable assumptions, conditioning on the strata indicator removes most of imbalances in the distributions of the predictors between the treatment () and control () groups.
After the propensity score stratification, we estimate the average treatment effect via the logistic model
(5.25) 
where a categorical variable denotes the strata membership of the th individual. (With a slight abuse of notation, we use in both of the models (5.24) and (5.25) to denote the regression coefficients of predictors.) The indicator of is excluded from the model for identifiability. The inclusion of the predictor in the above model is technically not necessary if the distributions of predictors are perfectly balanced between the treatment and control groups within each strata . Controlling for the predictor , however, makes the treatment estimation procedure more robust to potential misspecification in the propensity score model (5.24) as well as to any predictor imbalances that remain after conditioning on .
Each of the regression models (5.24) and (5.25) involves a large number of features, making reliable estimation virtually impossible without some regularization or sparsity assumption. Sparse Bayesian regression is one promising approach, with an opportunity for future extensions such as hierarchical modeling across different hospitals.
5.3 Prior choice and posterior computation
We fit the models (5.24) and (5.25) using the Bayesian bridge shrinkage prior (see Section 3.1) on the regression coefficients. We place a reference prior on the global shrinkage parameter (Berger et al., 2009). An uninformative prior is used for the intercept. For the average treatment effect and propensity score strata effects, we place weakly informative priors.
For posterior inference, we implemented two Gibbs samplers that differ only in their methods for the conditional updates of from the distribution (1.3). One of the samplers uses the proposed CG sampler, while the other uses a traditional direct method based on the Cholesky factorization. We refer to the respective sampler as the CGaccelerated and direct Gibbs sampler. The other conditional updates follow the approaches described in Polson et al. (2014).
5.4 Overall speedup from CG acceleration and posterior characteristics
We implement the Gibbs samplers in Python and run on 2015 iMac with Intel Core i7 processors. We run the Markov chains for 1,500 and 2,000 iterations for the propensity score and treatment effect model respectively, yielding 1,000 postconvergence samples. In total, the direct Gibbs sampler requires 77.4 hours for the propensity score model and 107 hours for the treatment effect model. On the other hand, the CGaccelerated sampler finishes in 7.04 and 4.36 hours, yielding 11fold and 25fold speedups.
We can explain the difference in the magnitudes of CGacceleration between the two models in terms of the posterior sparsity structures of the regression coefficients (Section 2.4 and 3.3). For this purpose, we now examine the posteriors focusing only on the sparsity structure; other scientifically important aspects of the posteriors are discussed in Section 5.7. As a measure of sparsity, for each regression coefficient we consider onesided tail probability, the larger of the posterior probabilities of or , as well as the magnitude of posterior means.
For the propensity score model, 40 and 85 out of the 22,175 regression coefficients have onesided tail probabilities above 97.5% and 90% respectively. Only 79 regression coefficients have the magnitude of their posterior means above 0.1, while 18,161 (81.9%) of the coefficients have the magnitude below 0.01.
For the doublyrobust treatment effect model, some of the regression coefficients have bimodal marginal posterior distributions and their samples occasionally deviate away from zero. Averaged over all the posterior draws, however, except for the fixed effects no regression coefficient comes out as significantly different from 0. This suggests that the propensity score stratification is indeed successful in balancing the distribution of ’s and thus it may have been unnecessary to include those predictors in the treatment effect model. Of course, we can only conclude this after actually fitting the doublyrobust model. As the Gibbs sampler requires close to 1,000 iterations before convergence, the burnin iterations alone would be a significant computational burden if it were not for the CGacceleration.
5.5 Mechanism behind CG acceleration
For both Gibbs samplers, the conditional updates of from (1.3) dominate the computational times. To better understand the mechanism behind the CGacceleration, therefore, we only need to examine the convergence rates of the CG sampler at each Gibbs iteration. Here we focus on a Gibbs update of after the burnin iterations for the propensity score model (5.24). Section S3 in the supplement provides additional analysis along with the corresponding results for the doublyrobust treatment effect model (5.25).
As done in Section 3, we examine how quickly the CG method finds an accurate solution to the linear system (2.5). The CG iterates are compared against the exact solution found by a direct linear algebra via the Cholesky decomposition. Figure 5.5 shows the distances between and as a function of the number of CG iterations .
We first look at the solid blue line which tracks the root mean squared residual as introduced in Section 4.2. The dotted vertical line indicates when the magnitude of the priorpreconditioned residual falls below the termination criteria of (4.20). The termination occurs at the iterations and the CG sampler consequently spends only of the computational time relative to the direct Gibbs update. Analysis of Section 5.6 confirms that the CG approximation error at termination is so small that it does not affect the stationary distribution of the Gibbs sampler in any significant way. The number of iterations at termination fluctuates from one iteration to another of the Gibbs sampler since the linear system (2.5) depends on the random quantities , , , and . This fluctuation is rather small, however. After the Gibbs sampler has converged, we rarely observe a deviation of more than from the average number of iterations — see the supplement Section S3.
In Section 4.2, we argue that is a surrogate and approximate upperbound for a more easily interpretable error metric. This is empirically confirmed here, by comparing the solid to dashed blue line; the dashed one tracks the root mean secondmoment normalized error
(5.26) 
which is computed as a proxy for (4.21). The standardization by second moment ensures that, when the computed error is small, all the coordinates of are close to those of either in terms of their posterior means or variances.
Finally, we compare the blue and orange dashed lines to assess the relative CG convergence rates under the prior and Jacobi preconditioner. It is clear that the advantage of the prior preconditioner, as demonstrated in the simulated examples of Section 3, continues to hold in this real data example. The observed convergence behaviors under the two preconditioners are again well explained by the eigenvalue distributions of the respective preconditioned matrices — see Figure S3 in the supplement.
5.6 Accuracy of CG sampler
Since CG technically does not yield an exact solution when terminated after iterations, in this section we assess the accuracy of the samples generated by the CGaccelerated Gibbs. We compare these samples against those generated by the direct Gibbs, which we take as “ground truth.” As we show, perturbation of the target distribution introduced by the CG approximation error is, if any, so small that it is essentially negligible.
In comparing the two sets of samples, we encountered one complication. While the global shrinkage parameter generally demonstrates good mixing under the Bayesian bridge prior (see Section 3.1), for our real world examples it proved difficult to obtain a sufficiently large effective sample size for within a reasonable amount of time. In order to ensure that the effective sample sizes for are large enough to adequately characterize the stationary distribution, therefore, we employ an empirical Bayes approach. We first find a value which approximately maximizes the marginal likelihood through Monte Carlo expectationmaximization (MCEM) algorithm (Casella, 2001). We then run the two samplers conditional on this value . The MCEM algorithm used here is essentially the same as in Park and Casella (2008) with only one difference — after sampling and from their conditionals, we update by maximizing the log density of with the local shrinkage parameter marginalized out.
We check for significant differences between the two sets of samples as follows. We first set and to be the posterior means estimated by averaging the samples from the direct Gibbs (used as a benchmark) and CGaccelerated Gibbs. Figure 5.6(a) graphically compares these two estimators as an informal sanity check. We then estimate the effective sample sizes of from the respective samplers using the R CODA package (Plummer et al., 2006). These estimated effective sample sizes can be used to estimate the Monte Carlo standard deviations of the differences (Geyer, 2011). When the two sets of MCMC samples have the same stationary distribution, the standardized differences are approximately distributed as the standard Gaussians by the Markov chain central limit theorem (Geyer, 2011).
Figure 5.6(b) confirms that the distribution of the standardized differences closely follows the “null” distribution. Figure 5.6(a) and 5.6(b) are based on the posterior samples for the propensity score model (5.24), but we obtained essentially the same result under the treatment effect model (5.25). We additionally perform the same diagnostic on the estimators of the posterior second moment of and obtained similar results.
5.7 Results and clinical conclusions from sparse regression models
The propensity score model finds that the patients treated by dabigatran and warfarin have substantial differences in their predictor characteristics. More precisely, a significant fraction of the individuals are much more likely to have been treated by one drug over the other as can be seen from Figure 5.8.^{1}^{1}1Figure 5.8 shows distributions of preference score, a transformation of propensity score that has been suggested as a more interpretable measure of the difference in covariate characteristics between the treated and control groups (Walker et al., 2013). With the Bayesian approach, it is also straightforward to obtain uncertainty in the propensity score estimates. For instance, we can easily characterize uncertainty in the inverse probability weight, defined as a function of the propensity score as if in the treated group and if in the control group. Inverse probability weights are widely used due to their highly desirable theoretical properties, but have known issues of being unstable (Stuart, 2010). In fact, Figure 5.8 shows that the posterior uncertainties of the large inverse probability weights are as large as their posterior means.
Two of the most significant predictors are the year of the treatment and age group. Both predictors have been encoded as binary indicators in the design matrix for simplicity, but the categorical and ordinal predictors in our model could have been estimated with shrinkage priors analogous to (Bayesian) grouped or fused lasso (Hastie et al., 2009; Kyung et al., 2010; Xu et al., 2015). The posterior mean and 95% credible intervals of these regression coefficients are shown in Figure 5.9. Figure 5.9 shows the effect sizes relative to the year 2010 and the age group ; when actually fitting the model, however, we use the most common category as a baseline for categorical variables.
For the treatment effect model, Figure 5.10(a) shows the posterior distribution of the average treatment effect of dabigatran over warfarin. The posterior indicates an evidence, though not a conclusive one, for the lower incidence rate of intracranial hemorrhage attributable to dabigatran. This is consistent with findings of Graham et al. (2015). The violin plot of Figure 5.10(b) summarizes the posterior distributions of the baseline incident rates within each propensity score strata. While the differences in the baseline incident rates may seem insignificant because of the overlaps in their posterior marginals, there is significant covariation among the baseline incident rates. In fact, the baseline for the 2nd quintle has more than 95% posterior probability of being larger than those for the 1st and 5th quintile.
6 Acknowledgment
We would like to thank Yuxi Tian and Martijn Schuemie for their help in wrangling the data set used in Section 5. We also thank Jianfeng Lu and Ilse Ipsen for useful discussions on linear algebra topics. Finally, this work is partially supported by National Science Foundation grant DMS 1264153 and National Institutes of Health grant U19 AI135995.
Appendix A Proofs
We first derive Theorem 2.6 as a consequence of Theorem 2.5 before we proceed to prove Theorem 2.5. {proof}[Theorem 2.6] By Theorem B.4 below, the th CG iterate satisfies the following bound:
(A.27) 
where denotes the th largest eigenvalue of . By Theorem 2.5, we know that
(A.28) 
and hence that . Since the function is increasing in , we can upper bound the righthand side of (A.27) in terms of , yielding the desired inequality (2.15). \qed
[Theorem 2.5] We prove the more general inequality (2.14). The lower bound is an immediate consequence of Proposition A.1. For the upper bound, first note that by the Poincaré separation theorem (Theorem A.2). From the expression (2.11) for , we have
(A.29)  
where the second equality follows from Proposition A.1. Applying Lemma A.3 with and , we obtain
(A.30) 
Thus we have shown
(A.31) 
where the inequality follows again from the Poincaré separation theorem. \qed
Proposition A.1
Given a symmetric matrix , the eigenvalues of the matrix are given by for .
The result follows immediately from the spectral theorem for normal matrices (Horn and Johnson, 2012).
Theorem A.2 (Poincaré separation theorem)
For a given symmetric matrix , let denote a submatrix with the first rows and columns removed from . Then the eigenvalues of a symmetric matrix and its submatrix satisfies
(A.32) 
for any . Since permuting the rows and columns of does not change its eigenvalues, the above inequality in fact holds for any submatrix of obtained by removing rows and columns of corresponding to an arbitrary set of common row and column indices .
See Chapter 4.3 of Horn and Johnson (2012). \qed
Lemma A.3
Let and be symmetric positive definite matrices and suppose that the largest eigenvalue of satisfies . Then we have
(A.33) 
where denotes the th largest eigenvalue of a given matrix.
The result follows immediately from Ostrowski’s theorem (Theorem 4.5.9 in Horn and Johnson (2012)). \qed
Appendix B Theories behind convergence behavior of CG
In this section, we provide mathematical foundations behind the claims made in Rule of Thumb 2.4. In essence, Rule of Thumb 2.4 is our attempt at describing a phenomenon known as the superlinear convergence of CG in a quantitative yet accessible manner. While this is a wellknown phenomenon among the researchers in scientific computing, it is rarely explained in canonical textbooks and reference books in numerical linear algebra.^{2}^{2}2For example, the discussions beyond Theorem 2.2 and 2.3 cannot be found in, to name a few, \citesupplementtrefethen1997numerical_linalg, \citesupplementdemmel1997numerical_linalg, \citesupplementsaad2003iterativemethods, and \citesupplementgolub2012matrix. Here we bring together some of the most practically useful results found in the literature and present them in a concise and selfcontained manner. Our presentation in Section B.1 and B.2 is roughly based on Section 5.3 of \citesupplementvorst2003iterative with details modified, added, and condensed as needed. More comprehensive treatment of the known results related to CG is found in \citesupplementmeurant2006lancozandcg. \citesupplementkuijlaars2006krylov_method_and_potential_theory sheds additional light on CG convergence behaviors by studying them from the potential theory perspective.
Section B.1 explains the critical first step in understanding the convergence of CG applied to a positive definite system — relating the CG approximation error to a polynomial interpolation error over the set consisting of the eigenvalues of . Equipped with this perspective, one can understand Theorem 2.2 as a generic and rather crude bound, ignoring the distributions of ’s inbetween the largest and smallest eigenvalues (Theorem B.3). Theorem 2.3 similarly follows from the polynomial approximation perspective.
The effects of the largest eigenvalues on CG convergence, as stated in Rule of Thumb 2.4, is made mathematically precise in Theorem B.4. Analyzing how the smallest eigenvalues delay CG convergence is more involved and requires a discussion of how the eigenvalues of are approximated by the Krylov subspace. The amount of initial delay in CG convergence is closely related to how quickly these eigenvalue approximations converge. A precise statement is given in Theorem B.5.
Section B.3 provides the proofs of all the results stated in this section.
b.1 CG approximation error as polynomial approximation error
The space of polynomials as defined below plays a prominent role in the behavior of a worstcase CG approximation error:
(B.34) 
Proposition B.1 below establishes the connection between CG and the polynomial space .
Proposition B.1
The difference between the th CG iterate and the exact solution can be expressed as
(B.35) 
In particular, the following inequality holds for any :