[

# [

[ University of California - Los Angeles, USA
###### Abstract

In a modern observational study based on healthcare databases, the number of observations typically ranges in the order of and that of the predictors in the order of . Despite the large sample size, the data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression provides a potential solution to this problem. There is a rich literature on desirable theoretical properties of Bayesian approaches based on shrinkage priors. On the other hand, the development of scalable methods for the required posterior computation has largely been limited to the case. Shrinkage priors make the posterior amenable to Gibbs sampling, but a major computational bottleneck still arises from the need to sample from a high-dimensional Gaussian distribution at each iteration. Despite a closed-form expression for the precision matrix , computing and factorizing such a large matrix is computationally expensive nonetheless. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector such that the solution to the linear system has the desired Gaussian distribution. We can then find an accurate solution to the linear system by the conjugate gradient (CG) algorithm through the matrix-vector multiplications by , without ever explicitly inverting . Practical performance of CG, however, depends critically on appropriate preconditioning of the linear system; we turn CG into a highly effective algorithm for sparse Bayesian regression by developing a theory of prior-preconditioning. We apply our algorithm to a clinically relevant large-scale observational study with and , designed to assess the relative risk of intracranial hemorrhage from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in the posterior computation.

Bayesian, Markov chain Monte Carlo, conjugate gradient, numerical linear algebra, sparsity, high-dimensional inference
\newcites

supplementReferences for Supplement CG-accelerated Gibbs]Prior-preconditioned conjugate gradient for accelerated Gibbs sampling in “large & large ” sparse Bayesian logistic regression models Nishimura et. al.]Akihiko Nishimura and Marc A. Suchard

## 1 Introduction

Given an outcome of interest and a large number of features , the goal of sparse regression (or variable selection) is to find a small subset of these features that captures the principal relationship between the outcome and features. Such a sparsity assumption is mathematical necessity when exceeds the sample size ; even when , however, the assumption often remains critical in improving the interpretability and stable estimation of regression coefficients . This is especially true when either

1. the design matrix is sparse i.e. only a small fraction of the design matrix contains non-zero entries because the features are observed infrequently.

2. the outcome is imbalanced i.e.  (or ) for most of .

Either of these conditions reduces the amount of information the data provides on the regression coefficients. In a modern observational study based on large-scale healthcare databases, for example, the sparsity of almost always holds true because a large number of potential pre-existing conditions and available treatments exist, yet only a small subset of these applies to each patient (Schuemie et al., 2018b). Imbalanced outcome is also common as many serious diseases of interest are rare among the population.

The particular application considered in this manuscript is a comparative study of two anti-coagulants dabigatran and warfarin, based on observational data from Truven Health MarketScan Medicare Supplemental and Coordination of Benefits Database. These drugs reduce the risk of blood clot formation, but can increase the risk of bleeding. The goal of the study is to quantify which of the two drugs has a lower risk of intracranial hemorrhage (bleeding inside the skull) among treated patients. The data set consists of patients and predictors of potential relevance.

An increasingly common approach to sparse regression is the Bayesian method based on continuous shrinkage priors on the regression coefficients . These priors often are represented as a scale-mixture of Gaussians

 βj|λj,τ∼N(0,τ2λ2j), λj∼πL(⋅), τ∼πT(⋅), (1.1)

with unknown global scale parameter and local scale parameter (Polson and Scott, 2010; Carvalho et al., 2010; Armagan et al., 2013; Polson et al., 2014; Bhattacharya et al., 2015; Bhadra et al., 2017). Compared to more traditional “spike-and-slab” discrete-mixture priors for sparse Bayesian regression, these continuous shrinkage priors are typically more computationally efficient while maintaining highly desirable statistical properties (Bhattacharya et al., 2015; Pal et al., 2014; Datta et al., 2013). Despite the relative computational advantage, however, posterior inference under these priors still encounters a major scalability issue. For instance, for our comparative study of two anti-coagulants, it takes over 50 hours on 2015 iMac to run 1,000 iterations of the current state-of-the-art Gibbs sampler in our reasonably optimized Python implementation (Section 5).

In this article, we focus on accelerating the state-of-the-art Gibbs sampler for sparse Bayesian logistic regression, but our approach applies whenever the likelihood of the data or latent parameter can be expressed as a Gaussian mixture. The Polya-Gamma data augmentation scheme of Polson et al. (2013) makes a posterior under the logistic model amenable to Gibbs sampling as follows. Through a Polya-Gamma auxiliary parameter , the conditional likelihood of a binary outcome becomes

 y′i|X,β,ω∼N(x\raisebox{2.15pt}{⊺}iβ,ω−1i)  for %  y′i:=ω−1i(yi−1/2). (1.2)

Correspondingly, the full conditional distribution of is given by

 β|ω,λ,τ,y,X∼N(Φ−1X\raisebox{2.15pt}{⊺}Ωy′,Φ−1)  for  Φ=X\raisebox{2.15% pt}{⊺}ΩX+τ−2Λ−2, (1.3)

where , a diagonal matrix with entries , and . The main computational bottleneck of the Gibbs sampler is the need to sample from a high-dimensional Gaussian of the form (1.3). The standard algorithm requires operations: for computing the term for and for the Cholesky decomposition of . These operations remain significant computational burdens even when a sparsity in significantly reduces the theoretically necessary number of arithmetic operations. This is because numerical computations in a modern computer architecture is memory bound, and the reduction in the number of arithmetic operations does not directly translate into reduced computing time (Golub and Van Loan, 2012; Dongarra et al., 2016).

Recently, significant progress has been made in computational techniques for cases. Bhattacharya et al. (2016) proposes an algorithm to sample from (1.3) with only operations, where the cost becomes negligible for . The algorithm requires computing the matrix for operations, and then solving an linear system for operations. Johndrow et al. (2018) reduce the cost by replacing the matrix with an approximation that can be computed with operations for . This technique offers no reduction in computational cost in cases, however. Hahn et al. (2018) propose a sampling approach for linear regression based on an extensive pre-processing of the matrix — a trick limited in scope strictly to Gaussian likelihood models. None of these advances addresses the “large & large ” logistic regression problem considered in this article.

Proposed in this article is a novel algorithm to sample from a high-dimensional distribution of the form (1.3) through the conjugate gradient (CG) method, requiring only a small number of the matrix-vector multiplication operations . The vector can be computed through operations and along with simple element-wise multiplications, without ever explicitly forming the matrix . This is an important feature when dealing with a large and sparse design matrix ; the matrix and hence may contain a much larger proportion of non-zero entries than does, making directly handling much more intensive in memory usage and the number of arithmetic operations. Note, for example, that it requires 74.5 GB of memory to allocate a dense matrix in double-precision numbers when . The ability to automatically exploits a sparsity structure in , therefore, is another major advantage of our algorithm.

Also developed in this article is a theory of an effective preconditioning technique in the context of sparse Bayesian regression. Preconditioning relates a given problem to a modified one to accelerate CG and is critical in achieving the full potential of the algorithm. While a variety of general-purpose preconditioners exist, design of an effective preconditioner remains problem specific. Exploiting the fact that the shrinkage priors dominate likelihood for all but a small fraction of the regression coefficients, we develop what we term the prior-preconditioning approach and demonstrate its superiority over general-purpose preconditioners in sparse Bayesian regression applications. It is worth noting that our contribution here is completely distinct from that of Cockayne et al. (2018), who propose an extension of CG that outputs a probability measure. We instead employ the classical CG method in a novel context, demonstrating its ability to significantly accelerate Monte Carlo simulations when applied in a right way.

The rest of the paper is organized as follows. Section 2 begins by describing how to recast the problem of sampling from the distribution (1.3) as that of solving a linear system , eliminating the need to factorize . The rest of the section explains how to apply CG to rapidly solve the linear system while developing theoretical foundations behind our prior-preconditioner. In Section 3, using simulated data, we demonstrate the effectiveness of the CG-based sampler for sparse regressions. Also demonstrated is how the behavior of CG depends on different preconditioning strategies. Section 4 describes practical details needed to successfully apply the CG-based sampler to sparse regression problems. Finally, in Section 5, we apply our algorithm to carry out the dabigatran vs. warfarin comparison, demonstrating an order of magnitude speed-up in the posterior computation. Our CG-accelerated Gibbs sampler for sparse Bayesian logistic regression is implemented as the bayesbridge package available from Python Package Index (pypi.org). The source code is available at a GitHub repository https://github.com/aki-nishimura/bayes-bridge.

## 2 Conjugate gradient sampler for sparse regression

### 2.1 Generating a Gaussian as the solution of a linear system

The standard algorithm for sampling a multivariate-Gaussian requires the Cholesky factorization of its precision (or covariance) matrix (Ripley, 1987). When the precision matrix has a specific structure as in (1.3), however, it turns out that the problem of sampling from the distribution (1.3) can be recast to that of solving a linear system. This in particular obviates the need to factorize the matrix . The key observation is that we can generate a Gaussian vector with for a computational cost negligible compared to an explicit formation of .

###### Proposition 2.1

The following procedure generates a sample from distribution (1.3):

1. Generate by sampling independent Gaussian vectors and and then setting

 b=X\raisebox{2.15pt}{⊺}Ωy′+X\raisebox{2.15pt}{⊺}Ω1/2η+τ−1Λ−1δ. (2.4)
2. Solve the following linear system for :

 Φβ=b(where Φ=X% \raisebox{2.15pt}{⊺}ΩX+τ−2Λ−2). (2.5)

The result follows immediately from basic properties of multivariate Gaussians; in particular, the solution to (2.5) has the required covariance structure because .

The above algorithm complements the algorithm Bhattacharya et al. (2016) propose; their algorithm reduces the task of sampling a multivariate Gaussian to solving a linear system, while our algorithm reduces it to solving a system. While the structures of the two linear systems are different, the techniques developed in this manuscript may be applicable with appropriate modifications to solving the linear system arising in Bhattacharya et al. (2016).

### 2.2 Iterative method for solving a linear system

The utility of Proposition 2.1 stems from the fact that solving the linear system (2.5) can be significantly faster than the standard algorithm for sampling a Gaussian vector. We achieve this speed-up by applying the CG method (Hestenes and Stiefel, 1952; Lanczos, 1952). CG belongs to a family of iterative methods for solving a linear system. Compared to traditional direct methods, iterative methods are more memory efficient and, if the matrix has certain structures (Section 2.3), can be significantly faster.

Iterative methods have found applications in Gaussian process models, where optimizing the hyper parameters of covariance functions requires solving linear systems involving large covariance matrices (Gibbs and MacKay, 1996). Significant research has gone into how best to apply iterative methods in this specific context; see Stein et al. (2012), Sun and Stein (2016), and Stroud et al. (2017) for example. Outside the Gaussian process literature, Zhou and Guan (2017) use an iterative method to address the bottleneck of having to solve large linear systems when computing Bayes factors in a model selection problem.

The use of CG as a computational tool for Monte Carlo simulation is a novel feature of our work that has not been considered by any the above works. While we were preparing our manuscript, we were also informed of the work by Zhang et al. (2018), which uses the same idea as in Proposition 2.1. They apply the CG method, apparently without any preconditioners, to generate a posterior sample from a Gaussian process model. Our work is distinguished by a careful development — supported by both theoretical analysis and systematic empirical evaluations — of preconditioning techniques (Section 2.3) tailored toward sparse Bayesian regression problems. The theoretical foundations laid out here also provide practical guidelines on how one may apply CG in other Bayesian computation problems.

The CG method solves a linear system involving a positive definite matrix as follows. Given an initial guess , which may be taken as for example, CG generates a sequence of increasingly accurate approximations to the solution. The convergence of the CG iterates ’s is intimately tied to the Krylov subspace

 K(Φ,r0,k)=span{r0,Φr0,…,Φk−1r0}, (2.6)

generated from the initial residual . With denoting an affine space , the approximate solution satisfies the following optimality property in terms of a weighted norm:

 (2.7)

The norm is often referred to as the -norm. The optimality property (2.7) in particular implies that CG yields the exact solution after iterations. The main computational cost of each update is the matrix-vector operation . Therefore, the required number of arithmetic operations to run iterations of the CG update is comparable to that of a direct method based on the Cholesky factorization of . Through an effective preconditioning strategy described in the next section, however, we can induce rapid convergence of CG for a typical precision matrix arising from the conditional distribution (1.3). In our numerical results, we indeed find that the distribution of even for is indistinguishable from (1.3) for all practical purposes (Section 5.6).

It is worth emphasizing that our CG sampler does not require explicitly forming the precision matrix of the form (1.3) because the vector can be computed via operations and along with simple element-wise multiplications. This is critical for our “large and large ” applications for a couple of reasons. First, computing the term in often requires more computational efforts than a subsequent factorization of when . Secondly, as mentioned earlier for a large sparse design matrix , storing an explicitly computed can significantly increase memory burden compared to storing itself.

### 2.3 Convergence behavior of the CG method

When directly applied to a given linear system, the iterative solution of CG often displays slow convergence. Section 2.4 covers the topic of how to induce more rapid CG convergence for the system (2.5) arising from sparse Bayesian regression. In preparation, here we describe how the convergence behavior of CG is related to the structure of the positive definite matrix .

Convergence behavior of CG can be partially explained by the following well-known error bound given in terms of the condition number , the ratio of the largest to the smallest eigenvalue of .

###### Theorem 2.2

Given a positive definite system and a starting vector , the -th CG iterate satisfies the following bound in its -norm distance to the solution :

 ∥βk−β∥Φ∥β0−β∥Φ≤2(√κ(Φ)−1√κ(Φ)+1)k. (2.8)

See Trefethen and Bau (1997) for a proof. Theorem 2.2 guarantees fast convergence of the CG iterates when the condition number is small. On the other hand, a large condition number does not always prevent rapid convergence of CG. This is because CG can also converge quickly when the eigenvalues of are “clustered.” The following theorem quantifies this phenomenon, albeit in an idealized situation in which has exactly distinct eigenvalues.

###### Theorem 2.3

If the positive definite matrix has only distinct eigenvalues, then the CG yields an exact solution within iterations. In particular, the result holds if is a rank- perturbation of an identity i.e.  for .

See Golub and Van Loan (2012) for a proof.

Theorem 2.2 and 2.3 are arguably the two most famous results on the convergence property of CG, perhaps because their conclusions are clear-cut and easy to understand. However, each theorem captures only an aspect of CG’s convergence property and falls short of describing typical behavior in practical applications; the assumption behind Theorem 2.3 is unrealistic, while the bound of Theorem 2.2 is too pessimistic. To better capture behavior of CG in actual applications, we bring together the most useful of the known results that have been scattered around the numerical linear algebra literature and summarize them as the following rule of thumb. We make all the statements contained in the rule of thumb mathematically precise in Appendix B. As we will see, the rule of thumb points us to the design of an effective preconditioning strategy in the context of sparse Bayesian regression.

###### Rule of Thumb 2.4

Suppose that the eigenvalues of are clustered in the interval except for a small fraction of them. Then CG “knocks off” the outlying eigenvalues exponentially quickly and its convergence subsequently accelerates as if the effective condition number of is rather than in Eq 2.8. The largest eigenvalues are effectively removed within iterations, while the same number of smallest eigenvalues tends to delay the convergence longer.

### 2.4 Preconditioning the CG method for rapid solutions of (2.5)

A preconditioner is a positive definite matrix such that the preconditioned system

 ~Φ~β=~b for ~Φ=M−1/2ΦM−1/2 and ~b=M−1/2b (2.9)

leads to faster convergence of the CG iterations. In practice, we only need and not since the algorithm can be implemented so that only the operation is required to solve the preconditioned system (2.9) via CG (Golub and Van Loan, 2012). This preconditioned CG algorithm still returns a solution in terms of the original system. While a wide range of general techniques has been proposed, finding a good preconditioner to a given linear system remains “a combination of art and science.” (Saad, 2003)

In light of Rule of Thumb 2.4, an effective preconditioner should modify the eigenvalue structure of so that the preconditioned matrix has more tightly clustered eigenvalues except perhaps for a small number of outlying eigenvalues. Larger outlying eigenvalues are preferable over smaller ones, as smaller ones cause more significant, if not severe, delays in the convergence of CG. In addition to the convergence rate, a choice of a preconditioner must take into account the one-time cost of computing the preconditioner itself as well as the cost of operation during each CG iteration.

In the contexts of sparse Bayesian regression, the linear system (2.5) turns out to admit a deceptively simple yet highly effective preconditioner, obviating the need for more complex and computationally expensive preconditioning strategies. In fact, the choice

 M=τ−2Λ−2 (2.10)

yields a modified system (2.9) with an eigenvalue structure ideally suited to the CG application. With a slight abuse of terminology, we will call this matrix the prior preconditioner since it corresponds to the precision of before observing . Note that this choice is different from the widely-used Jacobi preconditioner based on the diagonal elements of ; despite being a reasonable choice, the Jacobi preconditioner is substantially inferior to the prior preconditioner in typical applications (Section 3.3 and 5.4).

Our prior preconditioner can be motivated as follows. The shrinkage prior is employed either when 1) we want to impose a sparsity through an informative prior on the global shrinkage parameter or 2) we believe that the data support sparser models, say, in terms of the marginal likelihood. In either case, we expect posterior draws of to satisfy except for a relatively small subset of . (More precisely, we mean by that the contribution of the term is small in predicting the outcome.) This observation leads to the following simple heuristic. Note that the prior-preconditioned matrix is given by

 ~Φ=τ2ΛX\raisebox{2.15pt}{⊺}ΩXΛ+Ip (2.11)

and the matrix has the -th entry

 (2.12)

which is small when either or . In other words, the entries of are small away from the block corresponding to the indices . In general, smaller entries of a matrix have less contributions to the eigenvalue structures of the entire matrix (Golub and Van Loan, 2012). This means that the prior-preconditioned matrix (2.11) can be thought of as a perturbation of the identity with a matrix with approximate low-rank structure. As such, can be expected to have eigenvalues clustered around 1, except for a relatively small number of larger ones.

The above heuristic on the approximate low-rank structure of is formalized and quantified in Theorem 2.5. It is also worth noting, however, that it is too naive to conclude from the above heuristic that a good approximation to can be obtained by simply zeroing out ’s below some threshold. Such thresholding approach is successfully employed in Johndrow et al. (2018) for a related but different problem. In our context, however, numerical results clearly show that such approximation can be of a poor quality — see the supplement Section S4.

###### Theorem 2.5

Let denote the -th largest element of . The eigenvalues of the prior-preconditioned matrix (2.11) then satisfies

 1≤νk+1(~Φ)≤1+τ2λ2(k+1)ν1(X\raisebox{2.15pt}{⊺}ΩX) (2.13)

for . In fact, the following more general bounds hold. Let denote the submatrix of a given matrix corresponding to the row and column indices . With this notation, we have

 1≤νk+ℓ(~Φ)≤1+τ2λ2(k+1)νℓ+1((X\raisebox{2.15pt}{⊺}ΩX)(−k))≤1+τ2λ2(k+1)νℓ+1(X\raisebox{2.15pt}{⊺}ΩX) (2.14)

for any such that .

Theorem 2.5 guarantees tight clustering of the eigenvalues of the prior/preconditioned matrix — and hence rapid convergence of the CG — when most of ’s are close to zero. Through its dependence on , the bound (2.14) further shows that rapid CG convergence is also expected when the eigenvalues of decays quickly, which tends to happen when there is high-collinearity among the predictors.

Theorem 2.5 can also be used to directly relate the CG approximation error under the prior preconditioner to the decay rate in ’s:

###### Theorem 2.6

The prior-preconditioned CG applied to (2.5) yields iterates satisfying the following bound for any :

 ∥∥βm+m′−β∥∥Φ∥β0−β∥Φ ≤2(˜κ1/2m−1˜κ1/2m+1)m′ (2.15) where ˜κm=1+mink+ℓ=mτ2λ2(k+1)νℓ+1((X\raisebox{2.15pt}{⊺}ΩX)(−k)).

See Appendix A for proofs of Theorem 2.5 and 2.6.

To illustrate the implication of Theorem 2.6 in concrete terms, suppose that a posterior draw satisfies for some . In this case, we have . So the bound (2.15) implies

 ∥∥βm+m′−β∥∥Φ∥β0−β∥Φ≤2⋅10−0.086m′. (2.16)

For instance, after iterations, the CG approximation error in the -norm is guaranteed to be reduced by a factor of relative to the initial error.

We have introduced the prior-preconditioning strategy ultimately for the purpose of efficient posterior computation under sparse regression models. All the results have so far been formulated in linear algebraic languages, however. To provide a useful guideline on the performance of the CG-accelerated Gibbs sampler in practical applications, we now summarize the above discussions in a more statistical language.

###### Rule of Thumb 2.7

The prior preconditioner (2.10) induces rapid convergence of the CG applied to the linear system (2.5) when the posterior of concentrates on sparse vectors. (With continuous shrinkage priors, by sparsity we mean that most of ’s are virtually zero in their magnitudes.) As the sparsity of increases, the average convergence rate of the CG sampler also increases.

The statements above are born out by an illustrative example of Section 3 using synthetic sparse regression posteriors. Also, as we have seen, the statements can be made more precise in terms of the decay rate in the ordered statistics of a posterior sample (Rule of Thumb 2.4, Theorem 2.5, and Theorem 2.6).

### 2.5 General principle behind prior-preconditioning approach

We close our discussion of preconditioning techniques by providing alternative heuristics behind the prior preconditioner. While not as quantitative as Theorem 2.5, the following general principle suggests that the prior-preconditioning is effective beyond sparse regression settings for any Bayesian computation involving conditionally Gaussian distributions. In the context of the CG sampler, the preconditioned matrix represents the precision matrix of the transformed parameter . In fact, preconditioning the linear system (2.5) with a preconditioner is equivalent to applying a parameter transformation before employing the CG sampler. That is, we can apply one of the two strategies — precondition the linear system or apply the parameter transformation — to achieve exactly the same effect on the speed of the CG sampler.

When we choose the prior precision as the preconditioner, the transformed parameter a priori has the identity precision matrix, before its distribution is modified via the likelihood. This perspective, combined with the fact that the eigenvalues of represents the posterior precisions of along its principal components, suggests the following principle:

###### Principle Behind Prior-preconditioning 2.8

Under a strongly informative prior, the posterior looks like the prior except in a small number of directions along which the data provide significant information. This translates into the eigenvalues of the prior-preconditioned matrix clustering around 1 except for a relatively small number of large eigenvalues.

The eigenvalue structure of the prior-preconditioned matrix as predicted above is indeed observed across all of our numerical examples — see Figure 3.2, S3, and S6.

## 3 Demonstration of the CG sampler performance on simulated data

In addition to the prior preconditioning strategy introduced in Section 2.4, there remain a few more important details to successfully apply the CG sampler to sparse regression problems in practice. Preconditioning is undoubtedly the most essential ingredient of the CG sampler, however, and we defer other practical details to Section 4. Instead, we now turn to demonstrating the performance of the CG sampler applied to distributions of the form (1.3) as they arise from actual posterior distributions of sparse Bayesian logistic regression models. We illustrate how the CG convergence rates are affected by different preconditioning strategies and by corresponding eigenvalue distributions of the preconditioned matrices. Also, we use simulated data with varying numbers of non-zero coefficients and confirm how sparsity in regression coefficients translates into faster CG convergence as predicted by Theorem 2.5 and Rule of Thumb 2.7.

### 3.1 Choice of shrinkage prior: Bayesian bridge

While a variety of global-local shrinkage priors of the form (1.1) are available, we adopt the Bayesian bridge prior of Polson et al. (2014) in our implementation of the CG-accelerated Gibbs sampler. We make this choice for the following reasons. First, the Gibbs sampler under the Bayesian bridge tends to demonstrate better mixing in the global shrinkage parameter . This is because the Bayesian bridge formulation allows the update of from the distribution of with marginalized out (Polson et al., 2014). Secondly, many of the alternative shrinkage priors have extremely heavy tails that can be problematic in the logistic regression context. Under such heavy-tailed priors, the posterior ends up having heavy-tails when the parameters are only weakly identified from the data, causing issues both in terms of posterior computation and inference (Ghosh et al., 2018; Piironen and Vehtari, 2017).

Under the Bayesian bridge, the local shrinkage parameter ’s are given independent alpha-stable distributions with index of stability with . This choice induces to a prior on , when is marginalized out, such that

 π(βj|τ)∝τ−1exp(−|βj/τ|α). (3.17)

The case coincides with the Bayesian lasso. The distribution of becomes “spikier” as , placing greater mass around while at the same time having heavier tails. While an alpha-stable distribution has no closed-form expression, there are algorithms available to efficiently sample from the posterior distribution of (Polson et al., 2014). In typical applications, the marginal likelihood of data favors the values but only very weakly identifies it (Polson et al., 2014), so in our implementation we simply set .

### 3.2 Experimental set-up

We generate synthetic data of sample size with the number of predictors . To generate a design matrix with correlation among the predictors, we emulate a model from factor analysis (Jolliffe, 2002). We first sample a set of orthonormal vectors with uniformly from a Stiefel manifold, comprised of sets of orthonormal vectors. We then set the predictor for the -th observation to be

 xi=99∑ℓ=1fi,ℓuℓ+ϵi % for fi,ℓ∼N(0,(100−ℓ)2) and ϵi∼N(0,Ip). (3.18)

This is equivalent to independently sampling , where is a diagonal matrix with and is an orthonormal matrix sampled uniformly from the space of orthonormal matrices. We then center and standardize the predictors as is commonly done in sparse regression models (Hastie et al., 2009).

The above process yields a design matrix with moderate correlations among the predictors — the distribution of pairwise correlations is approximately Gaussian centered around 0 with the standard deviation of 0.13. Based on this design matrix , we simulate three different binary outcome vectors by varying the number of non-zero regression coefficients. More specifically, we consider a sparse regression coefficient with for each , and . In all three scenarios, the binary outcome ’s are generated from the logistic model as for .

For each of the simulated data sets, we obtain a posterior sample of by running the current state-of-the-art Polya-Gamma augmented Gibbs sampler using the direct linear algebra to sample from its conditional distribution (1.3). We confirm the convergence of the Markov chain through the traceplot of the posterior log density of (with and marginalized out). We can consider the state-of-the-art Gibbs sampler “exact” in a sense that the direct linear algebra has no potential convergence issues of the CG method. (It is, however, still affected by errors from finite precision arithmetic as always.)

Having obtained a posterior sample , we sample a vector as in (2.4) and apply CG to the linear system (2.5). We then compare the CG iterates to the exact solution obtained by solving the same system with the direct linear method based on the Cholesky factorization. We repeat this process for 8 random replications of the right-hand vector for the linear system (2.5).

### 3.3 Results: convergence rates, eigenvalues, and sparsity of coefficients

Figure 3.1 shows the CG approximation error as a function of the number of CG iterations, whose cost as we discuss in Section 2.2 is dominated by matrix-vector operations of the form . Here we characterize the error as the average of the relative error across all the coefficients. For each line on the plot, this error is averaged in the log-scale over the 8 random replications of the right-hand vector . Plots in the supplement Section S1 show, however, that the CG convergence behavior remains qualitatively similar regardless of choice of the error metric and varies little across the different right-hand vectors. In particular, the superior convergence rate of the prior-preconditioning over the Jacobi one holds consistently.

First, we focus on the results corresponding to the prior-preconditioned CG, indicated by the lines with circles. After matrix-vector operations, the distance between and is already orders of magnitudes smaller than typical Monte Carlo error, say, in estimating . In fact, with a relatively small number of additional CG iterations, the distance reaches the level of machine precision manifested as the “plateaus” of the approximation error seen in the blue dash-dot and orange dashed lines with circles.

Along with the approximation errors based on the prior preconditioner, as a benchmark we also compute the errors based on the Jacobi preconditioner . The Jacobi preconditioner is the simplest general-purpose preconditioner and usually performs well for linear systems like ours when the diagonals of is significantly larger than the off-diagonals (Golub and Van Loan, 2012). In the context of the CG sampler, the Jacobi preconditioning coincides with the use of the conditional precisions as a preconditioner. While more complex general-purpose preconditioners exist, they require substantially more computational efforts to compute them before the CG iterations can even get started (Golub and Van Loan, 2012). We therefore do not consider those preconditioners here.

It is evident from Figure 3.1 that the prior preconditioner induces more rapid CG convergence than the Jacobi preconditioner for the purpose of our CG-accelerated Gibbs sampler. The difference in convergence speed is more pronounced with sparser true regression coefficients. Studying the eigenvalue distributions of the respective preconditioned matrices provides further insight into the observed convergence behaviors. Figure 3.2 (a) & (b) show the eigenvalue distributions of the preconditioned matrices based on a posterior sample from the synthetic data with 10 non-zero coefficients. The trimmed version of the histograms are also given to highlight the tails of the distributions. The prior preconditioner induces the distribution with a tight cluster around 1 (or 0 in the scale) with a relatively small number of large ones, confirming the theory we developed in Section 2.4. On the other hand, the Jacobi preconditioner induces a more spread-out distribution, problematically introducing quite a few small eigenvalues that delay the CG convergence (Rule of Thumb 2.4).

Finally, we turn our attention to the relationship as seen in Figure 3.1 between the CG convergence rate and the sparsity in the underlying true regression coefficients. The CG convergence is clearly quicker when the true regression coefficients are sparser. To understand this relationship, it is informative to look at the values of drawn from the respective posterior distributions. Figure 3.3 plots the values of for corresponding to the first 250 coefficients. We use two different -scales for and , shown on the left and right respectively, to facilitate the qualitative comparison between the two cases. As expected, the posterior sample from the synthetic data with a larger number of signals has a larger number of away from zero. These relatively large ’s contribute to the delayed convergence of CG as quantified in Theorem 2.5 and Rule of Thumb 2.4.

A more significant cause of the delay, however, is the fact that the shrinkage prior yields weaker shrinkage on the zero coefficients when there are a larger number of signals. With a close look at Figure 3.3, one can see that corresponding to the true signals are not as well separated from the rest of ’s when . In fact, the histograms on the left of Figure 3.4 shows that the distribution of ’s for are shifted toward larger values compared to that for . This is mostly due to the posterior distribution of concentrating around a larger value — the value of the posterior sample is for the case while for the case. It is also worth taking a closer look at the tail of the distribution of ’s. The histograms on the right of Figure 3.4 show the distribution of the 250 largest ’s after scaling them by . The figure makes it clear that ’s corresponding to the true signals are much more well separated from the rest when . Overall, the slower decay in the largest values of ’s when results in the eigenvalues of the prior preconditioner having a less tight cluster around 1; compare the eigenvalue distributions of Figure 3.2.(a) & (b) to those of Figure 3.2.(c) & (d).

## 4 Practical CG sampler details for sparse regression

### 4.1 Initial vector for CG iterations

As suggested by Theorem 2.2 and by more thorough analysis of the CG convergence behavior in Appendix B, generally speaking the CG iterations decrease the distance between the iterates ’s and the exact solution relative to the initial error . Hence there is a benefit to choosing the initial vector with a small initial error . It should be noted, however, that choice of a preconditioner determines the exponential convergence rate of CG and, comparatively, choice of an initial vector often has a smaller effect on the approximation error. In other words, once the initial vector is chosen within a reasonable range, we should not expect a dramatic gain from further fine-tuning. In fact, when sampling from a sparse regression posterior, we find it difficult to improve much over a simple initialization , which is a reasonable choice as most coefficients are shrunken to near zero. Only small (%) though consistent improvements were observed by one of the other approaches with which we experimented — see the supplement Section S2.1.

### 4.2 Termination criteria for CG method

An iterative method must be supplied with a termination criteria to decide when the current iterate is close enough to the exact solution. A CG termination criteria used in most of the existing linear algebra libraries is based on the norm of the residual . This is mostly for convenience reasons as the residual norm is easily computed as a bi-product of the CG iterations. At least, the residual norm can be related to as

 ∥βk−β∥2=∥Φ−1rk∥2≤∥Φ−1∥2∥rk∥2. (4.19)

For the purpose of sampling a Gaussian vector , however, it is not at all clear when or can be considered small enough. To address this problem, we develop an alternative metric tailored toward our CG sampler for sparse regression.

For our CG sampler, we propose to assess the CG convergence in terms of the norm of the prior-preconditioned residual , where denotes the element-wise multiplication of two vectors. More specifically, we use the termination criteria

 p−1/2∥~rk∥2={p−1∑pj=1(~rk)2j}1/2≤10−6, (4.20)

in terms of the root-mean-squared residual .

In the supplement Section S2.2, we explain the utility of the norm as an approximate upper bound to the following quantity:

 ∥∥ξ−1⊙(βk−β)∥∥2  with  ξ2j=E[β2j|ω,λ,τ,y,X]. (4.21)

The standardization by second moment ensures that, when the computed error is small, all the coordinates of are close to those of either in terms of their means or variances of the target Gaussian distribution. In all our examples, we find this metric to work very well in quantifying the numerical error for the purpose of the CG-accelerated sampling; see Section 5.4 and 5.6 for illustrations.

### 4.3 Incorporating intercept and predictors with uninformative prior

When fitting a sparse regression model, standard practice is to include an intercept without any shrinkage, often with the improper flat prior (Park and Casella, 2008). Additionally, there may be predictors of particular interests, inference for whose regression coefficients is more appropriately carried out with uninformative or weakly-informative priors without shrinkage; see Zucknick et al. (2015) as well as the application in Section 5 for examples of such predictors. Our CG-accelerated Gibbs sampler can accommodate such predictors with an appropriate modification to the prior preconditioner proposed above.

For notational convenience, suppose that the regression coefficients are indexed so that the first -th coefficients are to be estimated without shrinkage. We further assume that the unshrunk coefficients are given independent Gaussian priors for where denotes an improper prior . The precision matrix of then is given by

 Φ=X\raisebox{2.15pt}{⊺}ΩX+[diag(σ)−200τ−2Λ−2] (4.22)

for where we employ the convention if . From the computational point of view, the unshrunk coefficients are distinguished from the shrunk ones by the fact that their prior scales (before conditioning on and ) typically have little to do with their posterior scales (after conditioning on and ). For this reason, a naively modified preconditioner may not be appropriate, especially for coefficients with corresponding to uninformative priors.

We propose a modified preconditioner of the form for appropriately chosen . Under the proposed form of a modified preconditioner, it can be shown that all but eigenvalues of the preconditioned matrix are “well-behaved” for the purpose of rapid CG convergence. Our goal, therefore, is to choose ’s to control the behavior of the additional eigenvalues introduced by the unshrunk coefficients. The detailed analysis of how to achieve this goal is somewhat involved and is provided in the supplement Section S2.3, but in the end we show that

 γj=cˆηj  for  ˆη2j≈var(βj|y,X) (4.23)

with some is a good choice. The factor is included since, as explained in Section S2.3, it is better to err on the side of choosing ’s larger than smaller. In our numerical results, we use and estimate from earlier MCMC iterations.

While we find a poor choice of ’s to significantly delay CG convergence in practice, we also point out the following: once ’s are chosen within reasonable ranges, their precise values have rather small effect on the convergence rate when is small. This is because, under the proposed modified preconditioner, all but eigenvalues are well-behaved regardless of the choice of . As CG has an ability to eventually “knock off” the extreme eigenvalues (Rule of Thumb 2.4), the additional eigenvalues will generally have limited effects in its convergence behavior.

## 5 Application: comparison of two drugs using healthcare databases

In this section, we demonstrate how CG-acceleration delivers an order of magnitude speed-up in the posterior computation for a large-scale observational study. We apply sparse Bayesian logistic regression to conduct a comparative study of two anti-coagulants dabigatran and warfarin. These drugs are administered to reduce the risk of blood clot formation, but as a side effect can increase the incidence rates of bleeding. The goal of the study is to quantify which of the two drugs have a lower risk of intracranial hemorrhage (bleeding inside the skull). This question has previously been investigated by Graham et al. (2015) and our analysis yields clinical findings consistent with theirs (see Section 5.7).

We are particularly interested in sparse Bayesian regression as a tool for the Observational Health Data Sciences and Informatics (OHDSI) collaborative (Hripcsak et al., 2015). Therefore, we follow the OHDSI protocol in pre-processing of the data as well as in the treatment effect estimation procedure. In particular, sparse regression plays a critical role in eliminating the need to hand-pick confounding factors; this enables the application of a reproducible and consistent statistical estimation procedure to tens of thousands of observational studies (Schuemie et al., 2018b, a; Tian et al., 2018).

### 5.1 Data set

We extract patient-level data from Truven Health MarketScan Medicare Supplemental and Coordination of Benefits Database. In the database, we find patients who were new users of either dabigatran or warfarin after diagnosis of atrial fibrillation. Among them, 19,768 are treated with dabigatran and the rest with warfarin. There are predictors, consisting of clinical measurements, pre-existing conditions, as well as prior treatments and administered drugs — all measured before exposure to dabigatran or warfarin. Following the OHDSI protocol, we screen out the predictors observed in less than % of the cohort. This reduces the number of predictors to . The precise definition of the cohort can be found at http://www.ohdsi.org/web/atlas/#/cohortdefinition/{2978,2979,2981}.

Each patient has or is exposed to only a small subsets of all the possible pre-existing conditions, treatments, and drugs. The design matrix therefore is sparse, with only 5% of the entries being non-zero. (The density of would have been 1% without the screening of the infrequent predictors.) Another noteworthy feature of the data is the low incidence rates of intracranial hemorrhage, so that the outcome indicator has non-zero entries for only 192 out of 72,489 patients.

### 5.2 Statistical approach: treatment effect estimation via sparse regression

To estimate the effect of treatment by dabigatran over warfarin on the outcome of interest, we use a doubly-robust method for average treatment effect estimation with propensity score stratification. The actual procedure and essential ideas are described below, but we refer the readers to Stuart (2010) and the references therein for further details.

Within the framework of propensity score methods, the treatment effect estimation proceeds in two stages. First, the propensity score of the treatment assignment to dabigatran of the -th individual is estimated by the logistic model

 logit{P(Ti=1|xi)}=β0+x\raisebox{2.15pt}{⊺}iβ. (5.24)

The quantiles of the estimated propensity scores are then used to stratify the population into subgroups of equal sizes. Following a typical recommendation, we choose the number of subgroups as . Under suitable assumptions, conditioning on the strata indicator removes most of imbalances in the distributions of the predictors between the treatment () and control () groups.

After the propensity score stratification, we estimate the average treatment effect via the logistic model

 logit{P(yi=1|α,β,si,xi)}=β0+α0Ti+M∑m=2αm\mathds1{si=m}+x\raisebox{2.15pt}{⊺}iβ, (5.25)

where a categorical variable denotes the strata membership of the -th individual. (With a slight abuse of notation, we use in both of the models (5.24) and (5.25) to denote the regression coefficients of predictors.) The indicator of is excluded from the model for identifiability. The inclusion of the predictor in the above model is technically not necessary if the distributions of predictors are perfectly balanced between the treatment and control groups within each strata . Controlling for the predictor , however, makes the treatment estimation procedure more robust to potential misspecification in the propensity score model (5.24) as well as to any predictor imbalances that remain after conditioning on .

Each of the regression models (5.24) and (5.25) involves a large number of features, making reliable estimation virtually impossible without some regularization or sparsity assumption. Sparse Bayesian regression is one promising approach, with an opportunity for future extensions such as hierarchical modeling across different hospitals.

### 5.3 Prior choice and posterior computation

We fit the models (5.24) and (5.25) using the Bayesian bridge shrinkage prior (see Section 3.1) on the regression coefficients. We place a reference prior on the global shrinkage parameter (Berger et al., 2009). An uninformative prior is used for the intercept. For the average treatment effect and propensity score strata effects, we place weakly informative priors.

For posterior inference, we implemented two Gibbs samplers that differ only in their methods for the conditional updates of from the distribution (1.3). One of the samplers uses the proposed CG sampler, while the other uses a traditional direct method based on the Cholesky factorization. We refer to the respective sampler as the CG-accelerated and direct Gibbs sampler. The other conditional updates follow the approaches described in Polson et al. (2014).

### 5.4 Overall speed-up from CG acceleration and posterior characteristics

We implement the Gibbs samplers in Python and run on 2015 iMac with Intel Core i7 processors. We run the Markov chains for 1,500 and 2,000 iterations for the propensity score and treatment effect model respectively, yielding 1,000 post-convergence samples. In total, the direct Gibbs sampler requires 77.4 hours for the propensity score model and 107 hours for the treatment effect model. On the other hand, the CG-accelerated sampler finishes in 7.04 and 4.36 hours, yielding 11-fold and 25-fold speed-ups.

We can explain the difference in the magnitudes of CG-acceleration between the two models in terms of the posterior sparsity structures of the regression coefficients (Section 2.4 and 3.3). For this purpose, we now examine the posteriors focusing only on the sparsity structure; other scientifically important aspects of the posteriors are discussed in Section 5.7. As a measure of sparsity, for each regression coefficient we consider one-sided tail probability, the larger of the posterior probabilities of or , as well as the magnitude of posterior means.

For the propensity score model, 40 and 85 out of the 22,175 regression coefficients have one-sided tail probabilities above 97.5% and 90% respectively. Only 79 regression coefficients have the magnitude of their posterior means above 0.1, while 18,161 (81.9%) of the coefficients have the magnitude below 0.01.

For the doubly-robust treatment effect model, some of the regression coefficients have bi-modal marginal posterior distributions and their samples occasionally deviate away from zero. Averaged over all the posterior draws, however, except for the fixed effects no regression coefficient comes out as significantly different from 0. This suggests that the propensity score stratification is indeed successful in balancing the distribution of ’s and thus it may have been unnecessary to include those predictors in the treatment effect model. Of course, we can only conclude this after actually fitting the doubly-robust model. As the Gibbs sampler requires close to 1,000 iterations before convergence, the burn-in iterations alone would be a significant computational burden if it were not for the CG-acceleration.

### 5.5 Mechanism behind CG acceleration

For both Gibbs samplers, the conditional updates of from (1.3) dominate the computational times. To better understand the mechanism behind the CG-acceleration, therefore, we only need to examine the convergence rates of the CG sampler at each Gibbs iteration. Here we focus on a Gibbs update of after the burn-in iterations for the propensity score model (5.24). Section S3 in the supplement provides additional analysis along with the corresponding results for the doubly-robust treatment effect model (5.25).

As done in Section 3, we examine how quickly the CG method finds an accurate solution to the linear system (2.5). The CG iterates are compared against the exact solution found by a direct linear algebra via the Cholesky decomposition. Figure 5.5 shows the distances between and as a function of the number of CG iterations .

We first look at the solid blue line which tracks the root mean squared residual as introduced in Section 4.2. The dotted vertical line indicates when the magnitude of the prior-preconditioned residual falls below the termination criteria of (4.20). The termination occurs at the iterations and the CG sampler consequently spends only of the computational time relative to the direct Gibbs update. Analysis of Section 5.6 confirms that the CG approximation error at termination is so small that it does not affect the stationary distribution of the Gibbs sampler in any significant way. The number of iterations at termination fluctuates from one iteration to another of the Gibbs sampler since the linear system (2.5) depends on the random quantities , , , and . This fluctuation is rather small, however. After the Gibbs sampler has converged, we rarely observe a deviation of more than from the average number of iterations — see the supplement Section S3.

In Section 4.2, we argue that is a surrogate and approximate upper-bound for a more easily interpretable error metric. This is empirically confirmed here, by comparing the solid to dashed blue line; the dashed one tracks the root mean second-moment normalized error

 {p−1∑j^ξ−2j(βk−βdirect)2j}1/2  with  ^ξ2j≈E[β2j|y,X] (5.26)

which is computed as a proxy for (4.21). The standardization by second moment ensures that, when the computed error is small, all the coordinates of are close to those of either in terms of their posterior means or variances.

Finally, we compare the blue and orange dashed lines to assess the relative CG convergence rates under the prior and Jacobi preconditioner. It is clear that the advantage of the prior preconditioner, as demonstrated in the simulated examples of Section 3, continues to hold in this real data example. The observed convergence behaviors under the two preconditioners are again well explained by the eigenvalue distributions of the respective preconditioned matrices — see Figure S3 in the supplement.

### 5.6 Accuracy of CG sampler

Since CG technically does not yield an exact solution when terminated after iterations, in this section we assess the accuracy of the samples generated by the CG-accelerated Gibbs. We compare these samples against those generated by the direct Gibbs, which we take as “ground truth.” As we show, perturbation of the target distribution introduced by the CG approximation error is, if any, so small that it is essentially negligible.

In comparing the two sets of samples, we encountered one complication. While the global shrinkage parameter generally demonstrates good mixing under the Bayesian bridge prior (see Section 3.1), for our real world examples it proved difficult to obtain a sufficiently large effective sample size for within a reasonable amount of time. In order to ensure that the effective sample sizes for are large enough to adequately characterize the stationary distribution, therefore, we employ an empirical Bayes approach. We first find a value which approximately maximizes the marginal likelihood through Monte Carlo expectation-maximization (MCEM) algorithm (Casella, 2001). We then run the two samplers conditional on this value . The MCEM algorithm used here is essentially the same as in Park and Casella (2008) with only one difference — after sampling and from their conditionals, we update by maximizing the log density of with the local shrinkage parameter marginalized out.

We check for significant differences between the two sets of samples as follows. We first set and to be the posterior means estimated by averaging the samples from the direct Gibbs (used as a benchmark) and CG-accelerated Gibbs. Figure 5.6(a) graphically compares these two estimators as an informal sanity check. We then estimate the effective sample sizes of from the respective samplers using the R CODA package (Plummer et al., 2006). These estimated effective sample sizes can be used to estimate the Monte Carlo standard deviations of the differences (Geyer, 2011). When the two sets of MCMC samples have the same stationary distribution, the standardized differences are approximately distributed as the standard Gaussians by the Markov chain central limit theorem (Geyer, 2011).

Figure 5.6(b) confirms that the distribution of the standardized differences closely follows the “null” distribution. Figure 5.6(a) and 5.6(b) are based on the posterior samples for the propensity score model (5.24), but we obtained essentially the same result under the treatment effect model (5.25). We additionally perform the same diagnostic on the estimators of the posterior second moment of and obtained similar results.

### 5.7 Results and clinical conclusions from sparse regression models

The propensity score model finds that the patients treated by dabigatran and warfarin have substantial differences in their predictor characteristics. More precisely, a significant fraction of the individuals are much more likely to have been treated by one drug over the other as can be seen from Figure 5.8.111Figure 5.8 shows distributions of preference score, a transformation of propensity score that has been suggested as a more interpretable measure of the difference in covariate characteristics between the treated and control groups (Walker et al., 2013). With the Bayesian approach, it is also straightforward to obtain uncertainty in the propensity score estimates. For instance, we can easily characterize uncertainty in the inverse probability weight, defined as a function of the propensity score as if in the treated group and if in the control group. Inverse probability weights are widely used due to their highly desirable theoretical properties, but have known issues of being unstable (Stuart, 2010). In fact, Figure 5.8 shows that the posterior uncertainties of the large inverse probability weights are as large as their posterior means.

Two of the most significant predictors are the year of the treatment and age group. Both predictors have been encoded as binary indicators in the design matrix for simplicity, but the categorical and ordinal predictors in our model could have been estimated with shrinkage priors analogous to (Bayesian) grouped or fused lasso (Hastie et al., 2009; Kyung et al., 2010; Xu et al., 2015). The posterior mean and 95% credible intervals of these regression coefficients are shown in Figure 5.9. Figure 5.9 shows the effect sizes relative to the year 2010 and the age group ; when actually fitting the model, however, we use the most common category as a baseline for categorical variables.

For the treatment effect model, Figure 5.10(a) shows the posterior distribution of the average treatment effect of dabigatran over warfarin. The posterior indicates an evidence, though not a conclusive one, for the lower incidence rate of intracranial hemorrhage attributable to dabigatran. This is consistent with findings of Graham et al. (2015). The violin plot of Figure 5.10(b) summarizes the posterior distributions of the baseline incident rates within each propensity score strata. While the differences in the baseline incident rates may seem insignificant because of the overlaps in their posterior marginals, there is significant covariation among the baseline incident rates. In fact, the baseline for the 2nd quintle has more than 95% posterior probability of being larger than those for the 1st and 5th quintile.

## 6 Acknowledgment

We would like to thank Yuxi Tian and Martijn Schuemie for their help in wrangling the data set used in Section 5. We also thank Jianfeng Lu and Ilse Ipsen for useful discussions on linear algebra topics. Finally, this work is partially supported by National Science Foundation grant DMS 1264153 and National Institutes of Health grant U19 AI135995.

## Appendix A Proofs

We first derive Theorem 2.6 as a consequence of Theorem 2.5 before we proceed to prove Theorem 2.5. {proof}[Theorem 2.6] By Theorem B.4 below, the -th CG iterate satisfies the following bound:

 ∥βm+m′−β∥Φ∥β0−β∥Φ≤2(√νm+1/νp−1√νm+1/νp+1)m′, (A.27)

where denotes the -th largest eigenvalue of . By Theorem 2.5, we know that

 1≤νp≤νm+1≤1+mink+ℓ=mτ2λ2(k+1)νℓ+1((X\raisebox{2.15pt}{⊺}ΩX)(−k))=˜κm (A.28)

and hence that . Since the function is increasing in , we can upper bound the right-hand side of (A.27) in terms of , yielding the desired inequality (2.15). \qed

{proof}

[Theorem 2.5] We prove the more general inequality (2.14). The lower bound is an immediate consequence of Proposition A.1. For the upper bound, first note that by the Poincaré separation theorem (Theorem A.2). From the expression (2.11) for , we have

 νℓ(~Φ(−k)) =νℓ(Ik+τ2Λ(−k)(X\raisebox{2.15pt}{⊺}ΩX)(−k)Λ(−k)) (A.29) =1+τ2νℓ(Λ(−k)(X% \raisebox{2.15pt}{⊺}ΩX)(−k)Λ(−k)),

where the second equality follows from Proposition A.1. Applying Lemma A.3 with and , we obtain

 νℓ(~Φ(−k))≤1+τ2λ2(k+1)νℓ((X\raisebox{2.15pt}{⊺}ΩX)(−k)). (A.30)

Thus we have shown

 νk+ℓ(~Φ)≤1+τ2λ2(k+1)νℓ((X\raisebox{2.15pt}{⊺}ΩX)(−k))≤1+τ2λ2(k+1)νℓ(X% \raisebox{2.15pt}{⊺}ΩX), (A.31)

where the inequality follows again from the Poincaré separation theorem. \qed

###### Proposition A.1

Given a symmetric matrix , the eigenvalues of the matrix are given by for .

{proof}

The result follows immediately from the spectral theorem for normal matrices (Horn and Johnson, 2012).

###### Theorem A.2 (Poincaré separation theorem)

For a given symmetric matrix , let denote a sub-matrix with the first rows and columns removed from . Then the eigenvalues of a symmetric matrix and its sub-matrix satisfies

 νk+ℓ(A)≤νℓ(A(−k))≤νℓ(A) (A.32)

for any . Since permuting the rows and columns of does not change its eigenvalues, the above inequality in fact holds for any sub-matrix of obtained by removing rows and columns of corresponding to an arbitrary set of common row and column indices .

{proof}

See Chapter 4.3 of Horn and Johnson (2012). \qed

###### Lemma A.3

Let and be symmetric positive definite matrices and suppose that the largest eigenvalue of satisfies . Then we have

 νk(B1/2AB1/2)≤νk(A)  for  k=1,…,p (A.33)

where denotes the -th largest eigenvalue of a given matrix.

{proof}

The result follows immediately from Ostrowski’s theorem (Theorem 4.5.9 in Horn and Johnson (2012)). \qed

## Appendix B Theories behind convergence behavior of CG

In this section, we provide mathematical foundations behind the claims made in Rule of Thumb 2.4. In essence, Rule of Thumb 2.4 is our attempt at describing a phenomenon known as the super-linear convergence of CG in a quantitative yet accessible manner. While this is a well-known phenomenon among the researchers in scientific computing, it is rarely explained in canonical textbooks and reference books in numerical linear algebra.222For example, the discussions beyond Theorem 2.2 and 2.3 cannot be found in, to name a few, \citesupplementtrefethen1997numerical_linalg, \citesupplementdemmel1997numerical_linalg, \citesupplementsaad2003iterative-methods, and \citesupplementgolub2012matrix. Here we bring together some of the most practically useful results found in the literature and present them in a concise and self-contained manner. Our presentation in Section B.1 and B.2 is roughly based on Section 5.3 of \citesupplementvorst2003iterative with details modified, added, and condensed as needed. More comprehensive treatment of the known results related to CG is found in \citesupplementmeurant2006lancoz-and-cg. \citesupplementkuijlaars2006krylov_method_and_potential_theory sheds additional light on CG convergence behaviors by studying them from the potential theory perspective.

Section B.1 explains the critical first step in understanding the convergence of CG applied to a positive definite system — relating the CG approximation error to a polynomial interpolation error over the set consisting of the eigenvalues of . Equipped with this perspective, one can understand Theorem 2.2 as a generic and rather crude bound, ignoring the distributions of ’s in-between the largest and smallest eigenvalues (Theorem B.3). Theorem 2.3 similarly follows from the polynomial approximation perspective.

The effects of the largest eigenvalues on CG convergence, as stated in Rule of Thumb 2.4, is made mathematically precise in Theorem B.4. Analyzing how the smallest eigenvalues delay CG convergence is more involved and requires a discussion of how the eigenvalues of are approximated by the Krylov subspace. The amount of initial delay in CG convergence is closely related to how quickly these eigenvalue approximations converge. A precise statement is given in Theorem B.5.

Section B.3 provides the proofs of all the results stated in this section.

### b.1 CG approximation error as polynomial approximation error

The space of polynomials as defined below plays a prominent role in the behavior of a worst-case CG approximation error:

 Pk={Qk(ν):Qk is a polynomial of degree k with % Qk(0)=1}. (B.34)

Proposition B.1 below establishes the connection between CG and the polynomial space .

###### Proposition B.1

The difference between the -th CG iterate and the exact solution can be expressed as

 βk−β=Rk(Φ)(β0−β)  for Rk=argminQk∈Pk∥Qk(Φ)(β0−β)∥Φ. (B.35)

In particular, the following inequality holds for any :

 ∥βk−β∥Φ≤∥Qk(Φ)(β0−β<