# Fast model-fitting of Bayesian variable selection regression using the iterative complex factorization algorithm

###### Abstract

Bayesian variable selection regression (BVSR) is able to jointly analyze genome-wide genetic datasets, but the slow computation via Markov chain Monte Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting by ten fold. The iterative method hinges on the complex factorization of the sum of two matrices and the solution path resides in the complex domain (instead of the real domain). Compared to the Gauss-Seidel method, the complex factorization converges almost instantaneously and its error is several magnitude smaller than that of the Gauss-Seidel method. More importantly, the error is always within the pre-specified precision while the Gauss-Seidel method is not. For large problems with thousands of covariates, the complex factorization is 10 – 100 times faster than either the Gauss-Seidel method or the direct method via the Cholesky decomposition. In BVSR one needs to repetitively solve large penalized regression systems whose design matrix only changes slightly between adjacent MCMC steps. This slight change in design matrix enables the adaptation of the iterative complex factorization method. The computational innovation will facilitate wide-spread use of BVSR in reanalyzing the genome-wide association datasets.

Keywords: fastBVSR, complex factorization, Gauss-Seidel, exchange algorithm.

## 1 Introduction

Bayesian variable selection regression (BVSR) can jointly analyze genome-wide genetic data to produce posterior probability of association for each covariate and estimate hyperparameters such as heritability and the number of covariates that have nonzero effects [Guan and Stephens, 2011]. But the slow computation due to model averaging using Markov chain Monte Carlo (MCMC) hampered its otherwise warranted wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting by ten fold.

Model and priors. We first introduce briefly the BVSR before we dive into computational issues. Our model, prior specification, and notations follow closely those of Guan and Stephens [2011]. Consider the linear regression model

(1) |

where is an column-centered matrix, denotes an identity matrix of proper dimension, and are -vectors, is an -vector, and MVN stands for multivariate normal. Let be an indicator of the -th covariate having a nonzero effect and denote . A spike-and-slab prior for (the -th component of ) is specified below

(2) | ||||

where is the proportion of covariates that have non-zero effects and is the variance of prior effect size. We will specify priors for both later. We use noninformative priors on parameters and ,

(3) | |||||

where Gamma is in the shape-rate parameterization. As pointed out in Guan and Stephens [2011], prior (3) is equivalent to , which is known as Jeffreys’ prior [Ibrahim and Laud, 1991, O’Hagan and Forster, 2004]. Some may favor a simpler form , which makes no practical difference [Berger et al., 2001, Liang et al., 2008]. One advantage of priors (3) is that its (null-based) Bayes factor is invariant to the shifting and scaling of .

For a given and , Bayes factor can be computed in closed form

(4) |

where denotes the submatrix of with columns for which , denotes the matrix determinant, and is the posterior mean for given by

(5) |

Note in (4) parameters , , and are integrated out. The null-based Bayes factor is proportional to the marginal likelihood with respect to parameters and , for which we specify priors below.

To specify prior for , Guan and Stephens [2011] introduced a hyperparameter (which stands for heritability) such that

(6) |

where denotes the variance of the -th covariate. Conditional on , specifying a prior on will induce a prior on , so later we may write to emphasize is a function of and . Since is motivated by the narrow sense heritability, its prior is easy to specify and we use

(7) |

by default to reflect our lack of knowledge on heritability, although one can impose a strong prior by specifying a uniform distribution on a narrow range. A bonus of specifying prior on through is that when the induced prior on is heavy-tailed [Guan and Stephens, 2011].

Up till now, we follow faithfully the model and the prior specification of Guan and Stephens [2011]. To specify the prior on is equivalent to specifying the prior on , and Guan and Stephens [2011] specified prior on as uniform on its log scale, , which is equivalent to for . Guan and Stephens [2011] samples , but here we do something slightly different by integrating out analytically. This is sensible because is very informative on . Specifically, we integrate over such that to obtain the marginal prior on

(8) |

where the finite integral is related to truncated Beta distribution and can be evaluated analytically. If we let go to and go to we have an improper prior and the marginal prior on becomes

(9) |

where we recall is the number of total markers, is the number of selected variates in the model, and denotes the Gamma function. Even for an improper prior on , is proper simply because it is defined on a finite set.

Computation. We have the posterior

(10) |

We use Markov chain Monte Carlo (MCMC) to sample this posterior to make inference, and our sampling scheme follows closely that of Guan and Stephens [2011]. To evaluate (10) for each parameter pair , we need to compute Bayes factor (4). Two time-consuming calculations are matrix determinant and defined in (5), both of which have cubic complexity (in ). The computation of the determinant can be avoided by using a MCMC sampling trick which we will discuss later. The main focus of the paper is a novel algorithm to evaluate (5), which reduces its complexity from cubic to quadratic.

The rest of the paper is structured as follows. Section 2 introduces the iterative complex factorization (ICF) algorithm. Section 3 describes how to incorporate the ICF into BVSR. Both sections contain numerical examples, including a real dataset from genome-wide association studies. A short discussion concludes the paper.

## 2 The iterative complex factorization

Iterative methods can be used to solve a linear system. If the convergence is quick, iterative methods can be significantly faster than direct solving the linear system using the Cholesky decomposition. Before introducing our algorithm, we first briefly review existing iterative methods.

### 2.1 Existing iterative methods

We are concerned with solving the following linear system

(11) |

where is a diagonal matrix with positive (but not necessarily identical) entries on the diagonal, and is . Clearly (5) is a special case of (11). In Guan and Stephens [2011], (11) was solved using the Cholesky decomposition of , which requires flops [Trefethen and Bau III, 1997, Lec. 23]. In fact, the most time-consuming part is to compute which requires flops, but when only changes by a few columns updating only requires flops.

Define

(12) |

Now write where () is the strictly lower (upper) triangular component and contains only the diagonals. Then three popular iterative procedures can be summarized below

The successive over-relaxation (SOR) method is a generalization of the Gauss-Seidel method, where is called the relaxation parameter. When is positive definite, the Gauss-Seidel method always converges and the SOR method converges for [Golub and Van Loan, 2012, Chap. 10.1.2]. For all three iterative methods, each iteration requires flops. Thus, whether an iterative method is more efficient than the Cholesky decomposition depends on how many iterations it takes to converge. Another notable class of iterative methods is called Krylov subspace method. Two famous examples are the steepest descent and the conjugate gradient [Trefethen and Bau III, 1997, Lec. 38].

### 2.2 The iterative complex factorization

Let be the Cholesky decomposition of , where is upper triangular. In the context of BVSR, if the Cholesky decomposition of is known, then the Cholesky decomposition of a new matrix can be obtained efficiently if differs from only by one or a few columns (Section 3.1). We want to solve

(13) |

We perform the following decomposition

(14) | ||||

where is the imaginary unit. Then we have the update where the right-hand side is a complex vector, and can be obtained by a forward and a backward substitution involving two complex triangular matrices. This update, however, diverges from time to time. Examining the details of the observed divergence reveals that the culprit is the imaginary part of . Because the solution is real, discarding the imaginary part of at the end of each iteration will not affect the fixed point to which the iterative method converges. Denoting the real part of a complex entity (scalar or vector) by Re, the general update of the ICF becomes

(15) |

where we have also introduced a relaxation parameter . Intuitively makes the update lazy to avoid over-shooting. The revised update converges almost instantaneously, in a few iterations, compared to a few dozen to a few hundred iterations with the Gauss-Seidel method. Each iteration of (15) requires flops, twice that required by a Gauss-Seidel iteration, because ICF operates complex (instead of real) matrices and vectors. The right-hand side of (15) can be reorganized as Note that can be computed via forward and backward substitutions and does not require matrix inversion.

### 2.3 ICF converges to the right target

###### Proposition 1.

Denote Then ICF in (15) converges for any starting point if where denotes the spectral radius.

###### Proof.

The statement follows faithfully Theorem 10.1.1 of Golub and Van Loan [2012]. ∎

###### Theorem 2.

There exists such that ICF detailed in (15) converges to the true solution.

###### Proof.

Using the notations defined in (12) and (14), we have . The imaginary part of can be expressed by

Both and are real matrices, and by the Woodbury Identity we have

(16) |

Immediately, for a fixed , the spectrum of is fully determined by the spectrum of . Because is skew-symmetric, we have . Since is symmetric, so does Then we have , which means is skew-symmetric. The eigenvalues of the matrix , which are identical to those of , are conjugate pairs of pure imaginaries or zero. Let be such a pair with and be the eigenvector corresponding to the eigenvalue . Thus we have , which produces where denotes the conjugate transpose of . is a Hermitian positive definite matrix, which yields

Since is also positive definite, we have

(17) |

From (16), the eigenvalues of are identical pairs equal to (and the expression includes the situation ). Proposition 1 requires , or equivalently

(18) |

holds for all . The existence of follows from (17) and (18). ∎

Next we provide a theory-guided procedure to tune the relaxation parameter . The following proposition connects spectral radius of with and .

###### Proposition 3.

###### Proof.

From the proof of Theorem 2, the smallest and the largest eigenvalue of are and with . Adjusting for their signs, we obtain (19). The right-hand side of (19) is a function of , with the first item decreasing in and the second item increasing in . The minimum of (19) is attained when the two quantities in the braces are equal, which proves (20). ∎

By the property of skew-symmetric matrix, when is odd. When is even, it is usually extremely small for a moderate sample size. Therefore can be safely assumed as . We start with . Suppose at the -th iteration we can produce an estimate for the spectral radius of . Then using (19), can be estimated by Plugging this into (20) we obtain the update for

Note the update does not involve , but only and . Note also that is a decreasing function of . Both observations make intuitive sense. Finally, to estimate the spectral radius of we use

where denotes the -norm. The procedure works well in our numerical studies. To take care of boundary conditions, we use for .

### 2.4 ICF outperforms other methods

Our numerical comparisons were based on real datasets of genome-wide association studies downloaded from dbGaP. The details of the datasets can be found in Section 3.3. Because the convergence of iterative methods is sensitive to the collinearity in the design matrix , our comparison studies used two datasets: the first one contains K SNPs sampled across the whole genome, and the other contains K SNPs that are physically adjacent. The first dataset has little or no collinearity (henceforth referred to as IND), and the second has collinearity due to linkage disequilibrium (henceforth referred to as DEP). The sample size is for both datasets. Our numerical studies compared different methods (detailed below) for their speed and accuracy of solving (11). For a given , for one experiment we sampled without replacement columns from IND (or DEP) dataset to obtain , simulated under the null , and solved (11) using different methods with which corresponds to For each we conducted independent experiments.

Our initial studies compared ICF with six other methods: the Cholesky decomposition (Chol), the Jacobi method, the Gauss-Seidel method (GS), the successive over-Relaxation (SOR) method, the steepest descent method, and the conjugate gradient (CG) method. We excluded the Jacobi method and the steepest descent method due to their poor performance. To ensure the fair comparison in the context of BVSR, the starting point for Chol, GS, and SOR is being obtained, and the starting point for ICF is that such that is obtained. For SOR, we have to choose a value for the relaxation parameter , which is known to be very difficult. A solution was provided by Young [1954] (see also Yang and Matthias [2007]), but we observed its difficulty when . After trial and error, we settled on using , which appeared to be optimal in our numerical studies. For all iterative methods, we start from and stop if

(21) |

or the number of iterations exceeds . The computer code for different methods was written in C++ and was run in the same environment. The Cholesky decomposition was implemented using GSL (GNU Scientific Library) [Gough, 2009], and GS and SOR were implemented in a manner that accounted for sparsity in to obtain maximum efficiency [Golub and Van Loan, 2012, Chap. 10.1.2].

Dataset | p | Time (in seconds) | Convergence failures | |||||||

Chol | ICF | GS | SOR | CG | ICF | GS | SOR | CG | ||

IND | 50 | 0.034 | 0.019 | 0.016 | 0.025 | 0.104 | 0 | 0 | 0 | 0 |

100 | 0.20 | 0.07 | 0.07 | 0.09 | 0.45 | 0 | 0 | 0 | 0 | |

200 | 1.38 | 0.30 | 0.34 | 0.39 | 2.29 | 0 | 1 | 1 | 0 | |

500 | 21.0 | 2.7 | 3.6 | 2.9 | 19.6 | 0 | 2 | 1 | 0 | |

1000 | 161 | 14 | 36 | 25 | 136 | 0 | 11 | 7 | 0 | |

DEP | 50 | 0.035 | 0.020 | 0.029 | 0.031 | 0.107 | 0 | 8 | 6 | 0 |

100 | 0.20 | 0.08 | 0.23 | 0.19 | 0.51 | 0 | 29 | 20 | 0 | |

200 | 1.39 | 0.45 | 2.13 | 1.69 | 2.66 | 0 | 125 | 93 | 0 | |

500 | 21.0 | 5.5 | 36.1 | 32.8 | 30.1 | 0 | 621 | 514 | 0 | |

1000 | 160 | 36 | 183 | 180 | 244 | 0 | 979 | 951 | 0 | |

The results are summarized in Table 1. For IND dataset, three iterative methods (ICF, GS and SOR) appear on par with each other for smaller . For ICF outperforms the other two, and is 10 times faster than the Cholesky decomposition. On average it took ICF iterations to converge, and it never failed to converge in all numerical studies. On the other hand, both GS and SOR failed to converge, at least once, for large . For DEP dataset, we note that ICF is fastest among all the methods compared. For larger () ICF is times faster than Cholesky decomposition, and times faster than the other three methods. Both GS and SOR had difficulty in converging within iterations for large . We also tweaked simulation conditions to check whether the results were stable. For example, we tried to simulate under the alternative instead of the null, and to initialize in different manners, such as using unpenalized linear estimates. The results remained essentially unchanged under different tweaks.

Finally, we compared the accuracy of different iterative methods, where the truth was obtained from the Cholesky decomposition method. In this experiment we used IND dataset and compared for and . Each method was repeated experiments. The maximum entry-wise deviation (denoted by ) was obtained for each method, each experiment, under each simulation condition. Only those converged experiments were included in the comparison. Figure 1 shows the distribution of for different methods. Clearly, ICF outperforms CG and SOR by a large margin. (GS is omitted since its accuracy underperformed that of SOR in almost every experiment.) Because in real applications we are oblivious to the truth (or the truth is expensive to obtain), a method is more desirable if its deviation from the truth is within the pre-specified precision. Figure 1 shows that ICF always achieves the pre-specified precision () while the other two methods do not. Moreover, the deviation of ICF is orders of magnitude smaller than that of SOR, and orders of magnitude smaller than that of the CG method. We repeated the experiments for DEP dataset and made the same observations.

## 3 ICF dramatically increases speed of BVSR

### 3.1 Incorporating ICF into BVSR

As we alluded to in the introduction, the most time-consuming step in MCMC for variable selection is the computation of the Bayes factor (4) for every proposed (superscript denotes the proposed value). To successfully incorporate ICF into BVSR, we need to overcome two hurdles: computing the determinant term in (4) (or avoid it), and efficiently obtaining the Cholesky decomposition of .

Computing matrix determinant has a cubic complexity. If this is not avoided, using ICF to compute becomes pointless. When evaluating (4), we need to compute the determinant which takes operations. Identifying that the determinant is a normalization constant, avoiding computation of the determinant becomes a well studied problem in the MCMC literature which deals with the so-called “doubly-intractable distributions,” meaning two nested unknown normalization constants [Møller et al., 2006, Murray et al., 2012]. Recall that we want to sample

(22) | ||||

where , and it is the computation of that we want to avoid. For a naive Metropolis-Hastings algorithm, if the current state is and the proposed move is , we need to evaluate the Hastings ratio

(23) |

where

(24) | ||||

and is the compound proposal distribution of proposing from The proposed move is accepted with the probability which is called the Metropolis rule. Apparently, both and need to be evaluated in a naive MCMC implementation.

To avoid computing , note that

(25) |

holds for all , and if is sampled from , the ratio becomes a one-sample importance-sampling estimate of , and we can plug in to replace leaving the posterior invariant. This is the essence of the ingenious exchange algorithm of Murray et al. [2012].

To tackle the second difficulty, notice that in each MCMC iteration, usually only one or a few entries of are flipped between and . Such proposals are often called “local” proposals, and Markov chains using local proposals usually have high acceptance ratios, which indicate the chains are “mixing” well [Guan and Krone, 2007]. When one covariate is added into the model, the new Cholesky decomposition, , can be obtained from the previous decomposition by a forward substitution. When one covariate is deleted from the model, we simply delete the corresponding column from and introduce new zeros by Givens rotation [Golub and Van Loan, 2012, Chap. 5.1], which requires flops if the -th predicator is removed. We provide a toy example in the Appendix to demonstrate our update on the Cholesky decomposition.

### 3.2 Numerical Examples

We developed a software package, fastBVSR, to fit the BVSR model by incorporating ICF and the exchange algorithm into the MCMC procedure. The algorithm is summarized in Appendix. The software was written in C++ and available at http://www.haplotype.org/software.html. In fastBVSR, we also implemented Rao-Blackwellization, described in Guan and Stephens [2011], to reduce the variance of the estimates for and . By default, Rao-Blackwellization is done every 1000 iterations.

To check the performance of fastBVSR, we performed simulation studies based on a real dataset described in Yang et al. [2010] (henceforth the Height dataset). The Height dataset contains subjects and common SNPs (minor allele frequency ) after routine quality control [c.f. Xu and Guan, 2014]. We sampled SNPs across the genome to perform simulation studies. Our aim was to check whether fastBVSR can reliably estimate heritability in the simulated phenotypes. To simulate phenotypes of different heritability, we randomly selected causal SNPs (out of ) to obtain . For each selected SNP we drew its effect size from the standard normal distribution to obtain . Then we simulated the standard normal error term to obtain , and scaled the simulated effect sizes simultaneously using such that and the heritability, , is (taking values from to ). This was the same procedure used in Guan and Stephens [2011]. For each simulated phenotype, we ran fastBVSR to obtain the posterior estimate of . We compared fastBVSR with GCTA [Yang et al., 2011], which is a software package to estimate heritability and do prediction using the linear mixed model. For heritability estimation, GCTA has been shown to be unbiased and accurate in a wide range of settings [Yang et al., 2010, Lee et al., 2011]. The result is shown in Figure 2. Both fastBVSR and GCTA can estimate the heritability accurately; the mean absolute error is for fastBVSR and for GCTA. Noticeably, fastBVSR has a smaller variance but a slight bias when the true heritability is large.

Next we compare the predictive performance of fastBVSR and GCTA. We first define the mean squared error as a function of ,

Then following Guan and Stephens [2011] we define relative prediction gain (RPG) to measure the predictive performance of

The advantage of RPG is that the scaling of does not contribute to RPG, so that simulations with different heritability can be compared fairly. Clearly when , ; when , . Figure 3 shows that fastBVSR has much better predictive performance than GCTA, which reflects the advantage of BVSR over the linear mixed model. This advantage owes to the model averaging used in BVSR [Raftery et al., 1997, Broman and Speed, 2002]. Besides, note that BVSR with Rao-Blackwellization performs slightly better than BVSR alone.

Lastly we check the calibration of the posterior inclusion probability (PIP), . We pool the results of the sets of simulations with different heritability (from to at increment of ). In each set there are estimated PIPs, one for each SNP, and in total there are PIPs. We group these PIPs into bins for , and for each bin we compute the fraction of true positives. If the PIP estimates are well-calibrated, we expect the average PIP roughly equals the fraction of true positives in each bin. Figure 4 shows that the PIP estimated by BVSR is conservative, which agrees with a previous study [Guan and Stephens, 2011], and the PIP estimated by Rao-Blackwellization is well cablibrated.

### 3.3 Analysis of real GWAS Datasets

We applied fastBVSR to analyze GWAS datasets concerning intraocular pressure (IOP). We applied and downloaded two GWAS datasets from the database of Genotypes and Phenotypes (dbGaP), one for glaucoma (dbGaP accession number: phs000238.v1.p1) and the other for intraocular hypertension (dbGaP accession number: phs000240.v1.p1). The combined dataset contains autosomal SNPs and subjects, and was used previously by Zhou and Guan [2017] to examine the relationship between p-values and Bayes factors.

We conducted five independent runs with different random seeds. The sparsity parameters for BVSR were set as and , which reflected a prior of marginally associated SNPs. The posterior mean model size in each run ranges from to (Table 2). The average number of iterations for ICF to converge is (% interval = ), which suggests that ICF is effective in analyzing real datasets. The speed improvement of fastBVSR over piMASS (a BVSR implementation that uses the Cholesky decomposition to fit Bayesian linear regression, a companion software package of Guan and Stephens [2011]) is dramatic: with model size around and more than individuals, fastBVSR took hours to run million MCMC iterations, compared to more than hours used by piMASS on problems of matching size. Out of concern for the cumulative error in obtaining by updating the Cholesky decomposition (detailed in Section 3.1), we compared every steps the updated Cholesky decomposition with the directly computed Cholesky decomposition, and results suggested that the entry-wise difference was negligible.

We examine the inference of the hyperparameters, namely, the narrow sense heritability , the model size , and the sparsity parameter . Table 2 shows the inference on these three parameters are more or less consistent across five independent runs, and thus we combine the results from the five independent runs in the following discussion. The posterior mean for is with credible interval This estimate of the heritability is smaller than those reported in the literature: 0.62 from a twin study [Carbonaro et al., 2008], 0.29 from a sibling study [Chang et al., 2005], and 0.35 from an extended pedigree study [van Koolwijk et al., 2007]. For comparison, using the same dataset, heritability estimated by GCTA is with standard error . The underestimation of using fastBVSR is perhaps due to over-shrinking of the effect sizes (the posterior mean of prior effect size is ). The posterior mean for is , which suggests that IOP is a very much polygenic phenotype, agreeing with the complex etiology of glaucoma [RN et al., 2014], to which high IOP is a precursor phenotype.

Combined | Run1 | Run2 | Run3 | Run4 | Run5 | |

0.26 | 0.25 | 0.24 | 0.29 | 0.26 | 0.24 | |

474 | 446 | 395 | 622 | 476 | 430 | |

0.0016 | 0.0015 | 0.0013 | 0.0021 | 0.0016 | 0.0014 |

Lastly, we examine the top marginal signals detected by fastBVSR. Table 3 lists top SNPs based on PIP inferred from fastBVSR, at an arbitrary PIP threshold of . Table 3 also includes PPA (posterior probability of association) based on the single SNP test for two choices of prior effect sizes: which is the posterior mean of inferred from fastBVSR and which is a popular choice for single SNP analysis. We use the posterior mean of inferred from fastBVSR as the prior odds of association. Multiplying a Bayes factor and the prior odds we obtain the posterior odds, from which we obtain the First the two columns of PPAs are highly similar in both ranks and magnitudes. This observation agrees with our experience and theoretical studies [Zhou and Guan, 2017] that modest prior effect sizes produce similar evidence for association. Second, PPA and PIP have significantly different rankings (Wilcox rank test ). There are two related explanations for this observation: if two SNPs are correlated, the PIPs tend to split between the two; conditioning on the other associated SNPs, the marginal associations tend to change significantly. Three known GWAS signals, rs12150284, rs2025751, rs7518099/rs4656461, which were replicated in our previous single-SNP analysis [Zhou and Guan, 2017], all show appreciable PIPs. Moreover, we note a potential novel association first reported in Zhou and Guan [2017]: two SNPs located in PEX14, rs12120962, rs12127400, have PIPs of and respectively.

SNP | Chr | Pos (Mb) | |||
---|---|---|---|---|---|

rs12120962 | 1 | 10.53 | 0.85 | 0.84 | 0.57 |

rs12127400 | 1 | 10.54 | 0.74 | 0.74 | 0.27 |

rs4656461 | 1 | 163.95 | 1.0 | 0.96 | 0.36 |

rs7518099 | 1 | 164.00 | 1.0 | 0.98 | 0.42 |

rs7645716 | 3 | 46.31 | 0.62 | 0.56 | 0.35 |

rs2025751 | 6 | 51.73 | 0.81 | 0.82 | 0.68 |

rs10757601 | 9 | 26.18 | 0.47 | 0.55 | 0.31 |

rs9783190 | 10 | 106.76 | 0.39 | 0.31 | 0.32 |

rs1381143 | 12 | 86.76 | 0.27 | 0.37 | 0.31 |

rs4984577 | 15 | 93.76 | 0.43 | 0.49 | 0.42 |

rs12150284 | 17 | 9.97 | 0.99 | 0.97 | 0.92 |

## 4 Discussion

We developed a novel algorithm, iterative complex factorization, to solve a class of penalized linear systems, proved its convergence, and demonstrated its effectiveness and efficiency through simulation studies. The novel algorithm can dramatically increase the speed of BVSR, which we demonstrated by analyzing a real dataset. One limitation of the ICF is that it requires Cholesky decomposition to obtain the effective complex factorization. Nevertheless, ICF still has a range of potential applications. The general ridge regression [Draper and Van Nostrand, 1979] is an obvious one, and here we point out two other examples. The first is the Bayesian sparse linear mixed model (BSLMM) of Zhou et al. [2013], which can be viewed as a generalization of BVSR. BSLMM has two components, one is the linear component that corresponds to and the other is the random effect component. The linear component requires one to solve a system that is similar to (11) [Text S2 Zhou et al., 2013]; therefore ICF can be applied to BSLMM to make it more efficient. Another example to apply ICF is using the variational method to fit the BVSR [Carbonetto and Stephens, 2012, Huang et al., 2016]. In particular, Huang et al. [2016] estimated using an iterative method. In each iteration, the posterior mean for is updated by solving a linear system that has the same form as (11), where the dimension is equal to the number of covariates with nonzero PIP estimates, and the diagonal matrix depends on the PIP estimates. Applying ICF to the variational method might be more beneficial compared to BVSR because the dimension of the linear system is exceedingly larger in the variational method.

The iterative complex factorization method converges almost instantaneously, and it is far more accurate than other iterative algorithms such as the Gauss-Seidel method. Why does it converge so fast and at the same time so accurate? Our intuition is that the imaginary part of the update in (15) , which involves the skew-symmetric matrix , whose Youla decomposition [Youla, 1961] suggests that it rotates , scales and flips its coordinates, and rotates back, and which then appends itself to be the imaginary part of , provides a “worm hole” through the imaginary dimension in an otherwise purely real, unimaginary, solution path. Understanding how this works mathematically will have a far-reaching effect on Bayesian statistics, statistical genetics, machine learning, and computational biology.

Finally, we hope our new software fastBVSR, which is 10 – 100 fold faster than the previous implementation of BVSR, will facilitate the wide-spread use of BVSR in analyzing or reanalyzing datasets from genome-wide association (GWA) and expression quantitative trait loci (eQTL) studies.

## References

- Berger et al. [2001] James O Berger, Luis R Pericchi, JK Ghosh, Tapas Samanta, Fulvio De Santis, JO Berger, and LR Pericchi. Objective bayesian methods for model selection: introduction and comparison. Lecture Notes-Monograph Series, pages 135–207, 2001.
- Broman and Speed [2002] Karl W Broman and Terence P Speed. A model selection approach for the identification of quantitative trait loci in experimental crosses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):641–656, 2002.
- Carbonaro et al. [2008] F Carbonaro, T Andrew, DA Mackey, TD Spector, and CJ Hammond. Heritability of intraocular pressure: a classical twin study. British Journal of Ophthalmology, 92(8):1125–1128, 2008.
- Carbonetto and Stephens [2012] Peter Carbonetto and Matthew Stephens. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73–108, 2012.
- Chang et al. [2005] Ta C Chang, Nathan G Congdon, Robert Wojciechowski, Beatriz Muñoz, Donna Gilbert, Ping Chen, David S Friedman, and Sheila K West. Determinants and heritability of intraocular pressure and cup-to-disc ratio in a defined older population. Ophthalmology, 112(7):1186–1191, 2005.
- Draper and Van Nostrand [1979] Norman R Draper and R Craig Van Nostrand. Ridge regression and james-stein estimation: review and comments. Technometrics, 21(4):451–466, 1979.
- Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
- Gough [2009] Brian Gough. GNU scientific library reference manual. Network Theory Ltd., 2009.
- Guan and Krone [2007] Y. Guan and S. M. Krone. Small-world mcmc and convergence to multi-modal distributions: From slow mixing to fast mixing. Annals of Applied Probability, 17:284–304, 2007.
- Guan and Stephens [2011] Yongtao Guan and Matthew Stephens. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, pages 1780–1815, 2011.
- Huang et al. [2016] Xichen Huang, Jin Wang, and Feng Liang. A variational algorithm for bayesian variable selection. arXiv preprint arXiv:1602.07640, 2016.
- Ibrahim and Laud [1991] Joseph G Ibrahim and Purushottam W Laud. On bayesian analysis of generalized linear models using jeffreys’s prior. Journal of the American Statistical Association, 86(416):981–986, 1991.
- Lee et al. [2011] Sang Hong Lee, Naomi R Wray, Michael E Goddard, and Peter M Visscher. Estimating missing heritability for disease from genome-wide association studies. The American Journal of Human Genetics, 88(3):294–305, 2011.
- Liang et al. [2008] Feng Liang, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger. Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481), 2008. ISSN 0162-1459.
- Møller et al. [2006] Jesper Møller, Anthony N Pettitt, R Reeves, and Kasper K Berthelsen. An efficient markov chain monte carlo method for distributions with intractable normalising constants. Biometrika, 93(2):451–458, 2006.
- Murray et al. [2012] Iain Murray, Zoubin Ghahramani, and David MacKay. Mcmc for doubly-intractable distributions. arXiv preprint arXiv:1206.6848, 2012.
- O’Hagan and Forster [2004] Anthony O’Hagan and Jonathan J Forster. Kendall’s advanced theory of statistics, volume 2B: Bayesian inference, volume 2. Arnold, 2004.
- Raftery et al. [1997] Adrian E Raftery, David Madigan, and Jennifer A Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179–191, 1997.
- RN et al. [2014] Weinreb RN, Aung T, and Medeiros FA. The pathophysiology and treatment of glaucoma: A review. JAMA, 311(18):1901–1911, 2014. doi: 10.1001/jama.2014.3192. URL +http://dx.doi.org/10.1001/jama.2014.3192.
- Trefethen and Bau III [1997] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
- van Koolwijk et al. [2007] Leonieke ME van Koolwijk, Dominiek DG Despriet, Cornelia M van Duijn, Luba M Pardo Cortes, Johannes R Vingerling, Yurii S Aulchenko, Ben A Oostra, Caroline CW Klaver, and Hans G Lemij. Genetic contributions to glaucoma: heritability of intraocular pressure, retinal nerve fiber layer thickness, and optic disc morphology. Investigative ophthalmology & visual science, 48(8):3669–3676, 2007.
- Xu and Guan [2014] Hanli Xu and Yongtao Guan. Detecting local haplotype sharing and haplotype association. Genetics, 197(3):823–838, 2014.
- Yang et al. [2010] Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Henders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin, Grant W Montgomery, et al. Common snps explain a large proportion of the heritability for human height. Nature genetics, 42(7):565–569, 2010.
- Yang et al. [2011] Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
- Yang and Matthias [2007] Shiming Yang and K Gobbert Matthias. The optimal relaxation parameter for the sor method applied to a classical model problem. Technical report, Technical Report number TR2007-6, Department of Mathematics and Statistics, University of Maryland, Baltimore County, 2007.
- Youla [1961] D. C. Youla. A normal form for a matrix under the unitary congruence group. Canad. J. Math., 13:694–704, 1961.
- Young [1954] David Young. Iterative methods for solving partial difference equations of elliptic type. Transactions of the American Mathematical Society, 76(1):92–111, 1954.
- Zhou and Guan [2017] Quan Zhou and Yongtao Guan. On the null distribution of bayes factors in linear regression. Journal of American Statistical Association, page to appear, 2017.
- Zhou et al. [2013] Xiang Zhou, Peter Carbonetto, and Matthew Stephens. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet, 9(2):e1003264, 2013.

## Appendix

### Examples for updating the Cholesky Decomposition

Suppose we start with

such that

#### Add a covariate

To add one covariate we attach it to the last column of and denote the new matrix by , and

To compute new Cholesky decomposition we solve

which requires only one forward substitution to get

#### Remove a covariate

Remove the second covariate of and denote the new matrix by to get,

is obtained by removing the second column from . Note that . To make zero, we introduce Givens rotation matrix

Note , and . The new Cholesky decomposition is then given by

and the bottom row of zeros can be removed without affecting subsequent calculations.