# Gene-proximity models for Genome-Wide Association Studies

## Abstract

Motivated by the important problem of detecting association between genetic markers and binary traits in genome-wide association studies, we present a novel Bayesian model that establishes a hierarchy between markers and genes by defining weights according to gene lengths and distances from genes to markers. The proposed hierarchical model uses these weights to define unique prior probabilities of association for markers based on their proximities to genes that are believed to be relevant to the trait of interest. We use an expectation-maximization algorithm in a filtering step to first reduce the dimensionality of the data and then sample from the posterior distribution of the model parameters to estimate posterior probabilities of association for the markers. We offer practical and meaningful guidelines for the selection of the model tuning parameters and propose a pipeline that exploits a singular value decomposition on the raw data to make our model run efficiently on large data sets. We demonstrate the performance of the model in simulation studies and conclude by discussing the results of a case study using a real-world dataset provided by the Wellcome Trust Case Control Consortium.

## 1Introduction

A genome-wide association study (GWAS) aims to determine the subset of genetic markers that is most relevant to a particular trait of interest. From a statistical perspective, this task is usually framed as a regression problem where the response variables are measurements of either qualitative traits, e.g., a binary value indicating the presence or absence of a disease, or quantitative traits, e.g., a person’s blood pressure, and the explanatory variables are the number of reference alleles present at each marker, or single nucleotide polymorphism (SNP), as well as other covariates of interest such as age or smoking status. Many linear models [3] have been developed to detect associations between SNPs and traits, but they generally suffer from the “large , small ” problem where the ratio of the number of predictors, , to the sample size, , is on the order of hundreds to thousands [42]. Moreover, other issues such as collinearity in the covariates due to linkage disequilibrium [LD, [36]], rare variants, and population stratification result in inefficient estimation of model parameters and a loss in statistical power to detect significant associations [41].

A common strategy to overcome the large--small- problem in GWAS is to forgo analyzing the SNPs jointly and to model instead them independently. Although successful GWAS have employed this strategy [7], multiple hypothesis testing leads to an increase in the Type I error and the necessary correction for this may lead to an overly conservative threshold for statistical significance. Strategies such as grouping SNPs based on proximities to genes [45] or moving windows [46] have been proposed to allow for an increase in power by modeling SNPs jointly, but there is no universal agreement on how to define such windows or groupings. Similarly, other strategies include replacing a group of highly correlated SNPs with only one of its members [23], and removing or collapsing rare variants within a window into a score statistic [4], but again there is no agreement on how to choose which SNPs to retain or group. Recent approaches aim at gaining more power by pooling information across studies through meta-analysis [11].

Although significant progress has been made on GWAS since 2000, it is still a relevant and challenging problem with goals such as modeling interactions between SNPs, genes, and environment effects that await beyond the obstacles already mentioned [20]. In order to move towards a unifying framework for GWAS that allows for the large--small- problem and the SNP-specific issues to be addressed simultaneously in a principled manner, we propose a novel hierarchical Bayesian model that exploits spatial relationships on the genome to define SNP-specific prior distributions on regression parameters. More specifically, in our proposed setting we model markers *jointly*, but we explore a variable selection approach that uses marker proximity to relevant genomic regions, such as genes, to help identify associated SNPs. Our contributions are:

We focus on binary traits which are arguably more common to GWAS, e.g., case control studies, but more difficult to model due to lack of conjugacy. To circumvent the need for a Metropolis-Hastings step when sampling from the posterior distribution on model parameters, we use a recently proposed data augmentation strategy for logistic regression based on latent Pólya-Gamma random variables [35].

We perform variable selection by adopting a spike-and-slab prior [14] and propose a principled way to control the separation between the spike and slab components using a Bayesian false discovery rate similar to [43].

We use a novel weighting scheme to establish a relationship between SNPs and genomic regions and allow for SNP-specific prior distributions on the model parameters such that the prior probability of association for each SNP is a function of its location on the chromosome relative to neighboring regions. Moreover, we allow for the “relevance” of a genomic region to contribute to the effect it has on its neighboring SNPs and consider “relevance” values calculated based on previous GWAS results in the literature, e.g. see [28].

Before sampling from the posterior space using Gibbs sampling, we use an expectation-maximization [EM, [10]] algorithm in a filtering step to reduce the number of candidate markers in a manner akin to distilled sensing [19]. By investigating the update equations for the EM algorithm, we suggest meaningful values to tune the hyperprior parameters of our model and illustrate the induced relationship between SNPs and genomic regions.

We derive a more flexible centroid estimator [8] for SNP associations that is parameterized by a sensitivity-specificity trade-off. We discuss the relation between this parameter and the prior specification when obtaining estimates of model parameters.

We start by describing previous work and stating our contributions in Section 2. In Section 3 we define our Spatial Boost model and the novel relationship between SNPs and genes. In Section 4 we describe how to fit the model using a combination of a filtering step that exploits an EM filtering step and Gibbs sampling. We provide guidelines for the selection of model tuning parameters in Section 5. We then illustrate the performance of the model on simulated data using real SNPs in Section 6 and apply the model to a real-world GWAS data set provided by the Wellcome Trust Case Control Consortium (WTCCC) in Section 7. Finally, we conclude with a discussion on future extensions to this work in Section 8.

## 2Previous and Related Work

A common solution to large--small- problems is to use penalized regression models such as ridge regression [21], LASSO [39], or elastic net [48]. These solutions can be shown to be equivalent, from a Bayesian perspective, to maximum *a posteriori* (MAP) estimators under appropriate prior specifications. For instance, for LASSO, the penalty can be translated into a Laplace prior. However, since LASSO produces biased estimates of the model parameters and tends to select only one parameter in a group of correlated parameters [48], it is not suitable for GWAS.

Techniques like group LASSO, fused LASSO [40], or sparse group LASSO [12] further attempt to account for the structure of genes and markers or linkage disequilibrium by assigning SNPs to groups based on criteria such as gene membership and then placing additional penalties on the norm of the vector of coefficients for each group, or on the norm of the difference in coefficients of consecutive SNPs. However, it is difficult to define gene membership universally since genes have varying lengths and may overlap with each other; moreover, the penalty on the norm of the difference in consecutive SNPs neglects any information contained in the genomic distance between them.

It may be possible to develop additional, effective penalty terms within models, such as and , to address the issues present in GWAS data in a penalized regression framework, but because genotypes are more correlated for markers that are close in genomic space due to linkage disequilibrium, the most effective penalties would need to capture the relevance of a particular SNP as a function of its location on the genome. Moreover, since it is typically easier to study the biological function of genes, we are particularly interested in SNPs that lie close to genes [25]; as a result, the most desirable penalties would likely be SNP-specific. We accomplish this by exploiting biological knowledge about the structure of the genome to set SNP-specific prior distributions on the model parameters in a hierarchical Bayesian model.

Researchers have considered hierarchical Bayesian models for variable selection in GWAS and other large scale problems (e.g. [16]). Some recent models exploit Bayesian methods in particular to allow for data-driven SNP-specific prior distributions [17] which depend on a random variable that describes the proportion of SNPs to be selected. These approaches have adopted a continuous spike-and-slab prior distribution [14] on the model parameters, set an inverse-gamma prior distribution on the variance of the spike component of the prior, and control the difference in the variance of the spike and slab components of the prior using a tuning parameter.

To incorporate external information in a hierarchical Bayesian model, researchers analyzing a different kind of data, gene expression levels, have recently considered relating a linear combination of a set of predictor-level covariates that quantify the relationships between the genes to their prior probabilities of association through a probit link function [33]. This formulation leads to a second-stage probit regression on the probability that any gene is associated with a trait of interest using a set of predictor-level covariates that could be, for instance, indicator variables of molecular pathway membership. In our model, we propose a special case of this formulation tailored for GWAS data where: (i) we use the logit link instead of the probit link, (ii) the predictor-level covariates are spatial weights that quantify a SNP’s position on the genome relative to neighboring genes, and (iii) the coefficients of each of the predictor-level covariates are numerical scores that quantify the relevance of a particular gene to the trait of interest.

Fitting a penalized model to a large data set (e.g. ) is computationally intense and thus so is the process of selecting an optimal value for any tuning parameters. Side-stepping this problem, some researchers have had success in applying a suite of penalty terms (e.g. LASSO, Adaptive LASSO, NEG, MCP, LOG) to a *pre-screened* subset of markers [22] and investigating the concordance of significant markers across each of the final models. Although a pre-screening of markers from a marginal regression would ideally retain almost all of the relevant variables, penalized models such as LASSO could likely be improved by using a larger number of SNPs than those which pass an initial screening step (e.g. a genome-wide significance threshold) [26].

## 3Spatial Boost Model

We perform Bayesian variable selection by analyzing binary traits and using the structure of the genome to dynamically define the prior probabilities of association for the SNPs. Our data are the binary responses for individuals and genotypes for markers per individual, where codes the number of *minor* alleles in the -th individual for the -th marker. For the likelihood of the data, we consider the logistic regression:

We note that GWA studies are usually *retrospective*, i.e. cases and controls are selected irrespectively of their history or genotypes; however, as [29] point out, coefficient estimates for are not affected by the sampling design under a logistic regression. Thus, from now on, to alleviate the notation we extend to incorporate the intercept, , and also set .

We use latent variables and a continuous spike-and-slab prior distribution for the model parameters with the positive constant denoting the separation between the variance of the spike and the slab components:

For the intercept, we set or, equivalently, we define and include in . In the original spike-and-slab prior distribution, the slab component is a normal distribution centered at zero with a large variance or even a diffuse uniform distribution and the spike component is a degenerate distribution at zero [32]. This setup results in exact variable selection through the use of the ’s, since would imply that the -th SNP’s coefficient is exactly equal to zero. Here we use the continuous version of the spike-and-slab distribution [14] that became more popular by avoiding the spike discontinuities at zero and thus allowing for a relaxed form of variable selection that lends itself easily to an EM algorithm (see Section 4.1).

For the variance of the spike component in we adopt an inverse Gamma prior distribution, . We expect to be reasonably small with high probability in order to enforce the desired regularization that distinguishes selected markers from non-associated markers. Thus, we recommend choosing and so that the prior expected value of is small.

In the prior distribution for , we incorporate information from relevant genomic regions. The most common instance of such regions are *genes*, and so we focus on these regions in what follows. Thus, given a list of genes with gene *relevances* (see Section 3.2 for some choices of definitions), , and weights, , the prior on is

The weights are defined using the structure of the SNPs and genes and aim to account for gene lengths and their proximity to markers as a function of a spatial parameter , as we see in more detail next.

### 3.1Gene Weights

To control how much a gene can contribute to the prior probability of association for a SNP based on the gene’s length and distance to that SNP we introduce a *range* parameter . Consider a gene that spans genomic positions to , and the -th marker at genomic position ; the gene weight is then

Generating gene weights for a particular SNP is equivalent to centering a Gaussian curve at that SNP’s position on the genome with standard deviation equal to and computing the area under that curve between the start and end points of each gene. Figure 1 shows an example. As , the weight that each gene contributes to a particular SNP becomes an indicator function for whether or not it covers that SNP;as , the weights decay to zero. Intermediate values of allow then for a variety of weights in that encode *spatial* information about gene lengths and gene proximities to SNPs. In Section 5.1 we discuss a method to select .

According to , it might be possible for multiple, possibly overlapping, genes that are proximal to SNP to boost . To avoid this effect, we take two precautions. First, we break genes into non-overlapping genomic blocks and define the relevance of a block as the mean gene relevance of all genes that cover the block. Second, we normalize the gene weight contributions to in , , such that . This way, it is possible to compare estimates of across different gene weight and relevance schemes.

### 3.2Gene Relevances

We allow for the further strengthening or diminishing of particular gene weights using gene relevances . If we set and allow for all genes to be uniformly relevant, then we have a “non-informative” case. Alternatively, if we have some reason to believe that certain genes are more relevant to a particular trait than others, for instance on the basis of previous research or prior knowledge from an expert, then we can encode these beliefs through . In particular, we recommend using either text-mining techniques, e.g. [2], to quantify the relevance of a gene to a particular disease based on citation counts in the literature, or relevance scores compiled from search hits and citation linking the trait of interest to genes, e.g. [28].

## 4Model Fitting and Inference

The ultimate goal of our model is to perform inference on the posterior probability of association for SNPs. However, these probabilities are not available in closed form, and so we must resort to Markov chain Monte Carlo techniques such as Gibbs sampling to draw samples from the posterior distributions of the model parameters and use them to estimate . Unfortunately, these techniques can be slow to iterate and converge, especially when the number of model parameters is large [9]. Thus, to make our model more computationally feasible, we propose first filtering out markers to reduce the size of the original dataset in a strategy similar to distilled sensing [19], and then applying a Gibbs sampler to only the remaining SNPs.

To this end, we design an EM algorithm based on the hierarchical model above that uses all SNP data simultaneously to quickly find an approximate mode of the posterior distribution on and while regarding as missing data. Then, for the filtering step, we iterate between (1) removing a fraction of the markers that have the lowest conditional probabilities of association and (2) refitting using the EM procedure until the predictions of the filtered model degrade. In our analyses we filtered of the markers at each iteration to arrive at estimates and stopped if . Next, we discuss the EM algorithm and the Gibbs sampler, and offer guidelines for selecting the other parameters of the model in Section 5.

### 4.1EM algorithm

We treat as a latent parameter and build an EM algorithm accordingly. If then for the M-steps on and we maximize the expected log joint . The log joint distribution , up to a normalizing constant, is

and so, at the -th iteration of the procedure, for the E-step we just need . But since

then

for and .

To update and we employ conditional maximization steps [30], similar to cyclic gradient descent. From we see that the update for follows immediately from the mode of an inverse gamma distribution conditional on :

The terms in that depend on come from the log likelihood of and from the expected prior on , , where

Since updating is equivalent here to fitting a ridge regularized logistic regression, we exploit the usual iteratively reweighted least squares (IRLS) algorithm [27]. Setting as the vector of expected responses with and as the variance weights, the update for is then

where we substitute for in the definition of .

#### Rank truncation of design matrix

Computing and storing the inverse of the -by- matrix in is expensive since is large. To alleviate this problem, we replace with a rank truncated version based on its singular value decomposition . More specifically, we take the top singular values and their respective left and right singular vectors, and so, if and and are the -th left and right singular vectors respectively,

where is the -th order diagonal matrix with the top singular values and (-by-) and (-by-) contain the respective left and right singular vectors. We select by controlling the mean squared error: should be large enough such that .

Since , we profit from the rank truncation by defining the (upper) Cholesky factor of and so that

by the Kailath variant of the Woodbury identity [34]. Now we just need to store and compute the inverse of the -th order square matrix to obtain the updated in .

### 4.2Gibbs sampler

After obtaining results from the EM filtering procedure, we proceed to analyze the filtered dataset by sampling from the joint posterior using Gibbs sampling. We iterate sampling from the conditional distributions

until assessed convergence.

We start by taking advantage of the conjugate prior for and draw each new sample from

Sampling is also straightforward: since the are independent given ,

with as in . Sampling , however, is more challenging since there is no closed-form distribution based on a logistic regression, but we use a data augmentation scheme proposed by [35]. This method has been noted to perform well when the model has a complex prior structure and the data have a group structure and so we believe it is appropriate for the Spatial Boost model.

Thus, to sample conditional on , , and we first sample latent variables from a Pólya-Gamma distribution,

and then, setting , , and , sample

We note that the same rank truncation used in the EM algorithm from the previous section works here, and we gain more computational efficiency by using an identity similar to when computing and storing .

### 4.3Centroid estimation

To conduct inference on we follow statistical decision theory [6] and define an estimator based on a generalized Hamming loss function ,

We assume that has symmetric error penalties, and that , that is, the loss for a false positive or negative is higher than for a true positive and true negative. In this case, we can define a *gain* function by subtracting each entry in from and dividing by :

Gain represents a sensitivity-specificity trade-off; if , that is, if true positives and negatives have the same relevance, then .

Let us define the marginal posteriors . The above estimator is then equivalent to

which can be obtained position-wise,

The estimator in is known as the *centroid estimator*; in contrast to maximum *a posteriori* (MAP) estimators that simply identify the highest peak in a posterior distribution, centroid estimators can be shown to be closer to the mean than to a mode of the posterior space, and so offer a better summary of the posterior distribution [8]. Related formulations of centroid estimation for binary spaces in have been proposed in many bioinformatics applications in the context of maximum expected accuracy [18]. Moreover, if then is simply a consensus estimator and coincides with the median probability model estimator of [5].

Finally, we note that the centroid estimator can be readily obtained from MCMC samples since we just need to estimate the marginal posterior probabilities and apply them to .

## 5Guidelines for Selecting Prior Parameters

Since genome-wide association is a large--small- problem, we rely on adequate priors to guide the inference and overcome ill-posedness. In this section we provide guidelines for selecting hyperpriors in the slab variance of , and , , and in the prior for .

### 5.1Selecting

Biologically, some locations within a chromosome may be more prone to recombination events and consequently to relatively higher linkage disequilibrium. LD can be characterized as correlation in the genotypes, and since we analyze the entire genome, high correlation in markers within a chromosome often results in poor coefficient estimates for the logistic regression model in Equation 1. To account for potentially varying spatial relationships across the genome, we exploit the typical correlation pattern in GWAS data sets to suggest a value for that properly encodes the spatial relationship between markers and genes in a particular region as a function of genomic distance. To this end, we propose the following procedure to select :

Divide each chromosome into regions such that the distance between the SNPs in adjacent regions is at least the average length of a human gene, or base pairs [38]. The resulting regions will be, on average, at least a gene’s distance apart from each other and may possibly exhibit different patterns of correlation.

Merge together any adjacent regions that cover the same gene. Although the value of depends on each region, we want the meaning of the weights assigned from a particular gene to SNPs in the Spatial Boost model to be consistent across regions. As a practical example, by applying the first two steps of the pre-processing procedure on chromosome 1, we obtain 1,299 windows of varying sizes ranging from 1 to 300 markers.

Iterate over each region and select a value of that best fits the magnitude of the genotype correlation between any given pair of SNPs as a function of the distance between them. We propose using the normal curve given in the definition of the gene weights to first fit the magnitudes, and then using the mean squared error between the magnitudes in the sample correlation matrix of a region and the magnitudes in the fitted correlation matrix as a metric to decide the optimal value of . In particular, given two SNPs located at positions and , we relate the magnitude of the correlation between SNPs and to the area

where is the standard normal cumulative function.

Figure 2 shows an example of application to chromosome 1 based on data from the case study discussed in Section 6. We note that the mean squared error criterion places more importance on fitting relatively larger magnitudes close to the diagonal of the image matrix, and so there is little harm in choosing a moderate value for that best fits the magnitudes of dense groups of correlated SNPs in close proximity.

### 5.2Selecting and

According to the centroid estimator in , the -th SNP is identified as associated if . Following a similar criterion, but with respect to the conditional posteriors, we have , and so, using ,

After some rearrangements, we see that, in terms of , this criterion is equivalent to with

that is, we select the -th marker if is more than “spike” standard deviations away from zero.

This interpretation based on the EM formulation leads to a meaningful criterion for defining and : we just require that , that is, that the smallest number of standard deviations is at least . Since ,

and so,

For a more stringent criterion, we can take the minimum over in the right-hand side of by setting . When setting it is also important to keep in mind that is the largest allowable gene boost, or better, increase in the log-odds of a marker being associated to the trait.

Since is related to the prior probability of a SNP being associated, we can take to be simply the logit of the fraction of markers that we expect to be associated *a priori*. However, for consistency, since we want , we also require that the right hand side of be non-negative, and so

Equation constraints and jointly, but we note that the two parameters have different uses: captures our prior belief on the probability of association and is thus part of the model specification, while defines the sensitivity-specificity trade-off that is used to identify associated markers, and is thus related to model inference.

As an example, if and we set , then the bound in with is . If we expect in markers to be associated, we have and the bound is respected. The upper bound for in is thus .

### 5.3Selecting

We propose using a metric similar to the Bayesian false discovery rate [BFDR, [43]] to select . The BFDR of an estimator is computed by taking the expected value of the false discovery proportion under the marginal posterior distribution of :

Since, as in the previous section, we cannot obtain estimates of just by running our EM algorithm, we consider instead an alternative metric that uses the conditional posterior probabilities of association given the fitted parameters, . We call this new metric EMBFDR:

Moreover, by the definition of the centroid estimator in , we can parameterize the centroid EMBFDR using :

We can now analyze a particular data set using a range of values for and subsequently make plots of the EMBFDR metric as a function of the threshold or as a function of the proportion of SNPs retained after the EM filter step. Thus, by setting an upper bound for a desired value of the EMBFDR we can investigate these plots and determine an appropriate choice of and an appropriate range of values of . In Figure 3 we illustrate an application of this criterion. We note that the EMBFDR has broader application to Bayesian variable selection models and can be a useful metric to guide the selection of tuning parameters, in particular the spike-and-slab variance separation parameter .

### 5.4Visualizing the relationship between SNPs and genes

For a given configuration of , , and , we can plot the bounds on and inspect the effect of parameters , , and . SNPs that are close to relevant genes have thresholds that are relatively lower in magnitude; they need a relatively smaller (in magnitude) coefficient to be selected for the final model. With everything else held fixed, as the boost received from the relevant genes will decrease to zero and our model will coincide with a basic version of Bayesian variable selection where . We demonstrate this visualization on a mock chromosome in Figure 4.

## 6Empirical Studies

We conduct two simulation studies. First, we compare the performance of our method to other well-known methods including single SNP tests, LASSO, fused LASSO, group LASSO, the penalized unified multiple-locus association (PUMA) suite of [22], and the Bayesian sparse linear mixed model (BSLMM) of [47]. Then we assess the robustness of our method to misspecifications of the range parameter, , and gene relevances. We describe each study in detail below, but we first explain how the data is simulated in each scenario.

### 6.1Simulation Study Details

To provide a fair comparison across methods and to realistically assess the robustness of our method to misspecifications, we designed our simulation study based on real-life genotypical data and current gene and marker annotations. Specifically, to keep a representative LD structure we sample whole-chromosome individual genotypes by subsampling individual data provided by the [1]; gene weights are computed based on gene lengths and positions in the hg19 reference, while marker positions are taken from actual SNP array designs in the WTCCC studies. We consider two studies: a “non-informative” setup where the gene relevances are uniformly set to one and , so that marker relevance scores are small and close to uniform, and a mild boost effect of ; and an “informative” study with gene relevances taken as search hit scores from [28] and , a frequent value when adopting the procedure in Section 5.1, and a stronger boost effect, . These two studies are extreme with respect to marker relevance scores—a function of gene relevances and genomic range—and spatial boost effects and aim at assessing how robust are our model and recommended guidelines. For instance, when fitting all scenarios we take an informative approach by considering the same gene relevances and genomic range from the “informative” scenario, but adopt a conservative approach by setting as in the “non-informative” scenario.

For both studies we simulated two scenarios for the number of markers : the “small” scenario comprised chromosome 19 with markers, and the “large” scenario containing all markers in chromosome 2. Chromosomes 19 and 2 are the smallest and largest in terms of number of markers, respectively. We kept the ratio of to the number of individuals at , representative of real-life studies, so in the small scenario and in the large scenario. In all simulations we fix the number of causal markers and set the baseline log-odds . For each simulation batch, we first sample uniformly at random individuals from the 1000 Genomes dataset, taking their whole chromosome genotypes according to the small or large scenario, and filter out markers with sampled MAF , deemed as rare variants, or . Next, we sample causal markers following , with marker relevance scores and taken according to a non-informative or informative scenario. Effect sizes are then sampled to reflect the challenging nature of GWAS: , that is, small effect sizes for causal markers and relatively large coefficients for noisier non-causal effects. Finally, for each replicate within a batch we sample phenotypes according to .

In each simulation scenario and dataset below we fit the model as follows: we adopt informative gene relevances from MalaCards and , start with conservative values for the baseline log-odds and the gene boost effect , and run the EM filtering process until either the predictive performance starts degrading or at most markers remain. We measure predictive performance using a metric similar to posterior predictive loss [13]: if, at the -th EM iteration, is the -th predicted response, the PPL measure under squared error loss is approximated by

The right panel in Figure 5 shows an example of how the relative PPL (rPPL) typically varies as the EM filter advances. We define the rPPL at the -th EM iteration as the ratio between PPL and the PPL of the null model, that is, the model with only the intercept and no marker genotypes as predictors. At the end of the filtering stage we run the Gibbs sampler with , where is the number of markers retained at the end of the EM filter. Parameter is actually elicited at each EM filtering iteration using EMBFDR.

### 6.2Comparison Simulation Study

In this study, we generated 10 batches of simulated data, each containing 5 replicates, for a total of 50 simulated data sets for each cross configuration of small and large scenarios by non-informative and informative studies. After simulating the data, we fit our model and compared its performance in terms of area under the receiver operating characteristic (ROC) curve (AUC) to the usual single SNP tests, LASSO, fused LASSO, group LASSO, PUMA, and BSLMM methods. We used the **penalized** package in R to fit the LASSO and fused LASSO models; we used ten-fold cross-validation to determine the optimal values for the penalty terms. Similarly, we used the **gglasso** package in R to fit the group LASSO model where we defined the groups such that any two adjacent SNPs belonged to the same group if they were within base pairs of each other; we used ten-fold cross validation to determine the optimal value for the penalty term. Finally, we used the authors’ respective software packages to fit the PUMA and BSLMM models.

To calculate the AUC for any one of these methods, we took a final ranking of SNPs based on an appropriate criterion (see more about this below), determined the points on the ROC curve using our knowledge of the true positives and the false positives from the simulated data, and then calculated the area under this curve. For our model, we used either the ranking (in descending order) of for a particular EM filtering step or using the samples obtained by the Gibbs sampler; for the single SNP tests we used the ranking (in ascending order) of the p-values for each marker’s test; for LASSO, fused LASSO and group LASSO we used the ranking (in descending order) of the magnitude of the effect sizes of the SNPs in the final model; for the other penalized regression models given by the PUMA program, we used the provided software to compute p-values for each SNP’s significance in the final model and used the ranking (in ascending order) of these p-values; for BSLMM we used the ranking (in descending order) of the final estimated posterior probabilities of inclusion for each SNP in the final model.

The left panel in Figure 5 shows examples of ROC curves from a random simulation replicate. Since the interest in GWA studies is focused on low false positive rates, we also evaluate the performance of the methods with respect to *relative* AUC (rAUC) at some false positive rate , defined simply as the normalized AUC up to FPR , that is, rAUC = AUC. According to this criterion, the solid ROC curve in Figure 5 has a clearer advantage over the dashed curve at a 10% FPR.The right panel illustrates how the AUC changes with rPPL and as the EM filtering iterations increase.

We summarize the results in Figure 9. With respect to AUC (top four panels), our methods perform better on average than the other methods in the informative scenario, but only comparably in the non-informative scenario. Note that all models have a modest performance due to the challenging nature of the simulation (but our model has improved performance under less stringent scenarios; see ? for more details). Most of the gain in our methods comes at the beginning of the ROC curves, i.e. at low false positive rates, as exemplified in Figure 5. This becomes more evident if we compare relative AUC in the bottom four panels. We note that the gains are present even in the “null” models with , so they stem from the joint modeling of markers. Additional gains in rAUC are more apparent in the informative scenario (bottom row).

The seemingly bad performance of our model in the non-informative scenario indicates that the model can be sensitive to poor marker relevance scores, arising either from meaningless gene relevances or a non-representative genomic range . This is not surprising given that the inference relies on good prior information in the large--small- regimen; however, as the informative scenario suggests, the model is more robust to misspecification of , which can be seen as the overall prior strength of relevance scores. The null model, for instance, often offers the best performances in both AUC and relative AUC, which recommends conservative, low values for in initial analyses. Moreover, the EM filtering procedure contributes to further gains in performance even in non-informative scenarios, that is, with misspecified relevance scores. These gains are even more pronounced with larger causal effect sizes relative to non-causal effects, as we show in ?; that is, we observe that the EM filter becomes more robust to these misspecifications, especially on large scenarios and under low false positive rates, as shown here.

### 6.3Relevance Robustness Simulation Study

To investigate how robust our model is to misspecifications of gene relevances and genomic range, we focus on their joint effect in defining marker relevance scores and again consider the small (chromosome 19) and large (chromosome 2) scenarios under the informative study, where the gene relevances are more varied. We randomly select one batch from the comparison study to be the reference in each scenario, with its replicates serving as ground truth. We then vary three sampling percentages , and, for each , we randomly select markers uniformly from each reference replicate and sample their relevance scores from the empirical distribution of relevance scores in each scenario. This simulation is replicated 50 times.

Hyper-prior parameters , , and were elicited as in Section 6.1. For each simulated replication we then fit our model and assess performance using the AUC, as in the previous study. Figure 12 illustrates the distribution of relevance scores for the markers in chromosomes 19 and 2 and how the performance of the model varies at each sample percentage. The AUC at initial EM iterations degrades with higher percentages for both scenarios, as expected; however, small scenario replicates continue to show a similar pattern at their last EM iterations, as opposed to large scenario replicates that show better and stable results across all percentages. We attribute this discrepancy in robustness to the distribution of relevance scores. As the left panel in Figure 12 shows, the large scenario has a bimodal distribution and low mean score and so a few markers are relevant; in contrast, the small scenario score distribution has more spread and higher mean and so many markers can be relevant and influence negatively the fit by calling more false positives. We thus recommend to analyze the resulting distribution of marker relevance scores when eliciting gene relevances and .

## 7Case Study

Using data provided by the WTCCC, we analyzed the entire genome (342,502 SNPs total) from a case group of 1,999 individuals with rheumatoid arthritis (RA) and a control group of 1,504 individuals from the 1958 National Blood Bank dataset. For now we addressed the issues of LD, rare variants, and population stratification by analyzing only the SNPs in Hardy-Weinberg Equilibrium [44] with minor allele frequency greater than 5%. There are 15 SNPs that achieve genome-wide significance when using a Bonferroni multiple testing procedure on the results from a single SNP analysis. Table 1 in ? provides a summary of these results for comparison to those obtained when using the spatial boost model.

When fitting the spatial boost model, we broke each chromosome into blocks and selected an optimal value of for each block using our proposed method metric, . We used the EMBFDR to select a choice for from the set at each step of our model fitting pipeline so that the BFDR was no greater than 0.05 while retaining no larger than 5% of the total number of SNPs. With a generous minimum standard deviation we have that trivially from , but we set to encode a prior belief that around 100 markers would be associated to the trait on average *a priori*. The bound on is then , but we consider log odds-ratio boost effects of . A value of is more representative of low power GWA studies; however, the larger boost effects offer more weight to our prior information. For comparison, we also fit a model without any gene boost by setting (the “null” model), and also fit two models for each possible value of trying both a non-informative gene relevance vector and a vector based on text-mining scores obtained from [28].

To accelerate the EM algorithm, we rank-truncate using singular vectors; the mean squared error between and this approximation is less than 1%. We apply the EM filtering 29 times and use PPL to decide when to start the Gibbs sampler. As Figure 13 shows, in all of our fitted models, the PPL decreases slowly and uniformly for the first twenty or so iterations, and then suddenly decreases more sharply for the next five iterations until it reaches a minimum and then begins increasing uniformly until the final iteration, similarly to the behavior depicted in Figure 5. For comparison to the 15 SNPs that achieve genome-wide significance in the single marker tests, Tables 2-15 in ? list, for each spatial boost model, the top 15 SNPs at the optimal EM filtering step, i.e. the step with the smallest PPL, and the top 15 SNPs based on the posterior samples from our Gibbs sampler when using only the corresponding set of retained markers.

We observe the most overlap with the results of the single SNP tests in our null model where and in our models that use informative priors based on relevance scores from MalaCards. Although there is concordance between these models in terms of the top 15 SNPs, it is noteworthy that we select only a fraction of these markers after running either the EM algorithm or the Gibbs sampler. Based on the results from our simulation study where we observe superior performances for the spatial boost model at low false positive rates, we believe that an advantage of our method is this ability to highlight a smaller set of candidate markers for future investigation.

Indeed, after running our complete analysis, we observe that the usual threshold of on would result in only the null spatial boost model (), the low gene boost non-informative model (), and the informative models selecting SNPs for inclusion in their respective final models. The SNPs that occur the most frequently in these final models are the first four top hits from the single SNP tests: rs4718582, rs10262109, rs6679677, and rs664893. The SNP with the highest minor allele frequency in this set is rs6679677; this marker has appeared in several top rankings in the GWAS literature (e.g. [7]) and is in high LD with another SNP in gene PTPN22 which has been linked to RA [31].

If we consider only the final models obtained after running the EM filter, we see another interesting SNP picked up across the null and informative models: rs1028850. In Figure 15, we show a closer look at the region around this marker and compare the trace of the Manhattan plot with the traces of each spatial boost model’s values at the first iteration of the EM filter. To the best of our knowledge this marker has not yet been identified as being associated to RA;moreover, it is located inside a non-protein coding RNA gene, LINC00598, and is close to another gene that has been linked to RA, FOXO1 [15].

As we increase the strength of the gene boost term with a non-informative relevance vector, the relatively strong prior likely leads to a mis-prioritization of all SNPs that happen to be located in regions rich in genes. In the supplementary tables (2-15) of ? we list the lengths of the genes that contain each SNP and we see that indeed the non-informative gene boost models tend to retain SNPs that are near large genes that can offer a generous boost. Perhaps due to prioritizing the SNPs incorrectly in these models, we do not actually select any markers at either the optimal EM filtering step or after running the Gibbs sampler. However, some of the highest ranking SNPs for these models, rs1982126 and rs6969220, are located in gene PTPRN2 which is interestingly a paralog of PTPN22.

## 8Conclusions

We have presented a novel hierarchical Bayesian model for GWAS that exploits the structure of the genome to define SNP-specific prior distributions for the model parameters based on proximities to relevant genes. While it is possible that other “functional” regions are also very relevant—e.g. regulatory and highly conserved regions—and that mutations in SNPs influence regions of the genome much farther away—either upstream, downstream, or, through a complex interaction of molecular pathways, even on different chromosomes entirely—we believe that incorporating information about the genes in the immediate surroundings of a SNP is a reasonable place to start.

By incorporating prior information on relevant genomic regions, we focus on well annotated parts of the genome and were able to identify, in real data, markers that were previously identified in large studies and highlight at least one novel SNP that has not been found by other models. Clearly, validation via other studies of the importance of this marker for RA should be investigated. In addition, as shown in a simulation study, while logistic regression under large--small- regimen is challenging, the spatial boost model often outperforms simpler models that either analyze SNPs independently or employ a uniform penalty term on the norm of their coefficients.

Our main point is that we regard a fully joint analysis of all markers as essential to overcome genotype correlations and rare variants. This approach, however, entails many difficulties. From a statistical point of view, the problem is severely ill-posed so we rely on informative, meaningful priors to guide the inference. From a computational perspective, we also have the daunting task of fitting a large scale logistic regression, but we make it feasible by reducing the dimension of both data—intrinsically through rank truncation—and parameters—through EM filtering. Moreover, from a practical point of view, we provide guidelines for selecting hyper-priors, reducing dimensionality, and implement the proposed approach using parallelized routines.

From the simulation studies in Section 6 we can further draw two conclusions. First, as reported by other methods such as PUMA, filtering is important; our EM filtering procedure seems to focus on effectively selecting true positives at low false positive rates. This feature of our method is encouraging, since practitioners are often interested in achieving higher sensitivity by focusing on lower false positive rates. Second, because we depend on good informative priors to guide the selection of associated markers, we rely on a judicious choice of hyper-prior parameters, in particular of the range parameter and how it boosts markers within neighboring genes that are deemed relevant. It is also important to elicit gene relevances from well curated databases, e.g. MalaCards, and to calibrate prior strength according to how significant these scores are.

We have shown that our model performs better than most variable selection methods, but that it can suffer in case of severe model misspecification. As a way to flag misspecification we suggest to check monotonicity in a measure of model fit such as PPL as we filter markers using EM.In addition, refining the EM filtering by using a lower threshold () at each iteration can help increase performance, especially at lower false positive rates.

When applying the spatial boost model to a real data set, we were able to confidently isolate at least one marker that has previously been linked to the trait as well as find another novel interesting marker that may be related to the trait. This shows that although we can better explore associations jointly while accounting for gene effects, the spatial boost model still might lack power to detect associations between diseases and SNPs due to the high correlation induced by linkage disequilibrium. In the future we plan to increase the power even further by extending the model to include a data pre-processing step that attempts to formally correct for the collinearity between SNPs.

## Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive comments that helped shape a much improved final version of the paper. We would also like to thank the editors for their generous comments and support during the review process. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from `wtccc.org.uk`

. This work was also supported in part by the ICR-KU International Short-term Exchange Program for Young Researchers, the JSPS Summer Program 2013 (SP13024), and the NSF EAPSI Program 2013.

### References

**An integrated map of genetic variation from 1,092 human genomes.**

1000 Genomes Project Consortium et al. (2012).*Nature**491*(7422), 56–65.**A text-mining technique for extracting gene-disease associations from the biomedical literature.**

Al-Mubaid, H. and R. K. Singh (2010).*International journal of bioinformatics research and applications**6*(3), 270–286.**A tutorial on statistical methods for population association studies.**

Balding, D. J. (2006).*Nature Reviews Genetics**7*(10), 781–791.**Statistical analysis strategies for association studies involving rare variants.**

Bansal, V., O. Libiger, A. Torkamani, and N. J. Schork (2010).*Nature Reviews Genetics**11*(11), 773–785.**Optimal predictive model selection.**

Barbieri, M. and J. Berger (2004).*The Annals of Statistics**32*(3), 870–897.*Statistical decision theory and Bayesian analysis*.

Berger, J. (1985). Springer.**Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.**

Burton, P. R., D. G. Clayton, L. R. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. P. Kwiatkowski, M. I. McCarthy, W. H. Ouwehand, N. J. Samani, et al. (2007).*Nature**447*(7145), 661–678.**Centroid estimation in discrete high-dimensional spaces with applications in biology.**

Carvalho, L. and C. Lawrence (2008).*Proceedings of the National Academy of Sciences**105*(9), 3209–3214.**Markov chain monte carlo convergence diagnostics: a comparative review.**

Cowles, M. K. and B. P. Carlin (1996).*Journal of the American Statistical Association**91*(434), 883–904.**Maximum likelihood from incomplete data via the EM algorithm.**

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977).*Journal of the Royal Statistical Society. Series B (Methodological)**39*(1), 1–38.**Meta-analysis methods for genome-wide association studies and beyond.**

Evangelou, E. and J. P. Ioannidis (2013).*Nature Reviews Genetics**14*(6), 379–389.**A note on the group lasso and a sparse group lasso.**

Friedman, J., T. Hastie, and R. Tibshirani (2010, Jan). Technical Report arXiv:1001.0736, Stanford University.**Model choice: A minimum posterior predictive loss approach.**

Gelfand, A. E. and S. K. Ghosh (1998).*Biometrika**85*(1), 1–11.**Variable selection via gibbs sampling.**

George, E. I. and R. E. McCulloch (1993).*Journal of the American Statistical Association**88*(423), 881–889.**JNK-dependent downregulation of FoxO1 is required to promote the survival of fibroblast-like synoviocytes in rheumatoid arthritis.**

Grabiec, A. M., C. Angiolilli, L. M. Hartkamp, L. G. van Baarsen, P. P. Tak, and K. A. Reedquist (2014).*Annals of the Rheumatic Diseases**74*(9), annrheumdis–2013.**Bayesian variable selection regression for genome-wide association studies and other large-scale problems.**

Guan, Y. and M. Stephens (2011).*The Annals of Applied Statistics**5*(3), 1780–1815.**Extension of the Bayesian alphabet for genomic selection.**

Habier, D., R. Fernando, K. Kizilkaya, and D. Garric (2011).*BMC bioinformatics**12*, 186.**A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA).**

Hamada, M. and K. Asai (2012).*Journal of Computational Biology**19*(5), 532–549.**Distilled sensing: Adaptive sampling for sparse detection and estimation.**

Haupt, J., R. M. Castro, and R. Nowak (2011).*Information Theory, IEEE Transactions on**57*(9), 6222–6235.**Ten years of genetics and genomics: what have we achieved and where are we heading?**

Heard, E., S. Tishkoff, J. A. Todd, M. Vidal, G. P. Wagner, J. Wang, D. Weigel, and R. Young (2010).*Nature Reviews Genetics**11*(10), 723–733.**Ridge regression - applications to nonorthogonal problems.**

Hoerl, A. and R. Kennard (1970).*Technometrics**12*, 69–82.**Puma: a unified framework for penalized multiple regression analysis of gwas data.**

Hoffman, G. E., B. A. Logsdon, and J. G. Mezey (2013).*PLoS computational biology**9*(6), e1003101.**Validating, augmenting and refining genome-wide association signals.**

Ioannidis, J. P., G. Thomas, and M. J. Daly (2009).*Nature Reviews Genetics**10*(5), 318–329.**Spike and slab variable selection: Frequentist and Bayesian strategies.**

Ishwaran, H. and J. S. Rao (2005).*Annals of Statistics**33*(2), 730–773.**A gene-centric approach to genome-wide association studies.**

Jorgenson, E. and J. S. Witte (2006).*Nature Reviews Genetics**7*(11), 885–891.**Risk prediction using genome-wide association studies.**

Kooperberg, C., M. LeBlanc, and V. Obenchain (2010).*Genetic epidemiology**34*(7), 643–652.*Generalized linear models*, Volume 37.

MacCullagh, P. and J. A. Nelder (1989). CRC press.**Genes related to rheumatoid arthritis.**

MalaCards (2014). http://www.malacards.org/card/rheumatoid_arthritis.*Generalized Linear Models*.

McCullagh, P. and J. A. Nelder (1989). London England Chapman and Hall 1983.**Maximum likelihood estimation via the ECM algorithm: A general framework.**

Meng, X.-L. and D. B. Rubin (1993).*Biometrika**80*(2), 267–278.**Linkage proof for ptpn22, a rheumatoid arthritis susceptibility gene and a human autoimmunity gene.**

Michou, L., S. Lasbleiz, A.-C. Rat, P. Migliorini, A. Balsa, R. Westhovens, P. Barrera, H. Alves, C. Pierlot, E. Glikmans, et al. (2007).*Proceedings of the National Academy of Sciences**104*(5), 1649–1654.**Bayesian variable selection in linear regression.**

Mitchell, T. J. and J. J. Beauchamp (1988).*Journal of the American Statistical Association**83*(404), 1023–1032.**An integrative framework for bayesian variable selection with informative priors for identifying genes and pathways.**

Peng, B., D. Zhu, B. P. Ander, X. Zhang, F. Xue, F. R. Sharp, and X. Yang (2013).*PloS one**8*(7), e67672.**The matrix cookbook.**

Petersen, K. B. and M. S. Pedersen (2008).**Bayesian inference for logistic models using Pólya-Gamma latent variables.**

Polson, N., J. Scott, and J. Windle (2013).*Journal of the American Statistical Association**108*(504), 1339–1349.**Linkage disequilibrium in humans: Models and data.**

Pritchard, J. and M. Przeworski (2001).*American Journal of Human Genetics**69*, 1–14.**Bayesian statistical methods for genetic association studies.**

Stephens, M. and D. J. Balding (2009).*Nature Reviews Genetics**10*(10), 681–690.*The Handy Science Answer Book*.

Technology Department Carnegie Library of Pittsburgh (2002). Visible Ink Press.**Regression shrinkage and selection via the lasso.**

Tibshirani, R. (1996).*Journal of the Royal Statistical Society**58*, 267–288.**Sparsity and smoothness via the fused lasso.**

Tibshirani, R. and M. Saunders (2005).*Journal of the Royal Statistical Society**67*, 91–108.**Genome-wide association studies: theoretical and practical concerns.**

Wang, W. Y., B. J. Barratt, D. G. Clayton, and J. A. Todd (2005).*Nature Reviews Genetics**6*(2), 109–118.**Bayesian factor regression models in the “large p, small n” paradigm.**

West, M. (2003).*Bayesian Statistics**7*(2003), 723–732.**A bayesian false discovery rate for multiple testing.**

Whittemore, A. S. (2007).*Journal of Applied Statistics**34*(1), 1–9.**A note on exact tests of hardy-weinberg equilibrium.**

Wigginton, J. E., D. J. Cutler, and G. R. Abecasis (2005).*The American Journal of Human Genetics**76*(5), 887–893.**Powerful snp-set analysis for case-control genome-wide association studies.**

Wu, M. C., P. Kraft, M. P. Epstein, D. M. Taylor, S. J. Chanock, D. J. Hunter, and X. Lin (2010).*The American Journal of Human Genetics**86*(6), 929–942.**Rare-variant association testing for sequencing data with the sequence kernel association test.**

Wu, M. C., S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin (2011).*The American Journal of Human Genetics**89*(1), 82–93.**Polygenic modeling with bayesian sparse linear mixed models.**

Zhou, X., P. Carbonetto, and M. Stephens (2013).*PLoS genetics**9*(2), e1003264.**Regularization and variable selection via the elastic net.**

Zou, H. and T. Hastie (2005).*Journal of the Royal Statistical Society**67*, 301–320.