# Set-Based Tests for Genetic Association Using the Generalized Berk-Jones Statistic

###### Abstract

Studying the effects of groups of Single Nucleotide Polymorphisms (SNPs), as in a gene, genetic pathway, or network, can provide novel insight into complex diseases, above that which can be gleaned from studying SNPs individually. Common challenges in set-based genetic association testing include weak effect sizes, correlation between SNPs in a SNP-set, and scarcity of signals, with single-SNP effects often ranging from extremely sparse to moderately sparse in number. Motivated by these challenges, we propose the Generalized Berk-Jones (GBJ) test for the association between a SNP-set and outcome. The GBJ extends the Berk-Jones (BJ) statistic by accounting for correlation among SNPs, and it provides advantages over the Generalized Higher Criticism (GHC) test when signals in a SNP-set are moderately sparse. We also provide an analytic p-value calculation procedure for SNP-sets of any finite size. Using this p-value calculation, we illustrate that the rejection region for GBJ can be described as a compromise of those for BJ and GHC. We develop an omnibus statistic as well, and we show that this omnibus test is robust to the degree of signal sparsity. An additional advantage of our method is the ability to conduct inference using individual SNP summary statistics from a genome-wide association study. We evaluate the finite sample performance of the GBJ though simulation studies and application to gene-level association analysis of breast cancer risk.

Keywords: Correlated tests; Generalized higher criticism; Omnibus test; Sparse alternative; Summary statistics.

## 1 Introduction

Genome-Wide Association Studies (GWAS) have been successful in identifying the associations between thousands of Single Nucleotide Polymorphisms (SNPs) and a variety of complex traits (Manolio et al. 2009). A traditional GWAS analysis tests for the effect of each individual SNP, and this approach has shown that single-SNP effects are often weak across the genome (Visscher et al. 2012). Recently, set-based tests which jointly analyze a group of SNPs - e.g. SNPs in a gene, pathway, or network - have become increasingly popular as complementary tools which can boost analysis power in GWAS (Wu et al. 2010). These tests are also standard for rare variant analysis in whole genome sequencing studies (Lee et al. 2014).

SNPs can be aggregated into sets based on a variety of genomic features. For example, they can be grouped by physical position, such as location in a gene or Linkage Disequilibrium (LD) block, or similar biological functions, such as membership in a genetic pathway or protein network. Set-based analyses then allow for some natural advantages over individual SNP methods. Besides reducing the number of multiple comparisons across the genome, SNP-set methods can increase power by pooling sparse and weak effects into a stronger aggregated signal, as well as by incorporating biological information into the test (Wu et al. 2011). In addition, set-based interpretations of association may be more meaningful than their single-marker counterparts, such as in a gene-level or pathway-level analysis.

A number of set-based tests for genetic association studies have been developed in recent years, including burden tests (Li and Leal 2008), the Sequence Kernel Association Test (SKAT) (Wu et al. 2011), the minimum p-value test (MinP) (Conneely and Boehnke 2007), and most recently the Generalized Higher Criticism (GHC) (Barnett et al. 2017). SKAT and burden tests are examples of methods more suitable for detecting dense signals. If the signals reside in only a few SNPs that are not correlated with noise SNPs, then the power of SKAT and burden tests will suffer.

While certain SNP-sets may contain a large number of signals, it is more common that genomic constructs formed with GWAS data will have only a few signal SNPs. Interestingly, within tests designed for this sparse alternative setting, there are still subtle differences in performance. Under extreme sparsity, as in the case of only one or very few signals in the entire set, the minimum p-value test and the GHC have good power for detecting a SNP-set effect. However GHC and MinP can lose power under moderate sparsity settings, which are relatively common in gene and pathway level analyses of GWAS data. For example, in the Cancer Genetic Markers of Susceptibility (CGEMS) GWAS for breast cancer risk (Hunter et al. 2007), four out of 42 SNPs in the FGFR2 gene, a known breast cancer risk loci, showed strong evidence of association without reaching genome-wide significance. It is hence of substantial interest to develop testing procedures that can reliably detect associations across a range of alternatives in the sparse signal regime.

When factors in a set are independent, several goodness-of-fit type methods have been proposed to perform set-based tests in the presence of sparse signals (Donoho and Jin 2004; Jager and Wellner 2007; Walther 2013). These methods test for the effect of a set by aggregating evidence from marginal test statistics, and they have been shown to possess attractive asymptotic properties when the size of a set goes to infinity. Specifically, they reach a so-called detection boundary when signals are sparse. In a certain sense, they are able to detect the weakest signals detectable by any statistical procedure under the sparse alternative. The class of tests with this ability includes the Higher Criticism (HC) (Donoho and Jin 2004) and the Berk-Jones (BJ) (Berk and Jones 1979). Compared to the HC and BJ tests, the minimum p-value test, for example, is known to attain the detection boundary over a smaller portion of the sparse regime. In terms of finite sample performance, it has been demonstrated through simulation that Berk-Jones outperforms Higher Criticism over a range of moderately sparse alternatives when marginal test statistics are independent (Walther 2013; Li and Siegmund 2015). Donoho and Jin (2004) provide an explanation for this result by showing that HC disproportionately weights evidence from the most extreme observed marginal test statistic, at the cost of losing sensitivity to detect signals in other locations.

However, a direct application of BJ to SNP-set testing is not desirable, as standard p-value calculations for BJ rely on independence between elements in a set (Moscovich-Eiger and Nadler 2017). This assumption is violated by LD-induced correlation between neighboring SNPs in a SNP-set. In addition, we will see that even if we correct the inference procedure of the Berk-Jones, the power of BJ can fall dramatically in high LD settings.

To overcome the challenges posed by correlated SNPs, this paper proposes the Generalized Berk-Jones (GBJ) statistic for testing the association between a SNP-set and outcome. GBJ accounts for LD among SNPs in a set while still preserving the ability to detect moderately sparse and weak signals in finite samples. In fact, GBJ reduces to the Berk-Jones statistic when all SNPs in a set are independent. GBJ can also be applied to SNP-set tests using both individual-level genotype data or GWAS summary statistics from single SNP analysis. To facilitate use, we additionally provide an analytic p-value calculation for GBJ. Our method is more computationally efficient than permutation and is shown to be accurate even at the extremely small levels required for genome-wide significance.

Additional insight into the strengths and weaknesses of GBJ is provided by studying the rejection regions of SNP-set tests developed for sparse alternatives. The rejection regions allow us to quantitatively describe how the power of each test is susceptible to changes in parameters such as the amount of correlation between SNPs or the size of the SNP-set. Since in practice we never have knowledge of the type of alternative, we also propose an omnibus test that combines GBJ, GHC, MinP, and SKAT for added robustness to different degrees of sparsity. An extensive simulation study demonstrates that GBJ outperforms alternative methods in testing SNP-set effects when signals are weak and moderately sparse, and we also show that the omnibus test is robust to a wide range of sparsity levels.

The remainder of the paper is organized as follows. Section 2 discusses the SNP-set testing framework using both individual-level data and GWAS summary statistics. In Section 3 we propose the Generalized Berk-Jones statistic for testing the association between a SNP-set and outcome. We also provide an analytic p-value calculation for GBJ and develop the omnibus test. Section 4 compares the rejection regions of GBJ and other tests designed for the sparse regime. In Section 5 we demonstrate the finite sample performance of GBJ through simulation. Section 6 presents an analysis of the CGEMS data, and we conclude with a discussion in Section 7.

## 2 The SNP-Set Testing Framework

### 2.1 Individual-Level Genotype Data

We begin by describing the SNP-set testing framework using individual-level data on genotype, outcome, and other covariates for total subjects. Suppose for subject , , we observe the outcome , a genotype vector of SNPs in a SNP-set, and a vector of covariates . Assume that conditional on follows a distribution in the exponential family (McCullagh and Nelder 1989) with the density function , where , , and , are known functions, is a canonical parameter, and is a dispersion parameter. Let denote the conditional mean of and assume it follows the Generalized Linear Model (GLM)

where ) is a differentiable monotone link function. We here only consider canonical link functions for simplicity. In matrix notation, the data take the form , , and

The null hypothesis of no association between a SNP-set and outcome, after controlling for covariates, is given by . Both the size of the set and the sparsity of signals can vary greatly between sets, e.g. from gene to gene, and the number of nonzero is unknown. Our aim is to develop a test suitable for different levels of sparsity while also accounting for the correlation among individual SNP test statistics.

The marginal score statistic for , , is

where denotes the th column vector of , is the projection matrix, , , is the MLE of under the null hypothesis, and is the variance function.

Under the null, the marginal score test statistics have an asymptotic multivariate normal distribution

where for all , and for we can estimate

(1) |

### 2.2 GWAS Summary Statistics

Many GWAS may not release individual-level data due to logistical challenges or data confidentiality agreements. Instead it is much more likely that a marginal test statistic for association with the outcome is available for each individual SNP (Pasaniuc and Price 2016). It is hence of great interest to be able to perform SNP-set testing using precomputed from across the genome. To test a set of precomputed with GBJ, we require estimation of their correlation matrix using external information.

Assume we have a panel of reference genotypes from subjects of the same ethnicity as those used to construct the summary statistics. For example, this could come from the publicly available 1000 Genomes dataset (1000 Genomes Project Consortium 2015). We estimate using equation (1) but replace and with and , where is the genotype vector of SNP from the reference panel, , are the first principal component vectors calculated from the reference panel, and is the same number of principal components as was used to control for population stratification (Price et al. 2006) in the original GWAS analysis of the data. Additionally for each subject we estimate by setting equal to the sample mean of the outcome. For a normally distributed outcome, this is exact as . For a binary outcome, since population stratification is the primary confounder of the SNP-outcome relationship in GWAS, and because generally varies slowly with the principal components, this approximation is practically reasonable. Ultimately we are approximating in (1) by up to a scale parameter, with the scale parameter eventually cancelled out in .

## 3 The Generalized Berk-Jones Test for SNP-Set Effects

### 3.1 The Berk-Jones Statistic

We briefly review the Berk-Jones statistic in this section to help introduce the Generalized Berk-Jones statistic in Section 3.2. The BJ statistic is designed to test for against the alternative that a nonempty subset of the are nonzero, assuming the marginal test statistics are independent. Let denote the survival function of a standard normal random variable and denote its inverse. Let denote the order statistics of the vector that results from applying the absolute value operator to each element of , so that is the smallest value of in magnitude.

Set , which is the number of marginal test statistics with a magnitude greater than or equal to some threshold . For a fixed , and if for all , then has a binomial distribution with size and mean parameter . This viewpoint motivates the Berk-Jones statistic for independent observations (Donoho and Jin 2004) as:

where , solves the equation

(3) |

and the second line uses the characterization of as a binomial random variable.

We see that BJ can roughly be explained as the maximum of a one-sided likelihood ratio test on the mean parameter of , performed over the larger half of observed test statistic magnitudes. At we have under the binomial likelihood null, and we have under the binomial likelihood alternative. We say binomial likelihood null and alternative to make clear that we are talking about an interpretation of the Berk-Jones statistic and to distinguish from the actual set-based null and alternative hypotheses being tested.

Note that the Higher Criticism test differs from the Berk-Jones by replacing the likelihood ratio statistic in (3.1) with the Pearson Chi-square statistic . Let be the number of causal SNPs in a set. The sparse regime is designated as , and we call moderately sparse, with referred to as extremely sparse. Donoho and Jin (2004) showed that, when the are all mutually independent, both HC and BJ are able to reach the detection boundary over the entire sparse signal regime as . Walther (2013) and Li and Siegmund (2015) showed that the BJ statistic generally has better power than HC when the size of the set is finite and the signals are moderately sparse.

If are correlated, as they will be for test statistics arising from neighboring SNPs in a gene, then no longer has a binomial distribution under the null. In this case, the standard Berk-Jones statistic no longer has a meaningful interpretation, and we may expect it to lose efficiency. In fact, we will show later that the rejection region of the Berk-Jones has a less desirable shape under various correlation structures, leading to a significant loss in power when the test statistics are not independent. Therefore we are interested in developing a modified BJ statistic that can account for correlation among the marginal test statistics in a set and thus possesses rejection regions which are more robust to arbitrary correlation structures.

### 3.2 The Generalized Berk-Jones Statistic

We now propose the Generalized Berk-Jones statistic for testing the association between a SNP-set and outcome. Following the spirit of Berk-Jones, GBJ considers a likelihood ratio type statistic on the mean parameter of , but the key difference is GBJ explicitly accounts for the correlation structure of . More precisely, we define the GBJ statistic as:

When the are correlated, follows either an underdispersed or overdispersed binomial distribution instead of the standard binomial. However finding the exact distribution of when is difficult. For a general , computing requires iterating through choose terms and is very time consuming. In special cases, such as when has an exchangeable correlation structure with , the calculation is much easier. However these scenarios occur rarely, if ever, in practice.

We propose to approximate the full distribution of using an Extended Beta-Binomial (EBB) distribution (Prentice 1986). The Extended Beta-Binomial is a reparameterization and extension of the standard Beta-Binomial distribution with the standard Beta-Binomial being a special case of the EBB. A random variable has probability mass function

(4) |

where we follow the convention for . The mean of is given by and the variance is .

The Extended Beta-Binomial distribution reduces to the Beta-Binomial distribution if we set and for . Because the standard Beta-Binomial distribution requires , it cannot accommodate underdispersion and never reduces to the binomial distribution. In contrast, the EBB allows for both overdispersion and underdispersion, and it reduces exactly to the binomial distribution when . This mechanism allows our GBJ statistic to reduce to the Berk-Jones when there is no correlation among the observations.

### 3.3 Calculation of the Generalized Berk-Jones Statistic

We now describe more precisely the mechanics of calculating the GBJ statistic. To begin, check if the condition is satisfied at any . If the condition is never satisfied, then the observed value of GBJ is 0 and we do not need to perform any more computation. The following steps should only be taken on indices where the condition is satisfied.

At each qualifying , we approximate the distribution of by an Extended Beta-Binomial random variable under both the binomial likelihood null and the binomial likelihood alternative. Denote these two variables by and . We solve for through moment matching equations

where and denote the expectation and variance conditional on . Similarly, we solve for using the same equations except with and , which are the expectation and variance conditional on .

The first moment matching equation is simple to solve, since clearly and . The variance term in the second equation is more difficult. We can use Theorem 1 of Barnett et al. (2017) for . For , we need the following theorem:

###### Theorem 1.

Let , and take be the probabilists’ Hermite polynomials. Define , where is the element of , and let for all . Then the variance of is:

The proof of this theorem is given in the Supplementary Materials. The terms in the infinite sum shrink very quickly, and in practice we see good accuracy using only the first ten.

After matching all four parameters , we calculate

The maximum value of among all qualifying is then the observed Generalized Berk-Jones statistic.

### 3.4 Analytic P-value Calculation

Let be a general supremum-based global statistic such as the GBJ statistic. Suppose is constructed using independent marginal test statistics . Denote the observed value of this statistic by , where higher values of indicate more evidence for the alternative. As noted by Moscovich-Eiger and Nadler (2017), the p-value for can often be written

where are ’boundary points’ that come from inversion of the test statistic. The points will depend on and and will be different for different choices of a global statistic, but for the sake of presentation, we will suppress this dependency in the notation. Moscovich-Eiger and Nadler (2017) proposed a method that can calculate the p-value of very quickly if are independent. However when are correlated, their techniques for a fast calculation are not applicable.

An exact p-value for GBJ, and for any global test applied to correlated observations, must take into account the covariance structure of . The p-value for these tests is then

(5) |

where is understood to additionally depend on as well. We are unaware of any computationally feasible expressions to calculate the joint distribution of the order statistics when is moderate or large. The Supplementary Materials provides a procedure to compute this probability by partitioning the region into separate sections, but the method is very computationally intensive and not feasible for use with .

However, an alternative way to write the rejection region of equation (5) is

(6) |

The right hand side of (6) suggests that the quantity can be calculated recursively. Indeed, define and . The quantity in (6) is just and can be calculated recursively as

(7) | |||||

for , with . We use an EBB approximation for the distribution of conditional on , with the equations

to match the moments. Finally, set where . Evaluation of follows from steps similar to the proof of Theorem 1.

Note that we can generalize the scheme described above to calculate
p-values for many different supremum-based global tests by adopting the
general approach of Moscovich-Eiger and Nadler (2017).
As long as the test statistic can be inverted to create the bounds ,
we can use equation (7) to calculate its p-value when applied to correlated observations.
In particular, we can use this procedure to perform p-value calculations for the
HC, GHC, BJ, and GBJ statistics.
Both the calculation and the Generalized Berk-Jones test are implemented in the
R package `GBJ`

, available on the `CRAN`

repository.

### 3.5 The Omnibus Test

While we will show that the Generalized Berk-Jones test possesses an attractive finite sample rejection region when signals are moderately sparse, GBJ may also lose power in the presence of very sparse or dense signals. As SNP-set inference involves testing for a composite alternative , there is no uniformly optimal test for both sparse and dense alternatives. Since signal sparsity varies between genes, the best test will also change from gene to gene, but it is unknown prior to scanning the genome. Thus we propose an omnibus test that offers robust power over a range of different sparsity levels.

The omnibus test is constructed by combining the SKAT, GBJ, GHC, and minimum p-value statistics, which have been described above. The motivation for choosing these four methods is to combine tests that are known to have good power when signals are dense, moderately sparse, very sparse, and the sparsest possible, respectively. The MinP method uses the set’s largest marginal test statistic in magnitude as a test statistic. When the are independent, Donoho and Jin (2004) showed that MinP asymptotically reaches the same detection boundary as HC and BJ in the very sparse regime but not the moderately sparse regime . In finite samples, MinP can have better power than the other three tests when there are only one or two causal SNPs. In contrast, SKAT is known to have high power when signals in a SNP-set are dense.

The omnibus test first applies each of the four tests to the same SNP-set, and then it carries forward the smallest p-value from the four tests as a test statistic. Specifically, the omnibus test statistic is defined as:

where , and denote the p-values of the four tests applied on the same SNP-set. As these tests are applied to the same data, the four p-values will be correlated.

Calculations of the p-value for OMNI must again account for the correlation between tests. We employ a Gaussian copula approximation for the joint distribution of the inverse-normal transformed p-values:

where denotes the joint cumulative distribution function of a multivariate normal distribution with mean vector zero and correlation matrix . The correlation matrix of the four component test statistics is estimated through parametric bootstrap under the null. For each subject in the study, we simulate a new outcome based on the null mean . When individual-level data are not available, we take to be the same constant for all subjects as an approximation. Then each of the four tests are applied with the simulated outcome instead of the original one. The original design matrix, or approximated design matrix if working with summary statistics, is used each time. Each of the four p-values is subtracted from 1 and then inverse-normal transformed; under the null hypothesis the four transformed values have marginal normal distributions with mean zero. As we only need to estimate the correlation matrix , only a small number of parametric bootstrap samples are needed. In practice, this procedure is repeated 100 times, and then we set equal to the sample correlations of the inverse-normal transformed statistics. We will see that this omnibus test performs well across a variety of settings.

## 4 Rejection Region Analysis of Different SNP-Set Tests

We study in this section the finite sample rejection regions for the BJ, GBJ, HC, and GHC tests, and we advocate for viewing these statistics as boundary-defining algorithms. Consider a fixed SNP-set with size and correlation structure , and suppose we wish to conduct inference at level . Using the p-value calculation from above, we can employ standard root-finding routines to find the observed value which would result in a GBJ p-value of 0.01 for a SNP-set with these characteristics. Then inverting to find boundary points as in Section 3.4 constructs a rejection region in terms of in terms of . That is, if the observed value of were larger than for any , then the GBJ p-value for this set would be less than . Plotting the bounds for various tests using different values of , , and shows us exactly what types of signals a given test is well-powered to detect at level . For the same setting, a test with smaller bounds is preferred, as it will provide more finite sample power.

To numerically compare the rejection boundaries, consider a simplified model of SNP-set correlation structure where the set is partitioned into only two sections. Let one section be the independence section, where all SNPs in this portion are completely independent of all other SNPs in the set. Let the other section be the correlated section, where all SNPs in this portion have common pairwise correlation with other SNPs in the section. For our numerical study, for the correlated section. We investigate SNP-sets of size and , correlated sections which contain 50% and 75% of the SNPs, and tests at size . These parameters are chosen to represent reasonable boundaries on the correlation structures seen in common GWAS data; Dawson et al. (2002) estimated that the average between SNPs separated by 100kb is around 0.1.

The rejection regions for each SNP-set are plotted in Figure 1. At the th coordinate on the x-axis, if the observed lies above the boundary of a particular test at that coordinate, then we would reject the null for that test at level . The lines on the graph are added to aid in visualization, but there should be no interpretation of interpolation between the points. It does not make sense to think of the boundary at , for example. While standard methods for inference on HC and BJ are invalid in the presence of correlation, valid p-values for these tests can be computed with the same ideas we have introduced for GBJ inference, specifically following equations (5)-(7). Thus we can show that HC and BJ sometimes have much less desirable rejection regions when SNPs in a set are correlated.

One of the clearest trends from Figure 1 is that the HC and GHC boundaries are lower for a small region around , and then the BJ and GBJ boundaries quickly become smaller as we move left. This behavior indicates that HC and GHC are better at detecting the sparsest alternatives with only a very few signals, as those signals would almost always manifest as the test statistics with the largest magnitude. In contrast, the plots demonstrate that BJ and GBJ can have more power to detect weaker, less sparse signals which may be more easily found by examining the test statistic which is, say, fifth or tenth largest in magnitude. The boundaries of HC and GHC can drop below BJ and GBJ again for the smallest observed magnitudes, but signals would only be found in these observations if they are particularly dense, a setting which is not the focus of our efforts. The intuition we can glean from this figure is closely aligned with the theoretical development of Donoho and Jin (2004) and the simulations of Li and Siegmund (2015) when the marginal test statistics are independent. These authors showed that HC is attuned to detect sparse signals arising at the very tail of the observed distribution, while BJ has more power as the number of signals rises.

These results also show why BJ is likely to have low power for detecting sparse signals when the level of correlation is high. When 75% of the SNPs are correlated, the rejection boundary for BJ at the largest few observations is the highest by multiple orders of magnitude on the p-value scale. It would not be desirable to apply BJ in these types of settings, as the test loses an extremely large amount of sensitivity to detect signals in the most outlying values. BJ is still likely to be suitable for detecting dense signals in these situations. Here, GBJ acts as a compromise between BJ and GHC under high correlation. GBJ provides a much lower boundary than BJ at the tail in exchange for slightly higher boundaries near the middle. Thus, GBJ can detect both sparse and dense signals in this example. On the other hand, GBJ provides a slightly higher boundary than GHC at the tail in exchange for lower boundaries past the tail, so it trades some power in the extremely sparse regime for more power to detect moderately sparse signals.

We see that choosing a different statistic is essentially choosing a different boundary-setting algorithm, and this choice should ideally be informed by parameters such as the amount of correlation and estimated sparsity level. Ultimately these plots illustrate that there is no single best global test for all types of alternatives. A genome-wide analysis strategy using the omnibus test will be likely to have robust power across different sparsity settings, correlation structures, and SNP-set sizes.

## 5 Simulation Results

### 5.1 Type I Error of the Generalized Berk-Jones Test

We first illustrate that our p-value calculations for the GBJ and omnibus tests are accurate enough to control the Type I error rate at levels required to declare genome-wide significance of a SNP-set. To replicate the setting of traditional GWAS data, we perform the size simulation on random regions across chromosome 5 which correspond to known gene sizes. We also conduct the size simulation on a high-LD subset of the FGFR2 gene and a low-LD subset of the FGFR2 gene to parse LD-related effects. We choose FGFR2 because it contains both high and low LD regions, and because it will later be the most significant gene in our analysis of the CGEMS data. All SNP-sets contain genotypes simulated with HAPGEN2 (Su et al. 2011) using the CEU population from HapMap3 as a reference panel.

In all simulations the outcome is generated as , and we fit the linear regression model (1) with and . Each simulation is repeated 20 million times, and we report the Type I error down to . Table 1 shows that our GBJ p-value calculation is accurate and protects the correct size for correlation structures seen in actual data. The p-value calculation for the omnibus test is similarly accurate at the most stringent significance levels, but it is somewhat conservative at larger significance levels.

Nominal | GBJ, WLD | GBJ, SLD | GBJ, Chr5 | OMNI, WLD | OMNI, SLD | OMNI, Chr5 |
---|---|---|---|---|---|---|

### 5.2 Power of GBJ Under Varying Hypothetical Correlation Structures and Sparsity Settings

To study the power of GBJ, we first conduct simulations under a variety of hypothetical correlation structures and sparsity settings. The performance of the GBJ is compared to the minimum p-value test, GHC, SKAT, and the omnibus test described in Section 3.5. The MinP test p-value is calculated by casting MinP as a boundary-defining test with for all , and then computation proceeds through the methods described in Section 3.4. GBJ and GHC similarly use the p-value calculation described above. For SKAT, we run the corresponding R package.

To study how power is impacted by different correlation structures between the SNPs, we utilize block correlation structures which are slightly more complex than those used for the rejection region analysis in Section 4. Specifically, consider a set of causal SNPs that are correlated amongst themselves with common pairwise correlation . All other SNPs are then non-causal, and we allow half of them to have an exchangeable correlation structure with correlation ; the other half of the non-causal SNPs are completely independent of all other non-causal SNPs. Finally the pairwise correlation between a causal SNP and a non-causal SNP is set at . The three correlations will vary between 0 and 0.3. All SNPs are generated to have minor allele frequency of 0.3.

We demonstrate the effects of signal sparsity by using a large SNP-set of =100 SNPs and varying the number of causal SNPs from to . This allows us to examine power profiles in the very sparse regime (one to three causal SNPs), in the moderately sparse regime (four to nine causal SNPs), and at the edge of the dense regime (ten or more causal SNPs). The true disease model is

(8) |

where all the are the same and depend on the number of causal SNPs . We reduce slightly as increases in order to keep the power of each test below one throughout the entire sparse regime. The full details on effect size for each simulation are available in the Supplementary Materials. Figure 2 considers the case where the noise SNPs are independent, and Figure 3 considers the case where the noise SNPs are correlated. We perform 500 simulations at each different value of and test at . All the power curves are smoothed to show empirical power.

The first significant trend appearing in Figure 2 is the effect of sparsity on power. Among the non-omnibus tests, we see that GHC and MinP perform well when the number of causal SNPs is low, as these tests often have the most power in the very sparse regime. In both panels of Figure 2, the transition to GBJ having more power than GHC and minP occurs in the moderately sparse regime. Then as the number of causal SNPs increases into the dense regime, SKAT begins to catch up and eventually becomes the most powerful test. This behavior matches our intuition as well as previously published simulation results. GHC and minimum p-value place excess weight on the most outlying observations, so they are well-tuned to detect the very sparse signals. The rejection region of GBJ is better-suited to find moderately sparse signals, and SKAT is known to perform well with dense signals.

The relationship between sparsity and power can be modified by the total amount of correlation. In the left panel of Figure 3 we set , which corresponds to the situation where causal SNPs and non-causal SNPs are correlated within themselves, but the two groups are independent of each other. In this case, MinP and GHC become the top-performing tests for a larger range of sparsity settings, with GBJ losing some of its advantage in the moderately sparse regime. SKAT has almost no power in these situations, as the signals are sparse and there is no correlation between causal and non-causal SNPs. It appears that a large amount of correlation between the non-causal SNPs is detrimental to the performance of GBJ. An explanation for this behavior can be found in the rejection region analysis of Figure 1. We see that the bounds of GBJ appear less favorable compared to GHC when the amount of correlation is high. Since over half of the SNPs in Figure 3 are correlated with , these settings represent a much larger amount of total correlation than was present in Figure 2.

In the right panel of Figure 3 we investigate the setting of by using an exchangeable correlation structure, and SKAT dominates as the most powerful test across almost all sparsity levels. Here we break slightly from the above framework by using exchangeable correlation to accommodate while still ensuring the correlation matrix of the SNPs is positive definite. GBJ is a close second to SKAT under most sparsity settings with these parameters. SKAT is known to have good performance in the presence of LD between causal SNPs and noise SNPs, which makes signals appear to be dense. The increased density of signals also buoys the performance of GBJ compared to GHC and minimum p-value, which perform the worst under exchangeable correlation.

Never losing too much power compared to the best test, the omnibus test appears to be robust to LD structure and sparsity. This behavior is expected as our omnibus approach integrates information from tests which perform well across multiple sparsity settings. Thus we would anticipate that OMNI is more resilient than any single test.

GBJ also demonstrates good power in a variety of situations, and its overall strength and robustness across the entire sparse regime are attractive properties. In particular, GBJ is the best-performing test when the level of sparsity is moderate and there is weak correlation among the noise SNPs. GBJ is disadvantaged against minP and GHC when signals are extremely sparse or there is excess correlation among noise SNPs, but it outperforms these tests as signals become more dense. In contrast, GBJ provides slightly less power than SKAT as signals reach the dense regime or when there is moderate correlation between causal and non-causal SNPs, but GBJ is also much more robust than SKAT when signals are sparse and not correlated with noise SNPs. Similar to OMNI, GBJ does not fall far behind the best test in any given situation, which suggests that GBJ is a good choice to use when the signal sparsity is unknown. Power for the standard BJ is not plotted in the interest of space, but it behaves like a dense test, similar to SKAT, under correlation. This behavior again matches Figure 1, which showed that BJ is more suited to detect dense signals as the amount of correlation increases.

### 5.3 Power of GBJ under Actual Chromosome 5 Correlation Structures

We conduct one final simulation to investigate the power of Generalized Berk-Jones under the unstructured LD patterns found in real GWAS data. In this simulation, we choose blocks of 40 SNPs at random locations on chromosome 5, and then genotype data are generated using HAPGEN2. We choose 40 to again approximately match the characteristics of FGFR2. There are 2000 blocks chosen for each sparsity level, and 10 simulations are performed on each block, for a total of 20000 at each number of causal SNPs. Again, the effect size is decreased as the number of causal SNPs increases, with the outcome still generated from equation (8). Testing is performed at to mimic a practical analysis.

We see in the left panel of Figure 4 that GBJ, GHC, and the omnibus test all have very similar power curves in this setting, while SKAT and minP lag slightly behind. As the number of causal SNPs increases, GBJ demonstrates the best power by a small amount. These results are rather homogenous because sparsity levels are more coarse and because the parameters are a mix of the values defined in Figures 2 and 3.

Recall that GBJ extended its advantage over GHC when there was less correlation among noise SNPs. When restricting our analysis to the blocks which have median , we see a more substantial superiority of GBJ over GHC in the moderately sparse regime, similar to Figure 2. Thus it is possible to recover the patterns of the structured simulation in real genotype data. Obviously median is not a perfect summary measure, as it cannot single-handedly capture all the parameters in a correlation matrix. Further parsing of the data would be necessary to see larger differences in performance. In a practical setting, we might switch between tests based on certain SNP-set characteristics, such as applying GBJ when the set is large and likely to have moderately sparse signals. These results do again demonstrate the robustness of GBJ across multiple situations, as it provides the most power across a large portion of the sparse regime.

## 6 Gene-Level Analysis of the CGEMS GWAS Data

The CGEMS breast cancer dataset contains a case-control sample of 1145 breast cancer cases, all postmenopausal women with European ancestry, and 1142 controls recruited from the Nurses’ Health Study. These women were genotyped at approximately 550000 SNPs with the Illumina HumanHap500 array. The dataset was originally analyzed by Hunter et al. (2007) in the single-marker GWAS approach. The authors did not find any individual SNPs to reach the genome-wide significance level of , but they highlighted FGFR2 as a strong candidate for future studies based on four SNPs in the gene that showed suggestive association with breast cancer. Such a situation succinctly illustrates the burden of adjusting for multiple comparisons when testing individual SNPs. Gene-level analysis provides an attractive alternative strategy that can reduce the number of comparisons and also aggregate evidence of signals across multiple SNPs in a gene. Here we perform a gene-level analysis to study the association between genes and breast cancer risk.

Since individual-level genotype data were available for this study, we first calculated the marginal test statistics for each SNP using the model in Section 2.1. Specifically, we fit a logistic regression model with four covariates - age and the first three genotype principal components to control for population structure (Price et al. 2006). Then, for each of 14991 genes, we collected the marginal test statistics for all SNPs located within the region defined by that gene. Each gene with more than one marginal SNP test statistic was analyzed with GBJ, GHC, SKAT, MinP, and the omnibus test.

In Table 2, we rank the top ten genes according to the smallest p-value produced by any of the five tests. In this sample, GBJ provides the strongest evidence of association for the top four genes and five of the top ten. Most of these genes are ranked highly by multiple other methods, however no other method provides the lowest p-value for more than two of the top ten genes. In fact, GHC and MinP produce the smallest p-value only once between the two of them. One possible explanation for the underperformance of GHC and MinP is that there may be multiple tagged SNPs surrounding the true causal loci for each of these genes, which could create a lack of extremely sparse alternatives.

Gene | GHC | GBJ | MinP | SKAT | OMNI | |
---|---|---|---|---|---|---|

FGFR2 | ||||||

CNGA3 | ||||||

PTCD3 | ||||||

POLR1A | ||||||

ZNF263 | ||||||

VWA3B | ||||||

TBK1 | ||||||

ABCA1 | ||||||

MMRN1 | ||||||

TIGD7 |

The lowest p-value for any gene over all five tests is produced by testing FGFR2 with GBJ, supporting the conclusions of Hunter et al. (2007). Since FGFR2 appears to have signals coming from at least four different SNPs and contains 35 SNPs in total, it would seem to fall into the category of moderate signal sparsity, where GBJ has good performance. Thus we may have expected beforehand that GBJ would be the most powerful test for this gene. FGFR2 has been further validated as a breast cancer associated locus in multiple follow-up studies (Meyer et al. 2008; Liang et al. 2008).

Besides FGFR2, genes such as PTCD3 and POLR1A have also been implicated as risk loci in independent investigations (Boehm et al. 2007; Jia et al. 2011). The overlap of our findings with other studies and other statistics provides a level of reassurance that GBJ performs well in identifying truly significant genes and not simply spurious associations. Alternately, ABCA1 is an example of a gene that may not have received further scrutiny if we were not utilizing the GBJ test. ABCA1 expression has been linked with breast cancer risk (Smith and Land 2012), but MinP and GHC do not provide the same strength of evidence that GBJ does. It seems likely that there are more than a few signal SNPs in ABCA1, especially since ABCA1 contains a relatively large number of SNPs compared to the other genes in this dataset.

Perhaps due to the limited sample size, no test produces a p-value low enough to be declared significant after Bonferroni correction for 14991 genes. Still, this analysis highlights the advantages of Generalized Berk-Jones compared to alternative tests. The GBJ p-value for FGFR2 does come very close to the Bonferroni-corrected level (), and it certainly provides more evidence of association than the single SNP statistics. Additionally, GBJ often gives the highest measure of significance, and never the lowest, in the genes displayed, demonstrating its robustness across different set sizes and LD patterns.

## 7 Discussion

We have proposed the Generalized Berk-Jones statistic to test for association between a SNP-set and an outcome. Our GBJ generalizes the standard Berk-Jones by modifying the BJ statistic to directly account for the correlation between individual SNPs. This modification results in a test that is more powerful when SNPs are in LD. We also provide an analytic p-value calculation for GBJ and generalize it to a class of supremum-based global tests, allowing valid inference for HC, GHC, BJ, and other methods when these procedures are applied as SNP-set tests using correlated marginal test statistics. Rejection region analysis demonstrates that GBJ can be described as a compromise between Berk-Jones and Higher Criticism-type tests in terms of finite sample performance.

While our numerical analysis shows situations where GBJ does not set the lowest boundary at either or , GBJ generally comes very close to the lowest boundary at both locations, which affords it both robustness to signal sparsity and power to detect moderately sparse signals. GHC and HC often set the lowest boundary around , but in return they concede a large amount of volume past the first few most extreme observations, which lowers power in the moderately sparse regime. BJ frequently sets the lowest boundary past the tail, but its tail boundary can be orders of magnitude larger than that of GBJ, HC, and GHC. Bounds in the expected signal regions must be viewed holistically, so slightly lower bounds at a few locations are not necessarily desirable if the price is much higher bounds in other signal locations, as in the case of BJ. Thus GBJ offers good power to detect moderately sparse effects without losing too much power when single-SNP signals are extremely sparse.

Simulation results reinforce the conclusions we find from examining the rejection regions of GHC and GBJ. Additionally we see that the MinP test performance is quite good when signals are very sparse, similar to GHC, but MinP does not perform as well as GHC when signals become more dense. SKAT has a unique power profile, as it can be particularly powerful when signals are dense or there is correlation between causal and non-causal SNPs, but it is also not robust to different correlation structures and will often have very little power in sparse settings where there is no correlation between causal and non-causal SNPs. The omnibus test offers robust power across different sparsity levels, and while it is rarely the best test, it also never has the worst power. When applied to data from the CGEMS study, we see that GBJ often produces the most significant p-values, perhaps owing to its versatility across different parameter settings.

In demonstrating that the BJ statistic can be adapted for increased robustness to correlation, we have also demonstrated that these types of boundary-defining algorithms can be modified to increase finite sample power under specific set-level parameters. It would be of interest to develop different boundary-defining methods that offer more favorable rejection regions in narrow but well-defined settings. For example, it may be possible to define tests which outperform GHC or minP over finer partitions of the extremely sparse regime.

In a similar vein, it would be interesting to understand the boundary shapes for other previously proposed boundary algorithms (Jager and Wellner 2007) in the class of Berk-Jones and Higher Criticism. While many of these algorithms share the same asymptotic guarantees of BJ and HC, little is known about their comparative finite sample performance, especially when observations in a set are correlated. These other methods might also have great value in specific settings as mentioned above. It would additionally be very valuable to develop SNP-set tests that are optimal in certain senses for arbitrary sparsity and correlation.

As genomic data collection techniques continue to evolve, it may be necessary to adapt the GBJ as well. In particular, the rise of whole genome sequencing and fine mapping studies is leading to the discovery of more SNPs with extremely rare minor alleles. Marginal test statistics generated from these SNPs are known to be non-Gaussian in finite samples, and thus they will not have the distribution we assume for GBJ. GBJ will need to account for null distributions that are not standard normal before SNP-sets containing rare variants can be tested.

SUPPLEMENTARY MATERIAL

## References

- 1000 Genomes Project Consortium (2015) 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571):68–74.
- Barnett et al. (2017) Barnett, I., Mukherjee, R., and Lin, X. (2017). The generalized higher criticism for testing SNP-set effects in genetic association studies. Journal of the American Statistical Association, 112(517):64–76.
- Berk and Jones (1979) Berk, R. H. and Jones, D. H. (1979). Goodness-of-fit test statistics that dominate the kolmogorov statistics. Probability Theory and Related Fields, 47(1):47–59.
- Boehm et al. (2007) Boehm, J. S., Zhao, J. J., Yao, J., Kim, S. Y., Firestein, R., Dunn, I. F., Sjostrom, S. K., Garraway, L., Weremowicz, S., and Richardson, A. (2007). Integrative genomic approaches identify IKBKE as a breast cancer oncogene. Cell, 129:1065.
- Conneely and Boehnke (2007) Conneely, K. and Boehnke, M. (2007). So many correlated tests, so little time! Rapid adjustment of p-values for multiple correlated tests. The American Journal of Human Genetics, 81:1158.
- Dawson et al. (2002) Dawson, E., Abecasis, G. R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D. M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., Carter, D., Papaspyridonos, M., Livingstone, S., Ganske, R., LÃµhmussaar, E., Zernant, J., TÃµonisson, N., Remm, M., MÃ¤gi, R., Puurand, T., Vilo, J., Kurg, A., Rice, K., Deloukas, P., Mott, R., Metspalu, A., Bentley, D. R., Cardon, L. R., and Dunham, I. (2002). A first-generation linkage disequilibrium map of human chromosome 22. Nature, 418(6897):544–548.
- Donoho and Jin (2004) Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, 32(3):962–994.
- Hunter et al. (2007) Hunter, D., Kraft, P., Jacobs, K., Cox, D., Yeager, M., Hankinson, S., Wacholder, S., Wang, Z., Welch, R., Hutchinson, A., and Wang, J. (2007). A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nature Genetics, 39(7):870–874.
- Jager and Wellner (2007) Jager, L. and Wellner, J. A. (2007). Goodness-of-fit tests via phi-divergences. The Annals of Statistics, 35(5):2018–2053.
- Jia et al. (2011) Jia, P., Zheng, S., Long, J., Zheng, W., and Zhao, Z. (2011). dmGWAS: dense module searching for genome-wide association studies in proteinprotein interaction networks. Bioinformatics, 27(1):95–102.
- Lee et al. (2014) Lee, S., Abecasis, G. R., Boehnke, M., and Lin, X. (2014). Rare-variant association analysis: study designs and statistical tests. The American Journal of Human Genetics, 95(1):5–23.
- Li and Leal (2008) Li, B. and Leal, S. M. (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics, 83:311.
- Li and Siegmund (2015) Li, J. and Siegmund, D. (2015). Higher criticism: -values and criticism. Annals of Statistics, 43(3):1323–1350.
- Liang et al. (2008) Liang, J., Chen, P., Hu, Z., Zhou, X., Chen, L., Li, M., Wang, Y., Tang, J., Wang, H., and Shen, H. (2008). Genetic variants in fibroblast growth factor receptor 2 (FGFR2) contribute to susceptibility of breast cancer in chinese women. Carcinogenesis, 29(12):2341–2346.
- Manolio et al. (2009) Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., and McCarthy, M. I. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265):747–753.
- McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. CRC press.
- Meyer et al. (2008) Meyer, K. B., Maia, A.-T., O’Reilly, M., Teschendorff, A. E., Chin, S.-F., Caldas, C., and Ponder, B. A. (2008). Allele-specific up-regulation of FGFR2 increases susceptibility to breast cancer. PLoS Biology, 1(5):e108.
- Moscovich-Eiger and Nadler (2017) Moscovich-Eiger, A. and Nadler, B. (2017). Fast calculation of boundary crossing probabilities for poisson processes. Statistics & Probability Letters, 123:177–182.
- Pasaniuc and Price (2016) Pasaniuc, B. and Price, A. L. (2016). Dissecting the genetics of complex traits using summary association statistics. Nature Genetics Reviews, 18:117–127.
- Prentice (1986) Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association, 81(394):321–327.
- Price et al. (2006) Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909.
- Smith and Land (2012) Smith, B. and Land, H. (2012). Anticancer activity of the cholesterol exporter ABCA1 gene. Cell Reports, 2.3:580–590.
- Su et al. (2011) Su, Z., Marchini, J., and Donnelly, P. (2011). HAPGEN2: simulation of multiple disease snps. Bioinformatics, 27(16):2304–2305.
- Visscher et al. (2012) Visscher, P. M., Brown, M. A., McCarthy, M. I., and Yang, J. (2012). Five years of GWAS discovery. The American Journal of Human Genetics, 90(1):7–24.
- Walther (2013) Walther, G. (2013). The average likelihood ratio for large-scale multiple testing and detecting sparse mixtures. In From Probability to Statistics and Back: High-Dimensional Models and Processes, volume 9, pages 317–326, Beachwood, OH. IMS.
- Wu et al. (2010) Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J., and Lin, X. (2010). Powerful SNP-set analysis for case-control genome-wide association studies. The American Journal of Human Genetics, 86(6):929–942.
- Wu et al. (2011) Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93.