Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies

Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies

Wei Jiang Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China Weichuan Yu Mail: eeyu@ust.hk Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China

Abstract

In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze data sets from multiple GWASs.

In this paper, we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous data sets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical data sets of four phenotypes. The R-package is available at: http://bioinformatics.ust.hk/Jlfdr.html.

1 Introduction

Understanding genetic mechanisms of common diseases and traits is important in biological and medical research. The goal of genome-wide association studies (GWASs) is to discover the susceptibility of single nucleotide polymorphisms (SNPs) to common diseases/traits (Altshuler et al., 2008). Due to decreasing genotyping costs (Perkel, 2008), constantly emerging successful stories (Klein et al., 2005; Kraft and Haiman, 2010) and efforts of the GWAS consortiums (Wellcome Trust Case Control Consortium, 2007; Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014), more and more GWASs have been conducted for common phenotypes (Welter et al., 2014).

Analyses of GWAS results show that the identified associations can only explain a small part of the additive genetic variances. This is referred to as the “missing heritability” problem (Manolio et al., 2009). The hints of hidden heritability (Gibson, 2010; Yang et al., 2010) and the estimated distribution of common SNPs’ effect sizes (Park et al., 2010) suggest that common diseases/traits are influenced by thousands of SNPs with small effects. To discover these genetic variants with small effects, we need to improve studies’ power. Jointly analyzing data sets from multiple GWASs on the same diseases in the same population provide an opportunity to improve the power.

There are two kinds of joint analysis methods: individual-level joint analysis and summary-statistics-based joint analysis. Individual-level joint analysis uses individual-level genotype data from all studies. One such example is mega-analysis (Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium, 2013), which pools all data together. Summary-statistics-based joint analysis only uses summary statistics from different studies. Since individual-level genotype data is difficult to access, summary-statistics-based analysis is widely used in joint analysis. The most commonly used method of summary-statistics-based joint analysis is meta-analysis (Evangelou and Ioannidis, 2013), which derives a new statistic for each SNP using summary statistics from multiple studies.

Our focus in this paper is to study summary-statistics-based joint analysis methods. More specifically, we like to study which joint analysis method provides the highest power for a given false discovery rate level. Figure 1 illustrates our motivation.

Figure 1: Rejection boundaries determined by different summary-statistics-based joint analysis methods: the optimal method and the meta-analysis method. Assume we jointly analyze data from two GWASs. For simplicity, we assume the tests are one-sided. We plot the test statistic pair into the coordinate plane. A SNP at the upper right corner shows more significant association than a SNP at the bottom left corner. The true associated SNPs are plotted with blue circles, and the false associated SNPs are plotted with yellow triangles. For each rejection boundary, the SNPs in the upper right region are discovered. All three analysis methods have the same false discovery proportion (10%). The optimal method has more empirical power (red solid line, 72%) than the meta-analysis method (purple dashed line, 36%).

Our major contribution in this paper is that we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). The Jlfdr generalizes the concept of the local false discovery rate (Efron, 2005) from the analysis of single study to the joint analysis of multiple studies. We prove that our method is the most powerful summary-statistics-based joint analysis method for a given false discovery rate level. In particular, the Jlfdr-based method is more powerful than commonly used meta-analysis methods when analyzing heterogeneous data sets from multiple GWASs.

The rest of this paper is organized as follows. In Section 2, we will first give the mathematical formulation of summary-statistics-based joint analysis methods. We will prove that the most powerful summary-statistics-based joint analysis method should control the Jlfdr. Then we will give implementation details of the Jlfdr-based method under the Gaussian mixture model. We will also discuss the relationship between the Jlfdr-based method and meta-analysis methods. In Section 3, we will use simulation experiments to demonstrate that Jlfdr-based method outperforms meta-analysis methods in terms of achieving higher power. Then we will show the empirical results using four different data sets. In Section 4, we will discuss limitations of our current method. Section 5 concludes the paper.

2 Method

2.1 Notations and criteria

Our method deals with a multiple GWAS setting. For simplicity, we illustrate the concepts with a two-GWAS setting. We use parenthesized superscript “” to denote the study index. For example, the sample sizes in study 1 and 2 are and , respectively. We use subscript (, is the total number of genotyped SNPs) to denote the SNP index.

To detect associations, we construct a null hypothesis for each SNP, in which association is assumed nonexistent. Assume we use a -value scheme to detect associations between SNPs and the phenotype, i.e., the test statistics follow a standard normal distribution under a null hypothesis. We use to denote the observed effect size in study . The asymptotically standard error of is . Correspondingly, the test statistic in study is . The underlying expected effect size is . The expected effect size of the same SNP may vary in different studies due to heterogeneity. The test statistic (uppercase letter indicates a random variable) follows an distribution. We use to represent the vector of test statistics in all studies, i.e., . Similarly, we use to represent the vector of expected effect sizes in all studies, i.e., .

We further assume SNPs have no association with the phenotype and SNPs have associations. Thus, the null proportion reads (). We use and to denote the null hypothesis and the alternative hypothesis, respectively.

In the joint analysis of summary statistics from multiple GWASs, we assume that of the hypotheses are rejected. There are false positives and true positives (i.e., ). Table 1 summarizes the numbers of hypotheses in the different categories.

is true is false Total
is rejected
is not rejected
Total
Table 1: The status of all hypotheses in the joint analysis. The letter in each cell denotes the count of the hypotheses in each category.

When testing multiple hypotheses, it is very easy to have false positives by random chance. This problem is known as the “multiplicity” problem. Many criteria are proposed to address the multiplicity problem. We present an incomplete list of these criteria in Table 2 (a). Let’s define the false discovery proportion (Fdp) as with “” denoting the maximum operation. Fdp is an unknown quantity in real cases. The classical false discovery rate (FDR) is the expectation of the Fdp. Controlling the FDR is more powerful than controlling the family-wise error rate (FWER). The Bayesian false discovery rate (Fdr) is the expected value of the Fdp given . Compared to FDR, Fdr is conditional on since we are only interested in controlling false positives when . We adopt Fdr in this paper as the criterion to avoid a plethora of false positives.

Criteria Mathematical Definitions References
Family-wise error rate (FWER) Tukey (1953)
False discovery rate (FDR) Benjamini and Hochberg (1995)
Bayesian false discovery rate (Fdr) Storey (2003)
(a) Different criteria for controlling false positives in multiple testing scenario.
Criteria Mathematical Definitions References
Power Neyman and Pearson (1933)
Bayesian power Kruschke (2010)
(b) Different criteria for measuring the amount of true positives in multiple testing scenario.
Table 2: Different criteria for evaluating a rejection region in multiple testing scenario. Here is the rejection region in the analysis. denotes effect sizes. and as well as other notations are explained in Table 1. “” denotes the maximum operation.

In addition to controlling false positives, we also need a criterion to measure the amount of true positives when evaluating a rejection region. A direct concept is power. The classical definition of power is a function of a given effect size as shown in the first row of Table 2(b). Since effect sizes of associated SNPs are different and unobserved, the actual power values are unknown. The Bayesian power removes the dependence of power on effect size by taking the expectation of the empirical power, which is defined as (). We list the definitions of the power and the Bayesian power in Table 2 (b). In this paper, we use the Bayesian power as the criterion to measure the amount of true positives.

Both Fdr and Bayesian power are functions of the rejection region . For two different rejection regions with the same Fdr level, we prefer the region with higher Bayesian power because it can find more true associations without increasing the proportion of false positives in the findings. Thus, we propose a joint analysis method determining the optimal rejection region when controlling the Fdr at a certain threshold , i.e.

(1)

Here denotes the Bayesian power. Actually, when controlling the Fdr at the same threshold, meta-analysis methods can also be regarded as the solutions to the above optimization problem with further constraint about the form of , i.e.

(2)

Here , and is a function which has different forms in different meta-analysis methods. We will give the explicit forms of the function in meta-analysis methods in subsection 2.4. Also, we will discuss the relationship between our proposed method and meta-analysis methods in detail in that subsection. In the next subsection, we will present the solution to the optimization problem in Eq. (1).

2.2 Jlfdr and optimal rejection region

To derive the solution to the optimization problem in Eq. (1), we need to introduce the concept of joint local false discovery rate (Jlfdr) first. Jlfdr is a simple extension of the local false discovery rate (Efron, 2005) from the analysis of single study to the joint analysis of multiple studies. It reads as

(3)

which is the posterior probability of a null hypothesis, given the observed summary statistic vector .

The relationship between Jlfdr and Fdr is (see the Supplementary Note for details)

(4)

In other words, Fdr is the expectation of Jlfdr, given that the test statistic vector is in the rejection region .

Let us define a rejection region , where is a threshold such that . We have the following theorem:

Theorem 1.

For any rejection region with , we have .

We show the proof of Theorem 1 in the Supplementary Note. Theorem 1 shows that is the most powerful rejection region when controlling Fdr at . This gives us a clue that we can improve the power of summary-statistics-based joint analysis by controlling the Jlfdr. In the next section, we shall provide details of the implementation of the Jlfdr-based method under the Gaussian mixture model.

2.3 Implementation of Jlfdr-based method under the Gaussian mixture model

The hints of hidden heritability (Gibson, 2010; Yang et al., 2010) and the estimated distribution of common SNPs’ effect sizes (Park et al., 2010) suggest that thousands of common SNPs with small effect sizes are associated with complex diseases. A natural prior to depict this “infinitesimal model” (Gibson, 2012) is Gaussian distribution with mean and variance . We assume the effect sizes of associated SNPs have this prior distribution. Since we don’t know which SNP is associated with diseases, we propose the following two-component mixture model to describe the prior distribution of effect sizes:

(5)

where is the unit point mass distribution at zero.

There may be some heterogeneity in different studies. The effect sizes of the same SNP may vary in different studies. We assume the effect sizes of the same associated SNP in different studies are normally distributed with mean and variance , i.e., . The distribution of the effect size vector is

(6)

where is the bivariate unit point mass distribution at the origin, and denotes the bivariate Gaussian distribution with expectation and covariance matrix .

Since the observed effect size asymptotically follows Gaussian distribution , the test statistics vector with prior (6) follows two-component Gaussian mixture distribution:

(7)

Here, is the identity matrix.

In order to obtain the global behavior of all SNPs, we need to obtain the marginal distribution of the test statistic vectors of all SNPs. Overall, the test statistic vector follows

(8)

Here is the index set of all associated SNPs, and is the corresponding cardinality of . is normally unknown.

The above Gaussian mixture model is computationally difficult due to the large number of components ( is normally in the range of hundred to thousand). To simplify the model, we use a -component Gaussian mixture model to approximate the non-null components, i.e.,

(9)

Then we reduce the distribution of to a -component Gaussian mixture model:

(10)

There are some unknown parameters and in the above mixture model. Dempster et al. (1977) proposed an EM-algorithm to estimate parameters with unobserved latent variables. With the observed vectors of summary statistics (), we use the EM-algorithm to estimate the parameters and in the Gaussian mixture model (10). Please note that is always much larger than any entry of in the GWAS setting. Hence, a Dirichlet prior is added for the proportions . This is the same penalty strategy proposed by Muralidharan (2010). Our experiments show that the rejection regions are not sensitive to the penalization parameter and the number of mixture components in the associated SNPs . In our default setting, and .

Denote the probability density function (pdf) of bivariate normal distribution as and the pdf of as . The Jlfdr reads

(11)

After calculating the Jlfdr, we approximate Fdr as

(12)

We determine the optimal rejection region by Jlfdr-thresholding, which determines the rejection region with smaller than the threshold . To determine the threshold , we sort the calculated Jlfdr values of each SNP in an ascending order first. Denote the -th Jlfdr value as . We can approximate the Fdr of the region as

(13)

We use to denote the largest such that , namely

(14)

Then the Jlfdr threshold is . We reject all SNPs with .

We present the detailed steps of the Jlfdr-based method in Algorithm 1.

Inference using the EM-algorithm:
Initialize and
repeat
     E Step:
     M Step:
until  and converge
Jlfdr-thresholding:
Initialize
Calculate Jlfdr for each SNP using Eq. (11) with inferred and .
Sort calculated Jlfdr in ascending order
for  to  do
     Calculate using Eq. (13).
     if then
         ; break      
Output: the SNPs with
Algorithm 1 Jlfdr-based method for summary-statistics-based joint analysis

2.4 Relationship between Jlfdr-based method and meta-analysis methods

We have the following theorem about the rejection region of Jlfdr-based method when using the Gaussian mixture model:

Theorem 2.

In the Gaussian mixture model (8), the rejection region of the Jlfdr-based method is

(15)

where is a constant determined by . If no heterogeneity exists between studies, the rejection region is

(16)

where , and is a constant determined by .

We present the proof of the above theorem in the Supplementary Note.

Meta-analysis methods are the most commonly used summary-statistics-based joint analysis methods. In meta-analysis, we usually calculate the weighted average of effect sizes in different studies. Dividing the weighted average effect size by its standard error yields a new -value-based test statistic. There are two kinds of models in the meta-analysis: fixed-effects model and random-effects model.

In the fixed-effects model, we assume that the underlying true effect sizes in different studies are identical. This corresponds to in the Gaussian mixture model 8. The optimal weighting strategy is the inverse-variance weighting since it minimizes the variance of the weighted average. Each effect size is weighted by the inverse of its variance, i.e.,

(17)

Here is the weighted average effect size. Its standard error is

(18)

Dividing by yields the new test statistic . We have the following theorem about the rejection region of the fixed-effects meta-analysis method (See the Supplementary Note for detailed proof):

Theorem 3.

The rejection region of the fixed-effects meta-analysis method is asymptotically

(19)

where , and is a constant determined by . .

This kind of rejection region is illustrated in Figure 1. The region coincides with the rejection region of the Jlfdr-based method when no heterogeneity exists between studies. Hence, the Jlfdr-based method and the fixed-effects meta-analysis method will have the same performance. In contrast, if heterogeneity exists between studies, the rejection regions determined by the Jlfdr-based method and the fixed-effects meta-analysis method are different. According to Theorem 1, the rejection region determined by the Jlfdr-based method can achieve the highest power among all summary-statistics-based joint analysis methods when controlling the Fdr. In other words, the Jlfdr-based method is more powerful than the fixed-effects meta-analysis method.

In the random-effects model, we assume that the true effect sizes in different studies are not identical and follow a distribution. Then we adjust the weights by incorporating the between-study variance. The weighted average effect size is

(20)

Its standard error is

(21)

Dividing by yields the new test statistic . We have the following theorem about the rejection region of the random-effects meta-analysis method (See the Supplementary Note for detailed proof):

Theorem 4.

The rejection region of the random-effects meta-analysis method is asymptotically

(22)

Here , which is the 2nd Hadamard power of , and is a constant determined by .

If no heterogeneity exists between studies, the random-effects meta-analysis method is less powerful than the fixed-effects meta-analysis method and the Jlfdr-based method. If heterogeneity exists between studies, we usually need a large number of studies to estimate the between-study variance precisely in the random-effects meta-analysis. Since we usually only have a few GWASs of the same diseases in the same population, the random-effects meta-analysis is not powerful enough. The Jlfdr-based method overcomes this problem by borrowing information from all genotyped SNPs. In any case, the rejection region determined by the Jlfdr-based method and the random-effects meta-analysis method are different. According to Theorem 1, the Jlfdr-based method is more powerful than the random-effects meta-analysis method.

3 Results

3.1 Simulation experiments

We use simulation experiments to demonstrate that the Jlfdr-based method is more powerful than the commonly used meta-analysis methods in analyzing summary statistics from multiple GWASs.

In our simulation experiments, we fix the sample size at 10000 in study 1. We conduct experiments with different sample sizes of 5000, 10000 and 15000 in study 2. The sample size ratios are 0.5, 1 and 1.5 correspondingly. The individual numbers in the control group and case group are the same in both studies, and the number of SNPs is . We simulate the minor allele frequency of each SNP according to uniform distribution . The proportion of the associated SNPs is . For associated SNPs, the expected -odds ratio in each study is simulated according to the following model:

(23)

where . In the homogeneous setting, . In the heterogeneous setting, . For non-associated SNPs, the expected -odds ratio is . The prevalence of the disease is . We use the -odds ratio test to detect associations in our experiments.

We use the Jlfdr-based method, the fixed-effects meta-analysis method and the random-effects meta-analysis method to jointly analyze summary statistics from study 1 and study 2. The Fdr is controlled at . In the fixed-effects meta-analysis and the random-effects meta-analysis, we use the one-dimensional mixture method Muralidharan (2010) to control the Fdr at .

In the homogeneous setting (), each SNP shares the same expected effect size between the two studies. Figure 2 presents the average empirical power and the average Fdp of 10 experimental runs using different methods. The average Fdp is well controlled in all methods. In this setting, the average empirical powers are almost the same in the Jlfdr-based method and the fixed-effects meta-analysis method. The subtle differences are due to random initial choices of the EM-algorithm and the Fdr approximations used in Eq. (13). This verifies the previous statement about the equivalence between the Jlfdr-based method and the fixed-effects meta-analysis method in the homogeneous setting.

Figure 2: (a) The average empirical power and (b) the average Fdp in the homogeneous setting () of the simulation experiment. The experiments are repeated 10 times with different sample size ratios (, and ). The average Fdp of the three methods (the Jlfdr-based method (Jlfdr), the fixed-effects meta-analysis method (MetaF) and the random-effects meta-analysis method (MetaR)) are about . When controlling Fdr at the same level, the Jlfdr-based method and the fixed-effects meta-analysis method have almost the same average empirical power. The subtle differences are due to random initial choices of the EM-algorithm and the Fdr approximations used in Eq. (13).

In the heterogeneous setting (), the expected effect sizes of each SNP vary between studies. Figure 3 plots the discovered associations using the Jlfdr-based method and the fixed-effects meta-analysis method in one run when . Although the Jlfdr-based method missed some associations detected by the fixed-effects meta-analysis method, it identifies more associations than the meta-analysis method. We ran the simulation experiments 10 times for the sample size ratio and . Figure 4 shows the average empirical power and the average Fdp. The average Fdp using all three methods are about in all sample size ratio settings. From the figure, we can see that the Jlfdr-based method can achieve higher power than the other methods when controlling Fdr at the same threshold.

Figure 3: The discovered associations in the heterogeneous setting () of the simulation experiment. Both the first and second studies have individuals. For each SNP, the pair of summary statistics is plotted with transformation . We use light grey circles to represent the associations discovered by both the Jlfdr-based method and fixed-effects meta-analysis method. We use black upward-pointing triangles and dark grey downward-pointing triangles to represent the associations only discovered by the Jlfdr-based method and the fixed-effects meta-analysis method, respectively. The rejection boundary in the Jlfdr-based method is plotted as the solid curve. The rejection boundary in the fixed-effects meta-analysis method is plotted as the dashed straight line. The Jlfdr-based method discovered more associations overall than the meta-analysis method, although it also misses some associations identified by the meta-analysis method.
Figure 4: (a) The average empirical power and (b) the average Fdp in the heterogeneous setting () of the simulation experiment. We ran experiments 20 times with different sample size ratios (, and ). The average Fdp values in three methods are about . When controlling Fdr at the same level, our proposed Jlfdr-based method can achieve higher power than the other methods in every sample size ratio setting.

3.2 Real data applications

3.2.1 SCZ data from PGC

We jointly analyze the summary statistics from schizophrenia (SCZ) studies conducted by the Psychiatric Genomics Consortium (PGC). The summary statistics from two SCZ studies, Sweden+SCZ1 (Ripke et al., 2013) and SCZ2 (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014), are available from the PGC. Sweden+SCZ1 is a large-scale meta-analysis of Swedish and mixed-European ancestry individuals that comprises 13,833 schizophrenia cases and 18,310 controls in the analysis. We use it as Study 1. SCZ2 is a larger-scale meta-analysis that comprises 36,989 schizophrenia cases and 113,075 controls. The analysis includes the individuals which have been analyzed in Sweden+SCZ1. By using the following inverse meta-analysis formula, we obtain the summary statistics from the meta-analysis comprising the individuals only be analyzed in SCZ2. The formula is

(24)

We use as the summary statistics of Study 2. We remove the SNPs with -value in the test of homogeneity. After that, there are SNPs remaining.

We use the Jlfdr-based method, the fixed-effects meta-analysis method and the random-effects meta-analysis method to jointly analyze the summary statistics from two studies. The Fdr is controlled at . We adopt the one-dimensional mixture method to control the Fdr at in meta-analysis methods.

Figure 5(a) plots the discovered associations using the Jlfdr-based method and the fixed-effects meta-analysis method. The Jlfdr-based method identifies more associations. Table 3 shows the numbers of discovered associations and the rejection criteria of the different analysis methods. Besides the loci discovered by meta-analysis methods, there are eight novel loci discovered by the Jlfdr-based method. Each locus is separated by at least 500 kilobases (kb) or a weak linkage disequilibrium (). The SNPs showing the most significant association with SCZ in these novel loci are presented in Supplementary Table 1.

Figure 5: The rejection region determined in the empirical datasets: (a) SCZ data from the PGC; (b) SLE data from dbGaP; (c) BMI data from the GIANT; (d) WHRadjBMI data from the GIANT. The descriptions of the three datasets are presented in the main text. For each SNP, the vector of summary statistics is plotted with transformation . We use light grey circles to represent the associations discovered by both the Jlfdr-based method and the fixed-effects meta-analysis method. We use black upward-pointing triangles and dark grey downward-pointing triangles to represent the associations only discovered by the Jlfdr-based method and the fixed-effects meta-analysis method, respectively.
Method Rejection Criterion #{Identified SNPs}
Jlfdr-based method 13405
Fixed-effects meta-analysis 13014
Random-effects meta-analysis 8348
Table 3: The rejection criterion and the number of identified associations in SCZ data from the PGC. and are the combined -values in the fixed-effects meta-analysis and random-effects meta-analysis, respectively.

3.2.2 SLE data from dbGaP

We conduct summary-statistics-based joint analysis in systemic lupus erythematosus (SLE) data from phs000122.v1.p1 and phs000216.v1.p1 in dbGaP (Mailman et al., 2007; Tryka et al., 2014). We use the study phs000122.v1.p1, in which there are 1,311 SLE cases and 3,340 controls, as Study 1, and we use the study phs000216.v1.p1, in which there are 706 cases and 353 controls, as Study 2. The individuals in the first study are all North Americans of European descent, and those in the second study are all females of European ancestry. We use the following quality control procedures for both studies:

  1. Missing data control: The SNPs with a missing data rate larger than are discarded.

  2. Minor allele frequency control: The SNPs with minor allele frequency less than in either case group or control group are discarded.

  3. Hardy-Weinberg equilibrium control: In the Hardy-Weinberg equilibrium test, the SNPs with -values less than in either case group or control group are discarded.

  4. Homogeneity control: In the homogeneity test, SNPs with -values less than are discarded.

After the quality control steps, there are autosome SNPs remaining.

We use as the Fdr threshold in all analyses. 5(b) plots the associations discovered by the Jlfdr-based method and the fixed-effects meta-analysis method. The Jlfdr-based method discovers more associations than the meta-analysis methods. Table 4 lists the numbers of the associations discovered using the different methods. Besides the loci discovered by meta-analysis methods, there are three novel loci discovered by the Jlfdr-based method. The loci are separated by at least 500kb or a weak linkage disequilibrium (). The most significant associations in these novel loci can be seen in Supplementary Table 2.

Method Rejection Criterion #{Identified SNPs}
Jlfdr-based method 106
Fixed-effects meta-analysis 94
Random-effects meta-analysis 54
Table 4: The rejection criterion and the number of identified associations in SLE data from dbGaP. and are the combined -values in the fixed-effects meta-analysis and random-effects meta-analysis, respectively.

3.2.3 BMI data from GIANT

We jointly analyze summary statistics from body mass index (BMI) studies conducted by the Genetic Investigation of ANthropometric Traits (GIANT) consortium (Locke et al., 2015). We use the joint GWAS and metabochip meta-analysis of 152,893 European men as Study 1, and we use the joint GWAS and metabochip meta-analysis of 171,977 European women as Study 2. There are autosome SNPs passing the homogeneity control (-value).

We use as the Fdr threshold in all analyses. Figure 5(c) plots the associations discovered by the Jlfdr-based method and the fixed-effects meta-analysis method. The Jlfdr-based method discovers more associations than meta-analysis methods. Table 5 shows the number of discovered associations and the corresponding rejection criterion of each method. There are six novel loci discovered by the Jlfdr-based method. The SNPs showing the most significant associations in these novel loci are listed in Supplementary Table 3.

Method Rejection Criterion #{Identified SNPs}
Jlfdr-based method 2717
Fixed-effects meta-analysis 2667
Random-effects meta-analysis 2186
Table 5: The rejection criterion and the number of identified associations in BMI data from the GIANT. and are the combined -values in the fixed-effects meta-analysis and random-effects meta-analysis, respectively.

3.2.4 WHRadjBMI data from GIANT

We conduct joint analysis in waist-to-hip ratio after adjusting for BMI (WHRadjBMI) studies from GIANT consortium (Shungin et al., 2015). We use the joint GWAS and metabochip meta-analysis of 93,480 European men as Study 1, and we use the joint GWAS and metabochip meta-analysis of 116,742 European women as Study 2. There are autosome SNPs passing the homogeneity control (-value).

Figure 5(d) highlights the associations discovered by the Jlfdr-based method and the fixed-effects meta-analysis method. The Jlfdr-based method identifies more associations than meta-analysis methods when controlling Fdr at the same level . Table 6 shows the number of the discovered associations and the corresponding rejection criterion of each method. Besides the loci discovered by meta-analysis methods, there are four novel loci discovered by the Jlfdr-based method. The details of the most significant SNPs in these loci are listed in Supplementary Table 4.

Method Rejection Criterion #{Identified SNPs}
Jlfdr-based method 452
Fixed-effects meta-analysis 420
Random-effects meta-analysis 192
Table 6: The rejection criterion and the number of identified associations in WHRadjBMI data from the GIANT. and are the combined -values in the fixed-effects meta-analysis and random-effects meta-analysis, respectively.

4 Discussion

Both the Jlfdr-based method and the meta-analysis methods jointly analyze summary statistics from multiple GWASs. Meta-analysis methods collapse the test statistics of all studies into a weighted average value for each SNP, which is simpler than the Jlfdr-based method. When no heterogeneity exists between studies, the Jlfdr-based method will degenerate to the fixed-effects meta-analysis method. This can be understood by the fact that there is no information loss during the collapsing when all studies are homogeneous. When heterogeneity exists between studies, however, the Jlfdr-based method can achieve higher power than the fixed-effects meta-analysis method. This is understandable as information about heterogeneity is lost during collapse when using the meta-analysis method. Since heterogeneity widely exists in most cases, we suggest to use the Jlfdr-based method instead of meta-analysis methods to jointly analyze summary statistics from multiple GWASs.

This paper proves that the Jlfdr-based method is the most powerful summary-statistics-based joint analysis method when the underlying distribution of the test statistics is known. In reality, we only know the theoretical distribution under a null hypothesis. The distribution under alternative hypotheses is usually unknown. Hence, in the implementation of the Jlfdr-based method, we assume test statistics follow the Gaussian mixture model. Then we use the EM-algorithm to infer parameters in the mixture model. Violation of the model assumptions and inaccuracy of parameters estimation will decrease the performance of the Jlfdr-based method.

We assume an independence between SNPs in the Gaussian mixture model. However, correlations between nearby SNPs often exist, which is known as linkage disequilibrium. We may further improve the Jlfdr-based method by taking advantage of the dependency information between SNPs.

5 Conclusion

Jointly analyzing data sets from multiple GWASs is a common strategy to discover associations. Since it is usually difficult to access individual-level genotyping data, summary-statistics-based joint analysis has become popular for jointly analyzing data sets from multiple GWASs. Among different summary-statistics-based joint analysis methods, we prefer the method with higher Bayesian power when Fdr is controlled at the same level, because it can discover more associations. With this criterion, we propose the Jlfdr-based method. It is the most powerful summary-statistics-based method. Simulation and empirical experiments demonstrate its superior performance over traditional meta-analysis methods.

Acknowledgement

This paper was partially supported by a grant under the Theme-based Research Scheme (project T12-402/13N) of the Hong Kong Research Grant Council (RGC).

References

  • Altshuler et al. (2008) Altshuler, D., Daly, M. J., and Lander, E. S. (2008). Genetic mapping in human disease. Science, 322(5903):881–888.
  • Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–38.
  • Efron (2005) Efron, B. (2005). Local false discovery rates. Technical Report 2005-20B, Department of Statistics, Stanford University.
  • Evangelou and Ioannidis (2013) Evangelou, E. and Ioannidis, J. P. (2013). Meta-analysis methods for genome-wide association studies and beyond. Nature Reviews Genetics, 14(6):379–389.
  • Gibson (2010) Gibson, G. (2010). Hints of hidden heritability in gwas. Nature Genetics, 42(7):558–560.
  • Gibson (2012) Gibson, G. (2012). Rare and common variants: twenty arguments. Nature Reviews Genetics, 13(2):135–145.
  • Klein et al. (2005) Klein, R. J., Zeiss, C., Chew, E. Y., Tsai, J.-Y., Sackler, R. S., Haynes, C., Henning, A. K., SanGiovanni, J. P., Mane, S. M., Mayne, S. T., et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science, 308(5720):385–389.
  • Kraft and Haiman (2010) Kraft, P. and Haiman, C. A. (2010). GWAS identifies a common breast cancer risk allele among BRCA1 carriers. Nature Genetics, 42(10):819–820.
  • Kruschke (2010) Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in cognitive sciences, 14(7):293–300.
  • Locke et al. (2015) Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C., Vedantam, S., Buchkovich, M. L., Yang, J., et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197–206.
  • Mailman et al. (2007) Mailman, M. D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L., et al. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nature Genetics, 39(10):1181–1186.
  • Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2013) Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2013). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry, 18(4):497–511.
  • Manolio et al. (2009) Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265):747–753.
  • Muralidharan (2010) Muralidharan, O. (2010). An Empirical Bayes mixture method for effect size and false discovery rate estimation. The Annals of Applied Statistics, 4(1):422–438.
  • Neyman and Pearson (1933) Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231:289–337.
  • Park et al. (2010) Park, J.-H., Wacholder, S., Gail, M. H., Peters, U., Jacobs, K. B., Chanock, S. J., and Chatterjee, N. (2010). Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature Genetics, 42(7):570–575.
  • Perkel (2008) Perkel, J. (2008). SNP genotyping: six technologies that keyed a revolution. Nature Methods, 5(5):447–453.
  • Ripke et al. (2013) Ripke, S., O’Dushlaine, C., Chambert, K., Moran, J. L., Kähler, A. K., Akterin, S., Bergen, S. E., Collins, A. L., Crowley, J. J., Fromer, M., et al. (2013). Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nature Genetics, 45(10):1150–1159.
  • Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature, 511(7510):421–427.
  • Shungin et al. (2015) Shungin, D., Winkler, T. W., Croteau-Chonka, D. C., Ferreira, T., Locke, A. E., Mägi, R., Strawbridge, R. J., Pers, T. H., Fischer, K., Justice, A. E., et al. (2015). New genetic loci link adipose and insulin biology to body fat distribution. Nature, 518(7538):187–196.
  • Storey (2003) Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics, 31(6):2013–2035.
  • Tryka et al. (2014) Tryka, K. A., Hao, L., Sturcke, A., Jin, Y., Wang, Z. Y., Ziyabari, L., Lee, M., Popova, N., Sharopova, N., Kimura, M., et al. (2014). NCBI’ s database of genotypes and phenotypes: dbGaP. Nucleic Acids Research, 42(D1):D975–D979.
  • Tukey (1953) Tukey, J. W. (1953). The problem of multiple comparisons. In The Collected Works of John W Tukey VIII. Multiple Comparisons: 1948-1983, number 1-300. Chapman and Hall, New York. Unpublished manuscript.
  • Wellcome Trust Case Control Consortium (2007) Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678.
  • Welter et al. (2014) Welter, D., MacArthur, J., Morales, J., Burdett, T., Hall, P., Junkins, H., Klemm, A., Flicek, P., Manolio, T., Hindorff, L., et al. (2014). The NHGRI GWAS catalog, a curated resource of SNP-trait associations. Nucleic Acids Research, 42(D1):D1001–D1006.
  • Yang et al. (2010) Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42(7):565–569.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
6332
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description