LSMM: A statistical approach to integrating functional annotations with genome-wide association studies

LSMM: A statistical approach to integrating functional annotations with genome-wide association studies

Jingsi Ming, Mingwei Dai, Mingxuan Cai,
Xiang Wan, Jin Liu and Can Yang
Correspondence should be addressed to Can Yang (macyang@ust.hk) and Jin Liu (jin.liu@duke-nus.edu.sg)
Abstract

Thousands of risk variants underlying complex phenotypes (quantitative traits and diseases) have been identified in genome-wide association studies (GWAS). However, there are still two major challenges towards deepening our understanding of the genetic architectures of complex phenotypes. First, the majority of GWAS hits are in the non-coding region and their biological interpretation is still unclear. Second, accumulating evidence from GWAS suggests the polygenicity of complex traits, i.e., a complex trait is often affected by many variants with small or moderate effects, whereas a large proportion of risk variants with small effects remains unknown. The availability of functional annotation data enables us to address the above challenges. In this study, we propose a latent sparse mixed model (LSMM) to integrate functional annotations with GWAS data. Not only does it increase statistical power of the identification of risk variants, but also offers more biological insights by detecting relevant functional annotations. To allow LSMM scalable to millions of variants and hundreds of functional annotations, we developed an efficient variational expectation-maximization (EM) algorithm for model parameter estimation and statistical inference. We first conducted comprehensive simulation studies to evaluate the performance of LSMM. Then we applied it to analyze 30 GWAS of complex phenotypes integrated with 9 genic category annotations and 127 tissue-specific functional annotations from the Roadmap project. The results demonstrate that our method possesses more statistical power over conventional methods, and can help researchers achieve deeper understanding of genetic architecture of these complex phenotypes. The LSMM software is available at https://github.com/mingjingsi/LSMM.

Department of Mathematics, Hong Kong Baptist University, Hong Kong

School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China

Department of Computer Science, Hong Kong Baptist University, Hong Kong

Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore

Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong

1 Introduction

Since the success of the first GWAS on age-related macular degeneration (Klein et al., 2005), more than 40,000 single-nucleotide polymorphisms (SNPs) have been reported in about 3,100 GWAS at the genome-wide significance level (see GWAS Catalog http://www.ebi.ac.uk/gwas/) (Welter et al., 2014). Despite these fruitful discoveries, the emerging evidence from GWAS presents great challenges towards deeper understanding of the genetic architectures of complex phenotypes. First, more than 85 genome-wide significant hits are located in the non-coding region (Welter et al., 2014) and thus their functional roles are still largely elusive. Second, complex phenotypes are often highly polygenic, i.e., they are affected by a vast number of risk variants with individually small effects. For example, 70-80 of the variation in human height can be attributed to genetics (Visscher et al., 2008). However, Wood et al. (2014) collected more than 250,000 samples and identified 697 variants at genome-wide significance level, and all these variants together can only explain 20 of heritability. A recent estimate (Boyle et al., 2017) suggests that about 100,000 variants may be associated with human height. Given current sample sizes, a large proportion of risk variants underlying complex phenotypes remain unknown yet.

Fortunately, an increasing number of reports suggest that the functional importance of SNPs may not be equal (Schork et al., 2013), which provides a direction to address the above challenges. On one hand, SNPs in or near genic regions can explain more heritability of complex phenotypes (Yang et al., 2011; Smith et al., 2011). For example, the partition of genic category annotations for SNPs have revealed that SNPs in 5’ UTR, exon and 3’ UTR are significantly enriched across diverse complex traits (Schork et al., 2013). On the other hand, tissue-specific functional annotations can provide information that is complementary to genic category annotations, for dissecting genetic contribution to complex diseases in a tissue-specific manner. To name a few, genetic variants related to functions of immune cells are significantly enriched for immune diseases, such as rheumatoid arthritis, coeliac disease and type 1 diabetes; variants with liver functions are enriched for metabolic traits, such as LDL, HDL and total cholesterol; variants with pancreatic islet functions are enriched for fasting glucose (Kundaje et al., 2015). Additionally, SNPs in genes that are preferentially expressed in the central nervous system are significantly enriched in psychiatric disorders (e.g., schizophrenia and bipolar disorder) (Chung et al., 2014).

A large amount of functional annotation data has become publicly available and the volume is still expanding. The Encyclopedia of DNA Elements (ENCODE) project (The ENCODE Project Consortium, 2012) have conducted more than 1,650 experiments on 147 cell lines to access functional elements across the human genome, such as DNase I hypersensitive sites and transcription factor binding. The NIH Roadmap Epigenomics Mapping Consortium (Kundaje et al., 2015) is generating high-quality genome-wide human epigenomic maps of histone modifications, chromatin accessibility, DNA methylation and mRNA expression across more than one hundred of human cell types and tissues.

With the availability of rich functional annotations, we aim to (1) integrate genic category annotations and tissue-specific functional annotations with GWAS to increase the statistical power of the identification of risk SNPs, and (2) detect relevant tissue-specific functional annotations among a large amount of available annotation data to have a more biologically insightful interpretation of GWAS results. Statistical methods to incorporate genic category annotations have been proposed, e.g., stratified FDR methods (Schork et al., 2013), cmfdr (Zablocki et al., 2014), GPA (Chung et al., 2014) and EPS (Liu et al., 2016). However, these methods were designed to handle a few number of functional annotations and can not be scalable to a large-scale integrative analysis.

In this study, we propose a Latent Sparse Mixed Model (LSMM) to integrate genic category annotations and tissue-specific functional annotations with GWAS data. The “latent" statuses are used to connect the observed summary statistics from GWAS with functional annotations. “Mixed" models are designed to simultaneously consider both genic category and tissue-specific annotations, where genic category annotations are put into the design matrix of fixed effects, and tissue-specific annotations are encoded in the design matrix of random effects. We further impose a “sparse" structure on the random effects to adaptively select relevant tissue-specific annotations. We conducted comprehensive simulations to investigate the properties of LSMM and then applied LSMM to real data. We integrated summary statistics from 30 GWAS with 9 genic category annotations and 127 tissue-specific functional annotations from the Roadmap project. Compared with conventional methods, our method is able to increase the statistical power in the identification of risk variants and detection of tissue-specific functional annotations and providing a deeper understanding of genetic architecture of complex phenotypes.

2 Latent Sparse Mixed Model (LSMM)

2.1 Model

Suppose we have the summary statistics (-values) of SNPs from GWAS. Consider the two-groups model (Efron, 2008), i.e., SNPs either belong to null or non-null group. Let be the latent variable indicating the membership of the -th SNP, i.e., or indicates the -th SNP from null or non-null group, respectively. We further denote the proportion of null and non-null group as and , respectively. Then we model the observed -values as (Chung et al., 2014),

(1)

where denotes the uniform distribution on [0,1] and is the beta distribution with parameter . We constrain to model the fact that -values from the non-null group tend to be closer to 0 rather than 1.

Suppose that we have collected not only the -values of SNPs from GWAS, but also functional annotations of these SNPs. To incorporate information from functional annotations for prioritization of risk variants and detection of tissue-specific functions for a complex phenotype, we consider the following latent sparse mixed model:

(2)

where is the design matrix for fixed effects, comprised of an intercept and covariates, is the vector of fixed effects, is the design matrix for random effects, is the vector of random effects, and is the number of random effects. Both the -th row of (i.e., ) and (i.e., ) corresponds to the -th SNP. Note that is a latent variable in model (2) but its corresponding is observed. This makes our model different from the standard generalized linear mixed model.

Now we partition functional annotations into two categories: genic category annotations and tissue-specific annotations. According to (Schork et al., 2013), genomic regions, such as exon, intron, 5’UTR and 3’UTR, are considered as genic category annotations. For tissue-specific annotations, we used epigenetic markers (H3k4me1, H3k4me3, H3k36me3, H3k27me3, H3k9me3, H3k27ac, H3k9ac, and DNase I Hypersensitivity) of multiple tissues from the Roadmap project. As we are more interested in the detection of tissue-specific results, we put genic category annotation data into and tissue-specific annotation data into , where each column of corresponds to a genic functional category and each column of corresponds to a tissue-specific functional category. In the simplest case, the entries in and are binary. For example, means that the -th SNP has a function in the -th genic category and otherwise. Our model also allows the entries in and to be continuous variables, e.g., a score between 0 and 1 can be used to indicate the degree that the -the SNP has a function in the -th category. The closer to 1, the more likely it has a functional role. The entries in are defined in the same way as those of .

To adaptively select tissue-specific annotations, we assign a spike-slab prior on :

(3)

where denotes the Gaussian distribution with mean and variance , denotes the Dirac delta function at zero, or means the -th annotation is relevant or irrelevant to the given phenotype, respectively. Here is a Bernoulli variable with probability being 1:

(4)

where can be interpreted as the proportion of relevant annotations corresponding to this phenotype.

Let be the collection of model parameters. The logarithm of the marginal likelihood can be written as

(5)

where

(6)

Our goal is to maximize the marginal likelihood to obtain the estimation of and compute the posterior

(7)

Then we can infer the risk SNPs and relevant tissue-specific functional annotations for this phenotype and calculate the false discovery rate.

2.2 Algorithm

Exact evaluation of posterior (7) is intractable. One difficulty is due to the sigmoid function resulting from the logistic model. The other comes from the spike-slab prior. To address this issue, we propose a variational EM algorithm for parameter estimation and posterior approximation.

Before starting the derivation of our algorithm, we first re-parametrize the spike-slab prior (3) by introducing a new Gaussian variable , then the product has the same distribution with in model (3). So model (2) can be written as

(8)

Hence the complete-data likelihood can be re-written as

(9)

where

(10)
(11)
(12)

where is the sigmoid function and . With this reparameterization, we get rid of the Dirac delta function.

Due to the intractability caused by the sigmoid function inside integration (5), we consider the JJ bound (Jaakkola and Jordan, 2000):

(13)

where and the right-hand-side of the inequality (13) is the JJ bound. Clearly, the JJ bound is in the exponential of a quadratic form. Applying this bound to (11), we can get a tractable lower bound of , denoted as , where is variational parameter. Let . The lower bound of the complete-data likelihood is defined as

(14)

Next we derive the variational EM algorithm. Let be an approximation of the posterior . We can obtain a lower bound of the logarithm of the marginal likelihood

(15)

where is the lower bound. The first inequality is based on the JJ bound. The second inequality follows Jensen’s inequality. To make it feasible to evaluate the lower bound, we use the mean-field theory and assume that can be factorized as

(16)

where . It turns out that can be obtained analytically and thus the lower bound can be exactly evaluated. By setting the derivative of with respect to the parameters in be zero, we can obtain the updating equations for parameter estimation. The detailed derivation of the algorithm can be found in Section 1 of Supplementary Document.

We note that LSMM covers two special cases: (1) Two-groups model only (denoted as TGM), when all the coefficients in (except the intercept term) and are zero; (2) Two-groups model plus fixed effects model only (denoted as LFM for the abbreviation of latent fixed effect model), when all coefficients in are zero. This motivates us developing a four-stage algorithm based on warm starts. More specifically, in the first stage, we run an EM algorithm to obtain the two parameters ( and the proportion of non-null group ) in the TGM. Then we use the estimated parameters as the starting point to run the second stage variational EM algorithm to fit the LFM and obtain the parameter , and the posterior probability of . In the third stage, we treat the obtained posterior as the value of and fit the logistic sparse mixed model to obtain the required initial value for the parameters in the next stage. Finally, in the fourth stage we run the above variational EM algorithm with the obtained parameters at the second and third stage until convergence. Since all the iterations are built upon the framework of EM algorithm, the lower bound is guaranteed to increase at each iteration. The details of the algorithm design are provided in Section 2 of Supplementary Document.

2.3 Identification of risk SNPs and Detection of relevant tissue-specific functional annotations

After the convergence of the variational EM algorithm, the approximated posterior of latent variables and can be obtained. Using this information, we are able to prioritize risk SNPs and relevant tissue-specific functional annotations.

Risk SNPs are identified based on , an approximation of the posterior probability that the -th SNP is associated with this phenotype. Accordingly, we can calculate the approximated local false discovery rate . To control the global false discovery rate (FDR), we sort SNPs by from the smallest to the largest and regard the -th re-ordered SNP as a risk SNP if

(17)

where is the -th ordered , is the corresponding global FDR, and is the threshold of global FDR. In simulations, we chose .

Relevant tissue-specific functional annotations are inferred from , an approximation of the posterior probability that annotation is relevant to this phenotype. Similarly, we can calculate the approximated local false discovery rate and convert it into the global false discovery rate. We can either control the local false discovery rate (e.g., ) or global false discovery rate with .

3 Results

3.1 Simulation

We conducted simulations to evaluate the performance of the proposed LSMM. The simulation data was generated as follows. The numbers of SNPs, fixed effects (genic category annotations) and random effects (tissue-specific functional annotations) were set to be , and respectively. The entries in design matrices and were generated from , , and . Given the proportion of relevant tissue-specific functional annotations , was drawn from and the corresponding nonzero entries of random effects were simulated from . The first entry of the coefficients of fixed effects , i.e., the intercept in the logistic model, was fixed at and other entries were generated from and then kept fixed in multiple replications. After that, we simulated from Bernoulli distribution with probability , and then generated from if and otherwise.

We first evaluated the performance of LSMM in the identification of risk SNPs. We compared LSMM with two special cases, LFM (with fixed effects only) and TGM (without fixed effects and random effects). After prioritizing the risk SNPs using these methods, we made a comparison upon their empirical FDR, power, area under the receiver operating characteristic curve (AUC) and partial AUC. We varied the proportion of relevant random effects at . Figure 1 shows the performance of these three models with and (results for other scenarios are shown in Figures S2-S9 in Supplementary Document). As shown in Figure 1, the empirical FDRs are indeed controlled at the nominal level () for all these models. For TGM and LFM, the powers increase as the proportion of relevant functional annotations increases. This is because a larger could result in an increasing proportion of non-null group for SNPs. However, the AUC and partial AUC of LFM slightly decrease because the estimates of fixed effects using LFM would become less accurate when the impact of functional annotations becomes larger. LSMM can adaptively select relevant functional annotations to improve its performance. As expected, it outperforms both TGM and LFM in terms of the power, AUC and partial AUC. One may wonder what if we do not do variable selection and simply treat the effects of all covariates as fixed effects. We evaluated this approach and found that, without variable selection, the FDR would be inflated when the GWAS signal is relatively weak (See Figure S10 in Supplementary Document). In addition, LSMM assumed independence among SNPs, which greatly facilitates the computation and inference of LSMM. We evaluated the impact of this assumption on LSMM. The details of the simulations are given in Section 3 of Supplementary Document. Because GWAS only aim to identify the local genomic region in LD with true risk genetic variants, it is reasonable to consider the identified SNPs not as false positives if they are in the flanking region of the true risk SNPs. In this sense, the results (Figure S1 in Supplementary Document) suggest that LSMM can provide a satisfactory FDR control.

Figure 1: FDR, power, AUC and partial AUC of LSMM, LFM and TGM for identification of risk SNPs with and . We controlled global FDR at 0.1 to evaluate empirical FDR and power. The results are summarized from 50 replications.

Next we evaluated the performance of LSMM in the detection of relevant tissue-specific functional annotations in terms of the FDR, power, AUC and partial AUC. We varied the proportion of relevant tissue-specific functional annotations at . The results with and are given in Figure 2 (results for other scenarios are shown in Figures S11-S18 in Supplementary Document). The empirical FDR is controlled at 0.1 with conservativeness. This is because the variational approach is adopted to approximate the posterior, e.g., the JJ bound and mean-field approximation. The performance of LSMM in the detection of relevant functional annotations depends on the signal strength of the GWAS data. When the signal of the GWAS data is relatively strong, i.e., is relatively small, LSMM has a very good performance of detecting relevant functional annotations, as indicated by power, AUC and partial AUC. We also conducted the following simulations to examine the role of adjusting covariates (i.e., genic category annotations) using fixed effects for detecting relevant tissue-specific annotations. We consider the case that genic category annotations and some tissue-specific annotations are correlated and , the vector of coefficients corresponding to genic category annotations, is nonzero. Without adjusting genic category annotations, some irrelevant tissue-specific annotations will be falsely included in the model due to their correlation with genic category annotations. To verify this, we simulated a case that 10 genic category annotations and first 50 tissue-specific annotations are correlated with correlation coefficient varied at and the remaining annotations are generated independently. To simulate the design matrices for genic category and tissue-specific annotations, we first simulated samples from a multivariate normal distribution with the correlation matrix among annotations and then made a cutoff so that 10 of the entries would be 1 and the others be 0. The results are shown in Figure S19 in Supplementary Document. In the presence of correlation, as expected, a larger FDR of detecting relevant tissue-specific annotations is observed without adjusting genic category annotations.

Figure 2: FDR, power, AUC and partial AUC of LSMM for detection of relevant annotations with and . We controlled global FDR at 0.1 to evaluate empirical FDR and power. The results are summarized from 50 replications.

Regarding parameter estimation, LSMM provides a satisfactory estimate of , the parameter in Beta distribution (See Figures S31-S33 in Supplementary Document). When the signal strength of GWAS data is not very weak, the estimated fixed effects (Figures S34-S44 in Supplementary Document) and the proportion of non-zero random effects (Figure S45 in Supplementary Document) are relatively accurate.

The computational time of LSMM depends on the strength of GWAS signal, the number of SNPs and the number of random effects. The left panel of Figure 3 shows that the computational time is nearly linear with respect to and with . In the right panel, we fixed and varied and . When the GWAS signal is relatively weak, e.g., , the timings of LSMM remain the same for different scales of random effects. This is because LSMM adopts a warm-start strategy and its last two stages start from the estimates at the second stage (i.e., fixed effects only) and converge in a few iterations because the GWAS signal is too weak to provide information for updating the random effects.

Figure 3: Computational time of LSMM. Left panel: We varied the number of SNPs and the number of random effects , with . Right panel: We varied the number of random effects and the strength of GWAS signal with . The results are summarized from 10 replications.

To test the robustness of LSMM, instead of using generative model (2), we conducted simulations based on probit model:

(18)

where . And we set if , if . The first entry of the coefficients of fixed effects , i.e. the intercept term, was fixed at and other entries were generated from and fixed during multiple replications. We set and varied the signal-noise ratio . Figure 4 shows the performance in identification of risk SNPs when . We note that FDRs are all well-controlled at the nominal level and LSMM shows the best performance in power, AUC and partial AUC. The advantages of LSMM over LFM and TGM is more apparent as the signal-noise ratio increases. The performance of LSMM in the detection of relevant functional annotations is provided in Figure 5. Results for other scenarios are shown in Figures S20-S23 in Supplementary Document. Furthermore, we simulated the underlying distribution of -values in non-null group from other distributions rather than the Beta distribution. The experimental results indicate that the FDR of LSMM is still well controlled at the nominal level, suggesting the robustness of LSMM and its potentially wide usage (results are shown in Figures S24-S26 in Supplementary Document).

Figure 4: FDR, power, AUC and partial AUC of LSMM, LFM and TGM for identification of risk SNPs based on probit model (18) with . We controlled global FDR at 0.1 to evaluate empirical FDR and power. The results are summarized from 50 replications.
Figure 5: FDR, power, AUC and partial AUC of LSMM, LFM and TGM for detection of relevant annotations based on probit model with . We controlled global FDR at 0.1 to evaluate empirical FDR and power. The results are summarized from 50 replications.

We compared LSMM with GPA in the identification of risk variants and detection of tissue-specific annotations. As LSMM can integrate both genic category and functional annotations, we compared GPA with LSMM without fixed effects (integrate functional annotations only) for a fair comparison. From the model setup, one main difference between GPA and LSMM is that GPA assumes conditional independence among annotations, whereas in LSMM we do not make this assumption. To check the influence of correlated functional annotations, we simulated a case that the first 10 functional annotations were correlated and all the others were independent. We set and varied the correlation among annotations at . To simulate the design matrices for correlated functional annotations, we first simulated samples from a multivariate normal distribution with the correlation matrix among annotations and then made a cutoff so that 10 of the entries would be 1 and the others be 0. Figure 6 shows the results with (results for other scenarios are shown in Figures S27-S29 in Supplementary Document). We observe that the empirical FDRs of LSMM and LSMM without fixed effects are indeed controlled at 0.1, but the FDR of GPA inflates very much when annotations are correlated. As the FDR of GPA is not controlled, the power of GPA is not comparable to the other two models. According to the AUC and partial AUC, the performance of GPA becomes worse as the correlation among annotations increase, while the performance of LSMM is still stable and outstanding. It implies that LSMM is able to identify true relevant annotations among correlated misleading ones. We also conducted simulations to compare LSMM with cmfdr, a fully Bayesian approach to incorporate genic category annotations in GWAS using MCMC sampling algorithm. We find that cmfdr is not able to handle a large number of annotations and the MCMC sampling algorithm is very time-consuming. The result is shown in Figure S30 in Supplementary Document. Besides the computational time, we observe the empirical FDR of cmfdr is slightly inflated and its performance for prioritization of risk variants is inferior to LSMM in terms of AUC and partial AUC.

Figure 6: FDR, power, AUC and partial AUC of LSMM, LSMM without fixed effects and GPA for identification of risk SNPs with . We controlled global FDR at 0.1 to evaluate empirical FDR and power. The results are summarized from 50 replications.

3.2 Real Data Analysis

We applied LSMM to analyze 30 GWAS of complex phenotypes. The source of the 30 GWAS is given in Table S2 in Supplementary Document. We used ANNOVAR (Wang et al., 2010) to provide the genic category annotations: upstream, downstream, exonic, intergenic, intronic, ncRNA_exonic, ncRNA_intronic, UTR3 and UTR5, where ncRNA means variant overlaps a transcript without coding annotation in the gene definition. We obtained 127 tissue-specific functional annotations from GenoSkylinePlus (Lu et al., 2017) (http://genocanyon.med.yale.edu/GenoSkyline). To avoid unusually large GWAS signals in the MHC region (Chromosome 6, 25Mb - 35Mb), we excluded the SNPs in this region.

We compared the number of identified risk SNPs using TGM, LFM and LSMM for 30 GWAS. Using LSMM as a reference, we calculated the ratio of the number of risk SNPs each method identified to that from LSMM under FDR thresholds and . The results are shown in Figure 7. For detecting the relevant tissue-specific functional annotations, we controlled the local fdr at . Figure 8 shows the approximated posterior probability for annotations and phenotypes, where the darkness of the red entry implies the level of relevance between the corresponding tissue-specific functional annotation and the phenotype, the darker the more relevant.


Figure 7: The number of risk variants identified by TGM, LFM and LSMM for 30 GWAS, under the same level of global FDR control (0.05 and 0.1). For visualization purpose, these numbers are normalized by dividing the corresponding number of variants identified by LSMM.
Figure 8: Relevant tissue-specific functional annotations for 30 GWAS.

Figure 7 shows that LSMM can identify more risk variants than TGM and LFM, under the same level of FDR control. The differences between TGM and LFM is due to the impact of genic category annotations and the differences between LFM and LSMM can be attributed to tissue-specific functional annotations. For HIV and bipolar disorder, a clear improvement in the identification of risk SNPs can be found from TGM to LFM, reflecting a large enrichment of genic category annotations. The contribution of tissue specific annotations can be clearly seen with the improvement from LFM to LSMM in several GWAS analyses, such as multiple sclerosis and coronary artery disease (CAD). For multiple sclerosis, genic category annotations do not show huge contributions, however, the contributions of tissue-specific annotations are substantial. As shown in Figure 8, its relevant tissue-specific annotations are related with immune system, GM12878 lymphoblastoid cells and primary B cells from peripheral blood. For CAD, both enrichment of genic category and tissue-specific annotations are estimated and its relevant cells are from a few different tissues, including blood, heart, lung and skin (See Figure 8). As a cardiovascular disease, it is reasonable to discover the relevance of these cells to CAD, and Fernández-Ruiz (2016) has shown its relationship with immune system. The annotations in lung and skin we detected may provide some new insights about the disease.

Among the 30 GWAS, we analyzed four GWAS of schizophrenia with different sample sizes, Schizophrenia1 (9,379 cases and 7,736 controls), Schizophrenia2 (9,394 cases and 12,462 controls), Schizophrenia3 (13,833 cases and 18,310 controls ) and Schizophrenia4 (36,989 cases and 113,075 controls). The detailed results are summarized in Table S3 in Supplementary Document. The Manhattan plots using TGM and LSMM are provided in Figure S46 in Supplementary Document. Clearly, LSMM steadily improves over TGM and LFM in the analysis of schizophrenia, a highly polygenic trait, with different sample sizes. In particular, for Schizophrenia3, LSMM identified 1,492 risk variants which could not be identified by TGM. Interestingly, the majority of them (872 variants) can be re-identified in Schizophrenia4 using TGM. This indicates that LSMM has a better power in prioritizing risk variants than TGM. For Schizophrenia4, four tissue-specific functional annotations are detected. In our analysis, both genetic variants related to functions of brain cells (brain angular gyrus) and blood cells (K562 leukemia cells) are detected to be relevant. This evidence not only connects Schizophrenia with brain, but also suggests the biological link between Schizophrenia and immune system (Ripke et al., 2014). We also analyzed two GWAS of years of education (Years of Education 1 and 2). Compared with Years of Eduction 1, the GWAS data set for Years of Education 2 is based on a larger sample size, and thus it enables LSMM to detect relevant functional annotations in brain and immune system. Our results are consistent with Finucane et al. (2015).

More findings about the relevance between tissue-specific annotations and GWAS are shown in Figure 8. Some are concordant with previous GWAS analyses. For example, we detect the functional annotation in liver to be relevant to the lipid-related phenotypes, including low-density lipoprotein, high-density lipoprotein, triglycerides and total cholesterol. Similar functional enrichment has been found by Kundaje et al. (2015); Finucane et al. (2015) and Lu et al. (2017). For height, more than 40 tissue-specific functional annotations are detected to be relevant using LSMM, which reflects its highly polygenic genetic architecture. These relevant annotations include cells in bone, vascular and skeletal muscle which were also shown significant enrichments for height by Finucane et al. (2015). Recent research has linked some neurodegenerative diseases, which were believed to be more related to brain and neural system, to the immune system, such as Alzheimer’s disease (Sims et al., 2017) and Parkinson’s disease (Sulzer et al., 2017). For Alzheimer’s disease, similar results have been found using LSMM. The relevant functional annotations are from blood cells, including monocytes-CD14+ and K562 leukemia cells. For autoimmune diseases including Crohn’s disease, ulcerative colitis, inflammatory bowel disease, rheumatoid arthritis, lupus, menopause, multiple sclerosis and primary biliary cirrhosis, the detected relevant functional annotations are mainly from the immune system and have many overlaps. Our results also provide the genomic level supports to previous medical literature, such as the relevance between spleen and inflammatory bowel disease (Muller et al., 1993), between liver and menopause (Mucci et al., 2001). The result also provides several new insights. Lipid-related phenotypes including high-density lipoprotein and total cholesterol are also relevant to functional annotations in immune system and brain. Additionally, annotations in immune system are considered relevant to blood-related phenotypes including red cell count, mean cell haemoglobin and mean cell volume. The foreskin fibroblast primary cells in skin are relevant to ulcerative colitis, four lipid-related phenotypes and red cell count.

Regarding the computational time, LSMM takes less than six minutes to handle each of the 30 GWAS datasets. We also recorded timings of cmfdr as a comparison. As cmfdr is not scalable to a large number of covariates, we only integrated the 9 genic category annotations in cmfdr. The MCMC algorithm was suggested (Zablocki et al., 2014) to run with 5,000 burn-in and 20,000 main iterations. According to our estimates, cmfdr takes more than ten days for most phenotypes. The detailed timing results are shown in Figure S47 of Supplementary Document.

If we did not adjust the genic category annotation, more relevant tissue-specific functional annotations would be detected (results are shown in Figure S48 in Supplementary Document). It indicates that LSMM could adjust covariates’ effects and provide a more reliable identification of relevant functional annotations.

4 Conclusion

We have presented a statistical approach, LSMM, to integrate genic category annotations and a large amount of tissue-specific functional annotations with GWAS data. LSMM can not only improve the statistical power in the identification of risk SNPs, but also infer relevant tissue-specific functional annotations to the phenotype, offering new insights to explore the genetic architecture of complex traits or diseases. Through comprehensive simulations and real data analysis of 30 GWAS, LSMM is shown to be statistically efficient and computationally scalable. As more annotation data will become publicly available in the future, we believe LSMM is widely useful for integrative analysis of genomic data.

Acknowledgement

This work was supported in part by grant NO. 61501389 from National Science Funding of China, grants NO. 22302815, NO. 12316116 and NO. 12301417 from the Hong Kong Research Grant Council, startup grant R9405 from The Hong Kong University of Science and Technology, and Duke-NUS Medical School WBS: R-913-200-098-263, and MOE2016-T2-2-029 from Ministry of Eduction, Singapore.

Supplementary Document

1 The variational EM algorithm

E-step

Let be the collection of model parameters. The logarithm of the marginal likelihood is

Using the signoid function denoted as , the complete-data likelihood can be written as

.where

We can use JJ bound (Jaakkola and Jordan, 2000) to get the tractable lower bound of which is denoted by :

where

Let . Then

is a lower bound of complete-data likelihood.

Next, let be an approximation of the posterior . Then we can obtain a lower bound of the logarithm of the marginal likelihood:

where is the lower bound. The second inequality follows Jensen’s inequality. And

To make it feasible to evaluate the lower bound, we assume that can be factorized as

where ,, .

We can obtain an approximation according to the mean-field method:

where the expectation is taken under the distribtion and .

When , we have

where denotes the expectation under , and the constant doesn’t depend on . Because is a quadratic form,

where

When , we have

So

Therefore we have

Now we evaluate the variational lower bound .

We set the partial derivative of the lower bound w.r.t to and be 0 to get the variational parameters and :

The variational lower bound is

M-step

Now we update , , , . We set the partial derivative of w.r.t the parameters to be 0 and get

and use Newton’s method to update :

where

Implementation

  • Initialize , , , , , . Let .

  • E-step: For , first obtain , and then update