REMI: Regression with marginal information and its application in genome-wide association studies
In this study, we consider the problem of variable selection and estimation in high-dimensional linear regression models when the complete data are not accessible, but only certain marginal information or summary statistics are available. This problem is motivated from the Genome-wide association studies (GWAS) that have been widely used to identify risk variants underlying complex human traits/diseases. With a large number of completed GWAS, statistical methods using summary statistics become more and more important because of restricted accessibility to individual-level data sets. Theoretically guaranteed methods are highly demanding to advance the statistical inference with a large amount of available marginal information. Here we propose an penalized approach, REMI, to estimate high dimensional regression coefficients with marginal information and external reference samples. We establish an upper bound on the error of the REMI estimator, which has the same order as that of the minimax error bound of Lasso with complete individual-level data. In particular, when marginal information is obtained from a large number of samples together with a small number of reference samples, REMI yields good estimation and prediction results, and outperforms the Lasso because the sample size of accessible individual-level data can be limited. Through simulation studies and real data analysis of the NFBC1966 GWAS data set, we demonstrate that REMI can be widely applicable. The developed R package and the codes to reproduce all the results are available at https://github.com/gordonliu810822/REMI
Keywords: Genome-wide association studies, marginal information, high dimensional regression.
High dimensional regression has been widely applied in various fields, such as medicine, biology, finance and marketing . Consider the linear regression model that relates a response variable to a vector of predictors :
where is the vector of regression coefficients and is the random error term with mean zero and noise level In most applications, the data set is comprised of an matrix for variables in and a vector for response collected from individuals. Given the individual-level data , there exist convex [31, 6] and nonconvex [10, 44] penalized methods for estimating with theoretical guarantee [47, 23, 45, 2, 46], just name a few. Also see the monographs [4, 15] and the references therein.
Motivated from the applications in human genetics, we consider the problem of estimating when the individual-level data is not accessible but the marginal information is available, such as and , , where is the -th column of . For this reason, we refer to our problem formulation as ”regression with marginal information” (REMI). To make our formulation feasible, we also assume that information of the covariance structure of variables in can be estimated via a reference panel data set in the form of an data matrix , where is the number of samples from the reference panel and . A natural question arises: Without accessing the individual-level data, can we use marginal information together with the reference data to estimate , assuming observations in and are from the same distribution?
In particular, our problem arises in genome-wide association studies (GWAS), which have been conducted over the past decade to study the genetic basis of human complex phenotypes, including both quantitative traits and complex diseases [16, 39, 38]. As of April, 2018, more than 59,000 unique phenotype-variant (typically Single Nucleotide Polymorphism, or SNP in short) associations have been reported in about 3,300 publications of GWAS (see the GWAS Catalog database https://www.ebi.ac.uk/gwas/). An important lesson from GWAS [42, 36, 38] is that complex phenotypes are highly polygenic, that is, they are often affected by many genetic variants with small effects. Well-known examples include human height , psychiatric disorders , as well as diabetes . Due to the polygenicity, variants with small effects largely remain undiscovered yet and large sample sizes are required in exploring genetic architectures of complex phenotypes. Researchers world-wide are forming large genomic consortia, such as the Genetic Investigation of ANthropometric Traits (GIANT) Consortium and the psychiatric genomic consortium (PGC), to maximize sample size, aiming at a deeper understanding of the genetic architecture of complex phenotypes.
Although much efforts have been made for data sharing, it is still very difficult for a research group to fully access the individual-level genotype data available in a consortium. For example, a core research group from the GAINT consortium reported that they can only access genotype data from about 44,000 individuals  while the total sample size is more than 250,000 for the consortium . There are several reasons for the restricted access to the individual-level data. First, privacy protection is always a big concern in sharing individual-level genotype data. Second, it is often time-consuming to achieve an agreement on data-sharing among different research groups. Third, many practical issues arise in data transportation and storage. In contrast, summary statistics from GWAS are widely available through many public gateways , e.g., the download session at the GWAS Catalog https://www.ebi.ac.uk/gwas/downloads/summary-statistics. Because these summary statistics (e.g., estimated effect sizes, standard errors, and -values) are often generated by simple linear regression analysis, summary statistics are essentially marginal information.
To meet the great demand of data analysis in GWAS, various statistical methods have been proposed to utilize marginal information. Using a few hundreds of human genome data from the 1000 Genome Project as a reference panel, information on the correlation structure of genetic variants (typically using “linkage disequilibrium” in genetics, or LD for short) becomes available. This allows these methods to bypass the individual-level data but only use marginal information. Here we roughly divide these methods into three categories: (a) Methods for heritability estimation. Heritability of a phenotype quantifies the relative importance of genetics and environment to the phenotype . When individual-level data is accessible, linear mixed model (LMM)-based approaches (e.g., GCTA [42, 43]) are widely used for heritability estimation . In the absence of individual-level data, Bulik et al.  first introduced the LD score regression, named LDScore, for heritability estimation only using summary statistics and the reference data from the 1000 Genome Project. Based on the minimal norm quadratic unbiased estimation criteria, Zhou  proposed a novel method of moments, MQS, for variance component estimation with summary statistics. (b) Methods for association mapping. Heritability estimation provides a global measure which quantifies the overall contribution from genetic factors while association mapping is to localize genetic variants associated with a given phenotype. Recently, a few statistical methods have been proposed for association mapping based on summary statistics, including FGWAS , PAINTOR , CAVIAR , and CAVIARBF . Although these methods are very useful for performing association mapping on summary statistics, they still have their limitations. On one hand, they adopted some ad-hoc ways to reduce computational cost. For example, to avoid a combinatorial search, FGWAS assumes that there is only one causal signal in an LD block and PAINTOR searches no more than two causal variants in its default setting. On the other hand, statistical analysis is oversimplified to overcome estimation difficulties. For example, the non-centrality parameter in PAINTOR and the variance components in CAVIAR and CAVIARBF are pre-fixed rather than adaptively estimated from data. (c) Methods for effect size estimation and risk prediction. Recently, Vilhjálmsson et al.  proposed a Bayesian method, LDpred, for effect size estimation and risk prediction by accounting for LD. Along this line, Hu et al.  further introduced AnnoPred to improve LDpred by incorporating functional information in human genome. However, neither LDpred or AnnonPred should be considered as a marginal-information-based method because it requires individual-level data as validation data for its parameter tuning.
Although the existing statistical methods have shown a good empirical performance in GWAS data analysis, there are a number of open questions on REMI. First, the sample size of the reference panel is often very small. For example, there are only about 370 samples from the 1000 Genome Project that can be used as reference for analyzing GWAS data in European ancestry. It remains unclear why such a small sample size is often good enough for exploring the correlation structure of a large number of variables (i.e., SNPs). Second, the theoretical properties of the existing methods for effect size estimation and prediction error are not clear. Third, the sampling-based algorithms are often time-consuming as they need to run thousands of Markov Chain Monte Carlo (MCMC) iterations . In this paper, we propose a unified framework to address the above open questions. The rest of this paper is organized as follows: In Section 2, we introduce our REMI model and discuss REMI in GWAS. In Section 3, we present an efficient coordinate descent algorithm following by discussion on some practical issues. In Section 4, we establish the error bound and prediction error of the proposed method. In particular, our theoretical results explain why a small number of samples (i.e., ) from the reference panel can be good enough for effect size estimation and risk prediction. In Section 5, we show the results from both simulation studies and real data analysis.
2 The REMI model
2.1 The REMI model
where is the norm and is a regularization parameter. In our problem, however, the individual-level data is not accessible. Hence, direct application of the Lasso is not feasible here. We note that several other important penalized methods have been proposed, including SCAD  and MCP . We will focus on the Lasso penalty below, although our proposed approach can also be based on the other penalties.
We now describe our proposed REMI model with the Lasso penalty. Rewrite (2) as
where the second term only involves the inner product of the optimization variable and marginal information, say, , which we assume is available. The difficulty comes from the first term, where is unknown since is not observed. Motivated by the application in GWAS, we assume that there exists a reference data matrix , where the rows of are i.i.d. and have the same distribution with covariance matrix as the rows of . Therefore, both and can be viewed as estimators of . So we propose to solve the following optimization problem to estimate :
where denotes the estimator using the reference covariance matrix. Clearly, the above model (4) only uses the marginal correlation between and , with the covariance matrix estimated by an external reference panel .
2.2 REMI in GWAS
In the context of GWAS, the available marginal information may not be but summary statistics from univariate linear regression:
where superscript is used to denote marginal information. and are the estimated effect size and its variance for SNP , respectively. Due to the polygenicity of many complex phenotypes, the standard errors can be well approximated by (Zhu and Stephens 2016). Let , be the vectors collecting estimated effect sizes and estimated variance, respectively, and be a diagonal matrix with being its -th diagonal element. Further, we introduce a diagonal matrix with its -th diagonal element being the sample standard deviation of , i.e., , and correlation matrix with . Noticing that and , the REMI formulation (2.1) becomes
where and the approximation holds in the case of polygenicity. Because is a tuning parameter that scales with a constant factor (), we slightly abuse for and propose to solve the following optimization problem
where , and denotes the estimates using correlation information. Similar to REMI (4) in which covariance matrix needs to be estimated, here correlation matrix needs to be estimated by samples from the reference panel . We refer (4) as REMI-C and (5) as the REMI-R, respectively.
3 Algorithm and Practical Issues
Here we adopt the widely used coordinate descent algorithm, which updates one parameter at a time, say , keeping all other parameters fixed at their current values. Thus the sub-problem for parameter can be written as
where is an element in . An efficient path algorithm can be developed based on the warm start and some other tricks as described in . In particular, we generate a sequence of equally spaced in logarithm with and , where is the minimum that shrinks all parameters to zero and is usually set to 0.05. For each , we use the solution of (4) from the last value as warm start. The path algorithm is described in Algorithm 1.
3.2 Reference panel
In REMI-R model (5), it involves the cohort-based estimated correlation matrix. Based on the nature of the correlation patterns of the SNPs, can be approximated by a block diagonal matrix. Specifically, we first partition the whole genome into blocks ( for European ancestry and for Asian ancestry, respectively ). Then we calculate empirical correlation matrix for each LD-block. To ensure a stable numerical result, we apply a simple shrinkage estimator to obtain within each block , where we used as default (the estimate of is insensitive to ). Thus, similar to , REMIs and its individual-level-data counterpart will produce approximately the same inferential results within a region. After plugging and in (5), we can use the coordinate-descent algorithm to obtain (Algorithm 2).
3.3 Choice of regularization parameter
The REMIs have one regularization parameter . Here we briefly show how to choose this parameter for REMI-R and it is straightforward to develop the same strategy for REMI-C. Similarly to the Lasso solver , we generate a sequence of from to , where is the minimum value of that shrinks all parameters to zero and is pre-specified with the default value at 0.05. Note that . We search for optimal value value using BIC,
Zou et al.  showed that the number of nonzero coefficients is an unbiased estimate for the degrees of freedom of Lasso. We choose to be the number of nonzero coefficients given . To make fair comparison of REMIs with Lasso, we also use BIC to select the regularization parameter when individual-level-data is accessible.
4 Theoretical properties
In this section, we give nonasymptotic bounds on the estimation error and the prediction error , where denotes index of significant entries of and denote the vector supported on .
Since in real applications of genetic data the underlying signal is not exactly sparse but with many small components. Here, we assume that the target is weak sparse, i.e., in addition to some significant components indexed by , there may be many nonzero entries in with very small magnitude, as indexed by . Let be the vector supported on . It is reasonable to assume that
since the signal whose magnitude smaller than this order is undetectable. Then
where, Let , , Recall the restrict eigenvalue  of is defined as
Assume the rows of and are i.i.d sub-Gaussian samples drawn from population with mean 0 and covariance matrix , and satisfying restricted eigenvalue condition with and the noise vector is mean zero sub-Gaussian with noise level , and Take .
(i) With probability at least we have
(ii) Suppose we observe , whose rows are sampled from the same distribution as that of ’s. Then with probability at least , the prediction error satisfies
The assumption that are i.i.d sub-Gaussian samples drawn from population with mean 0 and covariance matrix implies the restricted eigenvalue condition holds for some positive with high probability as long as [32, 34, 20]. As shown in Theorem 1, with the help of reference panel, we can also get an accurate estimator by (4) even if we only have marginal information in high-dimension setting as long as and . Furthermore, the estimation error of REMI model (4) achieve the minimax optimal rate as that of the Lasso  if the number of samples of reference penal is at the same order of the number of individual-level samples . Moreover if the magnitude of the significant entries larger than , the estimated support coincide with the true significant set .
5 Numerical Studies
5.1 Simulation studies
In simulation studies, we compare REMI-C (4), REMI-R (5) and Lasso using individual-level data. To avoid unrealistic LD pattern in simulation, we used the genotype data from the GERA data set . The GERA data set provided 657,184 genotyped SNPs for 62,313 European individuals. We performed strict quality control on data using PLINK . We excluded SNPs with a minor allele frequency less than 1, having missing values in more than 1 of the individuals or with a Hardy-Weinberg equilibrium -value below 0.0001. Moreover, we removed one member of pairs with genetic relatedness larger than 0.05. Finally, there remained 53,940 samples for 550,482 SNPs.
As individual-level-based analyses often suffer from limited sample sizes due to the restricted access of individual-level data, summary-level-based analyses may have advantages because the sample sizes are often much larger. To simulate this situation, we prefixed the sample size for individual-level-based analyses at . Specifically we randomly selected samples from 53,940 individuals in the GERA data set to form the genotype matrix , where was the total number of the genotyped SNP on chromosomes 16, 17 and 18. Then the phenotype vector was generated as , where and the heritability () was controlled at 0.2, 0.3, 0.4 and 0.5. Here was the vector of true effect size with sparsity , i.e., entries in were nonzero and they were sampled from . In our simulation study, we varied in . With at hand, the standard Lasso can be applied, serving as a reference of individual-level data analysis.
To generate summary-level data, we varied sample size from 3,000 to 50,000. We generated individual-level data as we described above and then we ran simple linear regression on , to obtain . After that, we pretended that we did not have individual-level data and then only used and as the input for REMI-C and REMI-R, respectively. We used 379 European samples from the 1000 Genome Project data as the reference panel to estimate covariance matrix (REMI-C) or correlation matrix (REMI-R), as detailed in Section 3.2. For each replication, we held out 200 independent samples to evaluate prediction accuracy. In total, we summarized our results based on 50 replications for each setting.
We compared the performance of REMI-C, REMI-R and the Lasso using individual-level data in terms of variable selection and prediction. Specifically, we used partial area under the receiver operating characteristic (ROC) curve (partial AUC) for variable selection performance and the Pearson’s correlation coefficient between predicted and observed phenotypes for prediction performance. The results of this simulation study are shown in Figure 1 and Figure 2. First, we can observe that the difference between REMI-C and REMI-R is nearly invisible. This justifies the approximation made in REMI-R. Second, when the sample size ( = 3,000 or 5,000) in summary-level data is similar to that of individual-level data (), the performance of variable selection and prediction for REMIs (both REMI-C and REMI-R) is similar to that of Lasso. Third, REMIs gradually outperform the Lasso as the sample size increases from 5,000 to 50,000, for both variable selection and prediction. This clearly indicates that REMIs can have big advantages over the Lasso when the sample sizes of summary-level data become much larger.
5.2 Real data analysis
To demonstrate the utility of REMIs, we first compared Lasso and REMIs based on the GWAS data set from the Northern Finland Birth Cohorts program (NFBC1966) . The NFBC1966 data set contains information for 5,402 individuals with a selected list of phenotypic data related to cardiovascular disease including high-density lipoprotein (HDL), low-density lipoprotein (LDL), total cholesterol (TC), triglycerides (TG), C-reactive protein (CRP), glucose, insulin, body mass index (BMI), systolic (SysBP) and diastolic (DiaBP) blood pressure. For each individual, 364,590 SNPs have been genotyped. We performed strict quality control on data using PLINK . We first excluded individuals having discrepancies between reported sex and sex determined from the X chromosome. We also excluded SNPs with a minor allele frequency less than 1, having missing values in more than 1 of the individuals or with a Hardy-Weinberg equilibrium -value below 0.0001. In particular, we selected well-imputed variants from HapMap 3 reference panel . After the strict quality control, 5,123 individuals with 310,975 SNPs in NFBC1966 were remaining for the further analysis. As we have the individual-level data, it is possible to run REMI-C, REMI-R and Lasso for all these traits. The solution paths using Lasso, REMI-R, and REMI-C for the ten metabolic traits in NFBC1966 data set are presented in Figures 3. The dotted vertical bars in these two figures indicate the corresponding selected tuning parameters based on BIC. One can see that the differences among solution paths for the Lasso using individual-level data, REMI-C and REMI-R using summary statistics are very minor, which is consistent with results from our simulation studies in Section 5.1.
In the released GWAS summary-level data sets, it is often the case that rather than the inner product is made available. Therefore, we applied REMI-R to analyze summary statistics for ten GWASs of complex phenotypes. The source of the GWASs is given in Table 1. Because the individuals of summary-level data sets were all from European ancestry, we used 379 European-ancestry samples in 1000 Genome Project  as a reference panel to estimate correlation matrix. Due to the quality of SNPs in the summary statistics, we restricted our analysis to a set of common and well-imputed variants from the HapMap 3 reference panel , which included 1,197,724 SNPs in total. Figure 4 shows the Manhattan plots of summary statistics for height (Ht) including (-value), and . The Manhattan plots of the absolute effect sizes from REMI-R for all other nine traits are shown in Figure 5.
Besides the effect size estimation, we evaluated prediction performance using 5,123 samples from the NFBC1966 . To make a fair comparison with Lasso, we first split all 5,123 samples into ten folds. On the one hand, we applied REMI-R on the summary statistics for these lipid traits listed in Table 1. Again, we used 379 European-ancestry samples from the 1000 Genome Project as a reference panel. For each of 10 folds in NFBC1966 data set, we calculated the predicted phenotypic values and evaluated the Pearson’s correlation coefficients between the predicted phenotypic values and the observed ones. On the other hand, we fitted the Lasso on the individual-level NFBC1966 data using the same ten-fold data for cross-validation. Specifically, we randomly selected nine folds of individual-level data as the training set to fit the Lasso, and evaluated prediction accuracy of the fitted model using the remaining one fold. Note that we used the same remaining fold to evaluate the prediction accuracy of the fitted REMI-R model. The prediction performance for REMI-R and Lasso is shown in Figure 6. Clearly, the prediction performance of REMI-R outperforms the standard Lasso as the sample size in the summary statistics for these lipid traits are around 100,000 but the individual-level data contains only 5,123 samples. These real data results indicate the great advantage of REMI over the Lasso for risk prediction.
In this study, we proposed a novel approach for high-dimensional regression analysis when only marginal regression information and an external reference panel data set are available. Our work is motivated from combining information from multiple GWAS. To date, a large number of GWAS have been conducted to find genetic factors associated with complex traits. Due to the need for privacy protection and issues in data-sharing of individual-level data, it is important to be able to effectively make full use of the summary statistics from separate studies. In contrast to the limited sample size in individual-level data based GWAS analysis, a prominent feature of summary-level data analysis is that it can effectively make use of multiple data sets, which leads to a much larger combined sample size.
Under mild conditions, we prove that the REMI estimator (4) based on the marginal information and the reference penal achieves the minimax optimal rate estimation error under reasonable conditions. In particular, the requirement on the size of the reference panel data is quite mild, it is only in the order of the logarithm of the model dimension. Our theoretical result successfully explains why a relatively small reference sample can be good enough for accurate estimation and prediction in real applications. We have conducted comprehensive simulations and real data analysis to demonstrate the utility of REMI. The experimental results show that the performance of REMI can be very similar to the Lasso when the sample sizes of summary-level data and individual-level data are the same. In genetic analysis, summary-level data sets are much easier to access and their sample sizes are often orders of magnitude larger than that of individual-level data sets. Consequently, REMI can be superior to the existing methods requiring complete data by taking advantages of the larger sample sizes, as demonstrated in our real data example.
This work was supported in part by grants No. 11501579 and NO. 61501389 from National Science Funding of China, grants NO. 22302815, NO. 12316116 and NO. 12301417 from the Hong Kong Research Grant Council, and Initiation Grant NO. IGN17SC02 from University Grants Committee, startup grant R9405 from The Hong Kong University of Science and Technology, Shenzhen Fundamental Research Fund under Grant No. QTD2015033114415450 and grant R-913-200-098-263 from Duke-NUS Graduate Medical School, and AcRF Tier 2 (MOE2016-T2-2-029) from Ministry of Education, Singapore.
We recall some simple properties of subgaussian and subexponential random variables.
We state the Bernstein-type inequality for the sum of independent and mean 0 sub-exponential random random variables
(Corollary 5.17 of ) Let be independent centered sub-exponential random variables. Then for every one has
where, is a absolute constant and .
Suppose the rows of and are i.i.d sub-Gaussian samples drawn from population with mean 0 and covariance matrix . Then, with probability at least , we have
as long as and .
Proof of Lemma 5. Since the proof of these two results are similar, we give one of them. Let be the th row of , , and denote the th entry of . Define which is sub-exponential with by Lemma 3. Therefore,
where the first inequality is due to union bound, and the second one follows from Lemma 4 and the last inequality is because of restricting . Then Lemma 5 follows from setting and the assumption that .
Under the same assumption as Lemma 5, we have
holds with probability at least .
Proof of Lemma 6.
where the first inequality is due to triangle inequality, and the second inequality follows from Cauchy-Schwartz inequality, and the last one holds with probability larger than due to Lemma 5. This finishes the proof of Lemma 6.
Suppose the rows of are i.i.d sub-Gaussian samples drawn from population with mean 0 and covariance matrix , and the entries of noise are i.i.d centered sub-Gaussian with noise level . With probability at least , we have
Proof of Lemma 7. We have,
the first inequality is due to union bound, and the second one follows from Lemma 3 and Lemma 4, where we use , and the last two inequality follows from by setting and the assumption that , i.e., with probability at least we have
Under the same assumption as Lemma 7, we have,
with probability larger than .
Proof of Lemma 8.
where first inequality is due to triangle inequality, and the second one follows from Cauchy-Schwartz inequality, the third inequality holds with probability larger than by using (8) and Lemma 5. This completes the proof of Lemma 8.
Now we are ready to prove Theorem 1.
Proof of Theorem 1. (i). Let . Define the event
The optimality of implies that
where, (eq1) and (eq2) and (eq6) are due to some algebra, and (eq3) follows from (9), and (eq4) uses Cauchy-Schwartz inequality, and (eq5) holds by conditioning and the assumption It follow from (12) that
Then, by the restricted eigenvalue condition on and (12) we deduce,
The above induction is conditioning on We need give a lower bound on . Indeed,
(ii). Let Then,