Analysis of CaseControl Association Studies: SNPs, Imputation and Haplotypes
Abstract
Although prospective logistic regression is the standardmethod of analysis for casecontrol data, it has been recently noted that in genetic epidemiologic studies one can use the “retrospective” likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy–WeinbergEquilibrium (HWE), gene–gene and gene–environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype–environment interactions in the presence of haplotypephase ambiguity. We provide novel insights to existing methods by construction of various scoretests and pseudolikelihoods. In addition, we describe a novel twostage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data.
24 \issue4 2009 \firstpage489 \lastpage502 \doi10.1214/09STS297
SNPs, Haplotypes and Imputation
a]\fnmsNilanjan \snmChatterjeelabel=e1]chattern@mail.nih.gov\corref, b]\fnmsYiHau \snmChenlabel=e2]yhchen@stat.sinica.edu.tw, c]\fnmsSheng \snmLuolabel=e3]Sheng.T.Luo@uth.tmc.edu and d]\fnmsRaymond J. \snmCarrolllabel=e4]carroll@stat.tamu.edu
Casecontrol studies \kwdEmpiricalBayes \kwdgenetic epidemiology \kwdhaplotypes \kwdmodel averaging \kwdmodel robustness \kwdmodel selection \kwdretrospective studies \kwdshrinkage.
1 Introduction
Casecontrol study designs are now widely used to study the role of genetic susceptibility in the etiology of rare complex diseases. Typically, a casecontrol study involves recruiting all or a large fraction of the diseased subjects (cases) that arise in an underlying study base and then sampling a comparable number of healthy subjects (controls), ideally from the exact same study base, and possibly matched with the cases by some sociodemographic characteristics such as race, age and gender. Biological samples and questionnaire data collected on the sampled subjects are then used to determine their genetic susceptibility, such as SNP genotypes and history of some nongenetic (environmental) exposures. For rare diseases such as cancers, casecontrol studies are costefficient compared to a crosssectional or prospective cohort studies because they dramatically reduce the number of nondiseased subjects to study.
In general, the standard method for analysis of casecontrol data is the prospective logistic regression ignoring the retrospective nature of the underlying design. The validity of this approach relies on the classic results by Cornfield (1956) who showed the equivalence of prospective and retrospectiveodds ratios. The efficiency of the approach was established in two other classic papers by Andersen (1970) and Prentice and Pyke (1979), who showed that the prospective analysis of casecontrol data yields the proper maximumlikelihood estimates of the odds ratio parameters of the logistic model under a “semiparametric” setup that allows the distribution of the underlying covariates to remain completely unrestricted. More recently, it has been shown that even in the presence of missing data and measurement error in covariates, the “prospective” treatment of casecontrol data can yield proper maximumlikelihood estimates as long as the distribution of the underlying covariates is allowed to remain unrestricted (Roeder, Carroll and Lindsay, 1996).
A special feature for studies in genetic epidemiology is that it is often reasonable to assume certain models for the population distribution of the genetic and environmental covariates of interest. The Hardy–WeinbergEquilibrium (HWE) law, for example, which implies a simple relationship between allele and genotype frequencies at a given chromosomal locus, is a natural model for a random mating, large, stable population in the absence of new genetic mutations, inbreeding and selective survivorship among genotypes (see Hartl and Clark (2007), Chapter 3). Genes which are physically apart and hence are not expected to be in linkage disequilibrium (LD) are also expected to be independently distributed in a homogeneous population. It is often also natural to assume a subject’s genetic susceptibility, a factor which is determined at birth, is independent of his/her subsequent environmental exposures. A pertinent question then is what is the most appropriate method for analysis of casecontrol data in genetic epidemiology where some natural model assumptions exist for the distribution of genetic and environmental factors in the underlying population.
We will assume data on some genetic () and environmental () exposures are collected in a casecontrol study involving controls () and cases (). If one ignores the retrospective nature of the casecontrol design, one can conduct inference based on the prospectivelikelihood
(1) 
where . The fundamental likelihood for casecontrol data, however, known as the “retrospective” likelihood, is given by
(2) 
In the absence of any missing data, it is evident from the classical theory that the prospectivelikelihood (1) provides a valid way of testing and estimation of the odds ratio association parameters of the underlying logistic regression model. In fact, the prospectivelikelihood yields the same maximumlikelihood estimates for the odds ratio association parameters that could be obtained by maximization of the proper retrospective likelihood (2) while allowing , the joint distribution of and , to remain completely nonparametric. Under constraints on , however, the retrospective likelihood would not yield the same maximumlikelihood estimator as that from the prospective likelihood. More importantly, the retrospectivelikelihood can exploit various population genetics model assumptions such as HWE, gene–gene and gene–environment independence to gain major efficiency over the prospectivelikelihood for inference on various association and interaction parameters. At the same time, if the underlying model assumptions are violated, then the use of the retrospective likelihood can lead to serious bias for both testing and estimation procedures. In the presence of missing data, a further complication is that the use of the prospective likelihood may not be even strictly valid in certain settings, such as that described in Section 4 for estimation of haplotype effects, where for the purpose of identifiability also requires some modeling assumptions, thus destroying its equivalence with that is known to hold under unspecified covariate distribution. Thus, to date, a debate remains about the most appropriate method of analysis of casecontrol studies in genetic epidemiology.
In this article we will review some modern developments for analysis of casecontrol studies in genetic epidemiology using the prospective andretrospectivelikelihoods. We will describe the methods primarily through two different types of applications: (a) association testing for genotyped and imputed single nucleotide polymorphisms (SNP) (Sections 2 and 3) and (b) estimation of haplotype effects and haplotype–environment interactions in the presence of phase ambiguity (Section 4). In each section we aim to provide new intuitive insights into the alternative methods by constructions of various score tests (Sections 2 and 3) and pseudolikelihoods (Section 4). As a byproduct, in Section 3 we also propose a novel “retrospective” method for association testing for untyped SNPs which can easily use any external algorithm for imputation of genotypes. In each section we will use numerical examples to illustrate the bias and efficiency tradeoff between the alternative methods. We will conclude the article with a discussion and recommendations for practical data analysis.
2 Association Analysis for Single Nucleotide Polymorphisms (SNPs)
2.1 The Prospective Approach
The genotype information for an individual SNP in a casecontrol study can be represented by the contingency table defined by crosstabulation of casecontrol and genotype status. Let be the indicator of case () or control () status and let be the number of minor alleles carried by an individual (). Let denote the number of subjects with genotype and disease status observed in the casecontrol sample. Suppose we are interested in testing the association of the disease outcome with a SNPgenotype using a population logistic regression model of the form
(3) 
where the function is chosen in a suitable way to reflect an assumed mode of genetic effect. If, for example, denotes the count for the minor allele at a SNP locus, then one can chose , or to model the effect of the minor allele as additive (in the logistic scale), dominant or recessive. One can also consider a saturated model by allowing to be a vector of two dummy variables associated with heterozygous () and homozygous variant () genotypes and to be the corresponding logoddsratios. The prospective analysis of casecontrol data yields an asymptotically unbiased estimate for the genotypeoddsratio parameters , but not for the intercept parameter .
The score function for under the prospectivelikelihood (1) can be written as
Under the null hypothesis, , we can estimate as since under that hypothesis, does not influence . Then the score function can be written as
which is proportional to the difference between the empirical means of in the cases () and in the controls (). We suppose without loss of generality that the indices for the cases are and those for the controls are . If, for example, we assume , that is, the additive effect, then corresponds to the numerator of the Cochran–Armitage trend test (van Belle et al., 2004, Chapter 7) that is widely used for singleSNP association testing. More generally, a “prospective” scoretest can be constructed under any genetic model based on and its variance under the null hypothesis of no association that be estimated by
where is the pooledsample variance of .
2.2 Retrospective Approach
The retrospective likelihood, , for the genotype data of a singleSNP can be written as the product of two sets of multinomial probabilities:
where , and , denotes the population genotype frequencies for the controls and the cases, respectively. Given the genotype probabilities for the controls, we can characterize the genotype probabilities for the cases according to the formula (Satten and Kupper, 1993)
(4) 
where denotes the odds ratio associated with the genotype as specified by the logistic model (3). Thus, the retrospective likelihood can be parameterized in terms of the genotype probabilities of the controls and the diseaseoddsratio parameters . The maximization of the retrospective likelihood , without imposing any further constraints on the genotype probabilities for the controls, will lead to the same estimator for that would be obtained by maximization of (Prentice and Pyke, 1979). In fact, it can be shown that the retrospective and prospectiveprofile likelihoods of become identical after maximization of the corresponding likelihoods with respect to the associated nuisance parameters (Roeder, Carroll and Lindsay, 1996). Thus, the associated tests, including score, Wald and likelihoodratio tests, are identical under the retrospective and prospective likelihoods.
Now suppose we are willing to assume that HWE holds in the underlying population and that the disease is rare so that HWE also holds approximately in the control population. In the retrospective likelihood , we can write the genotype probabilities for the controls as a function of the frequency, , of the minor allele as
It is easy to show that the score function for associated with the retrospective likelihood can be written as
which under the null hypothesis of no association reduces to
(5) 
Moreover, under the null hypothesis, the allele frequency can be substituted for by its maximumlikelihood estimate
(6) 
where denotes the frequency for genotype in the pooled sample of cases and controls. Thus, corresponds to the difference between the empirical mean of the function in cases and its expected value under HWE and the null hypothesis of no association. In contrast, note that corresponds to the difference between the empirical mean of the function in cases and the empirical mean for the same function in the controls. If the expectation in the retrospective score function (5) is estimated empirically without assuming HWE, then, as expected, it can be easily shown that the retrospective and prospective scores are the same. If, however, we assume HWE to evaluate the retrospective score function, then it would have smaller variance than that for the prospective score. In particular, this can be seen from the estimate of the variance estimate given by
where
By the Cauchy–Schwarz inequality, asymptotically, implying that the retrospective score test is asymptotically more powerful than its prospective counterpart when the assumption of HWE is valid.
Chen and Chatterjee (2007) compared the performance of 2 d.f. Waldtests of association based on the retrospective and prospective likelihoods. They observed major gains in power for the test based on the retrospectivelikelihood for the detection of nonmultiplicative effects, for example, recessive effects. Notice that if we assume an additive model, that is, , then the prospective and retrospective score functions and become identical because in this case . The larger the departure of the effect of a SNP from the additive form, the greater the gain in efficiency for the retrospective method. Application of retrospective methods for association testing, however, requires caution because of their sensitivity to the underlying model assumption. In particular, it can be seen from the formula of that the unbiasedness of that score function crucially depends on the assumption of HWE being correct for the underlying population. Satten and Epstein (2004) and Chen and Chatterjee (2007) have noted that even modest violation of HWE can cause serious inflation in TypeI error in association tests based on the retrospective likelihood.
2.3 EmpiricalBayes Methods
Luo et al. (2009) considered an empiricalBayes type shrinkage estimation approach to develop a 2 d.f. singleSNP association test that can gain power by exploiting the model assumptions of HWE for the underlying population and yet is resistant to bias when the model assumptions are violated. The method involves estimation of genotypespecific disease odds ratio parameters by dataadaptive “shrinkage” of a “prospective” modelfree estimator that does not require the HWE assumption toward a “retrospective” modelbased estimator that directly exploits the HWE constraints. The amount of“shrinkage” is samplesize and dataadaptive, so that in large samples the method has no bias whether the assumption of HWE holds or not and yet the method can gain efficiency by shrinking the analysis toward HWE, but only to the extent that the data validate the assumptions. In what follows we provide some insight into the empiricalBayes method through the construction of a scoretest. For numerical illustration, however, we will focus on the Wald test as originally developed in Luo et al. (2009).
Let and denote the sample mean and variance for the function , respectively. Further, let denote the difference between the empirical and expected means of when the latter quantity is computed assuming HWE and under the estimate of allele frequency given in (6). Intuitively, can be viewed as an estimate of the bias in estimation of the population mean of under the assumption of HWE. An empiricalBayes type score function can be now defined as
(7) 
where is the empiricalBayes estimate for the mean of the function under , given by
Thus, corresponds to a weighted average of the empirical mean of and its expected mean under HWE, with the weights defined by an estimate of the bias for the estimate of the population mean of under HWE and an estimate of the variance of the empirical mean of . As decreases, that is, the evidence of bias due to the violation of HWE becomes smaller, gives more weight to the more precise HWEbased estimator of the population mean of . Conversely, as decreases, that is, the sample mean of becomes more precise, then puts more weight to the robust modelfree estimator . The original perspective for constructing suchweighted combinations of modelbased and model free estimators from an empiricalBayes point of view can be found in Mukherjee and Chatterjee (2008). Simple methods for variance estimation for such estimators have been also described in that article.
2.4 The Cancer Genetics Markers of Susceptibility (CGEMS) Study
We illustrate the performance of alternative 2 d.f. single SNP association tests using data from the Cancer Genetics Markers of Susceptibility (CGEMS) study (Yeager et al., 2007; Hunter et al., 2007;Thomas et al., 2008), an NCI enterprize initiative to conduct multistage wholegenome association studies to identify susceptibility genes giving rise to increased risks of prostate and breast cancers. In this article we will focus on data from the initial scan for the prostate cancer study, involving genotype data on about SNPs from cases and controls. The details of the CGEMS study design and the results from the initial scan and subsequent replication studies can be found at the web site https://caintegrator.nci.nih.gov/cgems/.
Prospective  Retrospective  EmpiricalBayes  

5e–2  5.01e–2  5.66e–2  4.49e–2 
1e–2  0.98e–2  1.43e–2  0.87e–2 
1e–3  1.05e–3  3.85e–3  1.00e–3 
1e–4  1.27e–4  2.24e–3  1.31e–4 
1e–5  2.67e–5  1.76e–3  3.34e–5 
1e–6  2.22e–6  1.47e–3  4.45e–6 
We consider SNPs from 22 nonsex chromosomes with minor allele frequencies larger than . Table 1 displays the empirical proportions of the number of SNPs that are found to be significant at different nominal significance levels using 2 d.f. tests based on three different methods: (a) prospective, (b) retrospective and (c) empiricalBayes [see Luo et al. (2009) for more details]. For a welldesigned study and a robust analytic method, the empirical proportions are expected to be fairly close to the nominal significant levels, given that the vast majority of the SNPs are likely to be not associated with the disease. In Table 1, we observe that the empirical proportions of significant SNPs found by the prospective method closely follows the nominal significance levels. In contrast, the corresponding proportions for the retrospective test deviate severely from the nominal values in the range of , indicating significantly inflated typeI error due to the violation of HWE for many SNPs. The last column of Table 1 shows that the empiricalBayes procedure essentially corrects for all the bias of the retrospective method due to the violation of the HWE assumption.
Next, we conducted a simulation study to investigate the performance of various tests in ranking a true susceptibility locus in a genomewide association study (GWAS) that include hundreds of thousands of “null” SNPs. To generate realistic linkage disequilibrium patterns, we simulated GWAS data mimicking the CGEMS study itself. Given minor allele frequency among controls and the diseasegenotype odds ratio parameters for a chosen susceptibility locus, we simulate genotype data at that locus for the cases and controls separately from the corresponding multinomial distributions. Given the genotype data at the susceptibility locus for a case or a control, we simulate genotype data for the remainder of the SNPs by assigning the whole genotype profile for a randomly selected subject from the controls of the CGEMS study who have the same genotype data at the given susceptibility locus as the sampled subject in our simulation study. This algorithm, as originally described by Yu et al. (2009), assumes that given the genotypes for the susceptibility locus, the risk of the disease is independent of all the remaining SNPs. We simulated data sets with approximately cases and controls. For each data set we tested for association for each of the approximately 450,000 SNPs using the prospective, retrospective and empiricalBayes methods. The rank of the diseaseassociated SNP is obtained by sorting all the values in ascending order.
MAF  Prospective  Retrospective  EmpiricalBayes 

0.1  112163  8117  44319 
0.2  1888  203  52 
0.3  656  210  27 
0.4  15  82  2 
Table 2 displays the median ranks obtained by three methods for a true diseaseassociated SNP that has a recessive effect with a logoddsratio of . As expected, the ranks of all tests decrease as the minor allele frequency increases. Comparing the ranks of different tests at a specific minor allele frequency, we can see that the standard prospective method generally has the lowest power in the sense that it assigns much higher rank to the susceptibility SNP than the two other tests. When minor allele frequency is , we observe that the pure retrospective method performs the best in the sense that it assigns the lowest rank to the susceptibility SNPs among all the methods. In contrast, when minor allele frequency is greater than or equal to , we observe that the empiricalBayes procedure assigns considerable lower rank to the susceptibility SNP than the pure retrospective method. Intuitively, the results can be explained from the fact that the retrospective method yields low values for many null SNPs due to the violation of the HWE assumption (see Table 1) and thus dilutes the rank of the real susceptibility SNP.
3 Association Analysis for Imputed SNPs
The forms of the prospective and retrospectivescores suggest how they can be modified easily for SNPs that may not have been directly genotyped, but can be “imputed” conditional on neighboring SNPs and estimates of linkage disequilibrium from HapMap or other similar databases. Let denote the neighboring genotype information for an untyped SNPlocus with unobserved genotype . The prospective score for such an untyped SNP can be defined by taking the conditional expectation of the “complete data” score function given the observed data, that is, the neighboring genotype information. More formally, the prospective score for an untyped SNP can be written as
where the conditional expectations are taken with respect to a suitable imputation model such as those described by Nicolae (2006), Marchini et al. (2007) and others. The retrospective score for an untyped SNP can be similarly defined by the conditional expectation of the “complete data” retrospective score function given the observed data in the form
Notice that in the retrospective score function, the contribution of the term is a constant term given the allele frequency . The estimation of the allele frequency for an untyped SNP, however, requires imputation. In particular, under the “complete data” model we can write the estimate of the allele frequency under the null hypothesis of no association as
Thus, given an imputation model, we can estimate the allele frequency as
We further need the variances for and under the null hypothesis to obtain the corresponding score tests. The variance of can be estimated as
where is the pooledsample variance of . The prospectivescore test is based on the test statistic given by
where the superscripts T and—denote transpose and generalized inverse, respectively. Asymptotically, this statistic follows a chisquared distribution under the null hypothesis of , with the degrees of freedom given by the dimension of . The variance of the retrospective score , after adjusting for the estimation of the allele frequency by given by (3), can be estimated by
where is the pooledsample covariance between and . The variance of can also be alternatively estimated by the robust sandwichtype estimate given as
where the efficient score
The retrospectivescore test is then based on the test statistic given by
which again follows a chisquared distribution asymptotically under the null hypothesis, with the degrees of freedom given by the dimension of . In both the prospective and retrospectivescore tests given above, we obtain the conditional probability directly from some external reference database, for example, HapMap, a strategy similar to the proposal of Nicolae (2006).
We now demonstrate the potential power advantages that might be achieved by imputing the untyped SNP, using numerical studies following two scenarios as in Tables 1 and 2 of Nicolae (2006). In Scenario 1 the untyped SNP can be perfectly predicted by the genotypes of the typed SNPs, namely, the (see Stram et al., 2004, for a definition), while in Scenario 2 the untyped SNP is moderately predicted by the genotypes of the typed SNPs with . The SNP profiles together with the haplotype frequencies estimated from HapMap CEU samples in the two scenarios are summarized in Tables 3 and 4. Also listed in Tables 3 and 4 are the haplotype frequencies we actually used to simulate the SNP data for the casecontrol sample, which moderately deviate from those seen in the HapMap CEU sample to reflect the potential discrepancy between the HapMap and study samples. The haplotype pair for each person is generated according to HWE.
Haplotype of SNPs  Frequency  

Frequency  from HapMap  
1–0–0–0–0  0.158  0.058 
0–1–0–1–0  0.400  0.300 
1–1–0–1–0  0.050  0.050 
1–1–1–0–1  0.358  0.558 
0–1–1–0–1  0.022  0.017 
1–1–0–0–1  0.012  0.017 
Haplotype of SNPs  Frequency  

Frequency  from HapMap  
0–0–0–0  0.088  0.058 
0–0–1–1  0.027  0.017 
0–1–0–0  0.302  0.342 
0–1–1–0  0.008  0.008 
1–0–1–0  0.242  0.142 
1–0–1–1  0.333  0.433 
We simulated the casecontrol status by the logistic regression model (3), where the genetic determinant is given by the minor allele count of the untyped SNP, and the function is given by the recessive, dominant or additive genetic mode. The intercept , which yields an overall disease rate around 5%. Each analysis is based on a casecontrol sample with 1000 cases and 1000 controls. The simulation results are based on 1000 (3000) repetitions for evaluation of test power (size). All the tests are performed at a significance level of 0.01. The score tests are performed using the correct genetic model, and the retrospectivescore tests are based on the robust sandwichtype variance estimates; results based on modelbased variance estimates are quite similar and are omitted. When performing the prospective and retrospectivescore tests with imputed genotypes for the untyped SNP, we use the haplotype frequency estimates fromHapMap to obtain the conditional probabilities, even though the casecontrol sample is actually from a population with moderately different haplotype frequencies. To see the degree of recovery of missing information achieved by imputation, we also perform the prospective andretrospectivescore tests based on the true genotypes at the untyped SNP. In addition, we perform the multimarker Hotelling’s test based on genotypes at typed SNPs (Xiong, Zhao and Berwinkle, 2002; Chapman et al., 2003), which is equivalent to the prospectivescore test derived from the logistic regression model (3) with the covariates given as the vector of genotypes for all the typed SNPs.
Results for this simulation study are presented in Tables 5 (Scenario 1) and 6 (Scenario 2). It is seen that the score tests with imputed genotypes have size matching reasonably well with the nominal value of 1%, even though the imputation is based on haplotype frequencies that are obtained from the HapMap data and are different from the true frequencies. From the results regarding power, we see that imputing the untyped SNP in either the prospective or the retrospectivescore test can achieve substantial power gains as compared with the Hotelling’s test based only on genotyped SNPs. The relative power improvement gained by imputation can still be quite remarkable even when the accuracy for predicting the untyped SNP using the genotyped SNPs is only of a moderate level (Scenario 2, where ). On the other hand, the prediction accuracy does affect the degree of recovery of the missing information that may be achieved by imputation: in Scenario 1, with perfect prediction of the untyped SNP, the tests using imputed genotypes do attain the full power we would obtain if the tests were based on the true genotype of the untyped SNP. In Scenario 2, with moderate prediction of the untyped SNP, imputation of the untyped SNP can recover partial but not full power. It is worth remembering that, with exact data, the retrospectivescore test is usually more powerful than the prospectivescore under the dominant or recessive model, and the two tests are essentially equivalent under the additive model. Here we observe the same phenomena when the prospective and retrospectivescore tests are based on imputed genotypes.
Prospective score  Retrospective score  Hotelling’s  
imputed (true)  imputed (true)  
Recessive model  
0  1.1 (1.1)  1.1 (1.1)  0.9 
0.5  26.1 (26.1)  33.7 (33.7)  3.6 
0.6  40.1 (40.1)  55.3 (55.3)  5.6 
Dominant model  
0  1.0 (1.3)  1.0 (1.3)  0.9 
0.3  68.6 (68.6)  72.9 (72.9)  39.0 
0.4  96.0 (96.0)  96.7 (96.7)  79.3 
Additive model  
0  1.2 (1.2)  1.2 (1.2)  0.9 
0.2  43.0 (43.0)  43.0 (43.0)  24.2 
0.3  86.4 (86.4)  86.4 (86.4)  65.5 
Prospective score  Retrospective score  Hotelling’s  
imputed (true)  imputed (true)  
Recessive model  
0  1.4 (1.2)  1.2 (1.2)  1.1 
0.5  42.6 (92.2)  47.0 (97.6)  17.6 
0.6  59.4 (99.1)  66.4 (99.9)  24.9 
Dominant model  
0  0.8 (1.1)  0.9 (1.0)  1.1 
0.4  48.5 (95.6)  54.3 (98.2)  23.8 
0.5  71.6 (99.6)  77.2 (100)  41.5 
Additive model  
0  1.0 (1.3)  1.0 (1.3)  1.1 
0.3  60.2 (97.6)  60.1 (97.6)  40.6 
0.4  92.5 (99.9)  92.4 (99.9)  77.4 
As we noted earlier, when exact genotype data are available, the retrospectivescore test is more sensitive to violation of the HWE assumption than the prospectivescore test; that is, the former is usually biased while the latter still remains unbiased when HWE does not hold. To assess the robustness properties for the prospective and retrospectivescore tests with imputed genotype data, we performed a further simulation study where the SNP haplotypes are still given as in Tables 3 and 4, but the haplotype pair for each person is given by the model with for and for , where is the frequency for haplotype , and , the fixation index quantifying the departure from HWE, is set to 0.05. We can see from the results listed in Table 7 that, with imputed genotype data, the prospectivescore test, like its exactdata counterpart, still shows greater robustness in maintaining the typeI error rates than the retrospectivescore test. In particular, the retrospectivescore test, based on the recessive or dominant model, may yield high typeI error rates under violation of HWE, no matter whether exact or imputed genotype data are used. Thus, an empiricalBayes type shrinkage method that can adapt between prospective and retrospective methods depending on biasvariance tradeoff could be useful for analysis of both typed and untyped SNPs.
Prospective score  Retrospective score  

imputed (true)  imputed (true)  
Recessive model  
Scenario 1  0.8 (0.8)  1.7 (1.7) 
Scenario 2  1.2 (1.2)  5.9 (7.7) 
Dominant model  
Scenario 1  0.9 (0.9)  1.4 (1.4) 
Scenario 2  1.0 (0.8)  3.2 (5.1) 
Additive model  
Scenario 1  1.0 (1.0)  1.0 (1.0) 
Scenario 2  0.7 (0.8)  0.7 (0.8) 
We conclude this section with a discussion on the two types of association analyses recently developed for untyped SNPs: the full likelihood approach(Lin, Hu and Huang, 2008) and the twostage approach (Nicolae, 2006; Marchini et al., 2007). The full likelihood approach uses a retrospective likelihood for the casecontrol sample and a likelihood for the external (such as HapMap) data, by which the imputation and association analysis are simultaneously performed in a onestage manner. Conversely, the twostage approach performs the imputation and association analysis separately: imputing missing genotypes in the first stage and then performing association analysis in the second stage. In the imputation stage of the twostage approach, one can apply existing powerful external imputation algorithms such as Nicolae (2006) and Marchini et al. (2007), and, hence, the twostage approach is convenient to implement. There has been some debate on the efficiency difference between the two approaches (Marchini and Howie, 2008; Lin and Hu, 2008). Our simulation results (Tables 5 and 6) suggest that some of the efficiency difference between the full likelihood and the twostage approaches may be due to the use of different likelihoods (prospective vs. retrospective) and not so much due to the use of onestage vs. twostage analysis. In this section we have shown that one can still use a retrospective likelihood even in a twostage approach with powerful imputation performed at the first stage.
4 Haplotypes
4.1 Definitions, Background and Missing Data
Although singleSNP association tests are often the primary methods for genomewide association scans, many secondary or “downstream” analyses are often useful for detailed characterization of the risk of the disease associated with specific genomic regions of interest. One popular technique is haplotypebased association analysis, which involves studying the association of a disease with a genomic region in terms of the underlying “haplotypes,” the combination of alleles at multiple loci along individual homologous chromosomes. Originally, haplotypebased association analysis was considered a powerful technique for “indirect” association testing in situations where a causal SNP may not have been genotyped, but the haplotypes defined by multiple typed SNPs could serve as a good “surrogate” for the causal variant. With the advent of various imputation methods, although haplotype analysis has become less relevant for such indirect association testing, it remains a useful tool for parsimonious characterization of disease risk associated with multiple, possibly interacting, loci within a given region. Moreover, it is conceivable that for some regions, the haplotypes, and not the individual SNPs, are functional units and, thus, for these regions stronger signals of associations could be detected by performing haplotypebased regression analysis.
A technical problem for haplotypebased regression analysis is that typically the haplotype information for the study subjects is not directly observable. Instead, locusspecific genotype data are observed, which contain information on the pair of alleles a subject carries, but does not provide the “phase information,” that is, which combinations of alleles appear across multiple loci along the individual chromosomes. In general, the genotype data of a subject will be phaseambiguous whenever the subject is heterozygous at two or more loci. Statistically, the lack of phase information can be viewed as a special missing data problem.
For example, suppose A/a and B/b denote the major/minor alleles in two biallelic loci. A particular haplotype pair, called a diplotype, is the pair of alleles that are inherited from one’s parents. One such haplotype pair would be , and disease risk can be associated with the number of copies of particular haplotypes that one inherits. Unfortunately, the diplotypes are not observable directly, but instead we can observe only the unordered or combined genotypes, in this case at the first locus and at the second locus, that is, . However, when observing only the genotypes, the actual haplotype pair is unknown, or “phase ambiguous,” because the haplotype pair has the same set of unordered genotypes. Confronted with the unordered set of genotypes , we know that the actual haplotype pair is either or , but we must use probability models to take into account the phase ambiguity when performing statistical inference.
In Section 2 we described “modelfree” prospective and “modelbased” efficient retrospective methods for analyzing SNP data, and we also described empiricalBayes methods that dataadaptively move between the two. Just as in SNP data, for haplotype data there are also modelfree and modelbased methods, and accompanying empiricalBayes methods.
A variety of methods have been developed forhaplotypebased analysis of casecontrol data using the logistic regression model (Zhao, Li and Khalid, 2003; Lake et al., 2003; Epstein and Satten, 2003; Satten and Epstein, 2004; Spinka, Carroll and Chatterjee, 2005; Lin and Zeng, 2006; Chatterjee et al., 2006; Chen, Chatterjee and Carroll, 2009). Consider a general risk model similar to (3) but with the addition of environmental factors and written in terms of the diplotypes, denoted as :
(11)  
where the function is chosen in a suitable way to reflect an assumed mode of genetic effect. For example, suppose we are interested in the particular haplotype . A model that assumes an additive effect of this haplotype would have linear in the number of copies of the haplotype .
4.2 ModelBased and ModelFree Methods
4.2.1 Identifiability
The data setup then is that we have observations on environmental exposure , genotypes and cases and controls . What is missing is the underlying diplotype . The retrospective likelihood is still (2), but the risk of disease depends on the diplotype and not otherwise on the genotype.
While models such as (4.1) seem straightforward enough for random samples, in retrospective samples a problem arises because of the phase ambiguity. In particular, all components of may not be identifiable if the distribution of is left completely unrestricted (Epstein and Satten, 2003; Lin and Zeng, 2006). Thus, to make progress, some type of distributional assumptions are needed. Here we will distinguish between two approaches, both of them retrospective in nature but with different distributional assumptions. The first we call “modelfree” in that very little is actually assumed about the haplotype distribution. If haplotypes were observable, this method reduces to ordinary prospective logistic regression, while in the rare disease case with phase ambiguity, the method reduces to that of Zhao, Li and Khalid (2003). The second approach, which we call “modelbased,” makes much stronger assumptions about the haplotype distribution, and reduces to the efficient retrospective approach of Chatterjee and Carroll (2005) if haplotypes were observable. The modelfree method will thus be more robust but less efficient than the modelbased method.
4.2.2 Modelbased method
The modelbasedmethod (Spinka, Carroll and Chatterjee, 2005) has three aspects:

[(A.1)]

Haplotypes and the environment are assumed independent in the population.

The diplotypes are assumed to be in HWE in the population, so that
where denotes the population frequency for the haplotype .

The distribution of the environmental variable is left completely nonparametric.
The methodology Spinka, Carroll and Chatterjee (2005) used to construct their profile likelihood was a nonparametric maximum likelihood estimator over the unknown distribution of . However, there is an alternative derivation, one that is both more intuitive and much easier to work out. Indeed, it is a not sufficiently wellknown fact that for most purposes a casecontrol study can be viewed as a prospective study with missing data. Consider a sampling scenario where each subject from the underlying population is selected into the casecontrol study using a Bernoulli sampling scheme where the selection probability for a subject given his/her disease status is proportional to . Inference with the actual casecontrol data can then be based on the pseudolikelihood derived for such an alternative sampling scenario. Let denote that a subject is selected in the casecontrol sample under this Bernoulli sampling scheme and hence has been observed. Then in this alternative sampling scheme, and with the assumptions stated above,Spinka, Carroll and Chatterjee (2005) compute. This calculation is simple and in the rare disease case the resulting efficient modelbased likelihood function reduces to
(12)  
where , , , , and is the set of diplotypes consistent with the observed genotypes .
4.2.3 Modelfree method
The two important model assumptions in the modelbased estimator are (A.1) and (A.2). Although because of identifiability some model assumptions must be made, they can be weakened tremendously, as follows (Chen, Chatterjee and Carroll, 2009):

[(B.1)]

The haplotype and the environment are independent in the population given the genotype .

There population distribution for the diplotypes given the genotype , called , can be derived assuming HWE.
Following the same alternative sampling scheme as described in Section 4.2.2, or by doing a nonparametric maximum likelihood analysis, we can compute under assumptions (B.1), (B.2) and (A.3) to be
(13)  
To see why the likelihood requires far weaker assumptions than , note that requires the haplotype–environment independence and HWE assumption only to specify the conditional distribution while requires the same assumption to specify the entire joint distribution . As a result, requires the haplotype–environment independence and HWE only to resolve the phase ambiguous genotypes. The likelihood contribution for the subjects with phase unambiguous genotypes, that is, , is the same as that for the standard prospective logistic regression. In contrast, depends on the assumptions (A.1) and (A.2) irrespective of whether a subject has a missing phase or not.
Note that will contain little information on since it conditions on . Thus, when implementing methods based on this likelihood,Chen, Chatterjee and Carroll (2009) proposed to replace the score function for by the estimating function for based on the genotype data from the controls and assuming that the haplotypes are in HWE in the population.
4.3 EmpiricalBayes
In Section 4.2.2 we constructed a profile likelihood under strong assumptions leading to an efficient method that will not be robust to violations of the two major assumptions. Conversely, in Section 4.2.3 we computed a profile likelihood leading to much more robust inference, but at the cost of a steep loss of efficiency. Similarly to Section 2.3, here we briefly review a fully sample size and dataadaptive empiricalBayes method thatChen, Chatterjee and Carroll (2009) described for gaining efficiency when warranted but is still robust.
Let and be the modelfree and modelbased estimates, with components and . Let be the covariance matrix of , with the diagonal element of being : a sandwich estimator can be computed, although a nonparametric bootstrap can also be used. Then one can define the empiricalBayes estimator
(14)  
The intuition behind (14) is that if the model fails, will be large relative to , which as a variance is proportional to , hence, , and, hence, the empiricalBayes method will effectively become the modelfree estimator. If, however, the model assumption holds, then and are proportional to one another, so that and the empiricalBayes estimate goes part way toward the modelbased estimator, and hence gains efficiency over the modelfree estimate.Chen, Chatterjee and Carroll (2009) describe thelimiting distribution of (14) and how to compute an estimate of its variance.
Chen, Chatterjee and Carroll (2009) illustrate application of the different methods in two casecontrol data examples. The examples were chosen in such a way that from a priori biologic grounds one would expect the gene–environment independence assumption to hold in one case, but not in the other. The two examples together illustrate how the different shrinkage estimators adapt to alternative scenarios of gene–environment distribution.
5 Discussion
Researchers now increasingly use the Cochran–Armitage trend test as the primary method for singleSNP association testing in the GWAS. The test is known to have robust power for the detection of effect of susceptibility SNPs under a range of realistic modes of inheritance that give rise to some sort of monotone relationship between disease risk and allele count. As noted in Section 2, the retrospective and prospective methods have very similar, if not identical, power under the trend model and thus either could be used as the primary method for analysis of GWAS data. The trend test, however, can perform very poorly for the detection of SNPs for which the minor allele has a recessive effect. Thus, it is often recommended that a test under the recessive mode of inheritance be conducted as a secondary step to detect SNPs with recessive effects that may be missed by the primary trend test of association. The use of the retrospective method can be potentially beneficial at this stage. One, however, has to be cautious about creation of false positive results due to the violation of the HWE assumption. We recommend that if a retrospective method is to be used for potential power gain, then it should be used in conjunction with the empiricalBayes type shrinkage estimation. Our numerical investigations suggest that such a method can indeed be more powerful than the conventional “prospective” methods without creating excess false positives; see Tables 1 and 2.
In this article, although we focus on association tests involving biallelic SNPs, the same issues are relevant for genetic association tests involving loci with more than two alleles. In particular, one can gain efficiency for analysis of casecontrol data by assuming HWE or other natural populationgenetic models (Satten and Epstein, 2004; Lin and Zeng,2006) to specify multiallelic genotype frequency for the underlying population. The sensitivity of the methods to underlying model assumption can be reduced by appropriate shrinkage estimation techniques.
The impact of population stratification (PS) can be very different for prospective and retrospective methods. As it is well known, the presence ofpopulation stratification, that is, the existence of hidden ethnic substructures in the population, can create confounding bias in all of the methods when both genefrequency and disease risks vary across the underlying strata. The presence of PS can also cause large scale violation of the HWE assumption, thus making the retrospective method more susceptible to bias than its prospective counterpart. Our application of different methods to the CGEMSgenomewide association study data illustrated that the empiricalBayes type procedure can correct for inflated typeI error that may exist for retrospective methods due to large scale violation of the underlying HWE assumption.
The difference between prospective and retrospective methods becomes more relevant for studies of gene–gene and gene–environment interactions, a topic that we have not directly addressed in this article. In particular, retrospective methods, such as the caseonly analysis (Piegorsch, Weinberg and Taylor, 1994), which assumes gene–gene or/and gene–environment independence for the underlying population, can gain dramatic power for testing and estimation of odds ratio interaction parameters in the logistic regression model. Given that standard casecontrol analyses often have poor power for detection of multiplicative interactions due to small numbers of cases or controls in cells of crossing exposures, practitioners often find it is tempting to use the more powerful retrospective methods. The assumption of gene–environment independence, however, can be violated, either due to direct casual association between gene and environment or indirect association due to effects of family history and hidden population stratification. The assumption of gene–gene independence between physically distant genes can also be violated due to population stratification. Thus, we believe the development of shrinkage (Mukherjee and Chatterjee, 2008; Chen, Chatterjee and Carroll, 2009) and other types of dataadaptive techniques(Li and Conti, 2009) has been valuable for robust inference in casecontrol studies of genetic epidemiology.
Acknowledgments
Chatterjee’s research was supported by a gene–environment initiative grant from the National Heart Lung and Blood Institute (RO1HL09117201) and by the Intramural Research Program of the National Cancer Institute. Chen’s research was supported by the National Science Council of ROC (NSC 952118M001022MY3). Carroll’s research was supported by a grant from the National Cancer Institute(CA57030) and by Award Number KUSCI01604, made by King Abdullah University of Science and Technology (KAUST).
References
 Andersen (1970) Andersen, E. B. (1970). Asymptotic properties of conditional maximumlikelihood estimators. J. Roy. Statist. Soc. Ser. B 32 283–301. \MR0273723
 Chapman et al. (2003) Chapman, J. M., Cooper, J. D., Todd, J. A. and Clayton, D. G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity 56 18–31.
 Chatterjee and Carroll (2005) Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene–environment independence in casecontrol studies. Biometrika 92 399–418. \MR2201367
 Chatterjee et al. (2006) Chatterjee, N., Spinka, C., Chen, J. and Carroll, R. J. (2006). Likelihood based inference on haplotype effects in genetic association studiesComment. J. Amer. Statist. Assoc. 101 108–111.
 Chen and Chatterjee (2007) Chen, J. and Chatterjee, N. (2007). Exploiting Hardy–Weinberg equilibrium for efficient screening of single SNP associations from casecontrol studies. Human Heredity 63 196–204.
 Chen, Chatterjee and Carroll (2009) Chen, Y. H., Chatterjee, N. and Carroll, R. J. (2009). Shrinkage estimators for robust and efficient inference in haplotypebased casecontrol studies. J. Amer. Statist. Assoc. 104 220–233.
 Cornfield (1956) Cornfield, J. (1956). A statistical problem arising from retrospective studies. In Proceedings of the Third Berkeley Sympos. Math. Statist. Probab. 135–148. Univ. California Press, Berkeley. \MR0084935
 Epstein and Satten (2003) Epstein, M. P. and Satten, G. A. (2003). Inference on haplotype effects in casecontrol studies using unphased genotype data. American Journal of Human Genetics 73 1316–1329.
 Hartl and Clark (2007) Hartl, D. L. and Clark, A. G. (2007). Principles of Population Genetics, 4th ed. Sinauer Associates, Sunderland, MA.
 Hunter et al. (2007) Hunter, D. J., Kraft, P., Jacobs, K. B., Cox, D. G., Yeager, M., Hankinson, S. E., Wacholder, S., Wang, Z., Welch, R., Hutchinson, A., Wang, J., Yu, K., Chatterjee, N. et al. (2007). A genomewide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nature Genetics 39 870–874.
 Lake et al. (2003) Lake, S. L., Lyon, H., Tantisira, K., Silverman, E. K., Weiss, S. T., Laird, N. M. and Schaid, D. J. (2003). Estimation and tests of haplotype–environment interaction when linkage phase is ambiguous. Human Heredity 55 56–65.
 Li and Conti (2009) Li, D. and Conti, D. V. (2009). Detecting gene–environment interactions using a combined caseonly and casecontrol approach. American Journal of Epidemiology 169 497–504.
 Lin and Hu (2008) Lin, D. Y. and Hu, Y. (2008). Reply to Marchini and Howie. American Journal of Human Genetics 83 539–540.
 Lin, Hu and Huang (2008) Lin, D. Y., Hu, Y. and Huang, B. E. (2008). Simple and efficient analysis of disease association with missing genotype data. American Journal of Human Genetics 82 444–445.
 Lin and Zeng (2006) Lin, D. Y. and Zeng, D. (2006). Likelihoodbased inference on haplotype effects in genetic association studies. J. Amer. Statist. Assoc. 101 89–104. \MR2268031
 Luo et al. (2009) Luo, S., Mukherjee, B., Chen, J. and Chatterjee, N. (2009). Shrinkage estimation for robust and efficient screening of singleSNP sssociation from casecontrol genomewide association studies. Genetic Epidemiology Online.
 Marchini and Howie (2008) Marchini, J. and Howie, B. (2008). Comparing algorithms for genotype imputation. American Journal of Human Genetics 83 535–539.
 Marchini et al. (2007) Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genetics 39 906–913.
 Mukherjee and Chatterjee (2008) Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene–environment independence for analysis of casecontrol studies: An empirical Bayes approach to trade off between bias and efficiency. Biometrics 64 685–694.
 Nicolae (2006) Nicolae, D. L. (2006). Testing untyped alleles (TUNA)applications to genomewide association studies. Genetic Epidemiology 30 718–727.
 Piegorsch, Weinberg and Taylor (1994) Piegorsch, W. W., Weinberg, C. R. and Taylor, J. A. (1994). Nonhierarchical logistic models and caseonly designs for assessing susceptibility in populationbased casecontrol studies. Statist. Med. 13 153–162.
 Prentice and Pyke (1979) Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and casecontrol studies. Biometrika 66 403–412. \MR0556730
 Roeder, Carroll and Lindsay (1996) Roeder, K., Carroll, R. J. and Lindsay, B. G. (1996). A semiparametric mixture approach to casecontrol studies with errors in covariables. J. Amer. Statist. Assoc. 91 722–732. \MR1395739
 Satten and Epstein (2004) Satten, G. A. and Epstein, M. P. (2004). Comparison of prospective and retrospective methods for haplotype inference in casecontrol studies. Genetic Epidemiology 27 192–201.
 Satten and Kupper (1993) Satten, G. A. and Kupper, L. L. (1993). Conditional regression analysis of the exposuredisease odds ratio using known probabilityofexposure values. Biometrics 49 429–440.
 Spinka, Carroll and Chatterjee (2005) Spinka, C., Carroll, R. J. and Chatterjee, N. (2005). Analysis of casecontrol studies of genetic and environmental factors with missing genetic information and haplotypephase ambiguity. Genetic Epidemiology 29 108–127.
 Thomas et al. (2008) Thomas, G., Jacobs, K. B., Yeager, M., Kraft, P., Wacholder, S., Orr, N., Yu, K., Chatterjee, N., Welch, R., Hutchinson, A. et al. (2008). Multiple novel loci identified in a genomewide association study of prostate cancer. Nature Genetics 40 310–315.
 van Belle et al. (2004) van Belle, G., Heagerty, P. J., Fisher, L. D. and Lumley, T. S. (2004). Biostatistics: A Methodology for the Health Sciences. Wiley, Hoboken, NJ.
 Xiong, Zhao and Berwinkle (2002) Xiong, M., Zhao, J. and Berwinkle, E. (2002). Generalized T2 test for genome association studies. American Journal of Human Genetics 70 1257–1268.
 Yeager et al. (2007) Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S., Minichiello, M. J., Fearnhead, P., Yu, K., Chatterjee, N. et al. (2007). Genomewide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics 39 645–649.
 Yu et al. (2009) Yu, K., Li, Q., Bergen, A. W., Pfeiffer, R., Rosenberg, P., Caporaso, N., Kraft, P. and Chatterjee, N. (2009). Pathway analysis by adaptive combination of Pvalues. Genetic Epidemiology 33 700–709.
 Zhao, Li and Khalid (2003) Zhao, L. P., Li, S. S. and Khalid, N. A. (2003). Method for the assessment of disease associations with singlenucleotide polymorphism haplotypes and environmental variables in casecontrol studies. American Journal of Human Genetics 72 1231–1250.