A Bayesian Semiparametric Approach to Learning About GeneGene Interactions in CaseControl Studies
Abstract
Genegene interactions are often regarded as playing significant roles in influencing variabilities of complex traits. Although much research has been devoted to this area, till date a comprehensive statistical model that adequately addresses the highly dependent structures associated with the interactions between the genes, multiple loci of every gene, various and unknown number of subpopulations that the subjects arise from, seem to be lacking.
In this paper, we propose and develop a novel Bayesian semiparametric approach composed of finite mixtures based on Dirichlet processes and a hierarchical matrixnormal distribution that can comprehensively account for the unknown number of subpopulations and genegene interactions. Then, by formulating novel, and suitable Bayesian tests of hypotheses using a suitable metric for comparing clusterings in conjunction with the interaction parameters of the matrixnormal, we attempt to single out the roles of the genes, individually, and in interaction with other genes, in casecontrol studies. Quite importantly, we also attempt to identify the DPL.
Since the mixtures are conditionally independent of each other given the interaction structure, for model fitting we update them simultaneously in parallel processors using an efficient Gibbs sampling procedure. Given the updated mixture parameters the interaction parameters are then updated in a single block using the recently developed fast and efficient Transformation based MCMC (TMCMC) methodology. Thus, our model facilitates a highly efficient parallel computing methodology.
Application of our model and methodologies to biologically realistic data sets simulated from an
existing, realistic population genetics
model associated with casecontrol study, revealed quite encouraging performance.
Quite importantly, we also applied our ideas to a real, myocardial infarction dataset, and obtained
interesting results that partly agree with, but also complement, the existing works in this area, and reveal the importance
of sophisticated and realistic modeling of genegene interactions.
Keywords: Casecontrol study; Dirichlet process; Genegene interaction;
Myocardial infarction; Parallel processing; Transformation based MCMC.
Contents
 1 Introduction
 2 A new Bayesian semiparametric model for genegene interactions
 3 Detection of the roles of genes in casecontrol studies
 4 Validation of our model and methodologies with biologically realistic simulated data sets

5 Application of our model and methodologies to a real, casecontrol dataset on Myocardial Infarction
 5.1 Data description
 5.2 Remarks on model implementation

5.3 Results of the real data analysis
 5.3.1 Influential genes obtained from our analysis
 5.3.2 Disease predisposing loci detected by our Bayesian analysis
 5.3.3 Roles of genegene and SNPSNP interactions discriminating our gene findings with influential genes reported in the literature
 5.3.4 Posteriors of the number of distinct mixture components distinguishing important sets of genes
 5.4 Discussion of our Bayesian methods and GWAS in light of our findings
 6 Concluding remarks
 S1 A rule of thumb for choosing and
 S2 Choices of and
 S3 Elucidation that the th loci of any two different genes can be independent with positive probability
 S4 A parallel MCMC algorithm for model fitting
 S5 Hellinger distance for hypothesis testing and associated computational challenge
 S6 Clustering and Euclidean metrics
 S7 Simulation studies
 S8 Explanation of the issue that even small correlations between SNPwise casecontrol Euclidean distances determine the DPL
 S9 Some other significant SNPs in the real data analysis
1 Introduction
Isolated evaluation of individual single nucleotide polymorphisms (SNPs) for their association with complex diseases in GWAS have succeeded in explaining only small proportions of genetic inheritances; see ? and the references therein. It is now wellknown that some genes interact with one another in complex networks, and that such interactions may influence genetic variations of complex traits (see ?, ?, ?, ?, ?). It is hence anticipated that the want of significant success of GWAS may perhaps be due to the lack of a sophisticated statistical model incorporating the scientific understanding about genegene interactions (see ?) into genomic profiling, that may help in understanding the biological and biochemical pathways behind complex diseases; see ?.
One of the main challenges faced at the very onset of investigating genetic interactions is that of nonuniqueness in the definition of epistasis or genegene interactions. According to ?, biologically, epistasis can be categorised as functional epistasis and compositional epistasis, both of which differ widely from the statistical definition of epistasis. While functional epistasis indicates proteinprotein interactions at molecular level, any disruption of which is explained by a genetic consequence, compositional epistasis refers to blocking of one allelic effect by an allele at another locus. ? (see also ?) defined statistical interaction among the genes as deviations from additive marginal effects of individual genes. Although ? derived some strong empirical conditions under which statistical interactions correspond to compositional epistasis, a prevailing opinion reflected in both genetic and epidemiological literature suggests limitations of the tests based on Fisher’s statistical definition of epistasis in explaining the genegene interactions in the biological sense of the term. According to ? and ?, although statistical and biological interpretations of interaction need not be compatible with each other, quantification of biological interaction should be based on statistical concepts of interactions. We have formulated a statistical model composed of mixture distributions based on Dirichlet processes, incorporating the effects due to complex genetic interactions through hierarchical matrixnormalinverseWishart distribution.
Several casecontrol studies, defining genegene interactions via SNPSNP interactions have been performed (?), assuming that interaction between any two genes are in fact caused by interactions between their respective SNPs. Apart from not considering the genes as functional units, these linear modelbased statistical analyses suffer from very high testing dimensionality and become computationally burdensome because of the very large number of marginal and pairwise interaction effects to be incorporated in the linear model when dealing with a large number of SNPs. The relevant statistical interaction models existing in the casecontrol literature point towards the following tradeoff: dimension reduction by working at the genelevel makes computation feasible, but at the cost of useful SNPlevel information (see ?), while working at SNPlevel promises all the necessary information but at the cost of enormous computational burden (see also ?).
The aforementioned difficulties in the forms of tradeoffs can be traced back to additive modeling strategies. Indeed, even with a small number of SNPs, the traditional linear model is constituted of a very large number of terms consisting of the marginal and interaction effects at the SNP level. Attempts to incorporate the genes as functional units in the model necessarily calls for sacrifice of useful SNPlevel information, while principal components analysis for dimension reduction make genetic interpretation difficult. The linear modeling strategy based on Fisher’s statistical definition of epistasis can also be questioned on the ground of oversimplicity, since functionally, genegene interactions may involve very complex physical interplay among proteins as gene products (?). Moreover, in linear models the main effects and the interaction effects are estimated from the genotype data and then onwards assumed to be nonrandom covariates. More holistic approaches should be concerned with postulating highly structured joint distributions of the complex genotype data.
A further drawback of the existing interaction models is that they often ignore multiple subpopulations that the genotype data usually arise from. Indeed, for different subpopulations, the genes (or SNPs) may interact differently, which adds further complexity to the complicated functional form of epistasis. ? empirically demonstrate that methods ignoring population substructures can incur severe bias leading to largescale false positives. The fact that the number of subpopulations is not usually known is a further challenging issue that needs to be considered, as one must coherently and carefully account for the uncertainty associated with the unknown number of subpopulations.
The criticisms of the interaction models existing in the casecontrol literature motivated us to propose a new and general Bayesian model for genegene interactions and novel Bayesian hypotheses testing procedures and associated methodologies to investigate the effects of genes on complex diseases in the context of casecontrol dataset arising from a possibly stratified population of genotypes. In what follows we investigate only the genetic effects on complex diseases, without taking into account the environmental effects (but see ? and ?).
The rest of our paper is structured as follows. In Section 2 we introduce our proposed Bayesian semiparametric model, and in Section 3 we propose and develop a novel Bayesian hypothesis testing procedure for detecting the roles of genes in casecontrol studies. In Section 4 we present a brief discussion on validation of our model and methodologies with biologically realistic simulated data sets, the details of which are provided in the supplement, described below. In Section 5 we conduct a detailed analysis of casecontrol data on early onset of Myocardial Infarction obtained from dbGap, comparing and contrasting our findings with the existing results. We summarize our work and make concluding remarks in Section 6.
Additional details are provided in the supplement, whose sections and figures have the prefix “S” when referred to in this paper.
2 A new Bayesian semiparametric model for genegene interactions
Before we introduce our proposed model, we first detail the type of genotype and phenotype data that we are interested in.
2.1 Genotype data
For denoting the two chromosomes, let and indicate the presence and absence of the minor allele of the th individual, th gene, th locus, and the th group (either control or case), for , with denoting case; ; and .
In this paper, we shall concern ourselves with data sets of the aforementioned type. However, for our model, which we introduce below, it is obvious that data sets consisting of only minor allele counts at each locus contains exactly the same information as the above described data type.
2.2 Mixture models driven by Dirichlet processes
Given any , let , and . We assume that for every triplet , are independently distributed with mixture probability mass function with a maximum of components, given by
(2.1) 
where is the probability mass function of independent Bernoulli distributions, given by
(2.2) 
Using allocation variables , with probability distribution
(2.3) 
for and , (2.1) can be represented as
(2.4) 
We may assume appropriate Dirichlet distribution priors on for ; . However, as investigated in ?, the Dirichlet distribution often yields very small values of the probabilities , thereby tending to underestimate the true number of mixture components. On the other hand, setting exhibited much better performance. Therefore, in this work, we set , for , and for all .
Letting , we further assume that
(2.5)  
(2.6) 
where stands for Dirichlet process with expected probability measure having precision parameter . We assume that under , for and ,
(2.7) 
Thus, given a particular pair , our mixture model has the same structure as adopted by ? for inference on population structure. Discreteness of Dirichlet processes cause coincidences among the parameter vectors of with positive probability, so that, with positive probability, the actual number of mixture components in (2.1) falls below , the maximum number of components, the mixing probabilities taking the form , where . See ?, ?, ?, ?, for the details. In fact, we marginalize over to arrive at the wellknown Polya urn distribution of :
(2.8) 
where denotes point mass at . The property of coincidences among the parameter vectors is clearly preserved by the Polya urn scheme.
Observe that, after coincidences among the mixture components, the pairs come to be associated with different mixtures, with different numbers of components. This is reasonable, because, the distributions of the genotypes for the gene of any two individuals belonging to the same subpopulation but with different casecontrol status, that is, and are expected to correspond to different mixtures under significant genetic effect on the disease (see ?).
We have discussed a rule of thumb for the choice of as well as in Section S1 of the supplement, following whch we set and in our applications.
2.3 Incorporating the genegene and SNPSNP interactions through appropriate modeling of the parameters
2.3.1 Modeling the parameters of
Taking into consideration the SNPSNP dependence, which may exist within each gene and also among the genes, we model the Beta parameters and of (2.7) as follows:
For , where , and for every ,
(2.9)  
(2.10) 
Allowing and to be different ensures that the mean of under depends upon the th SNP. We further assume that for ,
(2.11)  
(2.12) 
We found that the Gaussian priors on and with other means and variances did not yield significantly different results, thus pointing towards inbuilt prior robustness in our modeling strategy.
Subsequently, using matrixnormal distribution as a prior on we incorporate the SNPwise dependence in a gene. Moreover, the SNPs associated with different genes are also dependent through the dependence structure among the genes imposed by the matrixnormal prior.
Note that allowing and to be shared by all the genes creates the impression that the labels of the loci are not exchangeable. However, our matrixnormalinverseWishart prior ensures that this is not the case, and that for any two genes and , and , (or ) and (or ), and hence their exponentiated versions, are independent with positive probability, given the data. This is elucidated in Section S3 of the supplement in light of the matrixnormalinverseWishart prior. We vindicated this mathematical argument with simulation studies; see Section 4 (details provided in Section S7 of the supplement). Specifically, we conducted extra simulation studies after randomly permuting the labels of the loci of each gene and reanalyzed each such data set. The results, provided in Section S7 of the supplement, are consistent with those obtained wthout permuting the labels.
2.3.2 Matrix normal prior for
We consider the following model for :
(2.13) 
Rewriting the dimensional vector as a matrix , (2.13) can be represented as a matrix normal distribution with mean matrix , left covariance matrix and right covariance matrix , having probability density function
(2.14) 
We note that the th column of , which we denote by , follows the multivariate normal distribution:
(2.15) 
where is the th column of . The covariance matrix between and is given by
(2.16) 
Similarly, the th row of , which we denote by , has the following multivariate normal distribution:
(2.17) 
being the th row of . Also,
(2.18) 
In our applications we chose .
The essence of the matrix normal structure is to offer a dependence structure among the genes. Given casecontrol status , the dependence structure associated with the genes is provided by , while the matrix represents the dependence between the genotype distribution of the cases and controls, given any particular gene.
2.3.3 Priors on and
We assume that
(2.19) 
where stands for InverseWishart distribution with degrees of freedom and positive definite scale matrix . The density function is given by
(2.20) 
We further assume that
(2.21) 
where the degrees of freedom satisfies and is a positive definite matrix; the density function is given by
(2.22) 
For our applications, we set and . These choices are the minimum values such that the prior expectations of and are welldefined. Choices of and are detailed in Section S2 of the supplement.
A schematic representation of our model and the parallel processing algorithm is provided in Figure 2.1. Details of our parallel processing algorithm are provided in Section S4 of the supplement.
3 Detection of the roles of genes in casecontrol studies
3.1 Formulation of a Bayesian hypothesis testing procedure
In order to investigate if genes have any significant effect on casecontrol, it is pertinent to test
(3.1) 
versus
(3.2) 
where
(3.3)  
(3.4) 
If and are not significantly different, then it is plausible to conclude that the role of genes is not significant in the casecontrol study.
In a nutshell, testing the hypothesis in (3.1)(3.4) requires some appropriate divergence measure between and and if denotes the divergence, then is to be accepted for appropriately large posterior probability of the event that is small.
In our situation, simultaneous consideration of a large number of genes, involving thousands of SNPs renders the existing measures of divergence practically infeasible to compute. For details, see Section S5 of the supplement. This compels us to seek alternative measures of divergence that are also amenable to efficient computation. In the mixture context, a natural measure is the discrepancy between the clusterings associated with the two mixture distributions (?, ?). However, since clusterings do not account for the magnitudes of the parameters, insignificant difference between the clusterings does not necessarily imply insignificant difference between the associated mixture densities. It is worth noting that even the Euclidean metric alone is not appropriate – since mixture densities are invariant with respect to permutations of the parameter components, large Euclidean distances between parameter vectors need not imply large difference between the densities. Thus, we propose to use the clustering based ideas in conjunction with ideas based on the Euclidean metric, appropriately modified for mixture densities. Indeed, if the clusterings associated with the mixture densities are known to be of insignificant difference, then insignificant Euclidean divergence between the parameter vectors does imply insignificant difference between the mixture densities. Details of all these issues are provided in Section S6 of the supplement.
3.2 Formal Bayesian hypothesis testing procedure integrating the above developments
With the clustering metric provided in Section S6 of the supplement, let us define
where
is the distance between the clusterings and , for . With this, we first test
(3.5) 
for reasonably small choice of (). Acceptance of indicates that clusterings associated with and have insignificant difference, for every . If is rejected, then it follows that for at least one , the clustering difference is significant.
If is rejected, then it entails that the clusterings associated with the mixture densities for case and control are significantly different. This implies that the mixture densities themselves are significantly different, so that rejection of leads to rejection of given by (3.1).
But whenever is accepted based on the “” loss and the clustering metric, as already argued, this does not necessarily imply that the mixture densities have insignificant difference. All one can infer in this case is that differences between the associated clusterings are insignificant. Hence, when is accepted, we consider a second test of the form
(3.6) 
where ; here is the Euclidean distance between
and
with , and ; see Section S6 of the supplement.
If is also accepted, then one can safely accept . If is rejected, we then consider a third test of the form
(3.7) 
where , with . Here is a pseudometric based on the Euclidean distance; see Section S6 for details.
If is accepted, then this implies acceptance of given by (3.1). Else, must be rejected. For clarity, we present a schematic diagram of the hierarchy of the hypotheses tests in Figure 3.1.
3.2.1 Choice of loss function for the Bayesian tests
Recall that the “” loss function (see, for example, ?, for details) entails zero loss under the correct decisions, loss under false acceptance of the null hypothesis and loss under false rejection of the null. Thus, we accept under the “” loss function if
(3.8) 
Since usually there is no clearcut way of specifying , we shall generally select the default value , reducing the “” loss function to the wellknown “” loss function.
However, for the tests associated with and , which are to be considered only if is accepted, we shall select much larger than 1. This is because it makes sense to provide greater protection to the null hypothesis given that the clustering test has already provided partial support to , indicating that at least the clusterings are not significantly different (see also Section S6 of the supplement).
Note that, under the “” loss, the null hypothesis is to be accepted if its posterior probability exceeds , while under the “” loss, the threshold posterior probability is . For the hypotheses involving and we shall set so that . This choice is motivated by the level of significance of classical significance tests. Some other choices will also be briefly touched upon.
3.2.2 Choice of
Choices of are expected to be problem specific. In Section S7 of the supplement, we discuss in detail the choices of in our applications. Briefly, we first consider an appropriate null model, for instance the same model as ours but with and set to identity matrix to reflect the null hypotheses of “no interaction” and same mixture distributions under cases and controls for each gene for no genetic effect. We then generate casecontrol genotype data from the null model and fit our general Bayesian model to the “null data” and set as the th percentile of the relevant posterior distribution.
3.3 Bayesian tests for individual genetic effects when is rejected
If given by (3.1) is finally accepted then we may conclude that there is no significant evidence to claim that the genes, individually, or in interaction with the other genes, are important factors in the casecontrol study.
On the other hand, if is rejected, then we check for significances of the individual genes by applying our Bayesian testing procedure on the hypotheses
(3.9) 
For each , we adopt the same procedure for testing the hypothesis as for testing versus ; only , and are to be replaced with , and , respectively. Note that at each stage associated with , and , the Bayesian hypotheses testing framework is equivalent to a Bayesian multiple testing paradigm. Specifically, testing versus for using our Bayesian methods is equivalent to minimizing the Bayes risk of the additive “01” loss function, and the subsequent Bayesian tests for the hypotheses versus and versus , for relevant indices , are Bayesian multiple testing procedures that minimize the Bayes risk of the additive “01c” loss function, where we choose .
If is accepted, then it is possible that the th gene is not individually influential, but some interaction effect involving the th gene may be significant. To check which interactions are significant (we may check this even if is rejected, since the th gene may be marginally significant as well as interactive with the other genes), one may conduct the tests versus , for , being the th element of . Acceptance of for some (or many) , indicates which of the genes interact with the th gene to contribute significantly to the underlying casecontrol study.
4 Validation of our model and methodologies with biologically realistic simulated data sets
We evaluate our model and methodologies on data sets generated from the GENS2 software designed by ?. In a nutshell, the software creates large, biologically realistic data sets having realistic LD patterns, where risks of complex diseases are influenced by known genegene and geneenvironment interactions. We consider two simulated data sets for our experiments – in the first experiment, we generate a casecontrol data set under the effect of genegene interaction, fit our model to the data set, and test the relevant hypotheses. We show that our model and methodology successfully captures the relevant information regarding the effects of the individual genes, genegene interaction, and the number of subpopulations. We also show that, in spite of LD, our model succeeds in capturing the close neighborhoods of the actual disease predisposing loci (DPL) of the genes.
In the second experiment we generate a data set where disease risk is devoid of any genetic effect and is influenced only by some environmental exposure. Application of our model and methods to this data set again successfully captures the correct situation, clearly indicating lack of genetic influence.
For both the simulated data sets we perform simulation experiments by randomly permuting the labels of the loci of each gene. For both cases we obtain results associated with the permuted labels that completely support those obtained from the original simulated data sets.
Details are provided in Section S7 of the supplement.
5 Application of our model and methodologies to a real, casecontrol dataset on Myocardial Infarction
Application of our ideas to a casecontrol dataset on earlyonset of myocardial infarction (MI) from MI Gen study, obtained from the dbGaP database (http://www.ncbi.nlm.nih.gov/gap), led to some interesting findings.
MI (more commonly, heart attack), is a complex disease and is a leading cause of death and disability all over the world. Much investigation has been carried out for detecting the genetic causes of myocardial infarction, all of which are based on the assumption that the main contributory factors for the disease are the mutations in the proteins associated with the pathophysiology of atherosclerosis (see ?).
Although the GWA studies have revealed a lot of genetic information regarding MI (an overview of the main results can be found in ?), only a very few of the detected genes are related to traditional risk factors (LDLcholesterol, diabetes and LP[a] etc.), and the other genes increase the risk by pathogenetic mechanisms that are not yet properly understood. Despite much success in deciphering the marginal effects of many SNPs, not much has been achieved in the genegene interaction front. According to ?, burden of multiple testing renders the standard GWAS samples underpowered to detect such effects, while ? blame the complexity of the epistatic effects as a reason behind the difficulty in detecting them.
5.1 Data description
The MI Gen data obtained from dbGaP consists of observations on presence/absence of minor alleles at SNP markers associated with 22 autosomes and the sex chromosomes of cases of earlyonset myocardial infarction, age and sex matched controls. The average age at the time of MI was 41 years among the male cases and 47 years among the female cases. The data broadly represents a mixture of four subpopulations: Caucasian, Han Chinese, Japanese and Yoruban. Since the names of the genes were not provided in the dataset, SNPs were mapped on to the corresponding genes using the Ensembl human genome database (http://www.ensembl.org/). However, technical glitches prevented us from obtaining information on the genes associated with all the markers. As such, we could categorize markers out of with respect to genes.
For our analysis, we considered a set of SNPs that are found to be individually associated with different cardiovascular end points like LDL cholesterol, smoking, blood pressure, body mass etc. in various GWA studies published in NHGRI catalogue and augmented this set further with another set of SNPs found to be marginally associated with MI in the MIGen study (see ?). Our study also includes SNPs that are reported to be associated with MI in various other studies, see ?, ? and ?. In all, we obtained 271 SNPs. Unfortunately, only 33 of them turned out to be common to the SNPs of our original MI dataset on genotypes, which has been mapped on to the genes using the Ensembl human genome database. However, we included in our study all the SNPs associated with the genes containing the 33 common SNPs. Specifically, our study involves the genotypic information on 32 genes covering 1251 loci, including the 33 previously identified loci for all the individuals available in our dataset.
Categorization of the casecontrol genotype data into the four subpopulations, each of which are likely to represent several further and rather varied subpopulations genetically, implies that the maximum number of mixture components must be fixed at some value much higher than . As before, we set and for every , to facilitate datadriven inference. Interestingly, the distributions of the number of distinct components for (so that the prior mean and variance are approximately ) were not significantly different from those of , indicating prior robustness.
We chose a similar setup for the null model. That is, we chose the same number of genes and the same number of loci for each gene, the same number of cases and controls, the same value , but for every , as in our simulation studies. We use the same priors as in the real data setup except that we set and to be identity matrices to ensure that the genetic interaction is not present and set the same mixture distribution under cases and controls for each gene to ensure the absence of genetic effects. For details see Section 4.1.2 of the supplement.
5.2 Remarks on model implementation
We implemented our parallel MCMC algorithm detailed in Section S4 of the supplement for posterior simulation on a VMware consisting of doublethreaded, bit physical cores, each running at GHz; such cores were available to us. The mixture components associated with , with , are updated in parallel, on of the total available threads. This is followed by updating the interaction parameters on a separate processor using a mixture of additive and additivemultiplicative TMCMC (see Section S4.1 of the supplement). However, in this problem, the interaction matrix is of order , and the associated Cholesky decomposition (see Section S4.1 of the supplement) then consists of parameters. Furthermore, here is a dimensional vector, , where , consists of parameters and , with its Cholesky decomposition, consists of unknowns. Hence, in all, there are interaction parameters to be updated.
Updating too many parameters in a single block, even with TMCMC, need not guarantee automatic efficiency. Here we consider updating subblocks of parameters at a time using additive TMCMC. Specifically, we update by updating the dimensional separately for ; we also update the blocks and and separately. Since consists of parameters, at each iteration we update only randomly chosen nonzero elements of the Cholesky factor of using additive TMCMC. The latter is certainly a valid TMCMC strategy, which is theoretically a mixture of TMCMC strategies (see, for example, ? in the context of MetropolisHastings), and maintains very reasonable acceptance rate in our application.
The above parallel MCMC algorithm takes about hours to yield iterations in our aforementioned VMware machine. We discard the first iterations as burnin. Informal convergence diagnostics such as trace plots exhibited adequate mixing properties of our parallel algorithm.
5.3 Results of the real data analysis
5.3.1 Influential genes obtained from our analysis
Our Bayesian hypotheses testing using both clustering metric and the Euclidean distance reveal that there is very significant overall genetic influence on MI. Indeed, it turned out that and , where and are the th percentiles of the null distributions of and . Furthermore, testing for the effects of the genes individually using the clustering metric showed that apart from only genes, namely, AP006216.10, AP006216.5, APOC1 and OR4A48P and AP00621.5, all other genes have significant effect on MI, while with the Euclidean metric, all the genes considered for study turned out to be significant. The posterior probabilities of the null hypotheses (of no significant genetic influence) are shown in Figure S4 of the supplement.
Interestingly, with respect to the Euclidean metric, all the five posterior probabilities of the null hypotheses associated with the aforementioned 5 genes, turned out to be empirically zero. Thus, even though the clustering metric accepts 5 null hypotheses, the confirmation tests with the Euclidean distances suggest rejection of all of them. We hence conclude that all the genes considered in the study have significant effect on MI. This is in keeping with the fact that the genes considered in our study were found to be associated with different cardiovascular endpoints in various GWA studies or have been confirmed to play important roles in causing MI in earlier studies.
5.3.2 Disease predisposing loci detected by our Bayesian analysis
We now show that the most influential SNPs corresponding to the maximum Euclidean distance in each of the significant genes in our study, which we continue to refer to as the DPLs, are usually close to, and sometimes exactly the same as the SNPs, already flagged by the earlier studies as influential.
Figure S5 of the supplement shows the index plots of the posterior medians of the clustering and Euclidean distances between case and control, with respect to the corresponding genes. In terms of the clustering metric, genes , , and are associated with the largest medians of the clustering distances, ranging between and . These genes consist of number of loci , , and , respectively. On computing the averaged Euclidean distances , of the loci in each such Gene, where the averages are taken over the TMCMC samples, we found that loci in , in , in and in have the largest distances among all the loci of the respective genes. These are depicted in Figure 5.1. Note that for the genes , , and to some extent, the significant SNPs from our study are not only close to the SNPs found significant in the existing studies (see ?), with respect to the Euclidean distance, but also lie in their close neighborhoods, suggesting relative agreement between the SNPs found significant in our study and the loci considered to be influential for MI in the literature. On the other hand, on which has been reported by ? to be associated with LDL cholesterol and total cholesterol (TC) does not turn out to be a significant SNP for MI according to our analysis.
We now focus attention to the genes that turned out to be more influential than the remaining in the sense that the medians of the Euclidean distances exceed . These are the genes , and with corresponding median Euclidean distances , and . These three genes clearly stand out in Figure S5 (panel (b)). Each of them consists of a single locus, and are yet highly influential.
Figures S6, S7 and S8 of the supplement analyze the SNPs of some of the other influential genes and point out the significant ones. Except for the genes and , the SNPs found significant in our study closely agree with the SNPs that are considered in the literature as influential.
In the next section we argue that genegene interaction plays a vital role in explaining the discrepancies between our findings and the existing results based on previous studies.
5.3.3 Roles of genegene and SNPSNP interactions discriminating our gene findings with influential genes reported in the literature
The actual genegene correlations based on medians of the posterior covariances, are shown in Figure 5.2, while Figure S9 of the supplement depicts the results on interaction testing. The color intensities correspond to the absolute values of the correlations. Note that the correlation structure involves both positive and negative values where negative correlations occur in more than 45% of the cases. A hierarchical clustering of the genes based on the absolute values of the correlations, are provided in Figure 5.3. The vertical axis of the diagram represents , standing for the correlations shown in Figure 5.2. In a nutshell, lower the order of the hierarchy, stronger are the correlations between the genes. For instance, Figure 5.3 shows that the correlation between genes and is the strongest; moreover, the correlation between and , for example, is stronger than the correlation between and .
Figures 5.2 and 5.3 depict complex interplay among the genes, many of which negatively influence each other with respect to the Euclidean distances. These genegene interactions also seem to influence the SNPwise Euclidean distances between cases and controls, revealing the effect of some new SNPs that failed to make their presence felt in the previous studies due to lack of adequate dependence structure. There is another important issue to point towards in this context. As will be seen, the correlations with respect to our current dataset are usually of small magnitude. This is not because the covariances are of small magnitude; indeed, they are all significantly bounded away from zero, but the variances are of very large order, so that the covariances scaled by the square roots of the variances, are small. These correlations, depending upon positive or negative signs, and in conjunction with the very complex interplay among the genes and the SNPs, are instrumental in deciding whether a SNP appears as the most influential one among a set of SNPs in any gene. The issue is explained mathematically in Section S8 of the supplement.
We elucidate the aforementioned issue with respect to the genes , , and ; see Figure 5.1. Indeed, among the SNPs of , the casecontrol Euclidean distance associated with , which has turned out to be influential in our analysis has maximum negative correlation of with that of , the SNP pointed significant by ?. Hence, it is not surprising that failed to be close to in terms of the Euclidean distance between case and control. On the other hand, of is positively correlated with the literaturebased important SNP , the correlation being , supporting the closeness of the Euclidean distances between and . For gene , the correlation between , the influential SNP by our study and the literaturebased , is . The consequence of this small, albeit negative correlation, is wellreflected in panel (c) of Figure 5.1; the corresponding Euclidean distances are close but there are other SNPs, positively correlated with , that have larger Euclidean distances compared to that of . For gene , however, the influential SNP by our study, is positively correlated with the literaturebased SNP , the correlation being . The fact that in spite of the positive correlation the two SNPs are not adequately close in terms of casecontrol Euclidean distances, begs some explanation, which we provide next.
Interestingly, the key to this phenomenon lies in intergenetic SNPSNP interaction. Indeed, , the most influential gene in terms of the maximum Euclidean distance, and consisting of the single locus , exerts negative influence on through the negative correlation , but has small and positive correlation, , with . This gene, through its only SNP, also influences the other genes via positive or negative correlations. For instance, its correlations with and of are and , respectively. In other words, the former receives lot more weight compared to the latter, so that becomes influential with respect to our model. For gene , its correlations with the relevant two loci and , are and , respectively, so that both SNPs are relatively close but the former takes precedence, becoming DPL, thanks to its larger correlation with . For gene , the correlations of and with are and , respectively. Hence, the former locus becomes the DPL because of its smaller negative correlation. Thus, there is a very complex interplay among the different genes, their SNPs and among the SNPs of different genes. In general it is infeasible to keep track of these complex dependencies and provide simple explanations for the different DPLs and their differences with the SNPs believed to be important by the scientific community.
The above elucidations attempt to point out that unless genegene and SNPSNP interactions are taken into account through a sophisticated, nonparametric framework, such complex interaction effects might have been missed, which would perhaps lead to declaration of some truly influential SNPs as nonsignificant, and some noninfluential SNPs as influential.
5.3.4 Posteriors of the number of distinct mixture components distinguishing important sets of genes
Figures S10, S11 and S12 show the posteriors of the number of distinct components associated with genes from the sets , , , , , , , and , , , respectively. The posteriors confirm our expectation that the four broad subpopulations composed of Caucasians, Han Chinese, Japanese and Yoruban admit further subdivisions in general. Indeed, although for genes , and , the number of subpopulations turned out to be less than with high posterior probabilities, for the other genes the number of subpopulations have exceeded with almost full posterior probabilities. The shown posteriors are negligibly different for case and control, for all the three sets of genes, which is to be expected because of the high positive correlations between and .
It is interesting to observe that the posteriors of the number of components of the first set of genes , , , are roughly stochastically dominated by the second set , , , , which, in turn, are dominated by those of the set , , . The implication is that the third set consists of more genetic variations, followed by the second set, while the first set consists of least genetic variations. Thus, the last set of genes, consisting of only one locus each, seems to be most likely to affect the disease, while the first set seems to be the least influential on MI.
5.4 Discussion of our Bayesian methods and GWAS in light of our findings
Our Bayesian analysis yielded results that are broadly in agreement with those obtained by GWA investigations reported in the literature. However, the fact that some of the SNPs which are flagged by the literature as important, did not show up as the most significant ones, deserves attention. The main issue that emerged in our investigation is that the genegene interactions are responsible for suppression of the socalled important SNPs via implicit induction of negative correlations among Euclidean distances between cases and controls for the associated genes. Had there been no such negative correlations, it is plausible that these SNPs would turn out to be the most influential ones.
Apart from a few agreements, the literature based SNPs are different from the SNPs detected significant in our analysis, many of which lie in the intronic regions and have not been thoroughly explored and hence need further investigation. As per our investigation, sophisticated, nonparametric modeling of genegene interactions plays a very crucial role in imparting significance to the overall effect of the individual genes. Since the GWAS did not incorporate the complex intra and intergenetic interactions into the model, it is perhaps not very unreasonable to question if the same genes would emerge as significant if realistic modeling of genegene interactions is taken into account.
For the current MI study, we summarize our findings in Table 5.1, where we present the 32 genes ordered with respect to the median casecontrol Euclidean distances, the SNPs flagged by the literature as significant, the corresponding SNPs detected by our Bayesian model and methods, and the phenotypes of the reportedly significant SNPs and our SNPs.
Chr  Literature  Reported  Bayesian  Reported  

Genes  No.  based SNPs  Phenotype  SNPs  Phenotype 
OR4A48P  11  rs7395662  LDL, HDL,  rs7395662  LDL, HDL, 
tryglycerides  tryglycerides  
AP006216.10  11  rs964184  triglycerides, LDL,  rs964184  triglycerides, LDL, 
HDL cholesterol  HDL cholesterol  
APOC1  19  rs4420638  LDL, HDL  rs4420638  LDL, HDL 
cholesterol  cholesterol  
TFAP2B  6  rs987237  BMI  rs2011201  
AP006216.5  11  rs7396835  Body weight, BMI  rs1263172  
triglyceride  
RAB11B  19  rs2967605  Carotid Artery  rs2913973  
heart disease  
BUD13  11  rs28927680  HDL, cholesterol  rs10488699  LDL, cholesterol 
triglycerides  HDL cholesterol  
CELSR2  1  rs599839  CHD, CAD,  rs14000  CHD 
LDL cholesterol  
RP199E18.2  6  rs6922269  CHD  rs11155760  
WDR12  2  rs6725887  CHD, CAD, MI  rs10205697  
C6orf106  6  rs2814944  HDL, LDL, BMI  rs1201872  
cholesterol  
HLADRA  6  rs3177928  cholesterol, LDL  rs1051336  
RP11306G20.1  7  rs11556924  rs10265116  
PCSK9  1  rs2479409  cholesterol,  rs2182833  cholesterol 
LDL cholesterol  
ADCY5  3  rs2877716  Carbohydrate metabolism  rs10934643  
HMGCR  5  rs3846662  LDL cholesterol  rs12654264  
GALNT2  1  rs4846914  cholesterol, HDL  rs1474925  
Triglycerides  
MRAS  3  rs9818870  CAD  rs1199335  
ZNF652  17  rs16948048  Diastolic blood  rs12940887  
pressure  
ANKS1A  6  rs17609940  CAD  rs17647222  
SLC22A1  6  rs1564348  LDL  rs1564348  
RP11136O12.2  8  rs17321515  Triglycerides  rs16900615  
GPAM  10  rs1129555  LDL  rs10885315  
RBMS1  2  rs7593730  Type 2 diabetes  rs11694165  
BDNFAS  11  rs1013442  Smoking  rs1013442  
SMARCA4  19  rs1122608  MI(early onset)  rs10415811  
FTO  16  rs1121980  BMI  rs10521303  
MIA3  1  rs17465637  MI(early onset)  rs17163303  
CDKAL1  6  rs10946398  Type 2 diabetes  rs1012625  
ADAMTS9AS2  3  rs4607103  Type 2 diabetes  rs10510917  
COL4A1  13  rs3742207  Arterial stiffness  rs1000989  
PHACTR1  6  rs12526453  MI (early onset)  rs1014342 
6 Concluding remarks
In this paper, we were compelled to consider a small part of the available real dataset consisting of SNPs cited in the literature as important. This small dataset, however, has the added advantage of alleviating computational burden. Indeed, since only twothreaded cores were available to us for implementation of our ideas, it is anyway imperative for us to confine attention to a (relatively small) subset of the available dataset. We are, however, expecting to expand our current parallel computing infrastructure, which would be of immense help in analysing the complete dataset, which is our actual goal.
In this work we have focused exclusively on genegene interaction. Recently, ? and ? have extended this model to incorporate geneenvironment interactions in our model, and developed tests for the effects of geneenvironment interactions as well as genegene interactions on casecontrol. They have successfully applied the ideas to various simulated datasets generated from the GENS2 software, and to the MI dataset, considering sex as the environmental variable. The results they obtained are broadly in agreement with the results on this MI dataset already existing in the literature.
Acknowledgment
We are grateful to Dr. Arunabha Majumdar for providing useful feedback on an earlier version of our manuscript.
Supplementary Material
S1 A rule of thumb for choosing and
Following ?, ?, ?, ?, we set in our applications. It follows from ? that the mean and variance of the distinct parameter vectors in the set are both given by approximately . When prior information regarding the true number of mixture components is lacking, it may be reasonable to specify the expected number of distinct components to be close to half of the maximum number of components possible, namely, close to . With , we fix , so that about distinct mixture components are to be expected a priori. Apart from this choice, we also considered the possibilities , , that is, the gamma distribution with mean and variance , and (so that the mean and variance are and , respectively); however, the choice for all outperformed the other choices with regard to capturing the true number of mixture components. Hence, in this work, we report all our results associated with and . According to this specification, the prior mean and variance of the number of distinct components are approximately . Thus, compared to smaller values of , this choice ensures greater variability so that datadriven inference on the number of components receives greater weight.
S2 Choices of and
For ; and , here we denote by the count of the minor allele at the th locus of the th gene and th casecontrol status. In other words, . With this notation we define
(S2.1) 
Also let
(S2.2) 
With these, we specify the th element of as
(S2.3) 
For the specification of , we first consider
(S2.4) 
Then, letting , we specify the th element of as
(S2.5) 
S3 Elucidation that the th loci of any two different genes can be independent with positive probability
Our assumption that and of each locus is shared by all the genes does not imply that the labels of the loci of the genes are not exchangeable. Indeed, the th loci of two different genes may be independent, given the data. To understand this, note that the th loci of any gene does not only have the effect and , but also , for given . In other words, all the loci of gene share the common effect . Hence, for any two genes denoted by and , and , and considering only , the th loci have the effects and . By our distributional assumptions it follows that the covariance between these effects is , given and , which is equal to zero if . Independence follows due to normality of our specified distributions. Now note that by our inverseWishart priors on and , the event gets positive probability for any , so that the covariance can be in any neighborhood of zero with positive probability, if connoted by the data.
S4 A parallel MCMC algorithm for model fitting
Recall that the mixtures associated with gene and casecontrol status are conditionally independent of each other, given the interaction parameters. This allows us to update the mixture components in separate parallel processors, conditionally on the interaction parameters. Once the mixture components are updated, we update the interaction parameters using a specialized form of TMCMC, in a single processor. The details of updating the mixture components in parallel are as follows.

Split the pairs in the available parallel processors.

During each MCMC iteration, for each in each available parallel processor, do the following

For , update the allocation variables by simulating from the full conditional distribution of , given by
(S4.1) for .

Let denote the distinct elements in . Also let denote the configuration vector, where if and only if .
Now let denote the number of distinct elements in and let denote the distinct parameter vectors. Further, let occur times.
Then update using Gibbs steps, where the full conditional distribution of is given by
(S4.2) where
(S4.3) (S4.4) In (S4.3) and (S4.4), and denote the number of and alleles, respectively, at the th locus of the th gene associated with the th mixture component. In other words, and . The function in the above equations is the Beta function such that for any , ; being the Gamma function.

Let and . Then, for ; ; and , update
