A Bayesian Semiparametric Approach to Learning About Gene-Gene Interactions in Case-Control Studies

# A Bayesian Semiparametric Approach to Learning About Gene-Gene Interactions in Case-Control Studies

Durba Bhattacharya and Sourabh Bhattacharya Durba Bhattacharya is an Assistant Professor in St. Xavier’s College, Kolkata, pursuing PhD in Interdisciplinary Statistical Research Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata 700108. Sourabh Bhattacharya is an Assistant Professor in Interdisciplinary Statistical Research Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata 700108. Corresponding e-mail: sourabh@isical.ac.in.
###### Abstract

Gene-gene interactions are often regarded as playing significant roles in influencing variabilities of complex traits. Although much research has been devoted to this area, till date a comprehensive statistical model that adequately addresses the highly dependent structures associated with the interactions between the genes, multiple loci of every gene, various and unknown number of sub-populations that the subjects arise from, seem to be lacking.

In this paper, we propose and develop a novel Bayesian semiparametric approach composed of finite mixtures based on Dirichlet processes and a hierarchical matrix-normal distribution that can comprehensively account for the unknown number of sub-populations and gene-gene interactions. Then, by formulating novel, and suitable Bayesian tests of hypotheses using a suitable metric for comparing clusterings in conjunction with the interaction parameters of the matrix-normal, we attempt to single out the roles of the genes, individually, and in interaction with other genes, in case-control studies. Quite importantly, we also attempt to identify the DPL.

Since the mixtures are conditionally independent of each other given the interaction structure, for model fitting we update them simultaneously in parallel processors using an efficient Gibbs sampling procedure. Given the updated mixture parameters the interaction parameters are then updated in a single block using the recently developed fast and efficient Transformation based MCMC (TMCMC) methodology. Thus, our model facilitates a highly efficient parallel computing methodology.

Application of our model and methodologies to biologically realistic data sets simulated from an existing, realistic population genetics model associated with case-control study, revealed quite encouraging performance. Quite importantly, we also applied our ideas to a real, myocardial infarction dataset, and obtained interesting results that partly agree with, but also complement, the existing works in this area, and reveal the importance of sophisticated and realistic modeling of gene-gene interactions.
Keywords: Case-control study; Dirichlet process; Gene-gene interaction; Myocardial infarction; Parallel processing; Transformation based MCMC.

## 1 Introduction

Isolated evaluation of individual single nucleotide polymorphisms (SNPs) for their association with complex diseases in GWAS have succeeded in explaining only small proportions of genetic inheritances; see ? and the references therein. It is now well-known that some genes interact with one another in complex networks, and that such interactions may influence genetic variations of complex traits (see ?, ?, ?, ?, ?). It is hence anticipated that the want of significant success of GWAS may perhaps be due to the lack of a sophisticated statistical model incorporating the scientific understanding about gene-gene interactions (see ?) into genomic profiling, that may help in understanding the biological and biochemical pathways behind complex diseases; see ?.

One of the main challenges faced at the very onset of investigating genetic interactions is that of non-uniqueness in the definition of epistasis or gene-gene interactions. According to ?, biologically, epistasis can be categorised as functional epistasis and compositional epistasis, both of which differ widely from the statistical definition of epistasis. While functional epistasis indicates protein-protein interactions at molecular level, any disruption of which is explained by a genetic consequence, compositional epistasis refers to blocking of one allelic effect by an allele at another locus. ? (see also ?) defined statistical interaction among the genes as deviations from additive marginal effects of individual genes. Although ? derived some strong empirical conditions under which statistical interactions correspond to compositional epistasis, a prevailing opinion reflected in both genetic and epidemiological literature suggests limitations of the tests based on Fisher’s statistical definition of epistasis in explaining the gene-gene interactions in the biological sense of the term. According to ? and ?, although statistical and biological interpretations of interaction need not be compatible with each other, quantification of biological interaction should be based on statistical concepts of interactions. We have formulated a statistical model composed of mixture distributions based on Dirichlet processes, incorporating the effects due to complex genetic interactions through hierarchical matrix-normal-inverse-Wishart distribution.

Several case-control studies, defining gene-gene interactions via SNP-SNP interactions have been performed (?), assuming that interaction between any two genes are in fact caused by interactions between their respective SNPs. Apart from not considering the genes as functional units, these linear model-based statistical analyses suffer from very high testing dimensionality and become computationally burdensome because of the very large number of marginal and pairwise interaction effects to be incorporated in the linear model when dealing with a large number of SNPs. The relevant statistical interaction models existing in the case-control literature point towards the following trade-off: dimension reduction by working at the gene-level makes computation feasible, but at the cost of useful SNP-level information (see ?), while working at SNP-level promises all the necessary information but at the cost of enormous computational burden (see also ?).

The aforementioned difficulties in the forms of trade-offs can be traced back to additive modeling strategies. Indeed, even with a small number of SNPs, the traditional linear model is constituted of a very large number of terms consisting of the marginal and interaction effects at the SNP level. Attempts to incorporate the genes as functional units in the model necessarily calls for sacrifice of useful SNP-level information, while principal components analysis for dimension reduction make genetic interpretation difficult. The linear modeling strategy based on Fisher’s statistical definition of epistasis can also be questioned on the ground of oversimplicity, since functionally, gene-gene interactions may involve very complex physical interplay among proteins as gene products (?). Moreover, in linear models the main effects and the interaction effects are estimated from the genotype data and then onwards assumed to be non-random covariates. More holistic approaches should be concerned with postulating highly structured joint distributions of the complex genotype data.

A further drawback of the existing interaction models is that they often ignore multiple sub-populations that the genotype data usually arise from. Indeed, for different sub-populations, the genes (or SNPs) may interact differently, which adds further complexity to the complicated functional form of epistasis. ? empirically demonstrate that methods ignoring population sub-structures can incur severe bias leading to large-scale false positives. The fact that the number of sub-populations is not usually known is a further challenging issue that needs to be considered, as one must coherently and carefully account for the uncertainty associated with the unknown number of sub-populations.

The criticisms of the interaction models existing in the case-control literature motivated us to propose a new and general Bayesian model for gene-gene interactions and novel Bayesian hypotheses testing procedures and associated methodologies to investigate the effects of genes on complex diseases in the context of case-control dataset arising from a possibly stratified population of genotypes. In what follows we investigate only the genetic effects on complex diseases, without taking into account the environmental effects (but see ? and ?).

The rest of our paper is structured as follows. In Section 2 we introduce our proposed Bayesian semiparametric model, and in Section 3 we propose and develop a novel Bayesian hypothesis testing procedure for detecting the roles of genes in case-control studies. In Section 4 we present a brief discussion on validation of our model and methodologies with biologically realistic simulated data sets, the details of which are provided in the supplement, described below. In Section 5 we conduct a detailed analysis of case-control data on early onset of Myocardial Infarction obtained from dbGap, comparing and contrasting our findings with the existing results. We summarize our work and make concluding remarks in Section 6.

Additional details are provided in the supplement, whose sections and figures have the prefix “S-” when referred to in this paper.

## 2 A new Bayesian semiparametric model for gene-gene interactions

Before we introduce our proposed model, we first detail the type of genotype and phenotype data that we are interested in.

### 2.1 Genotype data

For denoting the two chromosomes, let and indicate the presence and absence of the minor allele of the -th individual, -th gene, -th locus, and the -th group (either control or case), for , with denoting case; ; and .

In this paper, we shall concern ourselves with data sets of the aforementioned type. However, for our model, which we introduce below, it is obvious that data sets consisting of only minor allele counts at each locus contains exactly the same information as the above described data type.

### 2.2 Mixture models driven by Dirichlet processes

Given any , let , and . We assume that for every triplet , are independently distributed with mixture probability mass function with a maximum of components, given by

 [Xijk]=M∑m=1πmjkLj∏r=1f(xijkr|pmjkr), (2.1)

where is the probability mass function of independent Bernoulli distributions, given by

 f(xijkr|pmjkr)={pmjkr}x1ijkr+x2ijkr{1−pmjkr}2−(x1ijkr+x2ijkr). (2.2)

Using allocation variables , with probability distribution

 [zijk=m]=πmjk, (2.3)

for and , (2.1) can be represented as

 [Xijk|zijk]=Lj∏r=1f(xijkr|pzijkjkr). (2.4)

We may assume appropriate Dirichlet distribution priors on for ; . However, as investigated in ?, the Dirichlet distribution often yields very small values of the probabilities , thereby tending to underestimate the true number of mixture components. On the other hand, setting exhibited much better performance. Therefore, in this work, we set , for , and for all .

Letting , we further assume that

 p1jk,p2jk,…,pMjk \lx@stackreliid∼Gjk; (2.5) Gjk ∼DP(αjkG0,jk), (2.6)

where stands for Dirichlet process with expected probability measure having precision parameter . We assume that under , for and ,

 pmjkr\lx@stackreliid∼Beta(ν1jkr,ν2jkr). (2.7)

Thus, given a particular pair , our mixture model has the same structure as adopted by ? for inference on population structure. Discreteness of Dirichlet processes cause coincidences among the parameter vectors of with positive probability, so that, with positive probability, the actual number of mixture components in (2.1) falls below , the maximum number of components, the mixing probabilities taking the form , where . See ?, ?, ?, ?, for the details. In fact, we marginalize over to arrive at the well-known Polya urn distribution of :

 [pmjk|PMjk∖{pmjk}]∼αjkαjk+M−1G0,jk(pmjk)+1αjk+M−1M∑m′≠m=1δpm′jk(pmjk), (2.8)

where denotes point mass at . The property of coincidences among the parameter vectors is clearly preserved by the Polya urn scheme.

Observe that, after coincidences among the mixture components, the pairs come to be associated with different mixtures, with different numbers of components. This is reasonable, because, the distributions of the genotypes for the gene of any two individuals belonging to the same subpopulation but with different case-control status, that is, and are expected to correspond to different mixtures under significant genetic effect on the disease (see ?).

We have discussed a rule of thumb for the choice of as well as in Section S-1 of the supplement, following whch we set and in our applications.

### 2.3 Incorporating the gene-gene and SNP-SNP interactions through appropriate modeling of the parameters

#### 2.3.1 Modeling the parameters of G0,jk

Taking into consideration the SNP-SNP dependence, which may exist within each gene and also among the genes, we model the Beta parameters and of (2.7) as follows:

For , where , and for every ,

 ν1jkr =exp(ur+λjk); (2.9) ν2jkr =exp(vr+λjk). (2.10)

Allowing and to be different ensures that the mean of under depends upon the -th SNP. We further assume that for ,

 ur \lx@stackreliid∼N(0,1); (2.11) vr \lx@stackreliid∼N(0,1). (2.12)

We found that the Gaussian priors on and with other means and variances did not yield significantly different results, thus pointing towards in-built prior robustness in our modeling strategy.

Subsequently, using matrix-normal distribution as a prior on we incorporate the SNP-wise dependence in a gene. Moreover, the SNPs associated with different genes are also dependent through the dependence structure among the genes imposed by the matrix-normal prior.

Note that allowing and to be shared by all the genes creates the impression that the labels of the loci are not exchangeable. However, our matrix-normal-inverse-Wishart prior ensures that this is not the case, and that for any two genes and , and , (or ) and (or ), and hence their exponentiated versions, are independent with positive probability, given the data. This is elucidated in Section S-3 of the supplement in light of the matrix-normal-inverse-Wishart prior. We vindicated this mathematical argument with simulation studies; see Section 4 (details provided in Section S-7 of the supplement). Specifically, we conducted extra simulation studies after randomly permuting the labels of the loci of each gene and re-analyzed each such data set. The results, provided in Section S-7 of the supplement, are consistent with those obtained wthout permuting the labels.

#### 2.3.2 Matrix normal prior for λ

We consider the following model for :

 λ∼N(μ,A⊗Σ). (2.13)

Re-writing the -dimensional vector as a matrix , (2.13) can be represented as a matrix normal distribution with mean matrix , left covariance matrix and right covariance matrix , having probability density function

 (2.14)

We note that the -th column of , which we denote by , follows the multivariate normal distribution:

 Λcol,k∼NJ(μcol,k,σkkA), (2.15)

where is the -th column of . The covariance matrix between and is given by

 cov(Λcol,k1,Λcol,k2)=σk1k2A. (2.16)

Similarly, the -th row of , which we denote by , has the following multivariate normal distribution:

 Λrow,j∼N2(μrow,j,ajjΣ), (2.17)

being the -th row of . Also,

 cov(Λrow,j1,Λrow,j2)=aj1j2Σ. (2.18)

In our applications we chose .

The essence of the matrix normal structure is to offer a dependence structure among the genes. Given case-control status , the dependence structure associated with the genes is provided by , while the matrix represents the dependence between the genotype distribution of the cases and controls, given any particular gene.

#### 2.3.3 Priors on A and Σ

We assume that

 A∼IW(ξ,A0), (2.19)

where stands for Inverse-Wishart distribution with degrees of freedom and positive definite scale matrix . The density function is given by

 π(A)∝|A|−(ξ+J+12)×exp{−12tr(A0A−1)}. (2.20)

We further assume that

 Σ∼IW(ζ,Σ0), (2.21)

where the degrees of freedom satisfies and is a positive definite matrix; the density function is given by

 π(Σ)∝|Σ|−(ζ+32)×exp{−12tr(Σ0Σ−1)}. (2.22)

For our applications, we set and . These choices are the minimum values such that the prior expectations of and are well-defined. Choices of and are detailed in Section S-2 of the supplement.

A schematic representation of our model and the parallel processing algorithm is provided in Figure 2.1. Details of our parallel processing algorithm are provided in Section S-4 of the supplement.

## 3 Detection of the roles of genes in case-control studies

### 3.1 Formulation of a Bayesian hypothesis testing procedure

In order to investigate if genes have any significant effect on case-control, it is pertinent to test

 H0:h0j=h1j; j=1,…,J, (3.1)

versus

 H1:not H0, (3.2)

where

 h0j(⋅) =M∑m=1πmjk=0Lj∏r=1f(⋅|prmjk=0) (3.3) h1j(⋅) =M∑m=1πmjk=1Lj∏r=1f(⋅|prmjk=1). (3.4)

If and are not significantly different, then it is plausible to conclude that the role of genes is not significant in the case-control study.

In a nutshell, testing the hypothesis in (3.1)-(3.4) requires some appropriate divergence measure between and and if denotes the divergence, then is to be accepted for appropriately large posterior probability of the event that is small.

In our situation, simultaneous consideration of a large number of genes, involving thousands of SNPs renders the existing measures of divergence practically infeasible to compute. For details, see Section S-5 of the supplement. This compels us to seek alternative measures of divergence that are also amenable to efficient computation. In the mixture context, a natural measure is the discrepancy between the clusterings associated with the two mixture distributions (?, ?). However, since clusterings do not account for the magnitudes of the parameters, insignificant difference between the clusterings does not necessarily imply insignificant difference between the associated mixture densities. It is worth noting that even the Euclidean metric alone is not appropriate – since mixture densities are invariant with respect to permutations of the parameter components, large Euclidean distances between parameter vectors need not imply large difference between the densities. Thus, we propose to use the clustering based ideas in conjunction with ideas based on the Euclidean metric, appropriately modified for mixture densities. Indeed, if the clusterings associated with the mixture densities are known to be of insignificant difference, then insignificant Euclidean divergence between the parameter vectors does imply insignificant difference between the mixture densities. Details of all these issues are provided in Section S-6 of the supplement.

### 3.2 Formal Bayesian hypothesis testing procedure integrating the above developments

With the clustering metric provided in Section S-6 of the supplement, let us define

 d∗=max1≤j≤Jdj,

where

 dj=^d(PMjk=0,PMjk=1)

is the distance between the clusterings and , for . With this, we first test

 (3.5)

for reasonably small choice of (). Acceptance of indicates that clusterings associated with and have insignificant difference, for every . If is rejected, then it follows that for at least one , the clustering difference is significant.

If is rejected, then it entails that the clusterings associated with the mixture densities for case and control are significantly different. This implies that the mixture densities themselves are significantly different, so that rejection of leads to rejection of given by (3.1).

But whenever is accepted based on the “” loss and the clustering metric, as already argued, this does not necessarily imply that the mixture densities have insignificant difference. All one can infer in this case is that differences between the associated clusterings are insignificant. Hence, when is accepted, we consider a second test of the form

 (3.6)

where ; here is the Euclidean distance between

and

with , and ; see Section S-6 of the supplement.

If is also accepted, then one can safely accept . If is rejected, we then consider a third test of the form

 H0d∗E,min: d∗E,min<εversus% H1d∗E,min: d∗E,min≥ε, (3.7)

where , with . Here is a pseudo-metric based on the Euclidean distance; see Section S-6 for details.

If is accepted, then this implies acceptance of given by (3.1). Else, must be rejected. For clarity, we present a schematic diagram of the hierarchy of the hypotheses tests in Figure 3.1.

#### 3.2.1 Choice of loss function for the Bayesian tests

Recall that the “” loss function (see, for example, ?, for details) entails zero loss under the correct decisions, loss under false acceptance of the null hypothesis and loss under false rejection of the null. Thus, we accept under the “” loss function if

 P(d∗<ε|Data)≥11+c. (3.8)

Since usually there is no clear-cut way of specifying , we shall generally select the default value , reducing the “” loss function to the well-known “” loss function.

However, for the tests associated with and , which are to be considered only if is accepted, we shall select much larger than 1. This is because it makes sense to provide greater protection to the null hypothesis given that the clustering test has already provided partial support to , indicating that at least the clusterings are not significantly different (see also Section S-6 of the supplement).

Note that, under the “” loss, the null hypothesis is to be accepted if its posterior probability exceeds , while under the “” loss, the threshold posterior probability is . For the hypotheses involving and we shall set so that . This choice is motivated by the level of significance of classical significance tests. Some other choices will also be briefly touched upon.

#### 3.2.2 Choice of ε

Choices of are expected to be problem specific. In Section S-7 of the supplement, we discuss in detail the choices of in our applications. Briefly, we first consider an appropriate null model, for instance the same model as ours but with and set to identity matrix to reflect the null hypotheses of “no interaction” and same mixture distributions under cases and controls for each gene for no genetic effect. We then generate case-control genotype data from the null model and fit our general Bayesian model to the “null data” and set as the -th percentile of the relevant posterior distribution.

### 3.3 Bayesian tests for individual genetic effects when H0 is rejected

If given by (3.1) is finally accepted then we may conclude that there is no significant evidence to claim that the genes, individually, or in interaction with the other genes, are important factors in the case-control study.

On the other hand, if is rejected, then we check for significances of the individual genes by applying our Bayesian testing procedure on the hypotheses

 H0j: h0j=h1j versus H1j: h0j≠h1j, %for j=1,…,J. (3.9)

For each , we adopt the same procedure for testing the hypothesis as for testing versus ; only , and are to be replaced with , and , respectively. Note that at each stage associated with , and , the Bayesian hypotheses testing framework is equivalent to a Bayesian multiple testing paradigm. Specifically, testing versus for using our Bayesian methods is equivalent to minimizing the Bayes risk of the additive “0-1” loss function, and the subsequent Bayesian tests for the hypotheses versus and versus , for relevant indices , are Bayesian multiple testing procedures that minimize the Bayes risk of the additive “0-1-c” loss function, where we choose .

If is accepted, then it is possible that the -th gene is not individually influential, but some interaction effect involving the -th gene may be significant. To check which interactions are significant (we may check this even if is rejected, since the -th gene may be marginally significant as well as interactive with the other genes), one may conduct the tests versus , for , being the -th element of . Acceptance of for some (or many) , indicates which of the genes interact with the -th gene to contribute significantly to the underlying case-control study.

## 4 Validation of our model and methodologies with biologically realistic simulated data sets

We evaluate our model and methodologies on data sets generated from the GENS2 software designed by ?. In a nutshell, the software creates large, biologically realistic data sets having realistic LD patterns, where risks of complex diseases are influenced by known gene-gene and gene-environment interactions. We consider two simulated data sets for our experiments – in the first experiment, we generate a case-control data set under the effect of gene-gene interaction, fit our model to the data set, and test the relevant hypotheses. We show that our model and methodology successfully captures the relevant information regarding the effects of the individual genes, gene-gene interaction, and the number of sub-populations. We also show that, in spite of LD, our model succeeds in capturing the close neighborhoods of the actual disease predisposing loci (DPL) of the genes.

In the second experiment we generate a data set where disease risk is devoid of any genetic effect and is influenced only by some environmental exposure. Application of our model and methods to this data set again successfully captures the correct situation, clearly indicating lack of genetic influence.

For both the simulated data sets we perform simulation experiments by randomly permuting the labels of the loci of each gene. For both cases we obtain results associated with the permuted labels that completely support those obtained from the original simulated data sets.

Details are provided in Section S-7 of the supplement.

## 5 Application of our model and methodologies to a real, case-control dataset on Myocardial Infarction

Application of our ideas to a case-control dataset on early-onset of myocardial infarction (MI) from MI Gen study, obtained from the dbGaP database (http://www.ncbi.nlm.nih.gov/gap), led to some interesting findings.

MI (more commonly, heart attack), is a complex disease and is a leading cause of death and disability all over the world. Much investigation has been carried out for detecting the genetic causes of myocardial infarction, all of which are based on the assumption that the main contributory factors for the disease are the mutations in the proteins associated with the pathophysiology of atherosclerosis (see ?).

Although the GWA studies have revealed a lot of genetic information regarding MI (an overview of the main results can be found in ?), only a very few of the detected genes are related to traditional risk factors (LDL-cholesterol, diabetes and LP[a] etc.), and the other genes increase the risk by pathogenetic mechanisms that are not yet properly understood. Despite much success in deciphering the marginal effects of many SNPs, not much has been achieved in the gene-gene interaction front. According to ?, burden of multiple testing renders the standard GWAS samples underpowered to detect such effects, while ? blame the complexity of the epistatic effects as a reason behind the difficulty in detecting them.

### 5.1 Data description

The MI Gen data obtained from dbGaP consists of observations on presence/absence of minor alleles at SNP markers associated with 22 autosomes and the sex chromosomes of cases of early-onset myocardial infarction, age and sex matched controls. The average age at the time of MI was 41 years among the male cases and 47 years among the female cases. The data broadly represents a mixture of four sub-populations: Caucasian, Han Chinese, Japanese and Yoruban. Since the names of the genes were not provided in the dataset, SNPs were mapped on to the corresponding genes using the Ensembl human genome database (http://www.ensembl.org/). However, technical glitches prevented us from obtaining information on the genes associated with all the markers. As such, we could categorize markers out of with respect to genes.

For our analysis, we considered a set of SNPs that are found to be individually associated with different cardiovascular end points like LDL cholesterol, smoking, blood pressure, body mass etc. in various GWA studies published in NHGRI catalogue and augmented this set further with another set of SNPs found to be marginally associated with MI in the MIGen study (see ?). Our study also includes SNPs that are reported to be associated with MI in various other studies, see ?, ? and ?. In all, we obtained 271 SNPs. Unfortunately, only 33 of them turned out to be common to the SNPs of our original MI dataset on genotypes, which has been mapped on to the genes using the Ensembl human genome database. However, we included in our study all the SNPs associated with the genes containing the 33 common SNPs. Specifically, our study involves the genotypic information on 32 genes covering 1251 loci, including the 33 previously identified loci for all the individuals available in our dataset.

Categorization of the case-control genotype data into the four sub-populations, each of which are likely to represent several further and rather varied sub-populations genetically, implies that the maximum number of mixture components must be fixed at some value much higher than . As before, we set and for every , to facilitate data-driven inference. Interestingly, the distributions of the number of distinct components for (so that the prior mean and variance are approximately ) were not significantly different from those of , indicating prior robustness.

We chose a similar set-up for the null model. That is, we chose the same number of genes and the same number of loci for each gene, the same number of cases and controls, the same value , but for every , as in our simulation studies. We use the same priors as in the real data set-up except that we set and to be identity matrices to ensure that the genetic interaction is not present and set the same mixture distribution under cases and controls for each gene to ensure the absence of genetic effects. For details see Section 4.1.2 of the supplement.

### 5.2 Remarks on model implementation

We implemented our parallel MCMC algorithm detailed in Section S-4 of the supplement for posterior simulation on a VMware consisting of double-threaded, -bit physical cores, each running at GHz; such cores were available to us. The mixture components associated with , with , are updated in parallel, on of the total available threads. This is followed by updating the interaction parameters on a separate processor using a mixture of additive and additive-multiplicative TMCMC (see Section S-4.1 of the supplement). However, in this problem, the interaction matrix is of order , and the associated Cholesky decomposition (see Section S-4.1 of the supplement) then consists of parameters. Furthermore, here is a -dimensional vector, , where , consists of parameters and , with its Cholesky decomposition, consists of unknowns. Hence, in all, there are interaction parameters to be updated.

Updating too many parameters in a single block, even with TMCMC, need not guarantee automatic efficiency. Here we consider updating sub-blocks of parameters at a time using additive TMCMC. Specifically, we update by updating the -dimensional separately for ; we also update the blocks and and separately. Since consists of parameters, at each iteration we update only randomly chosen non-zero elements of the Cholesky factor of using additive TMCMC. The latter is certainly a valid TMCMC strategy, which is theoretically a mixture of TMCMC strategies (see, for example, ? in the context of Metropolis-Hastings), and maintains very reasonable acceptance rate in our application.

The above parallel MCMC algorithm takes about hours to yield iterations in our aforementioned VMware machine. We discard the first iterations as burn-in. Informal convergence diagnostics such as trace plots exhibited adequate mixing properties of our parallel algorithm.

### 5.3 Results of the real data analysis

#### 5.3.1 Influential genes obtained from our analysis

Our Bayesian hypotheses testing using both clustering metric and the Euclidean distance reveal that there is very significant overall genetic influence on MI. Indeed, it turned out that and , where and are the th percentiles of the null distributions of and . Furthermore, testing for the effects of the genes individually using the clustering metric showed that apart from only genes, namely, AP006216.10, AP006216.5, APOC1 and OR4A48P and AP00621.5, all other genes have significant effect on MI, while with the Euclidean metric, all the genes considered for study turned out to be significant. The posterior probabilities of the null hypotheses (of no significant genetic influence) are shown in Figure S-4 of the supplement.

Interestingly, with respect to the Euclidean metric, all the five posterior probabilities of the null hypotheses associated with the aforementioned 5 genes, turned out to be empirically zero. Thus, even though the clustering metric accepts 5 null hypotheses, the confirmation tests with the Euclidean distances suggest rejection of all of them. We hence conclude that all the genes considered in the study have significant effect on MI. This is in keeping with the fact that the genes considered in our study were found to be associated with different cardiovascular endpoints in various GWA studies or have been confirmed to play important roles in causing MI in earlier studies.

#### 5.3.2 Disease predisposing loci detected by our Bayesian analysis

We now show that the most influential SNPs corresponding to the maximum Euclidean distance in each of the significant genes in our study, which we continue to refer to as the DPLs, are usually close to, and sometimes exactly the same as the SNPs, already flagged by the earlier studies as influential.

Figure S-5 of the supplement shows the index plots of the posterior medians of the clustering and Euclidean distances between case and control, with respect to the corresponding genes. In terms of the clustering metric, genes , , and are associated with the largest medians of the clustering distances, ranging between and . These genes consist of number of loci , , and , respectively. On computing the averaged Euclidean distances , of the loci in each such Gene-, where the averages are taken over the TMCMC samples, we found that loci in , in , in and in have the largest distances among all the loci of the respective genes. These are depicted in Figure 5.1. Note that for the genes , , and to some extent, the significant SNPs from our study are not only close to the SNPs found significant in the existing studies (see ?), with respect to the Euclidean distance, but also lie in their close neighborhoods, suggesting relative agreement between the SNPs found significant in our study and the loci considered to be influential for MI in the literature. On the other hand, on which has been reported by ? to be associated with LDL cholesterol and total cholesterol (TC) does not turn out to be a significant SNP for MI according to our analysis.

We now focus attention to the genes that turned out to be more influential than the remaining in the sense that the medians of the Euclidean distances exceed . These are the genes , and with corresponding median Euclidean distances , and . These three genes clearly stand out in Figure S-5 (panel (b)). Each of them consists of a single locus, and are yet highly influential.

Figures S-6, S-7 and S-8 of the supplement analyze the SNPs of some of the other influential genes and point out the significant ones. Except for the genes and , the SNPs found significant in our study closely agree with the SNPs that are considered in the literature as influential.

In the next section we argue that gene-gene interaction plays a vital role in explaining the discrepancies between our findings and the existing results based on previous studies.

#### 5.3.3 Roles of gene-gene and SNP-SNP interactions discriminating our gene findings with influential genes reported in the literature

The actual gene-gene correlations based on medians of the posterior covariances, are shown in Figure 5.2, while Figure S-9 of the supplement depicts the results on interaction testing. The color intensities correspond to the absolute values of the correlations. Note that the correlation structure involves both positive and negative values where negative correlations occur in more than 45% of the cases. A hierarchical clustering of the genes based on the absolute values of the correlations, are provided in Figure 5.3. The vertical axis of the diagram represents , standing for the correlations shown in Figure 5.2. In a nutshell, lower the order of the hierarchy, stronger are the correlations between the genes. For instance, Figure 5.3 shows that the correlation between genes and is the strongest; moreover, the correlation between and , for example, is stronger than the correlation between and .

Figures 5.2 and 5.3 depict complex interplay among the genes, many of which negatively influence each other with respect to the Euclidean distances. These gene-gene interactions also seem to influence the SNP-wise Euclidean distances between cases and controls, revealing the effect of some new SNPs that failed to make their presence felt in the previous studies due to lack of adequate dependence structure. There is another important issue to point towards in this context. As will be seen, the correlations with respect to our current dataset are usually of small magnitude. This is not because the covariances are of small magnitude; indeed, they are all significantly bounded away from zero, but the variances are of very large order, so that the covariances scaled by the square roots of the variances, are small. These correlations, depending upon positive or negative signs, and in conjunction with the very complex interplay among the genes and the SNPs, are instrumental in deciding whether a SNP appears as the most influential one among a set of SNPs in any gene. The issue is explained mathematically in Section S-8 of the supplement.

We elucidate the aforementioned issue with respect to the genes , , and ; see Figure 5.1. Indeed, among the SNPs of , the case-control Euclidean distance associated with , which has turned out to be influential in our analysis has maximum negative correlation of with that of , the SNP pointed significant by ?. Hence, it is not surprising that failed to be close to in terms of the Euclidean distance between case and control. On the other hand, of is positively correlated with the literature-based important SNP , the correlation being , supporting the closeness of the Euclidean distances between and . For gene , the correlation between , the influential SNP by our study and the literature-based , is . The consequence of this small, albeit negative correlation, is well-reflected in panel (c) of Figure 5.1; the corresponding Euclidean distances are close but there are other SNPs, positively correlated with , that have larger Euclidean distances compared to that of . For gene , however, the influential SNP by our study, is positively correlated with the literature-based SNP , the correlation being . The fact that in spite of the positive correlation the two SNPs are not adequately close in terms of case-control Euclidean distances, begs some explanation, which we provide next.

Interestingly, the key to this phenomenon lies in inter-genetic SNP-SNP interaction. Indeed, , the most influential gene in terms of the maximum Euclidean distance, and consisting of the single locus , exerts negative influence on through the negative correlation , but has small and positive correlation, , with . This gene, through its only SNP, also influences the other genes via positive or negative correlations. For instance, its correlations with and of are and , respectively. In other words, the former receives lot more weight compared to the latter, so that becomes influential with respect to our model. For gene , its correlations with the relevant two loci and , are and , respectively, so that both SNPs are relatively close but the former takes precedence, becoming DPL, thanks to its larger correlation with . For gene , the correlations of and with are and , respectively. Hence, the former locus becomes the DPL because of its smaller negative correlation. Thus, there is a very complex interplay among the different genes, their SNPs and among the SNPs of different genes. In general it is infeasible to keep track of these complex dependencies and provide simple explanations for the different DPLs and their differences with the SNPs believed to be important by the scientific community.

The above elucidations attempt to point out that unless gene-gene and SNP-SNP interactions are taken into account through a sophisticated, nonparametric framework, such complex interaction effects might have been missed, which would perhaps lead to declaration of some truly influential SNPs as non-significant, and some non-influential SNPs as influential.

#### 5.3.4 Posteriors of the number of distinct mixture components distinguishing important sets of genes

Figures S-10, S-11 and S-12 show the posteriors of the number of distinct components associated with genes from the sets , , , , , , , and , , , respectively. The posteriors confirm our expectation that the four broad sub-populations composed of Caucasians, Han Chinese, Japanese and Yoruban admit further sub-divisions in general. Indeed, although for genes , and , the number of subpopulations turned out to be less than with high posterior probabilities, for the other genes the number of subpopulations have exceeded with almost full posterior probabilities. The shown posteriors are negligibly different for case and control, for all the three sets of genes, which is to be expected because of the high positive correlations between and .

It is interesting to observe that the posteriors of the number of components of the first set of genes , , , are roughly stochastically dominated by the second set , , , , which, in turn, are dominated by those of the set , , . The implication is that the third set consists of more genetic variations, followed by the second set, while the first set consists of least genetic variations. Thus, the last set of genes, consisting of only one locus each, seems to be most likely to affect the disease, while the first set seems to be the least influential on MI.

### 5.4 Discussion of our Bayesian methods and GWAS in light of our findings

Our Bayesian analysis yielded results that are broadly in agreement with those obtained by GWA investigations reported in the literature. However, the fact that some of the SNPs which are flagged by the literature as important, did not show up as the most significant ones, deserves attention. The main issue that emerged in our investigation is that the gene-gene interactions are responsible for suppression of the so-called important SNPs via implicit induction of negative correlations among Euclidean distances between cases and controls for the associated genes. Had there been no such negative correlations, it is plausible that these SNPs would turn out to be the most influential ones.

Apart from a few agreements, the literature based SNPs are different from the SNPs detected significant in our analysis, many of which lie in the intronic regions and have not been thoroughly explored and hence need further investigation. As per our investigation, sophisticated, nonparametric modeling of gene-gene interactions plays a very crucial role in imparting significance to the overall effect of the individual genes. Since the GWAS did not incorporate the complex intra and inter-genetic interactions into the model, it is perhaps not very unreasonable to question if the same genes would emerge as significant if realistic modeling of gene-gene interactions is taken into account.

For the current MI study, we summarize our findings in Table 5.1, where we present the 32 genes ordered with respect to the median case-control Euclidean distances, the SNPs flagged by the literature as significant, the corresponding SNPs detected by our Bayesian model and methods, and the phenotypes of the reportedly significant SNPs and our SNPs.

## 6 Concluding remarks

In this paper, we were compelled to consider a small part of the available real dataset consisting of SNPs cited in the literature as important. This small dataset, however, has the added advantage of alleviating computational burden. Indeed, since only two-threaded cores were available to us for implementation of our ideas, it is anyway imperative for us to confine attention to a (relatively small) subset of the available dataset. We are, however, expecting to expand our current parallel computing infrastructure, which would be of immense help in analysing the complete dataset, which is our actual goal.

In this work we have focused exclusively on gene-gene interaction. Recently, ? and ? have extended this model to incorporate gene-environment interactions in our model, and developed tests for the effects of gene-environment interactions as well as gene-gene interactions on case-control. They have successfully applied the ideas to various simulated datasets generated from the GENS2 software, and to the MI dataset, considering sex as the environmental variable. The results they obtained are broadly in agreement with the results on this MI dataset already existing in the literature.

## Acknowledgment

We are grateful to Dr. Arunabha Majumdar for providing useful feedback on an earlier version of our manuscript.

Supplementary Material

## S-1 A rule of thumb for choosing M and α

Following ?, ?, ?, ?, we set in our applications. It follows from ? that the mean and variance of the distinct parameter vectors in the set are both given by approximately . When prior information regarding the true number of mixture components is lacking, it may be reasonable to specify the expected number of distinct components to be close to half of the maximum number of components possible, namely, close to . With , we fix , so that about distinct mixture components are to be expected a priori. Apart from this choice, we also considered the possibilities , , that is, the gamma distribution with mean and variance , and (so that the mean and variance are and , respectively); however, the choice for all outperformed the other choices with regard to capturing the true number of mixture components. Hence, in this work, we report all our results associated with and . According to this specification, the prior mean and variance of the number of distinct components are approximately . Thus, compared to smaller values of , this choice ensures greater variability so that data-driven inference on the number of components receives greater weight.

## S-2 Choices of A0 and Σ0

For ; and , here we denote by the count of the minor allele at the -th locus of the -th gene and -th case-control status. In other words, . With this notation we define

 ¯wijk=1LjLj∑r=1wrijk. (S-2.1)

Also let

 ¯w⋅j⋅=1N0+N11∑k=0Nk∑i=1¯wijk. (S-2.2)

With these, we specify the -th element of as

 a0,j1j2=1N0+N11∑k=0Nk∑i=1(¯wij1k−¯w⋅j1⋅)(¯wij2k−¯w⋅j2⋅). (S-2.3)

For the specification of , we first consider

 ¯w⋅⋅k=1NkJNk∑i=1J∑j=1¯wijk. (S-2.4)

Then, letting , we specify the -th element of as

 σ0,k1k2=1NJN∑i=1J∑j=1(¯wijk1−¯w⋅⋅k1)(¯wijk2−¯w⋅⋅k2). (S-2.5)

## S-3 Elucidation that the r-th loci of any two different genes can be independent with positive probability

Our assumption that and of each locus is shared by all the genes does not imply that the labels of the loci of the genes are not exchangeable. Indeed, the -th loci of two different genes may be independent, given the data. To understand this, note that the -th loci of any gene does not only have the effect and , but also , for given . In other words, all the loci of gene share the common effect . Hence, for any two genes denoted by and , and , and considering only , the -th loci have the effects and . By our distributional assumptions it follows that the covariance between these effects is , given and , which is equal to zero if . Independence follows due to normality of our specified distributions. Now note that by our inverse-Wishart priors on and , the event gets positive probability for any , so that the covariance can be in any neighborhood of zero with positive probability, if connoted by the data.

## S-4 A parallel MCMC algorithm for model fitting

Recall that the mixtures associated with gene and case-control status are conditionally independent of each other, given the interaction parameters. This allows us to update the mixture components in separate parallel processors, conditionally on the interaction parameters. Once the mixture components are updated, we update the interaction parameters using a specialized form of TMCMC, in a single processor. The details of updating the mixture components in parallel are as follows.

• Split the pairs in the available parallel processors.

• During each MCMC iteration, for each in each available parallel processor, do the following

1. For , update the allocation variables by simulating from the full conditional distribution of , given by

 [zijk=m|⋯]∝πmjkLj∏r=1f(xijkr|pmjkr); (S-4.1)

for .

2. Let denote the distinct elements in . Also let denote the configuration vector, where if and only if .

Now let denote the number of distinct elements in and let denote the distinct parameter vectors. Further, let occur times.

Then update using Gibbs steps, where the full conditional distribution of is given by

 [cmjk=ℓ|⋯]∝⎧⎨⎩q∗ℓ,mjk% ifℓ=1,…,τ(m)jk;q0,mjkifℓ=τ(m)jk+1, (S-4.2)

where

 q0,mjk =αjkLj∏r=1β(n1mjr+ν1jkr,n2mjr+ν2jkr)β(ν1jkr,ν2jkr); (S-4.3) q∗ℓ,mjk (S-4.4)

In (S-4.3) and (S-4.4), and denote the number of and alleles, respectively, at the -th locus of the -th gene associated with the -th mixture component. In other words, and . The function in the above equations is the Beta function such that for any , ; being the Gamma function.

3. Let and . Then, for ; ; and , update