A Bayesian Generalized CAR Model for Correlated Signal Detection
Over the last decade, large-scale multiple testing has found itself at the forefront of modern data analysis. In many applications data are correlated, so that the observed test statistic used for detecting a non-null case, or signal, at each location in a dataset carries some information about the chances of a true signal at other locations. Brown, Lazar, Datta, Jang, and McDowell (2014) proposed in the neuroimaging context a Bayesian multiple testing model that accounts for the dependence of each volume element on the behavior of its neighbors through a conditional autoregressive (CAR) model. Here, we propose a generalized CAR model that allows for inclusion of points with no neighbors at all, something that is not possible under conventional CAR models. We consider also neighborhoods based on criteria other than physical location, such as genetic pathways in microarray determined from existing biological knowledge. This generalization provides a unified framework for the simultaneous modeling of dependent and independent cases, resulting in stronger Bayesian learning in the posterior and increased precision in the estimates of interesting signals. We justify the selected prior distribution and prove that the resulting posterior distribution is proper. We illustrate the effectiveness and applicability of our proposed model by using it to analyze both simulated and real microarray data in which the genes exhibit dependence that is determined by physical adjacency on a chromosome or predefined gene pathways.
Key Words: conditional autoregressive model, enrichment, microarray, multiple testing, significance analysis of microarrays, spike-and-slab prior
The analysis of high throughput data presents many challenges to researchers across a variety of disciplines. Many of the problems that must be dealt with are ubiquitous in the sciences but are exacerbated when the datasets are massive in size. Often, the goal is to detect the presence or absence of a signal over an extremely large number of cases, creating a massive multiple testing problem. Prior to the last two decades, most multiple testing procedures were constructed to control an overall error rate for a relatively small number of simultaneous tests (Efron, 2010). The advent of high throughput technology quickly revealed, however, that classical procedures can be inappropriate in the presence of thousands of simultaneous tests (Benjamini, 2010).
Suppose our data consist of cases, each of which arises independently from a normal distribution with case-specific mean, . For example, in gene microarray or next-generation RNA sequencing, may be the test statistic quantifying the differential expression of gene between cancerous and healthy tissue (Efron and Tibshirani, 2002; Li and Tibshirani, 2011). In functional magnetic resonance imaging (fMRI), signals are observed over time at thousands of points in the brain and a test statistic is calculated at each point to summarize the observed difference in that area’s signal between some stimulus condition versus a baseline (Friston, Holmes, Worsley, Poline, Frith, and Frackowiak, 1995). Frequently, a question of interest is whether or not at each ; i.e., a hypothesis test is conducted at each of thousands of locations to determine which of the are non-zero, indicating an interesting signal. The problem is to find a statistical procedure which corrects for multiple testing but does not sacrifice too much sensitivity for the sake of preventing false positives.
A similar issue arises in variable selection, where one is interested in determining which (of possibly many) variables contribute in a meaningful way to the observed response. A Bayesian approach is to assume the coefficient corresponding to each variable, , belongs to one of two distinct classes, the null class in which , or the non-null class, (e.g., Mitchell and Beauchamp, 1988; George and McCulloch, 1993; Scott and Berger, 2006; Efron, 2008). Each is assigned an a priori probability of belonging to the null class. This value may be regarded as the proportion of all possible cases which would be null so that models very sparse signals, while models abundant signals. This mixing proportion, , is usually unknown, but it can be assigned a prior distribution to reflect the researcher’s beliefs about the level of sparsity in the data, or it can be estimated via empirical Bayes. The posterior inclusion probabilities (Barbieri and Berger, 2004) can subsequently be estimated from the posterior distribution. Placing a Beta prior on induces a multiplicity adjustment in that the model automatically penalizes for the number of tests in a posteriori probability statements. *ScottBerger10 discuss this issue and the conditions under which multiplicity correction can be induced.
Much of the work thus far developed is based on the assumption of independent hypothesis tests. This assumption is untenable in many applications. For example, nontrivial dependence structures are known to exist in neuroimaging data (Lee, Jones, Caffo, and Bassett, 2014; Zhang, Guindani, Versace, and Vannucci, 2014), syndromic surveillance (Banks, Datta, Karr, Lynch, Niemi, and Vera, 2012), gene microarray (Zhao, Kang, and Yu, 2014), and RNA sequencing (RNA-seq; Love, Huber, and Anders, 2014). Correlation can cause the null distribution of the observed test statistics to be over- or under-dispersed relative to the theoretical null under independence. Consequently, either too few or too many test statistics may be declared significant. The deleterious impact correlation can have on empirical Bayes methods and false discovery rate control (FDR; Benjamini and Hochberg, 1995) was investigated in *QiuEtAl05. *Efron07 focused on the effects of dependence on the distribution of test statistics. While some work has been done on incorporating known dependence structure into Bayesian models for identification of interesting cases (e.g., Smith and Fahrmeir, 2007; Li and Zhang, 2010; Stingo, Chen, Tadesse, and Vannucci, 2011; Lee, Jones, Caffo, and Bassett, 2014; Zhao, Kang, and Yu, 2014; Zhang, Guindani, Versace, and Vannucci, 2014), it has been limited, particularly with respect to exploring the multiple testing adjustments incurred through data-dependent estimation of inclusion probabilities.
Many datasets include isolated observations. For example, genes in microarray data share common pathways, but many genes are in no pathway at all. It is important to include as many cases as possible when evaluating the posterior distribution. A standard CAR structure assumes every observation has at least one neighbor, so that one is forced to either use an inappropriate neighborhood structure or exclude isolated points altogether. In the current paper, we extend a model proposed by *BrownEtAl14 for Bayesian multiple hypothesis testing and discuss more general neighborhood structures to provide a unified treatment of cases with at least one neighbor and with no neighbor. We use a less restrictive improper prior distribution for the variance components and establish the propriety of the posterior distribution of our model, ensuring that inferences are valid. In addition to allowing the newly proposed model to identify isolated non-null cases, we demonstrate that the inclusion of isolated points results in stronger Bayesian learning and improved estimation of the signal strengths of the selected cases.
We briefly motivate a common model used in Bayesian signal detection in Section 2. This leads to our proposed extension of the so-called spike-and-slab prior to accommodate local dependence. We prove the propriety of the posterior distribution of our proposed model and discuss computation. In Section 3, we simulate correlated microarray data to study our model’s performance against a prior assuming independence and assuming a conventional CAR structure. We also compare it against results obtained from the significance analysis for microarrays (SAM) procedure (Efron, Tibshirani, Storey, and Tusher, 2001; Tusher, Tibshirani, and Chu, 2001; Efron, 2010). We apply our procedure to two real gene microarray datasets in Section 4, one using dependence determined by adjacency on a chromosome, and the other with gene pathways defining the neighborhoods. Section 5 concludes with a discussion of the results.
2.1 Mixture Priors for Multiplicity Adjustment
To facilitate a Bayesian multiple testing correction, we postulate a “two groups model” (Efron, 2008). That is, we assume two possible cases for each of the observed test quantities, reflected in a Bayesian model through the prior on ,
where is the Dirac delta spike at zero and is the Gaussian density function with mean zero and variance . So-called “spike-and-slab” or “spike-and-bell” priors of this form are standard in the Bayesian variable selection framework. They were introduced by Mitchell and Beauchamp (1988) for variable selection in linear regression. The mixture model was used in Geweke (1996), who provided a procedure for selecting models subject to order constraints among the variables included in each model. A similar approach was taken in George and McCulloch (1993), who treated each regression coefficient as arising from a mixture of two continuous distributions with different variances for stochastic search variable selection. Literature on Bayesian variable selection was reviewed in Clyde and George (2004) and O’Hara and Sillanpää (2009). Scott and Berger (2006) explored Model (1) and ways in which it induces multiplicity correction, along with graphical displays and decision rules for subsequent inferences.
By allowing to be determined by the data, the joint posterior distributions obtained from spike-and-slab priors adapt to the number of tests, resulting in the posterior inclusion probabilities, , being penalized to account for the multiple tests (Scott and Berger, 2006, 2010). Specifically, Scott and Berger (2006) used a prior density on , where is a specified value, and as a prior density for the variance components. Under this model, Scott and Berger (2006, Lemma 3) showed that
where the expectation is taken with respect to the joint posterior distribution of , , and .
2.2 Incorporation of Local Dependence
Posterior inference can be sharpened if we exploit correlation among potential predictors in the search for interesting signals. *BrownEtAl14 proposed allowing the continuous component of (1) to share information across observations by reparameterizing as , where and is Gaussian, so that , where and . The lattice structure of datasets such as those arising from fMRI and gene microarray, makes a conditional autoregressive model (CAR; Besag, 1974) a natural choice for incorporating local dependence into the prior on . Since the potential non-null signals are expected to be as much positive as negative, a priori it is reasonable to assume such signals have zero means. Thus, we consider prior distributions of the form , where , , and except when cases and are neighbors. The intrinsic autoregressive model (IAR; Besag, York, and Mollié, 1991) emerges by taking and , where if and only if sites and are neighbors and . Under an IAR model for , the prior density is given by
where and . Note that so that the precision matrix has a nontrvial nullspace and hence the IAR is improper. However, a “propriety parameter”, , can be used in the conditional distributions such that with precision matrix . In general, the precision matrix will be nonsingular if , where and are the smallest and largest eigenvalues of , respectively (Banerjee, Carlin, and Gelfand, 2015).
Any data that have a lattice structure with known or suspected correlations occurring along predefined networks can be modeled with a CAR model. For instance, genes in microarray are known to express themselves in clusters along a chromosome (e.g., Xiao, Reilly, and Khodursky, 2009), or to behave in concert along specific gene pathways (Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, Pomeroy, Golub, Lander, and Mesirov, 2005). Thus, neighborhoods can be defined in terms of adjacency on a chromosome or based on genes sharing certain predefined pathways determined from prior knowledge. Care should be taken in defining neighborhoods, though, as microarray / RNA-seq datasets often include genes that are not members of any known pathway and thus are isolated. Including isolated points in the IAR induces zero rows in the precision matrix, a problem that cannot be fixed with a propriety parameter. In response, we adjust the neighborhood weights to allow for inclusion of the isolated points while avoiding a singular precision matrix.
We modify the usual IAR model by defining the neighborhood weights about to be with conditional variance , where . The consequent precision matrix is . Then , with strict inequality for . Thus, with , we are able to include isolated points in the model while maintaining the propriety of the distribution. If we take , then for any isolated point , so that , , and independently of other points. Hence, we can facilitate conditional independence of while allowing all points to share information about plausible values of the hypervariance through the prior distribution on the variance components. Taking this view, we can express the traditional IAR model as a special case in which every point in a dataset has at least one neighbor and .
Let the joint density of the data and parameters be given by , where contains nuisance parameters modeled in the prior distribution. When for all , does not appear in the resulting likelihood and thus is Bayesianly unidentified (Gelfand and Sahu, 1999; Eberly and Carlin, 2000). This means that , so that we must have a proper prior on for the posterior distribution to be proper, making the inclusion of necessary when . See McLachlan and Peel (2000, Chapter 4) for further discussion of prior distributions in finite mixture models.
Usually, there is little direct information available about , so estimating it may be difficult. Previous work has shown that appreciable interaction between adjacent points only occurs when is close to its upper bound under the model (Banerjee, Carlin, and Gelfand, 2015). To give the data more freedom in determining the spatial association without specific regard for interpretability, we consider the prior . It is important to note that inclusion of is still possible when , provided that is bounded between the reciprocals of the smallest and largest eigenvalues of .
An additional advantage of including in the joint model for is that, under positive spatial association, the posterior distribution becomes insensitive to the choice of in the neighborhood weights. This is because as grows, is allowed to increase as well. In other words, if , then , where is the maximum eigenvalue of . A proof of this fact may be found in the Supplementary Material.
We also wish to avoid strong information about either the noise variance, , or the hypervariance, . *Gelman06 suggested that the priors specified for the variance parameters in hierarchical models may have a disproportionate effect in that they can impose strong restrictions on posterior inference. Conversely, priors used for scale hyperparameters that are intended to be noninformative may, in fact, be too weak by placing considerable probability on unreasonable extreme values in the posterior. To address this possibility, *Gelman06 proposed the use of a weakly informative (vague but proper) prior on the scale hyperparameter such as the folded- distribution. *ScottBerger06 argued for the use of a joint prior on and with density so that a standard improper prior on can be used while scaling by . The prior on is similar to the prior on which results from placing a folded- prior on with scale . Supplementary Figure 6 illustrates a slight difference between the two priors for small values of so that the Scott-Berger (SB) prior on is slightly less informative than a folded-. However, the SB prior and the folded--based prior are tail equivalent in the sense that the ratio of the two densities is as . Both priors attempt to balance the freedom of the data to inform about this parameter against the need to protect against unreasonable values. We thus follow the precedent set by Scott and Berger (2006) and use the same joint prior distribution on and .
This leads to the following model:
where and are the smallest and largest eigenvalues of , respectively. Since we are using an improper prior in (4), the posterior density is not guaranteed to be integrable. We provide a proof in the Appendix that the posterior distribution is indeed proper.
The practical effect of having room to increase along with in our proposed model can be seen through simulation. We simulate a array of observations arising from both null and non-null distributions. The activation pattern is created by drawing from an Ising distribution (e.g., Higdon, 1994), , with interaction parameter . The null cases () are drawn from and the non-null () cases are drawn from . The binary activation pattern and simulated data array are displayed in Supplementary Figures 7 and 8. We use these data to estimate the posterior distributions under model (4) with . See Subsection 2.3 for implementation details. A descriptive measure of spatial association is Moran’s , , where values away from zero are evidence of spatial association according to the predefined neighborhood structure (e.g., Banerjee, Carlin, and Gelfand, 2015, Sec. 4.1). Figure 1 displays smoothed histograms of realizations of from 2,000 replications each from the three respective posterior predictive distributions, , along with the approximate marginal posterior densities of . For each value of , the posterior of tends to concentrate near its upper bound and the posterior predictive densities of are nearly indistinguishable. Regardless of the value of , adjusts accordingly and the overall spatial association is consistent with the data.
Generally speaking, including as many cases as possible when evaluating the posterior distribution is important. It is often the case that the dataset to be analyzed contains subsets of correlated observations among many independent observations. A standard CAR structure assumes every observation has at least one neighbor so that the only possibilities for dealing with both types of data are to force every data point into the model via an inappropriate neighborhood structure or to exclude isolated cases. Using an inappropriate dependence structure is clearly undesirable, and excluding isolated points can result in discarding a large proportion of the data. With our proposed approach, adjusting the weights with in the denominator allows for the inclusion of all cases when evaluating the posterior distribution while simultaneously preserving the dependence among the cases sharing common networks as well as the independence of the isolated cases. In the sequel, we consider the performance of our model when or . These are admittedly ad hoc values, and may not be appropriate for all situations.
2.3 Computational Implementation
We facilitate the Gibbs sampling algorithm (Geman and Geman, 1984) by reparameterizing the model as . For ease of notation, define . Then
The full conditional distribution of is , where (Carlin and Louis, 2009). To avoid matrix inversion with extremely large , we update element-wise. The full conditional distributions are given in the Supplementary Material.
We use rejection sampling to draw from the full conditional distribution of . However, the importance ratio determining the acceptance probability is as . This means that the iterations can slow down on this step with candidate densities that concentrate on extremely small values of . Possible alternatives are an adaptive Metropolis algorithm (Haario, Saksman, and Tamminen, 2005) or adaptive rejection sampling (Gilks and Wild, 1992).
Since has bounded support, we follow the suggestion of *CarlinBanerjee03 and use slice sampling (Neal, 2003) to sample from the full conditional distribution of . Our experience is that the algorithm performs better with the “doubling” procedure outlined by *Neal03 to adaptively determine good proposal intervals. Note that the determinant in the conditional density of can be quickly computed using the eigenvalues of , which do not depend on any unknown parameters. These eigenvalues will have already been computed if using the proposed model in (4). Matrix computations can also be eased by calculating and before searching for an acceptable update for . These only need to be calculated once for each Gibbs iteration.
From (10) in the Supplementary Material, we can see the strong dependence between and in their full conditional distributions. On the iteration of a Gibbs sampling routine, if the sample draw is extremely close to 1, then most of the draws , will be zero. But then will be close to zero so that the conditional Beta density will concentrate close to 1, leading to another high value of , and so on. Thus, in spite of the computationally convenient conditional conjugacy, an MCMC routine can get stuck in the region of the parameter space with most , which slows. This situation can be ameliorated by reparameterizing the mixing proportion to eliminate boundary constraints, for instance a logit transformation , and using Langevin-Hastings proposals for to push the chain toward the posterior mode (Gilks et al., 1996, Ch. 6). Occasionally mixing in ordinary Metropolis proposals in place of Langevin-Hastings proposals offers further improvements when the chain is far from the mode (Carlin and Louis, 2009, Ch. 3).
The quantities of interest are the marginal posterior inclusion probabilities for each signal , . To estimate this from the posterior sample draws, we recognize that in Model (4), , where the expectation is taken with respect to , and is given by
This quantity can be estimated by , where is the plug-in estimate of evaluated with the draws from the approximate joint posterior, and is the Monte Carlo sample size. Similarly, we can use (2) to estimate the inclusion probabilities under the Scott-Berger model using drawn from the appropriate posterior. Both of these estimators are “Rao-Blackwellized” in the sense of *GelfandSmith90 and thus have smaller Monte Carlo variance than other more naive estimators (Carlin and Louis, 2009, Ch. 3).
3 Simulation Studies
To evaluate the performance of our model, we simulate a dataset in a manner similar to that which was carried out in *XiaoEtAl09, resulting in similar correlation structure as sometimes arises between genes on chromosomes. We consider a situation in which 1,000 gene expressions are recorded for ten subjects. For the gene on the subject, , the observed expression level is drawn from . Five of the subjects are taken as controls with baseline (i.e., null case) expression levels over all 1,000 genes, so that for . The remaining five “treatment” subjects’ data are simulated so that 100 genes are differentially expressed. That is, for , and . For each control subject, we generate the gene expression levels by drawing the vector of observations . For the treatment group, the test statistics belonging to the null class are again simulated as i.i.d. standard normal. We model correlation among the differentially expressed (non-null) cases by drawing each group of twenty test statistics as , where is the cluster of non-null cases, (i.e., ), and . This results in null cases that are independent of each other, but correlated non-null cases. Pooled statistics, , are then calculated between the control and treatment conditions for each gene and subsequently normalized via probit transformation of the -values (Efron, 2010), yielding test statistics , where with being the distribution function of the statistics.
We analyze the simulated data using both our proposed model and the Scott-Berger (SB) model assuming independence. In our model, we consider the sharing of information across genes using three different neighborhood structures. These neighborhoods are displayed graphically in Supplementary Figure 9. The associated adjacency matrices for these neighborhoods are
For further comparison, we implement also the Significance Analysis for Microarrays (SAM) procedure as outlined in *Efron10. SAM is a popular method for analyzing microarray data designed to approximately control the false discovery rate (FDR). For this procedure, we vary the FDR criterion between 0.05 and 0.15 to study performance across a range of FDR levels commonly used in practice. See Efron (2010, Chapter 4) for further details of the SAM procedure.
Each neighborhood is one in which every location has at least one neighbor so that, in (4), for all . We take , reducing to the usual IAR model like the one considered in Brown, Lazar, Datta, Jang, and McDowell (2014). To enforce sparsity a priori, we choose a high value of in the prior on so that its prior probability mass is concentrated close to 1. We take here. In the SB model, we find that the best results are obtained using a uniform prior on , with higher values of the shape parameter negatively affecting the model’s ability to detect activation. Lastly, note that the data generating mechanism here is different from what is assumed under either our proposed model or the SB model, allowing us to study robustness, as well.
For both Bayesian models, we simulate the posterior distributions via MCMC. We implement the SB model using Gibbs sampling with nested rejection sampling steps for the non-standard distributions. To draw from the posterior of our proposed model, we use the conditional distributions given in the Supplementary Material for a Gibbs sampling procedure with nested rejection and slice sampling steps as described in Subsection 2.3. The algorithms are coded entirely in
R (R Core Team, 2015). For our proposed model, a single chain uses a burn-in period of 5,000 iterations followed by an additional 10,000 sampling iterations, thinning to every fifth draw for a final sample of size 2,000. We run three chains in parallel and assess convergence with trace plots and scale reduction factors for selected parameters (Gelman and Rubin, 1992). Upon attaining approximate convergence, the retained draws from each of the three chains are combined for a final Monte Carlo sample size of 6,000. Similarly, we run parallel chains, assess convergence, and combine the draws to simulate the posterior distribution of the SB model. The posterior inclusion probabilities are estimated using (2) for the SB model and (5) for our proposed model. To select simulated genes as differentially expressed, we threshold the estimated posterior inclusion probabilities at 0.95.
Table 1 displays the empirical error rates for the SB model, the SAM procedure, and our model using , , and above as neighborhoods. Of these models, we observe that the SB model results in the highest overall misclassification proportion, due to the false non-discoveries. In this case, the SB model is overcorrecting for multiplicity to the point that no cases are selected at all (hence the identically zero false discovery proportion). The SAM procedure performs slightly better in terms of non-discoveries and overall misclassification proportion, at the expense of false positives accounting for 13% - 16% of all discoveries. In contrast, our proposed model performs better, regardless of the selected neighborhood structure. The first-order neighborhood with unit weights (; top illustration in Supplementary Figure 9) performs best both in terms of non-discoveries and false discoveries, but all of the error rates are very close when compared to the other two approaches.
|SB||SAM(0.05)||SAM(0.10)||SAM(0.15)||CAR ()||CAR (||CAR ()|
Figure 2 displays the empirical receiver operating characteristic (ROC) curves for the SB, SAM, and CAR() models. The ROC curves for the CAR() and CAR() models are virtually indistinguishable from the CAR() curve and so are not displayed. We see that the SB testing model and the SAM procedure are comparable in terms of overall discriminatory power. The approximate areas under the curves (AUC) are given in Table 2. While both of these procedures yield sensitivity slightly superior to that of our model at lower levels of specificity (less than about 0.40), ours still captures more area under its curve. In particular, note that our model attains a very sharp increase in sensitivity at very high levels of specificity.
Insight into the reasons for the discrepancies between our model and the SB model can be gained by examining the smoothed approximate posterior densities of , , and , displayed in Figure 3. Incorporating the dependence results in much stronger Bayesian learning about these parameters. In this simulation, 100 out of 1,000 cases are non-null, so that we expect to be large, though not exactly 0.9 since correlation among the cases reduces the effective sample size (Carlin and Louis, 2009, Ch. 3). That is indeed the case under our model. The SB model, on the other hand, results in considerably lower estimates of , along with weakly identified distributions of and . This weak identifiability contributes to the error rates observed in Table 1.
Supplementary Figure 10 plots the estimated inclusion probabilities versus the observed test statistics. All of the test statistics are assigned relatively high probabilities of being non-null under the SB model, but the lack of information about and prevent distinctions being drawn that are strong enough to pass a 0.95 inclusion probability threshold. Our approach, on the other hand, results in stronger statements about the likelihoods of cases being non-null. The ‘jagged’ quality of the curve corresponding to the CAR() model is due to the estimated inclusion probabilities being not a function of the values alone, but also of their location with respect to nearby observed values. To see this, consider the circled point in Supplementary Figure 10, which corresponds to the identified statistic in the graphical depiction of the test statistics in Supplementary Figure 11. This relatively extreme observation is in the middle of uninteresting test statistics. This is a truly null case, so the incorporation of a local dependence structure helps to prevent a false discovery.
In addition to neighborhoods determined by physical adjacency, it may be desirable to facilitate the sharing of information within gene sets such as those used in enrichment analysis (Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, Pomeroy, Golub, Lander, and Mesirov, 2005). To investigate this scenario, we simulate an additional microarray dataset with expression characteristics similar to the simulation carried out in *EfronTib07. We again consider a collection of 1,000 genes, with genes 11-20, 111-130, 211-230, 311-330, 411-430 defining five different gene sets. The set consisting of genes 111-130 is simulated as differentially expressed by drawing them independently from ; similarly, the genes 411-430 are drawn independently from . The remaining genes, including those in the remaining gene sets, are considered null cases and all drawn from . To distinguish dependence determined by pathway membership from dependence determined by physical adjacency, the genes are labeled and randomly permuted so that genes sharing common pathways are not adjacent in the physical sense. Note that many genes are members of no pathway at all and thus are isolated.
Suppose a researcher were to mistakenly assume the dependence structure for these data followed the same pattern as the previous example in which every gene is correlated with its physical neighbors, meaning the usual proper IAR with adjacency matrix can be used with in (4) as before. We compare the performance of this CAR structure to our proposed approach with neighborhoods determined by pathway membership. In the latter case, the adjacency matrix is determined by defining genes that are in the same set to be neighbors. To include all observations without reducing the rank of the precision matrix, we set in (4) so that the marginal distributions of isolated become . We implement both models using MCMC with the same burn-in and sampling settings as the previous simulation.
Supplementary Figure 12 displays the empirical ROC curves from our model under both neighborhood assumptions, in which we can see that making incorrect assumptions about the neighborhood structure severely inhibits the model’s discriminatory power. In fact, thresholding the posterior inclusion probabilities at 0.95 as before results in no cases being identified at all under the physical adjacency assumption. The misspecification of the correlation results in the model overestimating the noise variability in the data, as is clear upon examination of the histogram in Supplementary Figure 13. Superimposed on the histogram are two mean-zero normal densities with variances equal to the posterior means of under both models. The overestimation of the spread of the null distribution results in poor identification of the non-null cases, indicated with the dark tick marks along the -axis. Incorporating a more appropriate neighborhood structure results in improved estimation of the variance components and thus superior discrimination among cases.
Even with knowledge of an approximately correct dependence structure among pathways, a temptation might be to use a standard CAR model by discarding the isolated cases. While this is obviously better than using an entirely incorrect neighborhood assumption, the isolated points still provide information about the parameters common to both the null and non-null components of the data distribution and hence potentially useful information can be needlessly discarded. Consider the smoothed marginal posterior densities of , , and resulting from both approaches, displayed in Figure 4. Eliminating the isolated cases reduces in (4), so we obtain considerably less posterior concentration about the error variance, which in turn affects the amount of information available to estimate the second variance component, . As is the case in any hypothesis testing scenario, the operating characteristics are directly affected by the amount of information we have about the error variability. This results in the “borderline” cases being misclassified as noise at a 0.95 inclusion probability threshold, thus increasing the false non-discovery proportion. (See Supplementary Figure 14.) The false discovery, false non-discovery, and misclassification proportions at the 0.95 threshold with and without isolated cases are given in Table 3.
|With Isolated Cases||Without Isolated Cases|
In addition to detection, there is often an interest in estimating the true signal strengths of the non-null cases. Figure 5 plots the smoothed approximate posterior densities and approximate 95% credible intervals about the signals, , for two typical non-null cases in the simulated gene pathway data. Again, the posteriors are evaluated with and without the isolated cases using neighborhoods defined by pathway membership. The true signal strengths (i.e., the means of the distributions of the test statistics) for these two cases are (top panel) and (bottom panel), indicated in the plots by vertical lines. The Figure illustrates the reduction in posterior uncertainty that can be obtained by including all of the data points. For the top panel, the approximate 95% credible intervals with and without the isolated cases are and , respectively. For the bottom panel, the intervals are and with and without the isolated cases, respectively. Table 4 displays the average widths of the approximate 95% credible intervals over all of the cases in both non-null pathways with and without the isolated observations. By including the isolated points with an appropriate neighborhood structure afforded by our proposed model, we attain an approximate four-fold increase in the precisions of the signal estimates.
|Pathway||All Cases||Cases Excluded|
|2||0.9603 (0.0137)||1.708 (0.0604)|
|5||0.9338 (0.0105)||1.700 (0.0433)|
In practice, the most appropriate neighborhood structure to use in the CAR model may not be known. For gene microarray, though, only biologically meaningful dependence structures would usually be considered so that one would not be faced with completely unguided choices. Even among interpretable neighborhoods, we still might be forced to compare competing neighborhood assumptions in an effort to determine which is best. Consider again the simulated pathway data, only we do not know whether the dependence is among biologically-determined pathways, or if it is a function of physical adjacency. In this case, we can gauge the spatial dependence by considering Moran’s statistic under different neighborhood assumptions, whence the competing models can be examined by looking at the strength of estimated spatial association according to each. For the simulated pathway data, we have for the (incorrect) adjacency assumption, and for the (correct) pathway assumption. The lack of association under the adjacency structure is indicative of the invalid assumption, since we would suspect there to be some kind of association (otherwise there’s no need for a CAR structure at all). Further, we can investigate competing models’ predictive capabilities through the use of root mean square predictive error (RMSPE) and the Wantanabe-Akaike information criterion (WAIC; Wantanabe, 2010), the latter of which is asymptotically equivalent to Bayesian leave-one-out cross validation but averages over the posterior distribution instead of relying on point estimates, unlike AIC or DIC (Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin, 2014). Table 5 displays RMSPE and WAIC for both models, where we see that the correct neighborhood structure is favored.
The true correlation structure among gene expression data is complex. The implementation of our proposed model (and similar CAR models), however, requires neighborhood structures to be specified a priori, and this specification can potentially be incomplete or simplistic. Therefore, to study the performance of our proposed model under partially incorrect neighborhood assumptions, we simulate 1,000 expression levels over ten subjects (five treatment, five control) with genes 11-20, 111-130, 211-230, 311-330, 411-430 defining five different gene sets as before. We again suppose that the second and fifth sets are enriched in the treatment group. In contrast to the previous simulation, assume for each treatment subject that the correlation among signals is induced according to a directed graph, as might occur in metabolic pathways in which a signal originates from a single gene and cascades via several networks to other genes downstream (Stingo, Chen, Tadesse, and Vannucci, 2011). A “parent” gene is selected at random from each of the two active gene sets, and the signals for these parent genes are simulated as . Within each pathway, seven additional child nodes are selected at random and their signals are taken to be . The remaining signals are drawn independently from . Within each gene set, the observed expression levels are taken to be , where indexes over all genes, including the parent nodes and child nodes, and indexes the subjects. To represent incompleteness, suppose the neighborhood structure determining omits two genes that are actually members of one of the active sets, and likewise for one of the inactive sets. Lastly, we randomly select 30 isolated genes and draw their expression levels from for all subjects, treatment and control, where . Our proposed model thus incorrectly assumes uniform correlation within each pathway, uses incomplete pathway definitions, and ignores correlation among a subset of genes with no known pathway membership.
Table 6 displays the empirical false nondiscovery proportions, false discovery proportions, and overall misclassification proportions for these simulated data under the Bayesian independence testing model and our proposed pathway-determined CAR model with 0.95 posterior probability threshold, as well the SAM procedure with three common thresholds. Despite the model misspecification, our model performs comparably to the Bayesian approach assuming (incorrectly) independence and the generally applicable SAM procedure. In fact, our approach still yields the smallest empirical misclassification proportion, though they are all very close. Even though we have partially incorrect assumptions about the correlation structure, our approach is still superior to that of assuming complete independence in terms of predictive capability, as evident in the lower RMSPE and WAIC values, also displayed in Table 6. Further, the assumed neighborhood structure in our model predicts non-zero spatial correlations that are fairly consistent with those observed in the actual data under the same structure. This is evident in Supplementary Figure 15, which plots a histogram of realized values from the posterior predictive distribution along with the value observed in the actual data. Under the assumed neighborhood structure, .
These simulation results indicate that overall performance of the mixture prior can be improved by using common information across local neighborhoods, when it is available. This improvement is due to the induced penalty on the inclusion probabilities of statistics surrounded by uninteresting observations, and to improved estimation of the mixing proportion and variance components in the data. By choosing the neighborhood weights appropriately, we demonstrate how our model can accommodate isolated genes that have no neighbors. Our proposed approach is evidently superior to an approach using a conventional CAR model, even when the correct pathway information is used to define the neighborhoods but isolated points are discarded. Incorporating spatial dependence and isolated cases results in much sharper Bayesian learning in the posterior distribution, improving inference and reducing uncertainty. Simple diagnostics such as Moran’s under different assumed neighborhood structures can be helpful when competing neighborhood structures are available, as well as considering predictive capabilities through measures such as WAIC, which approximates out-of-sample predictive error while averaging over the posterior distribution. We illustrate also that even simplistic correlation assumptions in the Bayesian testing approach still perform competitively with alternatives such as the SAM procedure while predicting dependence features that are consistent with the observed data. Judging from model fit criteria, partially incorrect correlation assumptions are better than ignoring them altogether in an independence model.
4.1 E. Coli Data
As one application of our proposed model, we consider the microarray expression data from *XiaoEtAl09. This dataset contains transcriptional activity on the Escherichia coli chromosome measured as log ratios of transcript abundances between a control and various chemical, physiological, and genetic perturbations comprising 53 experimental conditions. The observed gene expression levels are the average log ratios across conditions. Here, we have 4 replicate measurements on 4,276 genes. Test statistics are calculated by dividing the mean difference by the standard error plus a small constant to reduce the effect of extreme observations, as is done in the SAM procedure. These statistics are probit transformed to yield equivalent statistics.
The E. coli chromosome has been shown to have correlated expression levels according to gene location, but with a circular structure so that the first and last genes on the chromosome are considered neighbors (Xiao, Reilly, and Khodursky, 2009). This structure leads us to use the adjacency matrix obtained by replacing the last elements of the first row and first column of in Section 3 by 1. We take since each point has two neighbors.
We simulate the posterior distributions of the independence model and our CAR model using the MCMC algorithms described in Sections 2 and 3. The resulting posterior activation probabilities are thresholded at 0.99 to select genes as being differentially expressed. To evaluate performance, we compare the genes selected as differentially expressed to a list of 41 genes identified in *Macnab92 as having a known or suspected function in the E. coli chromosome. This list serves as a reference with which we calculate approximate false discovery and false non-discovery proportions.
To illustrate the effect of the shape parameter in the prior for in our model, we repeatedly simulate the posterior distribution while varying over selected values between 1 and 1775. For the independence model, we again take . Table 7 gives the consequent empirical error rates. For lower values of , the sensitivity results in generally higher error rates compared to that from assuming independence. However, for higher , we are able to attain uniformly better performance, with all three error rates outperforming the independence model. The effect of on the marginal posterior distributions of and can also be seen in Supplementary Figure 16. Higher values result in higher estimated values of , as expected, but they also result in sharper posterior inferences about both and . We remark that the false discovery proportions are all quite high here. As this analysis is based on a real dataset, though, there is no way of knowing which of these are true false discoveries. Indeed, the high FDP could be a consequence of *Macnab92 listing the most interesting genes, not necessarily all interesting genes.
4.2 Male vs. Female Lymphoblastoid Cell Data
For an additional application under a different neighborhood structure, we consider mRNA expression profile data collected from lymphoblastoid cell lines derived from fifteen males and seventeen females. This dataset was analyzed in *SubramanianEtAl05 with gene set enrichment analysis (GSEA), who sought to identify gene sets enriched in males (male female) and enriched in females (female male). Each cell line contains measurements on 22,283 genes. The existing catalog for this dataset includes 319 cytogenetic gene sets, 24 for each of the 24 human chromosomes, and 295 associated with cytogenetic bands. For our analysis, we again calculate two-sample -statistics and normalize to obtain -statistics.
We consider two variants of our proposed CAR testing approach, both of which use the 319 gene sets to define the neighborhoods, but with one model including the isolated points and the other excluding isolated points so that a conventional CAR model is applicable. With the isolated cases included, we set in (4); when there are no isolated cases. Otherwise, both models are the same. We take in the prior on so that it becomes uniform. The MCMC algorithm, identical under both models, uses a burn in of 25,000 iterations, followed by 10,000 sampling iterations, of which every fifth draw is retained. The posterior inclusion probabilities are calculated using (5) and thresholded at 0.99 to identify differentially expressed genes.
The estimated posterior inclusion probabilities under both models are quite similar, though not exactly the same. This can be seen in Supplementary Figure 17, which plots the posterior inclusion probabilities versus the test statistics for both models. The differences result in a couple of disagreements on the selection of differentially expressed genes, listed in Table 8.
The disagreements between the two approaches can partly be explained by the reliability of the estimates themselves. Table 8 lists the effective sample sizes (ESS) of the MCMC draws of the signals for each identified gene, (Kass, Carlin, Gelman, and Neal, 1998). The ESS is an approximation of the number of independent pieces of information about a parameter produced by an MCMC algorithm, where lower numbers reflect higher autocorrelation in the chain and hence slower convergence. The differences between the two approaches considered here are substantial. With a couple of exceptions, the retained values from the Markov chain under the conventional CAR assumption are more highly correlated than in the gCAR, thus reducing the amount of available information about these parameters from a fixed number of iterations. On the other hand, including the isolated cases in this instance generally results in the Markov chain samples being almost as good as an i.i.d. sample from the target distribution.
Table 8 also indicates the pathway membership for each case. There are six gene sets in which the identified individual genes appear, and the clustering of genes along the X and Y chromosome is apparent. If one were to use the approach of simply declaring as enriched the pathways including the individual differentially expressed genes, our results would agree closely with those found in the GSEA. In particular, the GSEA identified the Y chromosome (
chrY) and two Y bands with at least 15 genes (
chrYp11, chrYq11) as being associated with higher expression levels in males. Two of the genes selected by both models appear in the
chrXq13 gene sets. These genes are associated with the set of X chromosome inactivation genes, which is expected to be enriched in females. Note that our proposed gCAR enabled us to identify an isolated gene (
203974_at) that does not appear in any of the predefined gene sets. This gene would have been missed entirely if we were to use a conventional CAR structure inside of our Bayesian testing model, since it would have been discarded from the analysis. We notice also a particular gene selected only by the conventional CAR that appears in both the X and Y chromosome. This curious case could be a false positive, though, as the slower MCMC convergence under the conventional CAR model makes the posterior inference less reliable than its gCAR counterpart.
The applications presented here illustrate two different approaches to defining neighborhoods across which information may be shared when searching for non-null cases under our Bayesian testing model. While the results are sensitive to the choice of hyperparameters as well as the threshold, it is apparent that “good” choices can lead to desirable operating characteristics. Applying our proposed approach to these data yields results consistent with past analyses. These results demonstrate our model’s ability to harness shared information between cases without sacrificing the possibility of identifying independent cases, something that would not be possible under the conventional CAR assumption. In this analysis, we found that including the isolated cases substantially reduced the autocorrelation. This results in quicker convergence and much greater sampling efficiency than is obtained from a conventional CAR model. This is an especially important consideration when performing MCMC in a large-scale setting, where the computational burden limits the feasible number of iterations. Our results suggest that there could be a positive effect on the sampling efficiency of an MCMC algorithm by including isolated, independent cases in our generalized CAR structure. This is possibly an interesting topic for future research that we do not pursue here.
We present a unified approach to correlated Bayesian testing whereby isolated cases and neighboring cases can be included in the same analysis. This allows for improved estimation of the signal strengths, the possibility of identifying isolated cases, and sharper posterior inferences about parameters of interest. We suggest some simple diagnostics that can aid a researcher in determining the most appropriate neighborhood structure when choosing among plausible models. When very little prior information is available concerning correlation structure, we note that there exist in the literature proposed techniques for discovering structure in high dimensional data such as sparse factor modeling (West, 2003) and independent components analysis (Comon, 1994). We demonstrate the robustness of our approach to model misspecification by applying it to simulated data with a complex correlation structure in which the assumptions are partially incorrect. It performs competitively with well-established procedures.
The results presented here are seen to be quite sensitive to the choice of the shape parameter in the prior for the mixing proportion, , in our proposed model. This is in part because large values result in both prior and posterior concentration of about large values. These large values mean that the Gibbs sampler tends to visit sparser models more often, and thus parameters that only appear in non-null cases are updated less frequently. Certain applications necessitate the use of a prior that enforces known sparsity (e.g., West, 2003; Carvalho, Chang, Lucas, Nevins, Wang, and West, 2008). While the choice of shape hyperparameter does have a considerable effect on subsequent inferences, we demonstrate how finding a good value leads to desirable operating characteristics. In working with our model, we found that an acceptable value of the shape parameter seems to depend upon the strength of the correlation across neighboring observations. The best way to choose this value or otherwise tune the prior to approximately match the true a priori non-null probability in the data is still an open problem worthy of further investigation.
A related point is the thresholding of the location-specific a posteriori inclusion probabilities. We use throughout this paper an informal 0.95 decision rule for selecting non-null cases. Decision rules in the Bayesian testing paradigm have been proposed through average risk optimization and consideration of the so-called Bayes false discovery rate (bFDR) (e.g., Efron and Tibshirani, 2002; Tadesse, Ibrahim, Vannucci, and Gentleman, 2005; Bogdan, Ghosh, and Tokdar, 2008). However, most results concerning the relationship between bFDR and frequentist error measures are based on assumed independence in the data, which we are not considering. Performance also is determined in part by specification of the prior probabilities of the hypotheses. However, the approach of Scott and Berger (2006) on which our model is based enjoys the virtue of inducing an automatic multiplicity adjustment, even in the correlated case (Brown, Lazar, Datta, Jang, and McDowell, 2014). While expression (5) allows our calculations to viewed as a fully Bayesian treatment of local false discovery rates (Efron, 2010, Ch. 5), much work remains to be done on establishing optimal decision rules for controlling different error rates under dependence.
This paper provides a glimpse at the possibility of facilitating more reliable inference by capturing (or at least approximating) the true dependence structures that are inherent in modern high-dimensional data. While this issue has garnered more interest in the recent literature (e.g., Smith and Fahrmeir, 2007; Li and Zhang, 2010; Stingo, Chen, Tadesse, and Vannucci, 2011; Lee, Jones, Caffo, and Bassett, 2014; Zhang, Guindani, Versace, and Vannucci, 2014; Zhao, Kang, and Yu, 2014) relatively limited work has been done on modeling nontrivial dependence in Bayesian models for signal detection, particularly in the high-dimensional setting where classical multiple testing approaches are no longer appropriate. Here we propose using Markov-type dependence structures in the continuous component of the spike-and-slab mixture with a Gaussian CAR model. An avenue of future work could be the exploration of other dependence structures. Work has been done on modeling complex measures of distance and covariance structures *[e.g.,]DrydenKoloZhou09, but there is a need for much further research toward building a flexible class of multiple testing models capable of dealing with a wide variety of dependence types.
Appendix: Proof of Posterior Propriety of the Proposed Model
Let denote the parameter vector and let be the joint distribution of the data and the parameters. Then for Model (4), it suffices to show that
First, note that for any ,
where . Hence, the integral inside the summation in (6) is
since . Also, the summation over has terms, so it suffices to establish that
We begin with the case when for all and define to simplify notation. We have that
Now, since implies that , where the notation means the matrix is positive definite (Banerjee, Carlin, and Gelfand, 2015), the prior on . We can thus integrate with respect to as the convolution of two normal densities. The marginal density of is then that of a distribution. So, we know the integration yields
Now, we reparameterize the variance components by defining and integrate over using the inverse gamma integral to obtain
We can rewrite the quadratic form in the denominator of the last expression as
where and . Let Then, after substantial simplification (see Supplementary Material) it can be shown that, for all ,
where is constant.
Similarly, after substantial simplification (see Supplementary Material), we can show that