False discovery rate smoothing
Abstract
We present false discovery rate smoothing, an empiricalBayes method for exploiting spatial structure in large multipletesting problems. FDR smoothing automatically finds spatially localized regions of significant test statistics. It then relaxes the threshold of statistical significance within these regions, and tightens it elsewhere, in a manner that controls the overall falsediscovery rate at a given level. This results in increased power and cleaner spatial separation of signals from noise. The approach requires solving a nonstandard highdimensional optimization problem, for which an efficient augmentedLagrangian algorithm is presented. In simulation studies, FDR smoothing exhibits stateoftheart performance at modest computational cost. In particular, it is shown to be far more robust than existing methods for spatially dependent multiple testing. We also apply the method to a data set from an fMRI experiment on spatial working memory, where it detects patterns that are much more biologically plausible than those detected by standard FDRcontrolling methods. All code for FDR smoothing is publicly available in Python and R.^{5}^{5}5https://github.com/tansey/smoothfdr
1 Introduction
1.1 Spatial smoothing in the twogroups model
The traditional problem of multiple testing concerns a group of related null hypotheses that are to be tested simultaneously. In the simplest form of the problem, a summary statistic is observed for each test. The goal is to decide which are signals () and which are null () while ensuring that some maximal error rate is not exceeded. Standard approaches to multiple testing include Bonferroni correction, which controls the overall probability of one or more false positives; and the BenjaminiHochberg procedure, which controls the false discovery rate, or the expected fraction of false positives among all discoveries (Benjamini and Hochberg, 1995). These techniques have been successfully applied across many fields of science, most notably in the analysis of DNA microarrays and other sources of genomic data.
But many largescale multiple testing problems exhibit spatial patterns that traditional methods do not account for. Examples include: (1) fMRI studies, where significant test statistics cluster in anatomically relevant regions of the brain; (2) studies of allele frequencies in biological populations, where genetic loci correspond to physical locations on the chromosome; (3) studies of variation in DNA methylation fraction across specific genomic regions; (4) neural spiketrain data recorded from a multielectrode array, in which electrodes fall at known locations on a twodimensional lattice; and (5) environmental sensor networks designed to detect spatially localized anomalies.
This paper presents a new method called false discovery rate smoothing that can learn and exploit the underlying spatial structures in these multipletesting problems. FDR smoothing finds spatially localized regions of significant test statistics by solving a specific optimization problem involving the penalty. It then relaxes the threshold of statistical significance within these regions in a manner that controls the global falsediscovery rate at a specified level.
Throughout the paper, we emphasize three key advantages of FDR smoothing:
 Local adaptivity.

FDR smoothing finds localized spatial structure “out of the box,” with automated hyperparameter tuning. This results in increased power and cleaner spatial separation of signals from noise versus standard methods.
 Robustness.

We highlight the failure modes of existing techniques for spatially dependent multiple testing and show that, in contrast, FDR smoothing is much more robust.
 Computational efficiency.

FDR smoothing utilizes modern techniques for convex optimization, yielding modest runtimes even for very large spatial problems with observations or more.
1.2 Connections with existing work
Our approach incorporates spatial smoothing into the “twogroups” model, a popular empiricalBayes approach for controlling the falsediscovery rate that has been advocated by Bradley Efron and many others (e.g. Scott and Berger, 2006; Efron, 2008; Bogdan et al., 2008; Scott and Berger, 2010; Martin and Tokdar, 2012). This strategy leads to a nonstandard highdimensional optimization problem. To solve this problem, we exploit recent advances in convex optimization for objective functions involving composite regularizers (e.g. the fused lasso of Tibshirani et al., 2005). Efron (2008) provides a recent review on multiple testing under the twogroups model, while Tibshirani and Taylor (2011) describe a wide class of composite regularizers which they call “generalized lasso” problems. We recommend these two papers to readers who wish to get a deeper sense of these two areas of the literature.
Many authors have considered the problem of multiple testing when the test statistics have a complicated dependence structure (e.g. Leek and Storey, 2008; Clarke and Hall, 2009). In contrast, FDR smoothing explicitly uses known spatial structure to inform the outcome of each test. It therefore differs significantly from these approaches, whose goal is to yield robust FDRcontrolling methods in the presence of arbitrarily strong dependence among the test statistics.
Specific to fMRI analysis, much effort has gone into improving FDRbased procedures. Perone Pacifico et al. (2004) assume that test statistics are distributed according to a smooth Gaussian random field and their FDR procedure requires additional approximations for practical computation of the hypothesis sets. Benjamini and Heller (2007) propose a method geared toward socalled Region of Interest (ROI) studies, where a partitioning is provided a priori, mapping the test locations into distinct clusters. They then use a robust variogram to estimate correlation of test statistics within the cluster and assume no dependence across cluster boundaries. Furthermore, they assume that the primary focus of the testing procedure is at the cluster level, with individual location testing a secondary concern.
But as very recent work by Eklund et al. (2016) demonstates, many common approaches for clusterlevel inference are highly nonrobust to deviations from the underlying assumption that brain activity can be described using a Gaussian random field. Our approach differs: it requires no such assumptions about Gaussianity, and no a priori clustering. And while it is capable of generating clusters ex post via a partitioning effect that arises naturally from the penalty, it is fundamentally concerned about maintaining the nominal FDR for voxellevel inference rather than clusterlevel inference. Moreover, our technique generalizes far beyond fMRI studies.
Additionally, Schwartzman et al. (2008) perform a smoothing of the test statistic via local averaging. But their approach operates under fundamentally different assumptions that ours. For example, they assume that at each spatial location one observes a vector , and the density is required to have the property . In the examples we consider, neither assumption is appropriate.
Our work is most directly comparable to two recent methods for spatiallyaware multiple testing, both of which were inspired by neuroimaging analysis. First, Zhang et al. (2011) proposed the procedure, which uses spatially smoothed values to improve power. Second, Shu et al. (2015) proposed a hidden Markov random field (HMRF) model, which generalizes previous work on onedimensional spatiallydependent multiple testing (Sun and Cai, 2009) to the multidimensional case. We benchmark our approach against these methods in Section 5.
FDR smoothing is also conceptually related to a recent proposal by Scott et al. (2014), called FDR regression. The two procedures share the goal of improving statistical power for the multipletesting problem by leveraging external information. But they operate in very different domains—regression on covariates for FDR regression, versus spatial smoothing here—and they differ from each other in the same way that any spatial smoothing problem differs from any regression problem. In principle, it is possible to perform spatial smoothing through regression, by introducing a suitable class of basis functions that “featurize” space. But this is rarely done in applied spatial modeling: it is inelegant, computationally burdensome at scale, sensitive to the choice of basis, and unnecessary. In the Appendix, we benchmark FDR smoothing against a version of FDR regression with spatial features, and show that the latter is not competitive, despite the additional difficulties it poses.
2 FDR smoothing: the basic approach
2.1 The twogroups model
The FDR smoothing algorithm builds upon the twogroups model for multiple testing (Berry, 1988; Efron et al., 2001). Suppose that we have test statistics arise from the mixture
(1) 
where is an unknown mixing fraction, and where and describe the null () and alternative () distributions of the test statistics. In many scientific applications, the ’s themselves are often the product of a lengthy preprocessing and modeling pipeline. For example, in many fMRI applications (including ours), the ’s arise from a complicated voxellevel regression model, which we do not detail here (see, e.g. Poldrack et al., 2011).
To summarize inferences, we report for each the quantity
(2) 
This quantity has both a Bayesian and a frequentist interprtation. To a Bayesian, is the posterior probability that is a signal (e.g. Scott and Berger, 2006; Muller et al., 2006). To a frequentist, is a local false discovery rate. The two are related by the following equation, from Efron et al. (2001). Let be any subset of the test statistics. Define the quantity
(3) 
Efron et al. (2001) called this “Bayesian FDR,” and showed that under mild conditions it provides a conservative estimate of frequentist FDR.
An important modeling choice in the twogroups model is how to specify the null hypothesis . Efron (2004) distinguishes two important scenarios: the theoretical null, in which is known; and the empirical null, in which is known only up to its functional form (e.g. Gaussian), and must be estimated jointly with and . As we show in Section 4.1, FDR smoothing can accommodate either scenario in a simple, modular way.
2.2 Spatial smoothing
Formulating the model.
FDR smoothing involves a conceptually simple modification of the twogroups model (1) that leads to a nonstandard highdimensional optimization problem. In this section, we provide a highlevel overview of the model. Technical details are deferred to Section 4.
Let each be associated with a vertex in an undirected graph with edge set . For example, in an fMRI problem, each is a voxel, and is a threedimensional grid. Suppose that the prior probability in (1) changes from site to site:
(4)  
(5) 
Thus is the prior odds that site has a signal. Section 4 describes how and are estimated, but for now we assume that they are fixed.
Let be the vector of log odds, and let be the negative log likelihood with and fixed:
We estimate by penalizing the likelihood: that is, by minimizing the function for some and penalty function . From a Bayesian perspective, this corresponds to the maximum a posteriori (MAP) estimate under the prior distribution .
While there are many reasonable choices for , in this paper we penalize the unweighted total variation of over the graph . This leads to the problem
(6) 
which enforces spatial smoothness by penalizing differences in log odds across edges in the graph, and which is closely related to intrinsic conditional autoregressive (CAR) priors.
This choice is motivated by an analogy with image segmentation in computer vision, a problem domain in which totalvariation penalties like that in (6) have proven successful and computationally efficient (e.g. Rudin et al., 1992). In our setup, represents an “image” of prior log odds, and an “image segment” corresponds to a region of interest where the proportion of signals is elevated versus the background. The key difference is that the object being smoothed () parametrizes a set of weights in a mixture model, rather than a set of means for each pixel in an image.^{6}^{6}6To see why ordinary image segmentation does not address the multipletesting problem, see the examples in Section 3.1. Totalvariation denoising of the scores can easily find the region of interest for Example 1, but will fail badly for Example 2 (where the signals are not meanshifted compared to the null).
Following Tibshirani and Taylor (2011), we rewrite (6) in the following way. Let be the size of the edge set, and let be the oriented adjacency matrix of the graph , which is the matrix defined as follows. If is the th edge in , then the th row of D has a in position , a in position , and a everywhere else. Thus the vector encodes the set of pairwise first differences between adjacent sites in the log odds of being a signal, and . We can therefore express the optimization problem as
(7) 
This clarifies that the penalty is a composition of two functions applied to : a linear transformation composed with the norm.
Using the solution.
The solution to this optimization problem yields an estimate for the log odds at all sites. Because the penalty encourages sparsity in the first differences of across the edges of the graph, the estimate will partition the nodes of the graph into regions where the log odds are locally constant. Thus the solution to (6) builds spatial structure directly into the estimated prior probabilities in Equations (4) and (5). These sitedependent prior probabilities are then used to compute the posterior probability
(8) 
Just as with the output of the ordinary twogroups model (2), we use these posterior probabilities directly in Equation (3) to find the largest set of discoveries for which satisfies the desired false discovery rate.
3 Examples
In this section we provide a snapshot of FDR smoothing’s performance on both a toy problem and on a real fMRI data set from an experiment on spatial working memory. In both examples, we compare FDR smoothing against a different multiple testing strategy in order to highlight its key advantages relative to other methods. These examples are meant to be illustrative rather than exhaustive; in Section 5 we report the results of a more extensive set of simulation experiments that benchmark FDR smoothing against various other methods.
3.1 A toy onedimensional problem
We simulated scores along a 1D grid of sites from the following model:
so that is the fraction of sites that are signals in the 500site region of interest and the fraction of sites that are signals elsewhere. In most practical problems is rarely 0, if only for the reason that outliers due to technical or experimental artifacts cannot be eliminated entirely.
This is a stylized version of a multipletesting problem that might come up in analyzing allele frequencies or DNA methylation fraction across adjacent sites in the genome (e.g. Jaffe et al., 2012). Two specific versions of the model were considered:
 Example 1:

and . This is a relatively easy problem: the signals are well separated from the null hypothesis, the region of interest is pure signal (i.e. 100% of the scores there are from ), and there are almost no signals/outliers outside the region of interest (). The top left panel of Figure 1 shows an example data set simulated from this model.
 Example 2:

and . This is a much harder problem than the first example. The signals are overdispersed compared to the null, but also centered at 0; the region of interest is impure signal, since only 50% of the scores there are from ; and there are scattered signals outside the region of interest (). In a real example, these scattered signals might be actual genes of interest or simply technical artifacts. The top right panel of Figure 1 shows an example data set simulated from this model.
Figure 1 shows the results of applying both and FDR smoothing (FDRS) to these two examples. The middle row of two panels shows the true versus reconstructed prior probabilities from FDR smoothing (Example 1 on the left, Example 2 on the right). The true prior probability as a function of site is shown as a solid black curve, and the FDR smoothing estimate as a grey curve. For reference, the estimate of the global mixing weight from the ordinary twogroups model is shown as a dashed line. Compared with the global estimate, the FDR smoothing estimate shows a favorable blend of adaptability and stability. For the region of interest (sites 2251–2750), the FDR smoothing estimate of is higher than the global estimate, though not as high as the truth—it is shrunk downwards to the mean. For all other sites, the estimate is lower than the global estimate, though not as low as the truth—it is shrunk upwards to the mean.
In our simulation studies described in Section 5, FDR smoothing consistently exhibits both of these features: it adapts automatically to spatial patterns in the data, but it also shrinks toward the global mixing weight estimated by the twogroups model. The sitelevel adaptation yields improved power. The shrinkage yields stability, preventing the model from being too aggressive in isolating spurious groups of signals.
This stability turns out to be a very desirable property in light of the evidence in the bottom four panels of Figure 1, which show the realized false discovery rate and true positive rate (TPR) across 150 simulated data sets from each of the two examples (bottom left two panels for Example 1, bottom right two panels for Example 2). For comparison, we also show the realized error rates of the procedure from Zhang et al. (2011). Several lessons can be drawn from these panels, especially about an important failure mode of :

FDR smoothing stays below the nominal FDR level of 5% on both examples.

stays below the nominal FDR of 5% on Example 1, albeit with higher variance and lower power (TPR) than FDR smoothing.

exceeds the nominal FDR on Example 2, yet still has less power than FDR smoothing.

Despite these advantages, computation time for FDR smoothing is minimal: roughly 1 second to fit the entire solution path across a grid of values (see Section 4.4).
As Section 5 will demonstrate more systematically, is not robust to situations where the region of interest is not 100% signal, and where there are outliers outside the region of interest.^{7}^{7}7We will see that the HMRF model of Shu et al. (2015) also has this problem. FDR smoothing, on the other hand, is very robust: the model performs well even under misspecification (see Section 5.3) and we have only been able to identify one pathological (and easily corrected) scenario where the procedure systematically exceeds the nominal FDR level (see Section 5.4).
3.2 Finding significant regions in fMRI
To show the performance of FDR smoothing in a more realistic scenario, we analyzed data from an fMRI experiment on spatial working memory. The experiment and analysis protocol are described in detail in Appendix A.
The upper left panel of Figure 2 shows an image of scores, often called the statistical parametric map, arising from the experiment and model. Signals correspond to regions of the brain that exhibit systematically different levels of activity across the two experimental conditions (difficult versus easy spatial working memory tasks). The full 3D image has million voxels. The left panel of Figure 2 shows a single horizontal section in which the large scores exhibit an obvious pattern of spatial clustering. The shapes of these clusters suggest the underlying brain regions associated with the specific cognitive task under study.
The upperright panel shows the results of applying the BenjaminiHochberg procedure to these scores at a 5% false discovery rate. The procedure clearly finds regions of adjacent points that are all significant. However, the edges of these regions are indistinct, and there are many spatially isolated discoveries that presumably represent technical or experiment artifacts. (It is not scientifically plausible that they are real discoveries for a spatial working memory task.)
The middle two panels show two previous approaches in the literature for leveraging spatial structure. The middle left shows the result of applying local averaging to the scores and rescaling to unit variance, as suggested for fMRI data by Efron (2012); the middle right shows the output of the procedure. Both methods are overlyaggressive in their reporting, likely yielding several more false discoveries than the specified 5% threshold in the areas away from the top of the image. We conclude this on the basis of two facts: (1) these two methods exhibit exactly these deficiencies on the simulated examples, in Sections 3.1 and 5; and (2) more importantly, the extra regions ostensibly “discovered” by and smoothed BH are not scientifically plausible based on what is known about spatial reasoning. Our interpretation is that (and smoothed BH, which is very similar) are suffering from the some of the nonrobustness problems exhibited in Section 3.1
Now consider the bottom two panels. The bottom left panel shows our procedure’s estimated partition of the raw data shown in the left panel. Darker greys correspond to signaldense areas containing locally higher fraction of significant scores; lighter areas correspond to signalsparse areas containing a locally lower fraction of significant scores. The bottom right panel then shows the final output: the discovered signals at the 5% FDR level. Compared with the other procedures, the image reveals regions of significant signals that are more biologically plausible, detecting regions in the bilateral prefrontal cortex that are commonly associated with working memory function (Owen et al., 2005). This reflects the local adaptivity of FDR smoothing: it loosens the threshold for significance in the apparently interesting regions and tightens it in the uninteresting regions.
The partitions shown in the right panel of Figure 1 and the bottom left panel of Figure 2 are estimated by our specialized adaptation of an edgedetection algorithm used to denoise images of natural scenes. However, we emphasize that FDR smoothing is not simply denoising the scores. That is, the estimated partition does not merely pick out areas in which the actual scores (raw pixel values) are locally constant or locally homoscedastic. Rather, it picks out areas in which the unknown true fraction of signals is locally constant. Unlike the pixel values, these local enrichment fractions are not actually observed by the experimenter. This is a significant complication not present in image denoising and is the fundamental statistical problem addressed by our approach.
4 Details of model fitting
In this section, we describe our modelfitting process in much more detail.

Section 4.4 describes a pathbased procedure for choosing the regularization parameter .
4.1 Estimating the null and alternative densities
Our overall fitting approach emphasizes modularity. We estimate the null and alternative densities and separately from , before solving the FDRsmoothing optimization problem (6). We do so by fitting the ordinary twogroups model from Section 2.1, thereby ignoring the spatial information contained in the graph . This gives us estimates for and , which we then fix and use to estimate using the model described in Section 2.2.
This twostage procedure may sound adhoc, but in fact has a simple justification. Consider the underlying sitelevel mixture model , where is the prior probability of a signal at site . Under this model, the marginal distribution of the ’s across sites is a “mixture of mixtures.” Thus if is a randomly selected test statistic from the data set, then marginally
This appears at first glance to be a mixture model with mixture components, one per site. But in fact, because each component is itself a mixture of and , the marginal for is actually a mixture with only two components. These are exactly the components we want to estimate:
(9) 
where the mixing weight is the average prior probability of a signal across all sites.
This has a striking implication: we can estimate and for our spatially varying model even if we assume that the ordinary twogroups model is true—that is, by ignoring spatial information and using the marginal distribution of test statistics alone. Thus we follow a twostep procedure: estimate and from (9) using standard tools, and then fix these to estimate the spatially varying . (The first step also recovers , although this information is not used directly.)
We now describe our use of standard tools to estimate and directly from (9). An open question is whether using spatial information would sharpen these estimates; this is plausible, but we leave it for future work.
Estimating .
In some cases, is taken directly from the distributional theory of the test statistic in question (e.g. standard Gaussian) and therefore need not be estimated at all. For such problems where a theoretical null describes the data well, this step can be skipped.
However, as Efron (2004) argues, in many multipletesting problems the data are poorly described by the theoretical null. In such cases, an empirical null hypothesis with known parametric form but unknown parameters must be estimated in order to produce reasonable results. For the problems considered in this paper, the null hypothesis is assumed to be a Gaussian with unknown mean and variance: . But in principle the same matching technique can be used in any exponential family.
To estimate and , we apply the centralmatching method of Efron (2004), which uses the shape of the histogram of test statistics near zero (which come mostly or exclusively from the null distribution). Specifically, let be the subset comprising some central fraction (we use 1/3) of the scores. Central matching proceeds in three steps:

Construct a smooth estimate of the log density of the test statistics in , and let be the point where obtains its maximum.

Form a secondorder Taylor approximation of about its maximum:

The quadratic approximation on the log scale corresponds to a Gaussian on the original scale. Therefore set the mean and standard deviation of the empirical null using the slope and curvature of :
For an alternative approach to estimating an empirical null, see Martin and Tokdar (2012).
Estimating .
Having fixed an estimate for , can be estimated by any of several existing methods for onedimensional Gaussian deconvolution, including finite mixture models or Dirichletprocess models (Do et al., 2005). We use and recommend the predictiverecursion algorithm of Newton (2002) because it is fast, flexible, and enjoys strong guarantees of accuracy (see Tokdar et al., 2009). Predictive recursion generates a nonparametric estimate for the marginal density under the alternative after a small number of passes through the data.^{8}^{8}8In our examples we use 50 passes, although in our experience 10 passes is virtually always sufficient to yield stable estimates. For further details, see Martin and Tokdar (2012); for pseudocode, see Scott et al. (2014).
4.2 An expectationmaximization algorithm
We now turn to the details of solving the optimization problem in (6). This problem is hard for at least two reasons: the likelihood term is nonconvex, and the penalty term is nonseparable in . We are not aware of any algorithm that is guaranteed to find the global minimum efficiently, even for fixed , and the method we describe below finds only a local minimum. Nonetheless, this paper marshals evidence that the local solutions actually found by our algorthm yield good reconstructions of underlying spatial patterns and better power than existing FDRcontrolling methods.
We handle the likelihood term with a simple dataaugmentation step that leads to an expectationmaximization (EM) algorithm. For now, we assume that is fixed; we describe our method for choosing this hyperparameter in Section 4.4. Introduce binary latent variables such that
Marginalizing out the clearly gives us the original model (4). Treating as fixed gives the completedata negative log likelihood:
(10) 
With fixed, this is a convex function in and is equivalent to the negative log likelihood of a logisticregression model with identity design matrix.
Therefore, a stationary point of (7) may be found via a conceptually simple EM algorithm. Suppose that the step estimate for the underlying image of log odds is . In the E step, we compute . Because the completedata log likelihood is separable and linear in the , we simply plug in the conditional expected value for , given the current guess for , into . Since is a binary random variable, this is just the conditional probability that :
(11) 
where is the prior probability that site produces a signal, given by the inverse logit transform of the current estimate from equation (5).
In the M step, we maximize the completedata log likelihood. This requires solving the convex subproblem
(12) 
where the are the completedata sufficient statistics.
To solve this sub problem, we expand in a secondorder Taylor approximation at the current iterate . This turns the step into a weighted leastsquares problem with a generalizedlasso penalty. Thus up to a constant term not depending on , the intermediate problem to be solved is,
(13) 
where and are the gradient and Hessian with respect to the first argument of the completedata negative log likelihood , evaluated at the current iterate (denoted generically as ). These are simple to evaluate:
The Hessian matrix is diagonal because the log likelihood is separable in . Ignoring terms that are constant in , the solution to (13) can be expressed as the solution of a penalized, weighted leastsquares problem:
(14) 
with working responses and weights given as follows:
Recall that is the point at which the Taylor expansion for the completedata log likelihood is computed. In our EM algorithm, this is the current estimate .
Thus the overall steps of algorithm can be summarized as follows.
 1) Estep:
 2) Quadratic approximation:

Expand in a secondorder Taylor series about the current iterate , thereby forming the “quadratic + penalty” surrogate subproblem in (14).
 3) Penalized weighted least squares:
In principle, a full M step requires that steps 2 and 3 be interated until local convergence after each E step. In practice, we take a partial M step by iterating steps 2 and 3 only once. This speeds the algorithm up: step 3 is by far the most computationally expensive, and we want it to be using sufficient statistics that are as uptodate as possible. Moreover, as long as the completedata objective function is improved at each step, the resulting sequence of iterates still converges to a stationary point of (7).
4.3 Solving the Mstep via graphbased TV denoising
The most computationally expensive part of the FDR smoothing algorithm is the need to repeatedly solve the graphfused lasso problem in (14). Even though the GFL is convex, the massive size and graph structure of fMRI scans make classical techniques inefficient. Solving (14) therefore requires a highly efficient, scalable method.
Many special cases of this problem have been studied in the literature, each with highly efficient, specialized solutions. For example, when is a onedimensional chain graph, (14) is the ordinary (1D) fused lasso (Tibshirani et al., 2005), solvable in linear time via dynamic programming (Johnson, 2013). When is a Ddimensional grid graph, (14) is often referred to as totalvariation denoising (Rudin et al., 1992), for which several efficient solutions have been proposed (Chambolle and Darbon, 2009; Barbero and Sra, 2011, 2014). In contrast to these methods, we next develop a method to solve the GFL over a general graph.
The core idea of our algorithm is to decompose a graph into a set of trails. We note two preliminaries. First, every graph has an even number of odddegree vertices (West, 2001). Second, if is not connected, then the objective function is separable across the connected components of , each of which can be solved independently. Therefore, for the rest of the section we assume that the edge set forms a connected graph.
We also remind the reader of some basic terminology in graph theory. A walk is a sequence of vertices, where there exists an edge between the preceding and following vertices in the sequence. A trail is a walk in which all the edges are distinct. An Eulerian trail is a trail which visits every edge in a graph exactly once. A tour (also called a circuit) is a trail that begins and ends at the same vertex. An Eulerian tour is a circuit where all edges in the graph are visited exactly once.
The following theorem from West (2001) states that any connected graph can be decomposed into a set of trails T on which our optimization algorithm can operate.
Theorem 1.
(West, 2001, Thm 1.2.33). The edges of a connected graph with exactly odddegree vertices can be partitioned into trails if . If , there is an Eulerian tour. Furthermore, the minimum number of trails that can partition the graph is .
This theorem enables us to rewrite the penalty term in (14) as a summation over trails , where each trail contains the list of (start, end) vertices for each edge in its walk:
(15) 
Denoting the weighted squared loss portion of the objective as , slack variables are then introduced for each in the penalty term, which results in the following equivalent problem:
(16)  
subject to  
where the constraints hold for all pairs , for all . We can then solve this problem efficiently via an ADMM routine (Boyd et al., 2011) with the following updates:
(17)  
(18)  
(19) 
where is the scaled dual variable, is the scalar penalty parameter, , and is a sparse binary matrix used to encode the appropriate for each . We point out that solving (18) corresponds to solving a weighted 1dimensional fused lasso problem, which can be done in linear time via an efficient dynamic programming routine (Johnson, 2013; Arnold et al., 2014). We iterate the updates in (17)(19) in order until the dual and primal residuals have sufficiently small norms.
This trailbased GFL algorithm is highly efficient and can solve the GFL for an fMRI scan locally on a single machine in seconds. In our preliminary experiments, we found the method to be substantially more efficient than any other general GFL method and nearly as efficient as the specialized methods mentioned earlier for multidimensional grid graphs. The key update steps in our ADMM algorithm is also embarrassingly parallel: each in (17) and each in (18) can be solved independently of the other nodes and trails, respectively. For even more massive problems, where the data may not fit in memory, our algorithm will continue to scale well due to its distributed nature. Finally, we note that our algorithm is extensible to other smooth, convex loss functions , though we do not explore such extensions here. A more detailed presentation of this algorithm is available as a supplemental file, “A fast and flexible algorithm for the graphfused lasso.”
4.4 Choosing the regularization parameter
Once the null and alternative densities have been estimated, the only remaining tuning parameter in FDR smoothing is , the amount of regularization applied to the vector of first differences in log odds across the edges in . We now describe our method for choosing in a dataadaptive way.
Figure 4 illustrates the importance of choosing an appropriate value. The top left panel depicts an underlying image of prior probabilities. We used this image to generate scores according to the spatially varying twogroups model in equations (4) and (5) with a specific choice of . (For details, see the “small signal” experiment in Section 5.) Choosing too small, as in the bottom left panel, produces a grainy reconstruction that overfits the data. Choosing too large, as in the bottom right panel, results in oversmoothing and the loss of interesting spatial structure. Our procedure yields the choice of shown in the topright panel. The true regions are recovered with reasonable accuracy, and the graininess of the bottom left panel is avoided.
We avoid having to handtune in an adhoc fashion by adopting the following approach, based on the same solutionpath idea that is often used to set in problems (e.g. Tibshirani and Taylor, 2011).

Calculate the FDRsmoothing solution across a decreasing grid of regularization parameters , using the solution for as a warm start to find the solution.

For each solution corresponding to point on the grid, calculate a relative quality measure .

Choose to be the point in the grid where the quality measure is smallest.
The choice of quality measure should enforce a compromise between the fit and complexity of the reconstructed image. Perhaps the two most common approaches are AIC and BIC. Let be the degrees of freedom of the estimator and the maximized value of the log likelihood. Then up to constants not depending on ,
(20)  
(21) 
For simple onedimensional problems under squared error loss, calculating the degrees of freedom of the generalized lasso equates to counting the number of change points along the axis. The twodimensional extension of this result appeals to Stein’s lemma, and involves counting the number of distinct contiguous 2d regions or plateaus in (Tibshirani and Taylor, 2012); these results can analogously be extended to arbitary graphs, where a plateau equates to a connected subgraph. For example, the true prior in the upper left panel of Figure 4 has three plateaus: the two darker squares, and the white background.
Unfortunately, this remarkable result on the degrees of the freedom of the generalized lasso applies only to problems where is squared error loss. We are aware of no analogous results in more complicated situations involving mixture models such as ours. Therefore, we cannot plug in the true degrees of freedom when calculating AIC and BIC, because it is not known. In the absence of a better alternative, we use the number of plateaus as a surrogate for the degrees of freedom. This is a heuristic solution, but one that seems to yield good performance in practice. The upshot is that if a good estimator for the true degrees of freedom could be found, it is likely that a smarter could be chosen automatically, and that our overall method could be improved.
Figure 5 shows a typical solution path trace for the log likelihood, surrogate degrees of freedom, AIC, and BIC. In FDR smoothing problems, the number of plateaus is typically much smaller than the number of data points, and the penalty that AIC places on the degrees of freedom is dominated by the log likelihood. As a result, AIC is a disaster in practice, producing images that are far too grainy. On the other hand, BIC achieves a much better balance across a range of problems, and we recommend it as a criterion for choosing .
An additional practical complication is that it is nontrivial to compute the number of plateaus efficiently for largescale problems. The naïve approach of counting the number of distinct values in can fail badly if the estimate has multiple spatiallyseparated plateaus with the same estimated prior probability (up to the precision of the ADMM convergence criterion). Pseudocode for our plateaucounting method is provided in the appendix.
5 Simulation experiments
5.1 Setup and protocol
To demonstrate the effectiveness of FDR smoothing, we conducted simulation experiments across eight different scenarios defined by a full cross of three factors:

two different configurations of the sitespecific prior signal probability in a region of heightened signal probability. In one configuration, the true prior probability in this region is defined to be saturated at ; that is, every site in the signal region is a signal. In the other configuration, the signal region is defined to be more mixed, with .

two different configurations of in the “background” region. We consider the cases where the background is purely from the null region, , and another where the background contains some signal, with . Figure 7 (top left panel) shows an example of the sitespecific prior distributions.

two different choices for , the distribution of test statistics under the alternative hypothesis. Each is defined as the Gaussian convolution of some “noiseless” signal distribution . We consider a “wellseparated” alternative, , that has its modes far from the mode of the null distribution, and a “poorlyseparated” alternative, , that has the same mean as the null but with fatter tails.
In all eight scenarios, the spatial structure was a twodimensional grid graph, as in the fMRI example. We simulated 30 data sets in each scenario and set the desired false discovery rate at .
For each data set, we simulated scores as follows. Let be the true image of prior probabilities, let the true (noiseless) distribution of signals, and let be a Dirac measure at zero. Then is drawn from the mixture model
The null hypothesis is that , in which case . The alternative hypothesis is that , in which case is drawn from the Gaussian convolution .
Figure 6 shows the true true (orange curve) in the two alternative hypothesis scenarios; the nonparametric estimates of are shown in the thin gray lines. We show all 120 estimates from our benchmarks to convey a sense of the variance of the predictive recursion procedure, across a wide array of scenarios.
We compare our approach against three other techniques for multiple testing: 1) the Benjamini–Hochberg procedure (Benjamini and Hochberg, 1995); 2) the procedure (Zhang et al., 2011); and 3) the hidden Markov random field (HMRF) method of Shu et al. (2015). The first method is well known; we briefly explain the second and third. estimates the false discovery rate from locally smoothed values, obtained by taking the median of pvalues in a local neighborhood on the graph. The HMRF method models the dependence of the sitespecific priors using an Ising model. The model is fit via an EM routine that relies on Gibbs sampling to compute the E step. Finally, to establish a baseline we also present the results of an “oracle” twogroups model in which both the true and the true underlying vector are assumed known. The represents the theoretical limit of performance of the twogroups model. Details on how we ran and the HMRF model are provided in Appendix C.
We also considered two spatial smoothing approaches, but ultimately did not include these are benchmarks. The first method, FDRregression (Scott et al., 2014), requires introducing a large set of basis functions to model spatial dependence. (It also requires that the underlying graph be embedded in a metric space.) We did benchmark against FDR regression, but the resulting comparison added little to the overall presentation and insight: FDRsmoothing outperformed FDR regression soundly in all of our comparisons and we therefore chose not to include these results in the paper; these results are included in Appendix D for completeness. Another method (Sun et al., 2015) was determined to not be suitable for inclusion for several reasons: 1) it is focused on continuous spaces with a valid distance metric (our space has only an adjacency structure); 2) the reliance on testing an interval null of the form for some sufficiently small , which is not necessarily the same as testing a point null (Berger and Delampady, 1987); and 3) the implementation of their method relies on using Gaussian processes. In practice their code was not able to scale even to our synthetic dataset on a grid, and certainly would be unable to scale to a full fMRI dataset of 750,000 voxels. We also note that although this benchmark example focuses on a single plateau of elevated prior probability, we conducted a suite of other experiments with varying plateau sizes, counts, and prior probabilities; we found that FDR smoothing performs similarly in all of these cases and thus chose to focus on the simplest experiment that still highlights the key differences among the available methods.
5.2 Results and interpretation
The benchmark results (Table 1) reveal a failure of both and HMRF: they are not robust to regions of interest which are not saturated (i.e. 100% of the test statistics from ). Consequently, both methods substantially exceed the 10% false discovery rate threshold (bottom subtable; columns 3, 4, 7, and 8). FDR smoothing, in contrast, obeys the FDR threshold in all scenarios.
From a power perspective, FDR smoothing also provides clear advantages. In six of the eight examples, FDR smoothing has the highest true positive rate (TPR) among all methods which did not violate the FDR threshold. Furthermore, in each of these scenarios the FDR smoothing method was within 24% of the oracle method, leaving little room for increased power. The only scenario in which FDR smoothing is slightly underpowered is when the signals are poorly separated and the signal regions are saturated. To better understand why HMRFs performed well in this scenario, we give a typical example of the results on a mixed signal region scenario in Figure 7.
As shown in the upper right panel of Figure 7, the signals are clearly mixed within the signal region. Nevertheless, the HMRF model essentially estimates the entire signal region to be purely signal. This is a typical result for both mixed and saturated signal region scenarios. The difference is that under the saturated signal regime, estimating the entire signal region as 100% signal happens to be the correct strategy. The implication here is that HMRFs are only preferable in the special case where the data contain plateaus of saturated signal with minimal noise in the other regions. However, in such cases, it seems likely that one could segment the data in a more effective manner by leveraging this prior knowledge. For additional details, see Appendix E.
True positive rate (TPR)  
WellSeparated  PoorlySeparated  
Signal Region  Saturated  Mixed  Saturated  Mixed  
Background  Pure  Noisy  Pure  Noisy  Pure  Noisy  Pure  Noisy 
BH  0.500  0.516  0.416  0.451  0.415  0.428  0.370  0.388 
FDRL  0.920  0.810  0.461  0.375  0.747  0.661  0.289  0.235 
HMRF  0.990  0.924  0.996  0.795  0.989  0.910  0.926  0.553 
FDRS  0.999  0.925  0.678  0.597  0.776  0.686  0.510  0.460 
Oracle  1.000  0.945  0.696  0.607  1.000  0.929  0.553  0.495 


False discovery rate (FDR)  
WellSeparated  PoorlySeparated  
Signal Region  Saturated  Mixed  Saturated  Mixed  
Background  Pure  Noisy  Pure  Noisy  Pure  Noisy  Pure  Noisy 
BH  0.079  0.073  0.089  0.085  0.078  0.074  0.087  0.085 
FDRL  0.102  0.150  0.399  0.441  0.103  0.135  0.384  0.421 
HMRF  0.097  0.054  0.550  0.452  0.102  0.047  0.481  0.254 
FDRS  0.059  0.059  0.099  0.100  0.002  0.009  0.079  0.076 
Oracle  0.100  0.100  0.100  0.100  0.100  0.100  0.100  0.100 
The procedure assumes that pvalues are spatially dependent, with adjacent pixels having similarlydistributed pvalues. Given this view of the problem, aims to increase power by taking the median pvalue of each test site and its neighbors. It then corrects for multiplicity by appealing to facts about the order statistics of the uniform distribution. This modeling assumption is fundamentally different than in FDR smoothing, which assumes that the dependence lies in the prior weights of the twogroups model. As can be seen in the bottom left panel of Figure 7, the consequence of the approach is that discoveries tend to clump together into small local clusters. This results in overestimating the number of discoveries in highsignaldensity regions, as well as being sensitive to outlier neighborhoods.
Overall, FDR smoothing strikes an appealing balance of reliability, power, and speed, especially relative to the other available methods. Across a variety of experimental settings, FDR smoothing strictly adheres to the FDR threshold, providing important reassurances to the scientist. Finally, all of our FDR smoothing scenarios could easily have been run on a laptop. Running the full solution path for an example like the one shown in Figure 7 takes 14 minutes on a single node in our cluster environment, with 1/3 of that time spent performing 50 sweeps of predictive recursion; in practice, 5 sweeps is generally sufficient and would speed up FDR smoothing even more. For comparison, the same example takes over two days to run the HMRF model.
5.3 Robustness to Misspecification
The two groups model we rely on makes several assumptions about the distribution of the observed zscores. We investigate the robustness of FDR smoothing under the relaxation of two of these assumptions: a spatiallyinvariant alternative distribution and uncorrelated errors. At their core, these experiments are designed to see how well our algorithm performs when the zscores are not conditionally independent as specified by the two groups model, but rather are related to their location and the values of their neighbors.
The alternative distribution estimated by our predictive recursion routine is assumed to be globally fixed and spatially invariant. However, it is reasonable to believe that some experiments may have a spatiallydependent alternative distribution. For instance, a significant difference in some region of the brain may be due to inhibition of neuronal activity while another region may experience excitation. To examine the performance of FDR smoothing under a spatiallydependent alternative distribution, we conducted 30 independent trials on a grid where the alternative distribution was with a spatiallydependent mean function: (Figure 8, left panel; here and index position on the grid.) In each trial, we randomly generated three plateaus by sampling contiguous 1000point regions with replacement, and drawing the test statistics at each site from the corresponding distribution. Figure 8 (Middle) shows the result of each globallyestimated alternative distribution. In many trials, the model incorrectly estimates the alternative distribution to be a roughlyunimodal distribution centered near one of the two peak activation levels (6 and 6) of the true distribution. Nonetheless, as Figure 8 (Right) shows, the model maintains strong performance by achieving high power while only slightly exceeding the 10% FDR threshold on average (mean FDR: 11.54%).
Another fundamental assumption of our model is that each zscore is an independent draw from its corresponding mixture component. We investigate our method’s robustness to violations of this assumption, by drawing scores from a multivariate Gaussian process, defined as follows. For noiseonly nodes, was zero, and for signal nodes we drew , where is a Dirac measure. The covariance matrix was that of a Gaussian process with a squaredexponential kernel , where is the bandwidth parameter, is the indicator function, and is a small nugget term added for numerical stability. We generated signal regions as in the previous experiment (1000point plateaus of connected vertices) and varied the strength of covariance between neighbors on the graph by increasing the bandwidth parameter. For each value of , we simulated 10 independent data sets. As shown in Figure 9 (left), the FDR smoothing model maintains the 10% FDR threshold until the bandwidth exceeds . We note that a bandwidth above 1 here produces visibly “clumpy” regions of correlated noise. This would be easy to detect in real data, and is well beyond what one might expect in a lowtomoderate correlation setting. See Appendix F for an example visualization of the data at the first bandwidth above the threshold.
The FDR smoothing approach appears overall to be very robust to model misspecification. In most situations, misspecification leads to an increasingly conservative estimation of the discoveries, rather than by substantially exceeding the FDR threshold. It is only in the presence of noise with visible (and easily detected) spatial correlation that the model begins to violate the FDR threshold, which happens gradually as the spatial correlation increases. These results should give further confidence to the practitioner: using FDR smoothing may be useful even in scenarios where the assumed model does not completely fit the experimental design.
5.4 Pathological Scenarios
The one pathological example we have found where FDR smoothing fails catastrophically is when the number of test sites drawn from the alternative distribution is nearly as large, or larger, than the amount drawn from the null distribution. When this happens, the predictive recursion procedure may actually interpret the alternative distribution as the null and vice versa. The result is a complete inversion of the estimated prior regions and corresponding posteriors.
To illustrate this point, we simulated along a continuum from entirelynull to entirelyalternative distribution samples. At each point on the grid, we ran 10 independent trials and measured TPR and FDR for the model. The null distribution was a standard normal and the alternative distribution was a normal distribution centered at 2. Figure 9 (right) shows the performance of FDR smoothing as the proportion of test sites drawn from the alternative distribution varies. After the pathological threshold of ~50% is crossed, we see that the TPR and FDR “flip”, as the model now estimates all null sites to be discoveries.
Fortunately, this pathological scenario is easy to detect in practice. As a sanity check, one should simply examine the estimated null and alternative distributions. The purpose of the empirical null procedure is to correct for slight systematic bias in the experimental procedure that generated the data. One should therefore expect the resulting null estimation to be reasonably close to a standard normal. In the pathological case above, however, the empirical mean of the null distribution is very far from zero; similarly, the alternative distribution is almost precisely a standard normal. If this scenario arises, we advise the practitioner to simply use the theoretical null, as the benefit from the empirical null is much smaller than from the spatial smoothing procedure of FDR smoothing.
6 Discussion
Modern scientific analyses often involve largescale multiple hypothesis testing. In many cases, such as fMRI experiments, these analyses exhibit spatial structure that is ignored by traditional methods for multiplicity adjustment. As we have shown, exploiting this spatial structure via FDR smoothing offers a simple way to increase the overall power of an experiment while maintaining control of the false discovery rate. Our method achieves this performance by automatically identifying spatial regions where the local fraction of signals is enriched versus the background.
While our results show strong statistical and computational performance, there are many areas in which our approach could be improved. We call attention to a few of these areas and suggest them as subjects for future research.
First, the choice of a constant penalty on the first differences across edges in the graph leads to some slight overshrinkage in the estimated prior probabilities. This is most evident in Figure 1, where the estimated are shrunk back to the grand mean (or equivalently, toward the estimate of the ordinary twogroups model) versus the true . This reflects the wellknown “nondiminishing bias” feature of the or lasso penalty, and is often mitigated in linear regression by using the adaptive lasso (Zou, 2006) or a concave penalty (Mazumder et al., 2011; Polson and Scott, 2012). Translating these ideas to the FDR smoothing problem presents a formidable algorithmic challenge and is an important area for future work. Nevertheless, even with this noticeable overshrinkage, FDR smoothing achieves stateoftheart performance in our synthetic experiments. Moreover, it is possible that the overshrinkage is a feature rather than a deficit, in that it prevents the method from being too aggressive in hunting for very small regions of signals.
Second, our method for choosing , the regularization parameter, is effective but ad hoc. Our pathbased approach would benefit from new theory on the degrees of freedom of the generalized lasso in mixturemodel settings, or from an entirely different principle for choosing the tuning parameter in sparse estimation.
Third, we have presented FDR smoothing as a general method and provided examples that suggest its wide potential for application. Perhaps the most obvious area in which it could be useful is in the analysis of fMRI data. The literature on fMRI data analysis is large and mature, including the literature on multiplicity correction (e.g. Hayasaka and Nichols, 2003; Poldrack, 2007; Nichols, 2012). We have not attempted to benchmark FDR smoothing against some of these specialized methods, which exploit specific features of fMRI problems that may not hold more generally. This comparison would be out of place in a paper intended for a general statistical audience, but we intend to pursue it in our future work.
Finally, both the lasso and the twogroups model sit (independently of one another) at the center of a large body of theoretical work. We cannot hope to summarize this literature and merely refer to reader to Bickel et al. (2009) for the lasso and Bogdan et al. (2011) for the twogroups model. Combining these two lines of work to produce a theoretical analysis of FDR smoothing represents a major research effort and is beyond the scope of this paper. Nonetheless, given the strong empirical performance of the method, we are hopeful that such an analysis will someday bear fruit.
All code for FDR smoothing is publicly available in Python and R at https://github.com/tansey/smoothfdr.
References
 Arnold et al. (2014) T. Arnold, V. Sadhanala, and R. J. Tibshirani. glmgen: Fast generalized lasso solver. https://github.com/statsmaths/glmgen, 2014. R package version 0.0.2.
 Barbero and Sra (2011) A. Barbero and S. Sra. Fast newtontype methods for total variation regularization. In L. Getoor and T. Scheffer, editors, ICML, pages 313–320. Omnipress, 2011. URL http://dblp.unitrier.de/db/conf/icml/icml2011.html#JimenezS11.
 Barbero and Sra (2014) A. Barbero and S. Sra. Modular proximal optimization for multidimensional totalvariation regularization. 2014. URL http://arxiv.org/abs/1411.0589.
 Benjamini and Heller (2007) Y. Benjamini and R. Heller. False discovery rates for spatial signals. Journal of the American Statistical Association, 102(480):1272–1281, 2007.
 Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the falsediscovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57:289–300, 1995.
 Berger and Delampady (1987) J. O. Berger and M. Delampady. Testing precise hypotheses. Statistical Science, pages 317–335, 1987.
 Berry (1988) D. Berry. Multiple comparisons, multiple tests, and data dredging: A Bayesian perspective. In J. Bernardo, M. DeGroot, D. Lindley, and A. Smith, editors, Bayesian Statistics 3, pages 79–94. Oxford University Press, 1988.
 Bickel et al. (2009) P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. Annals of Statistics, 37:1705–32, 2009.
 Bogdan et al. (2008) M. Bogdan, J. K. Ghosh, and S. T. Tokdar. A comparison of the BenjaminiHochberg procedure with some Bayesian rules for multiple testing. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen, volume 1, pages 211–30. Institute of Mathematical Statistics, 2008.
 Bogdan et al. (2011) M. Bogdan, A. Chakrabarti, F. Frommlet, and J. K. Ghosh. Asymptotic Bayesoptimality under sparsity of some multiple testing procedures. The Annals of Statistics, 39(3):1551–79, 2011.
 Boyd et al. (2011) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
 Chambolle and Darbon (2009) A. Chambolle and J. Darbon. On total variation minimization and surface evolution using parametric maximum flows. International journal of computer vision, 84(3):288–307, 2009.
 Clarke and Hall (2009) S. Clarke and P. Hall. Robustness of multiple testing procedures against dependence. The Annals of Statistics, 37:332–58, 2009.
 Do et al. (2005) K.A. Do, P. Muller, and F. Tang. A Bayesian mixture model for differential gene expression. Journal of the Royal Statistical Society, Series C, 54(3):627–44, 2005.
 Efron (2004) B. Efron. Largescale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association, 99(96–104), 2004.
 Efron (2008) B. Efron. Microarrays, empirical Bayes and the twogroups model (with discussion). Statistical Science, 1(23):1–22, 2008.
 Efron (2012) B. Efron. LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, volume 1. Cambridge University Press, 2012.
 Efron et al. (2001) B. Efron, R. Tibshirani, J. Storey, and V. Tusher. Empirical Bayes analysis of a microarray experiment. Journal of American Statistical Association, 96:1151–60, 2001.
 Eklund et al. (2016) A. Eklund, T. E. Nichols, and H. Knutsson. Cluster failure: Why fmri inferences for spatial extent have inflated falsepositive rates. Proceedings of the National Academy of Sciences, 113(28):7900–05, 2016.
 Fedorenko et al. (2013) E. Fedorenko, J. Duncan, and N. Kanwisher. Broad domain generality in focal regions of frontal and parietal cortex. Proc Natl Acad Sci U S A, 110(41):16616–21, Oct 2013. doi: 10.1073/pnas.1315235110.
 Hayasaka and Nichols (2003) S. Hayasaka and T. Nichols. Validating cluster size inference: random field and permutation methods. Neuroimage, 20(4):2343–56, 2003.
 Jaffe et al. (2012) A. E. Jaffe, A. P. Feinberg, R. A. Irizarry, and J. T. Leek. Significance analysis and statistical dissection of variably methylated regions. Biostatistics, 13(1):166–78, 2012.
 Johnson (2013) N. A. Johnson. A dynamic programming algorithm for the fused lasso and l 0segmentation. Journal of Computational and Graphical Statistics, 22(2):246–260, 2013.
 Leek and Storey (2008) J. T. Leek and J. D. Storey. A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105(48):18718–23, 2008.
 Martin and Tokdar (2012) R. Martin and S. Tokdar. A nonparametric empirical Bayes framework for largescale multiple testing. Biostatistics, 13(3):427–39, 2012.
 Mazumder et al. (2011) R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–38, 2011.
 Moeller et al. (2010) S. Moeller, E. Yacoub, C. A. Olman, E. Auerbach, J. Strupp, N. Harel, and K. Uğurbil. Multiband multislice geepi at 7 tesla, with 16fold acceleration using partial parallel imaging with application to high spatial and temporal wholebrain fmri. Magn Reson Med, 63(5):1144–53, May 2010. doi: 10.1002/mrm.22361.
 Muller et al. (2006) P. Muller, G. Parmigiani, and K. Rice. FDR and Bayesian multiple comparisons rules. In Proceedings of the 8th Valencia World Meeting on Bayesian Statistics. Oxford University Press, 2006.
 Newton (2002) M. A. Newton. A nonparametric recursive estimator of the mixing distribution. Sankhya, Series A, 64:306–22, 2002.
 Nichols (2012) T. Nichols. Multiple testing corrections, nonparametric methods, and random field theory. Neuroimage, 62(2):811–5, 2012.
 Owen et al. (2005) A. M. Owen, K. M. McMillan, A. R. Laird, and E. Bullmore. Nback working memory paradigm: A metaanalysis of normative functional neuroimaging studies. Human brain mapping, 25(1):46–59, 2005.
 Perone Pacifico et al. (2004) M. Perone Pacifico, C. Genovese, I. Verdinelli, and L. Wasserman. False discovery control for random fields. Journal of the American Statistical Association, 99(468):1002–1014, 2004.
 Poldrack (2007) R. A. Poldrack. Region of interest analysis for fMRI. Social Cognitive & Affective Neuroscience, 2(1):67–70, 2007.
 Poldrack et al. (2011) R. A. Poldrack, J. A. Mumford, and T. E. Nichols. Handbook of Functional MRI Data Analysis. Cambridge University Press, New York, NY, 2011.
 Polson and Scott (2012) N. G. Polson and J. G. Scott. Local shrinkage rules, Lévy processes, and regularized regression. Journal of the Royal Statistical Society (Series B), 74(2):287–311, 2012.
 Power et al. (2014) J. D. Power, A. Mitra, T. O. Laumann, A. Z. Snyder, B. L. Schlaggar, and S. E. Petersen. Methods to detect, characterize, and remove motion artifact in resting state fmri. Neuroimage, 84:320–41, Jan 2014. doi: 10.1016/j.neuroimage.2013.08.048.
 Rudin et al. (1992) L. Rudin, S. Osher, and E. Faterni. Nonlinear total variation based noise removal algorithms. Phys. D, 60(259–68), 1992.
 Schwartzman et al. (2008) A. Schwartzman, R. F. Dougherty, and J. E. Taylor. False discovery rate analysis of brain diffusion direction maps. The Annals of Applied Statistics, pages 153–175, 2008.
 Scott and Berger (2006) J. G. Scott and J. O. Berger. An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference, 136(7):2144–2162, 2006.
 Scott and Berger (2010) J. G. Scott and J. O. Berger. Bayes and empiricalBayes multiplicity adjustment in the variableselection problem. The Annals of Statistics, 38(5):2587–2619, 2010.
 Scott et al. (2014) J. G. Scott, R. C. Kelly, M. A. Smith, P. Zhou, and R. E. Kass. False discoveryrate regression: an application to neural synchrony detection in primary visual cortex. Journal of the American Statistical Association, 2014. to appear.
 Shu et al. (2015) H. Shu, B. Nan, and R. Koeppe. Multiple testing for neuroimaging via hidden Markov random field. Biometrics, 2015.
 Sun and Cai (2009) W. Sun and T. Cai. Largescale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):393–424, 2009.
 Sun et al. (2015) W. Sun, B. J. Reich, T. Tony Cai, M. Guindani, and A. Schwartzman. False discovery control in largescale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(1):59–83, 2015.
 Tibshirani et al. (2005) R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society (Series B), 67:91–108, 2005.
 Tibshirani and Taylor (2011) R. J. Tibshirani and J. Taylor. The solution path of the generalized lasso. Annals of Statistics, 39:1335–71, 2011.
 Tibshirani and Taylor (2012) R. J. Tibshirani and J. Taylor. Degrees of freedom in lasso problems. The Annals of Statistics, 40(2):1198–1232, 2012.
 Tokdar et al. (2009) S. Tokdar, R. Martin, and J. Ghosh. Consistency of a recursive estimate of mixing distributions. The Annals of Statistics, 37(5A):2502–22, 2009.
 West (2001) D. B. West. Introduction to graph theory, volume 2. Prentice hall Upper Saddle River, 2001.
 Zhang et al. (2011) C. Zhang, J. Fan, and T. Yu. Multiple testing via FDRL for large scale imaging data. Annals of statistics, 39(1):613, 2011.
 Zou (2006) H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
Appendix A Details of fMRI data set
The fMRI data set analyzed in Section 3 was acquired and processed as follows. A spatial working memory localizer (Fedorenko et al., 2013) was performed by a single subject. On each trial, a 4x2 spatial grid is presented, and locations in that grid are presented sequentially (1000 ms per location), followed by a forcedchoice probe between two grids, one of which contained all of the locations presented in the preceding series. In the easy condition, one location is presented on each presentation, whereas in the hard condition two locations are presented on each presentation. Twelve 32second experimental blocks were interspersed with 4 16second fixation blocks (acquisition time = 7:28). The contrast presented in Figure 1 compares the hard versus easy conditions.
fMRI acquisition was peformed using a multiband EPI (MBEPI) sequence (Moeller et al., 2010) (TR=1.16 ms, TE = 30 ms, flip angle = 63 degrees, voxel size = 2.4 mm X 2.4 mm X 2 mm, distance factor=20%, 64 slices, oriented 30 degrees back from AC/PC, 96x96 matrix, 230 mm FOV, MB factor=4, 10:00 scan length). fMRI data were preprocessed according to a pipeline developed at Washington University, St. Louis (Power et al., 2014), including realignment for motion correction, distortion correction using a field map, and registration to a 3mm isotropic atlas space. Preprocessed task fMRI data were analyzed at the first level using the FSL Expert Analysis Tool (FEAT, version 5.0.6), using prewhitening and highpass temporal filtering (100 second cutoff).
Appendix B Finding plateaus in 2D images
Algorithm 1 outlines our approach to finding plateaus, which is needed in our pathbased algorithm for choosing . Note that each point in the grid is touched at most times, where is the number of neighbors of that point. Thus the algorithm runs in , which is effectively linear time since . The algorithm is mildly sensitive to underlying numerical inaccuracies in the ADMM solution for . It is well known that finiteprecision ADMM solutions tend to slightly “round off” sharp edges in the underlying image. This produces some slight numerical noise in the degrees of freedom estimate. In our experience, this is rarely a practical concern, and can always be corrected by tightening the convergence criterion for ADMM below the plateau tolerance in Algorithm 1.
Appendix C Benchmark setup
As described in Section 5, all methods were run across a suite of scenarios, with 30 independent trials per scenario and a 10% FDR threshold. This appendix describes the methodspecific settings for the two main competing methods: and the HMRF model.
For , we used “Method 1” from (Zhang et al., 2011) as this was suggested for fMRItype data. We set the nullcutoff . This is higher than used in (Zhang et al., 2011), which used ; however, they also used a 1% FDR threshold. Since controls the proportion above which we expect almost all pvalues to be true nulls, using a of is more reasonable with an FDR of . Preliminary experiments confirmed the authors’ claim that is not very sensitive to the setting of .
The HMRF model has several tunable parameters and required tweaks to run the code provided in the supplementary materials of (Shu et al., 2015):

In order to compile the C++ code, we needed to change all calls to floor(x) with (double((int)x)).

2d grids and edge points are not supported in their implementation. To process the entire grid, we had to embed it within the center of a array. This should have no effect on the result, as we specified the original lattice as the region of interest.

The alternative density estimation procedure is parametric (as opposed to our nonparametric approach) and requires specifying the number of components in a Gaussian mixture model. We specify the correct number of components in each case, to give their model the best possible estimation (i.e. 2 for the wellseparated scenarios and 1 for the poorlyseparated scenarios).

We ran with the default parameters of , , , . These correspond to 5000 iterations of the main Gibbs sampler with a 1000iteration burnin. These settings are identical to those used in the HMRF paper.
We made every effort possible to be as generous as possible to both methods. This is the main reason for choosing to include the “saturated” signal regions, as these cases highlight the areas where and HMRFs perform well, even though we expect them to be rare in practice, as evidenced by the various prior plateaus discovered by FDR smoothing in Figure 2.
Appendix D Comparisons with FDR Regression
Benchmark performance results against FDR regression (FDRR) are presented in Table 2. We performed 100 independent trials for each of eight different scenarios, corresponding to two different plateau setups (large regions vs. small regions) and the following four different alternative distributions:
FDR regression using a 100dimensional spline basis comes close to the performance of FDR smoothing, but also has many conceptual and computational disadvantages. These are essentially the same disadvantages that one would face in treating any spatial smoothing problem in a regression framework. For example, to handle a smoothing problem using FDR regression, one must choose the basis set and the number of basis elements. This is implicitly a choice about the smoothness of the underlying prior image, and is not straightforward in large problems or problems over an arbitrary graph structure. FDR smoothing, on the other hand, has no tunable parameters once our pathbased method for choosing is used. Moreover, FDR regression cannot localize sharp edges in the underlying image of prior probabilities, unless these edges happen to coincide with any edges present in the basis set. FDR smoothing finds these edges automatically without requiring a clever choice of basis, and without having to tolerate undersmoothing in other parts of the image. Finally, at an algorithmic level, the important matrix operations in FDR smoothing involve very sparse matrices and benefit enormously from precaching. This is not true in FDR regression, which involves dense matrices and linear systems that change at every iteration.
As the table shows, FDR regression with basis functions does provide sensible answers and good FDR performance. However, the FDR smoothing approach benefits greatly by exploiting the spatial structure of the problem, resulting in better power and more interpretable summaries at lower computational cost.
True positive rate (TPR)  
Large Regions  Small Regions  
Alt 1  Alt 2  Alt 3  Alt 4  Alt 1  Alt 2  Alt 3  Alt 4  
BH  0.364  0.215  0.128  0.366  0.212  0.123  0.090  0.194 
2G  0.394  0.229  0.134  0.403  0.211  0.123  0.091  0.196 
FDRR  0.559  0.334  0.167  0.610  0.242  0.141  0.097  0.232 
FDRS  0.592  0.352  0.168  0.645  0.264  0.144  0.093  0.257 
Oracle  0.688  0.524  0.332  0.718  0.298  0.193  0.139  0.292 


False discovery rate (FDR)  
Large Regions  Small Regions  
Alt 1  Alt 2  Alt 3  Alt 4  Alt 1  Alt 2  Alt 3  Alt 4  
BH  0.072  0.070  0.073  0.070  0.090  0.093  0.093  0.092 
2G  0.089  0.083  0.083  0.089  0.092  0.096  0.098  0.096 
FDRR  0.075  0.058  0.050  0.086  0.102  0.106  0.109  0.105 
FDRS  0.072  0.057  0.054  0.079  0.092  0.095  0.098  0.096 
Oracle  0.101  0.100  0.100  0.101  0.097  0.101  0.101  0.098 
Appendix E HMRF details and improvements
The HMRF model, while following the priordependence philosophy of FDR smoothing, makes a different distributional assumption on the dependence by placing an Ising model on the priors. This has two important side effects. First, the model is not necessarily going to discover constant regions of prior probability. This is clear when looking at the “local index of significance” (LIS) statistics produced by the HMRF, shown in Figure 10. While the LIS space is substantially smoothed, it is not constant across different plateaus like in FDR smoothing. The other core issue with the HMRF model is that its substantial complexity results in a very difficult model to fit. The implementation provided by the authors performs an EM algorithm with Gibbs sampling and required more than three days to run the examples with the suggested number of iterations, compared to minutes with FDR smoothing on the same examples and the same compute cluster. More to the point, the final fit shows clear bias to local optima that overestimate the strength of the signal region. The result is a model which performs well only when the regions are clearly segmented and the signal region is saturated, and which otherwise fails to adhere to the specified FDR threshold. See Appendix E for more details on the HMRF model, its configuration, and suggestions from the HRMF authors on ways to improve the runtime and fit of the model; we did not incorporate these suggestions in our benchmarks as they were either purely computational speedups or were intuitive suggestions that would require developing entirely new methods.
In an effort to provide fair evaluation, we contacted the first and second authors of the HMRF paper (Shu et al., 2015). They provided several suggestions for improving the speed of the algorithm and its performance. The following speedup suggestions were offered:

Reduce the number of burnin iterations.

Monitor the stability of the parameter estimations in order to stop earlier than the maximum number of iterations.

Stop the backtracking line search at a fixed number of steps in the Newton’s method step.

Use an updated pseudorandom number generator as the code relies on an outdated generator which may be slower than the most uptodate version.
Note that all of the above suggestions would reduce the running time of the algorithm, but would not likely result in an improved fit or better performance on the benchmarks. The main performance improvement suggested was to preprocess the zscores so as to detect the different regions first, then run separate HMRFs on each region. One way to do this would be to run FDR smoothing, then treat each plateau as a different region and fit an HMRF to them. It is unclear whether this approach would truly address the underlying issues we observed in the benchmarks. Thus, while this is an interesting idea and may be effective, it would constitute an entirely new method and therefore we leave it to future work.