On spike and slab empirical Bayes multiple testing

# On spike and slab empirical Bayes multiple testing

\fnmsIsmaël \snmCastillolabel=e1]ismael.castillo@upmc.fr [    \fnmsÉtienne \snmRoquainlabel=e2]etienne.roquain@upmc.fr [ Sorbonne Université, Laboratoire de Probabilités, Statistique et Modélisation, LPSM,
4, Place Jussieu, 75252 Paris cedex 05, France
Sorbonne Université
###### Abstract

This paper explores a connection between empirical Bayes posterior distributions and false discovery rate (FDR) control. In the Gaussian sequence model, this work shows that empirical Bayes-calibrated spike and slab posterior distributions allow a correct FDR control under sparsity. Doing so, it offers a frequentist theoretical validation of empirical Bayes methods in the context of multiple testing. Our theoretical results are illustrated with numerical experiments.

[
\kwd
\runtitle

Spike and slab empirical Bayes multiple testing \thankstextT1Work partly supported by grants of the ANR, projects ANR-16-CE40-0019 (SansSouci) and ANR-17-CE40-0001 (BASICS)

{aug}

and

class=AMS] \kwd[Primary ]62C12, 62G10 Frequentist properties of Bayesian procedures \kwdFalse discovery rate \kwdsparsity \kwdmultiple testing

## 1 Introduction

### 1.1 Context

In modern high dimensional statistical models, several aims are typically pursued, often at the same time: testing of hypotheses on the parameters of interest, estimation and uncertainty quantification, among others. Due to their flexibility, in particular in the choice of the prior, Bayesian posterior distributions are routinely used to provide solutions to a variety of such inference problems. However, although practitioners may often directly read off quantities such as the posterior mean or credible sets once they have simulated posterior draws, the question of mathematical justification of the use of such quantities, in particular from a frequentist perspective, has recently attracted a lot of attention. While the seminal papers [ggv], [sw01] set the stage for the study of posterior estimation rates in general models, the case of estimation in high dimensional models has been considered only recently from the point of view of estimation, see [js04], [cv12], [vkv14] among others, while results on frequentist coverage of credible sets are just starting to emerge, see e.g. [belnur15], [vsvv17d]. Some of the previous approaches rely on automatic data-driven calibration of the prior parameters, following the so-called empirical Bayes approach, notably [js04], estimating the proportion of significant parameters, and [jiangzhang09], where the full distribution function of the unknowns is estimated.

Our interest here is on the issue of multiple testing of hypotheses. Typically, the problem is to identify the active variables among a large number of candidates. This task appears in a wide variety of applied fields as genomics, neuro-imaging, astrophysics, among others. Such data typically involve more than thousands of variables with only a small part of them being significant (sparsity).

In this context, a typical aim is to control the false discovery rate (FDR), see (8) below, that is, to find a selection rule that ensures that the averaged proportion of errors among the selected variables is smaller than some prescribed level . This multiple testing type I error rate, introduced in [BH1995], became quickly popular with the development of high-throughput technologies because it is “scalable” with respect to the dimension: the more rejections are possible, the more false positives are allowed. A common way to achieve this goal is to compute the -values (probability under the null that the test statistic is larger than the observed value) and to run the Benjamini-Hochberg (BH) procedure [BH1995], which is often considered as a benchmark procedure. In the last decades, an extensive literature aimed at studying the BH method, by showing that it (or versions of it) controls the FDR in various frameworks, see [BY2001, BKY2006, Sar2007, FDR2007], among others.

In a fundamental work [fdr06], Abramovich, Benjamini, Donoho and Johnstone proved that a certain hard thresholding rule deduced from the BH procedure – keeping only observations with significant -values – satisfies remarkable risk properties: it is minimax adaptive simultaneously for a range of losses and sparsity classes over a broad range of sparsity parameters. In addition, similar results hold true for the misclassification risks, see [BCFG2011, NR2012]. These results in particular suggest a link between FDR controlling procedures and adaptation to sparsity. Further results in this direction include [bogdanslope15], [sucandes16] for the SLOPE estimate. Here, we shall follow a questioning that can be seen as ‘dual’ to the former one: starting from a commonly used Bayesian procedure that is known to optimally adapt to the sparsity in terms of risk over a broad range of sparsity classes (and even, under appropriate self-similarity type conditions, to produce adaptive confidence sets), we ask whether a uniform FDR control can be guaranteed.

### 1.2 Setting

In this paper, we consider the Gaussian sequence model. One observes, for ,

 Xi=θ0,i+εi, (1)

for an unknown -dimensional vector and i.i.d. . This model can be seen as a stylized version of an high-dimensional model. The problem is to test

 H0,i:‘‘θ0,i=0" against H1,i:‘‘θ0,i≠0",

simultaneously over . We also introduce the assumption that the vector is -sparse, that is, is supposed to belong to the set

 ℓ0[sn]={θ∈Rn:#{1≤i≤n:θi≠0}≤sn}, (2)

for some sequence , typically much smaller than , measuring the sparsity of the vector.

### 1.3 Bayesian multiple testing methodology

From the point of view of posterior distributions, one natural approach for testing is simply based on comparing posterior probabilities of the hypotheses under consideration. Yet, to do so, a choice of prior needs to be made, and for this reason it is important to carefully design a prior that is flexible enough to adapt to the unknown underlying structure (and, here, sparsity) of the model. This is one of the reasons behind the use of empirical Bayes approaches, that aim at calibrating the prior in a fully automatic, data-driven, way. Empirical Bayes methods for multiple testing have been in particular advocated by Efron (see e.g. [Efron2008] and references therein) in a series of works over the last 10-15 years, reporting excellent behaviour of such procedures – we describe two of them in more detail in the next paragraphs – in practice. Fully Bayes methods, that bring added flexibility by putting prior on sensible hyperparameters, are another alternative. In the sequel Bayesian multiple testing procedures will be referred to as BMT for brevity.

Several popular BMT procedures rely on two quantities that can be seen as possible Bayesian counterparts of standard -values:

• the -value: the probability that the null is true conditionally on the fact that the test statistics is equal to the observed value, see e.g. [ETST2001];

• the -value: the probability that the null is true conditionally on the fact that the test statistics is larger than the observed value, introduced in [Storey2003].

(Note that the -value is usually called “local FDR”. Here, we used another terminology to avoid any confusion between the procedure and the FDR.) Obviously, these quantities are well defined only if the trueness/falseness of a null hypothesis is random, which is obtained by introducing an appropriate prior distribution.

Once the prior is calibrated (in a data-driven way or not), the -values (resp. -values) can be computed and combined to produce BMT procedures. For instance, existing strategies reject null hypotheses with:

• a -value smaller than a fixed cutoff [Efron2007];

• a -value smaller than the nominal level [Efron2008];

• averaged -values smaller than the nominal level [muelleretal04, SC2007, SC2009].

For alternatives see, e.g., [sarkaretal08]. In particular, one popular fact is that the use of Bayesian quantities “automatically corrects for the multiplicity of the tests”, see, e.g., [Storey2003]; while using -values requires to use a cutoff that decreases with the dimension , using -values/-values can be used with a cutoff close to the nominal level , without any further correction. This is well known to be valid from a decision theoretic perspective for the Bayes FDR, that is, for the FDR integrated w.r.t. the prior distribution, as we recall in Proposition 1 below. When the hyper-parameters are estimated from the data within the BMT, the Bayes FDR is still controlled to some extent, as proved in [SC2007, SC2009]. However, controlling the Bayes FDR does not give theoretical guarantees for the usual frequentist FDR, that is, for the FDR at the true value of the parameter, as the pointwise FDR may deviate from an integrated version thereof.

### 1.4 Frequentist control of BMT

In this paper, our main aim is to study whether BMT procedures have valid frequentist multiple testing properties.

A first hint has already been given in [Storey2003, Efron2008]: it turns out that the BH procedure can loosely be seen as a “plug-in version” of the procedure rejecting the -values smaller than (namely, the theoretical c.d.f. of the -values is estimated by its empirical counterpart). Since the BH procedure controls the (frequentist) FDR, this might suggest a possible connection between BMT and successful frequentist multiple testing procedures.

In regard to the rapidly increasing literature on frequentist validity of Bayesian procedures from the estimation perspective, the multiple testing question for BMT procedures has been less studied so far from the theoretical, frequentist, point of view. This is despite a number of very encouraging simulation performance results, see e.g. [muelleretal04, caoetal09, guindanietal09, martintokdar12]. A recent exception is the interesting preprint [salomond17] that shows a frequentist FDR control for a BMT based on a continuous shrinkage prior; yet, this control holds under a certain signal-strength assumption only. One main question we ask in the present work is whether a fully uniform control (over sparse vectors) of the frequentist FDR is possible for some posterior-based BMT procedures. Also, we would like to clarify whether the final FDR control is made at, or close to, the required level .

### 1.5 Spike and slab prior distributions and sparse priors

Let be a fixed hyper-parameter. Let us define the prior distribution on as

 Πw,γ=((1−w)δ0+wG)⊗n, (3)

where is a distribution with a symmetric density on . Such a prior is a tensor product of a mixture of a Dirac mass at (spike), that reflects the sparsity assumption, and of an absolutely continuous distribution (slab), that models nonzero coefficients. This is arguably one of the most natural priors on sparse vectors and has been considered in many key contributions on Bayesian sparse estimation and model selection, see, e.g., [MitchellBeauchamp], [GeorgeFoster].

Of course, an important question is that of the choice of and . A popular choice of is data-driven and based on a marginal maximum likelihood empirical Bayes method (to be described in more details below). The idea is to make the procedure learn the intrinsic sparsity while also incorporating some automatic multiplicity correction, as discussed e.g. in [scottberger10, bgt08]. Following such an approach in a fundamental paper, Johnstone and Silverman [js04] show that, provided has tails at least as heavy as Laplace, the posterior median of the empirical Bayes posterior is rate adaptive for a wide range of sparsity parameters and classes, is fast to compute and enjoys excellent behaviour in simulations (the corresponding R–package EBayesThresh [js05] is widely used). Namely, if denotes the euclidian norm and is the coordinate-wise median of the empirical Bayes posterior distribution, there exists such that

 supθ0∈ℓ0[sn]Eθ0∥^θ−θ0∥2 ≤c1snlog(n/sn). (4)

Thus, asymptotically (in the regime , ), it matches up to a constant the minimax risk for this problem ([djhs92]). In the recent work [cm17], the convergence of the empirical Bayes full posterior distribution (not only aspects such as median or mean) is considered, and similar results can be obtained, under stronger conditions on the tails of (for instance Cauchy works). More precisely, for the empirical Bayes posterior, one can find a constant such that

 supθ0∈ℓ0[sn]Eθ0∫∥θ−θ0∥2dΠ(θ|X) ≤C1snlog(n/sn). (5)

Further, under some conditions, one can show that certain credible sets from the posterior distributions are also adaptive confidence sets in the frequentist sense [cs18]. Alternatively, one can also follow a hierarchical approach and put a prior on . The paper [cv12] obtains adaptive rates for such a fully Bayes procedure over a variety of sparsity classes, and presents a polynomial time algorithm to compute certain aspects of the posterior.

Empirical Bayes approaches have also been successfully applied to a variety of different sparse priors such as empirically recentered Gaussian slabs as in [belnur15, belghos16], or the horseshoe [vsvv17r, vsvv17d], both studied in terms of estimation and the possibility to construct adaptive confidence sets. In [jiangzhang09], an empirical Bayes approach based on the ‘empirical’ cdf of the s is shown to allow for optimal adaptive estimation over various sparsity classes. For an overview on the rapidly growing literature on sparse priors, we refer to the discussion paper [vsvv17d].

Yet, most of the previous results are concerned with estimation or confidence sets, although a few of them report empirical false discoveries, e.g. [vsvv17d], Figure 7, without theoretical analysis though.

### 1.6 Aim and results of the paper

Here we wish to find – if this is at all possible – a posterior-based procedure using a prior (possibly an empirical Bayes one i.e. ), that can perform simultaneous inference in that a) it behaves optimally up to constants in terms of the quadratic risk in the sense of (4) (or (5)), b) its frequentist FDR at any sparse vector is bounded from above by (a constant times) a given nominal level. More precisely, given a nominal level and a multiple testing procedure deduced from (-values or -values procedure, as listed in Section 1.3) we want to validate its use in terms of a uniform control of its false discovery rate , see (8) below, over the whole parameter space. That is, we ask whether we can find independent of such that, for large enough,

 supθ0∈ℓ0[sn]FDR(θ0,φt) ≤C2t. (6)

Our main results are as follows: for a sparsity with ,

• Theorem 1 shows that (6) holds with arbitrary small for the BMT procedure rejecting the nulls whenever the corresponding -value is smaller than .

• Theorem 2 shows that (6) holds for some for the BMT procedure rejecting the nulls whenever the corresponding -value is smaller than (with a slight modification if only few signals are detected).

These results hold for spike and slab priors, for being Laplace or Cauchy, or even for slightly more general heavy-tailed distributions. The hyperparameter is chosen according to a certain empirical Bayes approach to be specified below (with minor modifications with respect to the choice of [js04]).

In addition, from a pure multiple testing perspective, it is important to evaluate the amplitude of in (6). Our numerical experiments support the fact that, roughly, . Furthermore, Theorem 3 shows that for some subset (containing strong signals), we have for the -value BMT, for any (sequence) ,

 limn FDR(θ0,φt)=t, (7)

so the FDR control is exactly achieved asymptotically in that case.

It follows from these results (combined with previous results of [js04, cm17]) that the posterior distribution associated to a spike and slab prior, with Cauchy and a suitably empirical Bayes–calibrated , is appropriate to perform several tasks: (6) (multiple testing), (5)–(4) (posterior concentration in -distance). The posterior can also be used to build honest adaptive confidence sets ([cs18]). The present work, focusing on the multiple testing aspect, then completes the inference picture for spike and slab empirical Bayes posteriors, confirming their excellent behaviour in simulations.

### 1.7 Organisation of the paper

In Section 2, we introduce Bayesian multiple testing procedures associated to spike and slab posterior distributions as well as the considered empirical Bayes choice of . In Section 3, our main results are stated, while Section 4 contains numerical experiments and Section 5 a short discussion. Preliminaries for the proofs are given in Section 6, while the proof of the main results can be found in Section 7. The supplement gathers a number of useful lemmas used in the proofs, as well as the proofs of Propositions 12 and Theorem 3.

### 1.8 Notation

In this paper, we use the following notation:

• for a cdf, we set

• and

• means that there exists constants such that for large enough;

• means that there exists constants such that for large enough;

• , for means that there exists constants such that for all , ;

• , as means that there exists constants such that for large enough;

• means .

Also, for , the symbols (resp. ) denotes the expectation (resp. probability) under in the model (1). The support of is denoted by or sometimes for simplicity. The cardinality of the support is denoted by .

## 2 Preliminaries

### 2.1 Procedure and FDR

A multiple testing procedure is a measurable function of the form , where each (resp. ) codes for accepting (resp. rejecting ). For any such procedure , we let

 FDR(θ0,φ)=Eθ0[∑ni=11{θ0,i=0}φi(X)1∨∑ni=1φi(X)]. (8)

A procedure is said to control the FDR at level if for any in . Note that under , we have , which means that an –FDR controlling procedure provides in particular a (single) test of level of the full null “ for all ”. As already mentioned, in the framework of this paper, our goal is a control of the FDR around the pre-specified target level, as in (6) or (7) (where ).

### 2.2 Prior, posterior, ℓ-values and q-values

Recall the definition of the prior distribution from (3) and let

 g(x)=∫γ(x−u)ϕ(u)du. (9)

The posterior distribution of is then explicitly given by

 θ|X ∼n⨂i=1 ℓi(X)δ0+(1−ℓi(X))GXi (10)

where is the distribution with density and

 ℓi(X) =ℓ(Xi;w,g); (11) ℓ(x;w,g) =Π(θ1=0|X1=x)=(1−w)ϕ(x)(1−w)ϕ(x)+wg(x). (12)

The quantities , , given by (11) are called the -values. Note that, although we do not emphasize it in the notation for short, the -values depend also on and . The -value measures locally, for a given observation , the probability that the latter comes from pure noise. This is why it is sometimes called ‘local-FDR’, see [ETST2001].

If one has in mind a range of values –i.e. those that exceed a given amplitude–, a different measure is given by the q-values defined by

 qi(X) =q(Xi;w,g); (13) q(x;w,g) =Π(θ1=0||X1|≥|x|)=(1−w)¯¯¯¯Φ(|x|)(1−w)¯¯¯¯Φ(|x|)+w¯¯¯¯G(|x|); (14) ¯¯¯¯G(s) =∫+∞sg(x)dx. (15)

The identity (14) relating the -value to is proved in Section 10.

### 2.3 Assumptions

We follow throughout the paper assumptions similar to those of [js04]. The prior is assumed to be unimodal, symmetric and so that

 |logγ(x)−logγ(y)| ≤Λ|x−y|,x,y∈R; (16) γ(y)−1∫∞yγ(u)du ≍yκ−1,as y→∞,κ∈[1,2]; (17) y∈R →y2γ(y) is bounded. (18)

Conditions (16), (17) and (18) above are for instance true when is Cauchy (, ) or Laplace (, is the scaling parameter). As we show in Remark 5, explicit expressions exist for , see (9), in the Laplace case. In the Cauchy case, the integral is not explicit, but in practice (to avoid approximating the integral) one can work with the quasi-Cauchy prior, see [js05], that satisfies the above conditions and corresponds to

 γ(x) =(2π)−1/2(1−|x|¯¯¯¯Φ(x)/ϕ(x)); (19) g(x) =(2π)−1/2x−2(1−e−x2/2). (20)

### 2.4 Bayesian Multiple Testing procedures (BMT)

We define the multiple procedures defined from the -values/-values in the following way:

 φ\tinyℓ-vali(t;w,g) =\mathds1{ℓi(X)≤t},1≤i≤n; (21) φ\tinyq-vali(t;w,g) =\mathds1{qi(X)≤t},1≤i≤n, (22)

where is some threshold, that possibly depends on . As we will see in Section 6.2, these two procedures, denoted , for brevity, simply correspond to (hard) thresholding procedures that select the ’s larger than some (random) threshold. The value of the threshold is driven by the posterior distribution in a very specific way: it depends on , , and on the whole data vector through the empirical Bayes choice of the hyper-parameter , that automatically “scales” the procedure according to the sparsity of the data.

### 2.5 Controlling the Bayes FDR

If the aim is to control the FDR at some level , a first result indicates that choosing in and may be appropriate, because the corresponding procedures control the Bayes FDR, that is, the FDR where the parameter has been integrated with respect to the prior distribution (see, e.g., [sarkaretal08]). More formally, for any multiple testing procedure , and hyper-parameters and , define

 BFDR(φ;w,γ) =∫RnFDR(θ,φ)dΠw,γ(θ). (23)

Then the following result holds.

###### Proposition 1.

Let and and consider any density satisfying the assumptions of Section 2.3. Let as defined in (21) and as defined in (22). Then we have

 BFDR(φℓ;w,γ) ≤αP(∃i:ℓi(X)≤α) (24) ≤αP(∃i:qi(X)≤α)=BFDR(φq;w,γ)≤α. (25)

This result can be certainly considered as well known, as (24) (resp. (25)) is similar in essence to Theorem 4 of [SC2009] (resp. Theorem 1 of [Storey2003]). It is essentially a consequence of Fubini’s theorem, see Section 9.1 for a proof. While Proposition 1 justifies the use of /-values from the purely Bayesian perspective, it does not bring any information about and at an arbitrary sparse vector .

### 2.6 Marginal maximum likelihood

In order to choose the hyper-parameter , we explore now the choice made in [js04], following the popular marginal maximum likelihood method. Let us introduce the auxiliary functions

 β(x) =gϕ(x)−1;β(x,w)=β(x)1+wβ(x). (26)

A useful property is that is increasing on from to infinity, see Section 6.1. The marginal likelihood for is by definition the marginal density of , given , in the Bayesian setting. Its logarithm is equal to

 L(w)=n∑i=1logϕ(Xi)+n∑i=1log(1+wβ(Xi)),

which is a differentiable function on . The derivative of , the score function, can be written as

 S(w)=n∑i=1β(Xi)1+wβ(Xi)=n∑i=1β(Xi,w). (27)

The function is (a.s.) decreasing and thus is (a.s.) strictly concave. Hence, almost surely, the maximum of the function on a compact interval exists, is unique, and we can define the marginal maximum likelihood estimator by

 ^w = argmaxw∈[1n,1] L(w) (a.s.). (28)

This choice of is close to the one in [js04]. The only difference is in the lower bound, here , of the maximisation interval, which differs from the choice in [js04] by a slowly varying term. This difference is important for multiple testing in case of weak or zero signal (in contrast to the estimation task, for which this different choice does not modify the results). Another slightly different choice of interval, still close to , will also be of interest below.

In addition, if , it solves the equation in . However, note that the maximiser can be at the boundary and thus may not be a zero of .

## 3 Main results

Let us first describe the -value algorithm.

Algorithm EBayesL Input: , slab prior , target confidence Output: BMT procedure Find the maximiser given by (28); Compute given by (12); Return, for , (29)

###### Theorem 1.

Consider the parameter space given by (2) with sparsity for some . Let be a unimodal symmetric slab density that satisfies (16)–(18) with as in (17). Then the algorithm EBayesL produces as output the BMT defined in (29) that satisfies the following: there exists a constant such that for any , there exists an integer such that, for any ,

 supθ0∈ℓ0[sn]FDR(θ0,φ\tinyℓ-val)≤Cloglogn(logn)κ/2. (30)

Theorem 1 is proved in Section 7. The proof relies mainly on two different arguments: first, a careful analysis of the concentration of , which requires to distinguish between two regimes (signal weak/moderate or strong, basically). Second, study the FDR behavior of the -value procedure taken at some sparsity parameter (not random but depending on ) in each of these two regimes. This requires to analyse the mathematical behavior of a number of functions of , uniformly over a wide range of possible sparsities, which is one main technical difficulty of our results. In particular, the concentration of is obtained uniformly over all sparse vectors with polynomial sparsity, without any strong-signal or self-similarity-type assumption, as would typically be the case for obtaining adaptive confidence sets. Such assumptions would of course simplify the analysis significantly, but the point here is precisely that a uniform FDR control is possible for rate-adaptive procedures without any assumption on the true sparse signal. The uniform concentration of is expressed implicitly and requires sharp estimates, contrary to rate results for which a concentration in a range of values is typically sufficient. In particular, some of our lemmas in the supplement are refined versions of lemmas in [js04].

As a corollary, (30) entails

 ¯¯¯¯¯¯¯¯limnsupθ0∈ℓ0[sn]FDR(θ0,φ\tinyℓ-val) =0,

and this for any chosen threshold in . From a pure -FDR controlling point of view, while making a vanishing small proportion of errors is obviously desirable, it implies that is, as far as the FDR is concerned, somewhat conservative, in the sense that it does not spend all the allowed type I errors ( instead of ) and thus will make too few (true) discoveries at the end. It turns out that in the present setting -values are not quite on the “exact” scale for FDR control. An alternative is to consider the -value scale, as we now describe.

Algorithm EBayesq Input: , slab prior , target confidence Output: BMT procedure Find the maximiser given by (28). Compute Return, for , (31)

We also consider the following variant of the procedure EBayesq, which is mostly the same, except that it does not allow for too small estimated weight . Set, for tending slowly to infinity,

 ωn=Lnn¯¯¯¯G(√2.1logn). (32)

For instance, for Cauchy or quasi-Cauchy, we have while for Laplace(1) we have .

Algorithm EBayesq Input: , slab prior , target confidence , sequence Output: BMT procedure Same as for EBayesq, returning Return, for , and as in (32), (33)

###### Theorem 2.

Consider the same setting as Theorem 1. Then the algorithm EBayesq produces the BMT procedure in (31) that satisfies the following: there exists a constant such that for any , there exists an integer such that, for any ,

 supθ0∈ℓ0[sn]FDR(θ0,φ\tinyq-val) ≤Ctlog(1/t). (34)

In addition, the algorithm EBayesqproduces the BMT procedure in (33) that satisfies, for as in (32) with , , and as before (but with possibly different numerical values), for any ,

 supθ0∈ℓ0[sn]FDR(θ0,φ\tinyq-val.0) ≤Ct. (35)

The proof of Theorem 2 is technically close to that of Theorem 1 and is given in Section 7, see also Section 7.2 for an informal heuristic that serves as guidelines for the proof. The statements of Theorem 2 are however of different nature, because the -value threshold appears explicitly in the bounds (34)-(35), that do not vanish as .

The two bounds (34) and (35) differ from a term, which may become significant for small . This term appears in the case where the signal is weak (only few rejected nulls), for which the calibration is slightly too large. This may not be the case using a different type of sparsity–adaptation, or a different estimate . Indeed, this phenomenon disappears when using EBayesq, since is then set to when it is not large enough, in which case the FDR control is shown to be guaranteed, and we retrieve a dependence in terms of a constant times the target level .

A consequence of Theorem 2 is that an –FDR control can be achieved with EBayesq/EBayesqprocedures by taking sufficiently small (although not tending to zero). Again, it is important to known how small the constant can be taken in (34) and (35). When the signal is strong enough, the following result shows that and the factor can be removed in (34).

Let us first introduce a set of ‘large’ signals, for arbitrary ,

 L0[sn]={θ∈ℓ0[sn]:|θi|≥ a√2log(n/sn) for i∈Sθ,  |Sθ|=sn}. (36)
###### Theorem 3.

Consider defined by (36) with an arbitrary , for and for some . Assume that is a unimodal symmetric slab density that satisfies (16)–(18) with as in (17). Then, for any pre-specified level , EBayesq produces the BMT procedure in (31) such that

 limnsupθ0∈L0[sn]FDR(θ0,φ\tinyq-val)=limninfθ0∈L0[sn]FDR(θ0,φ\tinyq-val)=t. (37)

In addition, EBayesqwith , satisfies the same property whenever , for as in (32), which is in particular the case if grows faster than a given power of and .

Theorem 3, although focused on a specific regime, shows that empirical Bayes procedures are able to produce an asymptotically exact FDR control. Again, this may look surprising at first, as the prior slab density is not particularly linked to the true value of the parameter in (37). This puts forward a strong adaptive property of the spike and slab prior for multiple testing.

###### Remark 4.

Our three results can be extended to the case where is not of the form (9) (that is, not necessarily of the form of a convolution with the standard gaussian), but satisfies some weaker properties, see Section 6.1. This extended setting corresponds to a “quasi-Bayesian” approach where the -values (resp. -values) are directly given by the formulas (11) (resp. (13)), without specifying a slab prior .

## 4 Numerical experiments

In this section, our theoretical findings are illustrated via numerical experiments. A motivation here is also to evaluate how the parameters , , and the hyper-parameter (or ) affect the FDR control, in particular the value of the constant in the bound of Theorem 2.

For this we consider , and the following two possible scenarios for :

• constant alternatives: if and otherwise;

• randomized alternatives: i.i.d. uniformly distributed on if and otherwise.

The parameter range for is taken equal to . The marginal likelihood estimator given by (28) is computed by using a modification of the function wfromx of the package EbayesThresh [js05], that accommodates the lower bound in our definition (instead of , see (49), in the original version). The parameter is either given by the quasi-Cauchy prior (19)-(20) or by the Laplace prior of scaling parameter (see Remark 5 for more details). For any of the above parameter combinations, the FDR of the procedures EBayesL, EBayesq (defined in Section 3) is evaluated empirically via replications.

Figure 1 displays the FDR of the procedures EBayesL (-values) and EBayesq (-values). Concerning EBayesL, in all situations, the FDR is small while not exactly equal to the value , which seems to indicate that the bound found in Theorem 1 is not too conservative. Moreover, the quasi-Cauchy version seems more conservative than the Laplace version, which corroborates our theoretical findings (in our bound (30), we have the factor for quasi-Cauchy and for Laplace). As for EBayesq, when the signal is large, the FDR curves are markedly close to the threshold value when is small, which is in line with Theorem 3. However, for a weak sparsity , the FDR values are slightly inflated (above the threshold ), which seems to indicate that the asymptotical regime is not yet reached for this value. Looking now at the whole range of signal strengths, one notices the presence of a ‘bump‘ in the regime of intermediate values of , especially for the Laplace prior. However, this bump seems to disappear when decreases. We do not known presently whether this bump is vanishing with or if this corresponds to a necessary additional constant (or ) in the achieved FDR level, but we suspect that this is related to the fact that the intermediate regime was the most challenging part of our proofs. Overall, the Cauchy slab prior seems to have a particularly suitable behavior. This was not totally surprising for us as it already showed more stability than the Laplace prior in the context of estimation with the full empirical Bayes posterior distribution, as seen in [cm17].

Finally, we provide additional experiments in the supplement, see Section 14. The findings can be summarized as follows:

• the curves behave qualitatively similarly for randomized alternatives (second scenario);

• at least in the considered framework, the classical BMT procedure based on averaged -values [SC2007] seems to fail to control the frequentist FDR for large signals, which is markedly different from EBayesq;

• the procedure EBayesq(with ) has a global behavior similar to EBayesq, with more conservativeness for weak signal (as expected).

• it is possible to uniformly improve EBayesqby considering the following modification (named EBayesq.hybridbelow): if , instead of rejecting no null, EBayesq.hybridperforms a standard Bonferroni correction, that is, rejects the ’s such that . Note that a careful inspection of the proof of Theorem 2 (EBayesqpart) shows that the bound (35) is still valid for EBayesq.hybrid.

## 5 Discussion

Our results show that spike and slab priors produce posterior distributions with particularly suitable multiple testing properties. One main challenge in deriving the results was to build bounds that are uniform over sparse vectors. We demonstrate that such a uniform control is possible up to a constant term away from the target control level. This constant is very close to in simulations, and can even be shown to be asymptotically for some subclass of sparse vectors.

The results of the paper are meant as a theoretical validation of the common practical use of posterior-based quantities for (frequentist) FDR control. While the main purpose here was validation, and as such the proposed procedures were not particularly meant to improve upon the classical FDR-controlling procedures (which specifically target the FDR control, while here the starting point is the posterior, which is rather a global inference object over the whole vector ), it is remarkable that a uniform control of the FDR very close to the target level can be obtained for the spike and slab BMT procedure in the present unstructured sparse high-dimensional model.

While many studies focused on controlling the Bayes FDR with Bayesian multiple testing procedures, this work paves the way for a frequentist FDR analysis of Bayesian multiple testing procedures in different settings. In our study, the perhaps most surprising fact is how well marginal maximum likelihood estimation combines with FDR control under sparsity: as shown in our proof (and summarized in our heuristic) the score function is linked to a peculiar equation that makes perfectly the link between the numerator and the denominator in the FDR of the -value–based multiple testing procedure. This phenomenon has not been noticed before to the best of our knowledge. We suspect that this link is only part of a more general picture, in which the concentration of the score process in general sparse high dimensional models plays a central role. While this exceeds the scope of this paper, generalizing our results to such settings is a very interesting direction for future work.

## 6 Preliminaries for the proofs

### 6.1 Working with general g

As noted in Remark 4, the results of Theorems 1, 2 and 3 are also true under slightly more general assumptions, that do not impose that is coming from a by a convolution product. Namely, let us assume

 g is a positive, symmetric, differentiable density that decreases on a vicinity of +∞ (38)

( decreasing on a vicinity of means that is decreasing for , for a suitably large constant ). Assume moreover that

 |(logg)′(y)| ≤Λ, for all y∈R,Λ>0; (39) ¯¯¯¯G(y) ≍g(y)yκ−1,as y→∞, % for some κ∈[1,2]; (40) y∈R →(1+y2)g(y) is bounded; (41) g/ϕ is increasing on [0,∞) from (g/ϕ)(0)<1 to ∞; (42)

By Lemma 9, it is worth to note that (42) implies

 ¯¯¯¯G/¯¯¯¯Φ is increasing on [0,∞) from 1 to ∞. (43)

In the case where is of the form of a convolution with , see (9), conditions (39), (40) and (41) are easy consequences of the fact when and condition (42) follows from the fact that for all fixed , the function is increasing, see Lemma 1 of [js04] for a detailed derivation.

A consequence of (39) is that and have at least Laplace tails

 g(y) ≥g(0)e−Λy,y≥0; (44) ¯¯¯¯G(y) ≥g(0)Λ−1e−Λy,y≥0. (45)

### 6.2 BMT as thresholding-based procedures

Recall the definitions (21) and (22). Let, for any and in ,

 r(w,t)=wt(1−w)(1−t). (46)

The following quantity plays the role of threshold for –values,

 ξ=(ϕ/g)−1:(0,(ϕ/g)(0)]→[0,∞), (47)

i.e. is the decreasing continuous invert of (that exists thanks to (42)). Simple algebra shows that for with ,

 ℓi(X)≤t ⇔|Xi|≥ξ(r(w,t)). (48)

When becomes small, the order magnitude of is given in Lemma 12: slightly exceeds but not by much, which comes from the fact that has heavy tails.

Another quantity close to we shall use in the sequel is the threshold introduced in [js04] and defined as, for any ,

 ζ(w)=β−1(w−1). (49)

Combining the definitions leads, see (84) for details, to and . Similarly, let us introduce a threshold for –values as

 χ=(¯¯¯¯Φ/¯¯¯¯G)−1:(0,1]→[0,∞), (50)

which is the decreasing continuous invert of (that exists thanks to (43)). For all and with ,

 qi(X)≤t ⇔|Xi|≥χ(r(w,t)). (51)

Lemma 13 shows that, for small , the order of magnitude of is slightly more than but not by much, which comes from the fact that has heavy tails. Also, Lemma 10 together with (48)-(51) imply

 χ(u)≤ξ(u),for u≤1. (52)

### 6.3 Single type I error rates

The single type I error rates of our procedures are evaluated by the following result (proved in Section 9.2).

###### Proposition 2.

Consider any function satisfying the assumptions of Section 6.1. Consider as in (46), as in (47) and as in (50). Then the following bounds hold. For all such that ,

 Pθ0=0(ℓi(X)≤t) ≤2r(w,t)g(ξ(r(w,t)))ξ(r(w,t)); (53) Pθ0=0(ℓi(X)≤t) ≥r(w,t)g(ξ(r(w,t)))ξ(r(w,t)) if r(w,t) % is small enough. (54)

For -values, we have, for all such that ,

 Pθ0=0(qi(X)≤t) =r(w,t)2¯¯¯¯G(χ(r(w,t))). (55)

As a result, for a fixed , we see that the more is heavily tailed, the more the type I error rate is large. This is well–expected, as the heavier the tails of , the more mass the prior puts on large values.

## 7 Proof of of the main results

### 7.1 Notation

The following moments are useful when studying the score function . Let us set

 ~m(w)=−E0β(X,w)=∫∞−∞β(t,w)ϕ(t)dt (56)

and further denote

 m1(τ,w) =Eτ[β(X,w)]=∫∞−∞β(t,w)ϕ(t−τ)dt. (57) m2(τ,w) =Eτ[β(X,w)2]=∫∞−∞(β(t,w))2ϕ(t−τ)dt. (58)

These expectations are well defined and studied in detail in Appendix 12, refining previous results established in [js04].

In order to study the FDR of a procedure , we introduce the notation

 V(φ) =∑i: θ0,i=0φi,S(φ)=∑i: θ0,i≠0φi, (59)

counting for the number of false and true discoveries, respectively.

### 7.2 Heuristic

Why should the marginal empirical Bayes choice of lead to a correct control of the FDR? Here is an informal argument that will give a direction for our proofs. We consider the case of here as it is expected to reject more nulls than and thus to have a larger FDR.

First, let us note that, when there is enough signal, one can expect to be approximately equal to the solution of the score equation in expectation , that is, by using (27),

 ∑i:θ0,i≠0m1(θ0,i,w⋆)=(n−sn)~m(w⋆),

where and are defined by (56) and (57), respectively, if there has exactly nonzero coordinates. As seen in Section 12, up to -terms,

 ∑i:θ0,i≠0m1(θ0,i,w⋆) ≈∑i:θ0,i≠0¯¯¯¯Φ(ζ(w⋆)−θ0,i)+¯¯¯¯Φ(ζ(w⋆)+θ0,i)w⋆; ~m(w⋆) ≈2¯¯¯¯G(ζ(w⋆)).

Now consider the FDR and assume that all quantities are well concentrated (in particular, take the expectation both in the numerator and denominator in (8)). Then, by using (55), we have

 FDR(θ0,φ\tinyq-val(α;^w,g)) ≈FDR(θ0,φ\tinyq-val(α;w⋆,g)) ≈∑i:θ0,i=0Pθ0,i(q⋆i(X)≤α)∑i:θ0,i=0Pθ0,i(q⋆i(X)≤α)+∑i:θ0,i≠0Pθ0,i(q⋆i(X)≤α) ≈(n−sn)r(w⋆,α)2¯¯¯¯G(ζ(w⋆))(n−sn)r(w⋆,α)2¯¯¯¯G(ζ(w⋆))+∑i:θ0,i≠0Pθ0,i(q⋆i(X)≤α),

where we denoted and we used that is close to , as seen in Section 11. Now, by using the definition of ,

 ∑i:θ0,i≠0Pθ0,i(q⋆i(X)≤α)= ∑i:θ0,i≠0¯¯¯¯Φ(χ(r(w⋆,α))−θ0,i)+¯¯¯¯Φ(χ(r(w⋆,α))+θ0,i) ≈ ∑i:θ0,i≠0¯¯¯¯Φ(ζ(w⋆)−θ0,i)+¯¯¯¯Φ(ζ(w⋆)+θ0,i),

where we used again . Now using the above properties of , the latter is

 ≈w<