Asymptotic Properties of Bayes Risk of a General Class of Shrinkage Priors in Multiple Hypothesis Testing Under Sparsity
Consider the problem of simultaneous testing for the means of independent normal observations. In this paper, we study some asymptotic optimality properties of certain multiple testing rules induced by a general class of one-group shrinkage priors in a Bayesian decision theoretic framework, where the overall loss is taken as the number of misclassified hypotheses. We assume a two-groups normal mixture model for the data and consider the asymptotic framework adopted in Bogdan et al. (2011) who introduced the notion of asymptotic Bayes optimality under sparsity in the context of multiple testing. The general class of one-group priors under study is rich enough to include, among others, the families of three parameter beta, generalized double Pareto priors, and in particular the horseshoe, the normal-exponential-gamma and the Strawderman-Berger priors. We establish that within our chosen asymptotic framework, the multiple testing rules under study asymptotically attain the risk of the Bayes Oracle up to a multiplicative factor, with the constant in the risk close to the constant in the Oracle risk. This is similar to a result obtained in Datta and Ghosh (2013) for the multiple testing rule based on the horseshoe estimator introduced in Carvalho et al. (2009, 2010). We further show that under very mild assumption on the underlying sparsity parameter, the induced decision rules based on an empirical Bayes estimate of the corresponding global shrinkage parameter proposed by van der Pas et al. (2014), attain the optimal Bayes risk up to the same multiplicative factor asymptotically. We provide a unifying argument applicable for the general class of priors under study. In the process, we settle a conjecture regarding optimality property of the generalized double Pareto priors made in Datta and Ghosh (2013). Our work also shows that the result in Datta and Ghosh (2013) can be improved further.
Multiple hypothesis testing has become a topic of growing importance in statistics, particularly for the analysis of high-dimensional data. Its application extends over various scientific fields such as genomics, bio-informatics, medicine, economics, finance, just to name a few. For example, in microarray experiments, thousands of tests are performed simultaneously to identify the differentially expressed genes, that is genes whose expression levels are associated with some biological trait of interest. Microarray experiment is just one out of many examples where one needs to analyze sparse high-dimensional data, the main objective being detection of a few signals amidst a large body of noises. Multiple hypothesis testing is one convenient and fruitful approach towards this end. The biggest impetus to research in multiple hypothesis testing came from the classic paper of Benjamini and Hochberg (1995). Since then the topic has received considerable attention from both frequentists and Bayesians.
In this paper, we consider simultaneous testing for means of independent normal observations. Suppose we have independent observations such that for . The unknown parameters represent the effects under investigation, while is the variance of the random noise. We wish to test against , for Our focus is on situations when is large and the fraction of non-zero ’s is small. For each , is assumed to be a random variable whose distribution is determined by the latent binary random variable , where denotes the event that is true while corresponds to the event that is false. Here ’s are assumed to be i.i.d random variables, for some in . Under , i.e. , the distribution having mass 1 at 0, while under , and it is assumed to follow a distribution with . Thus
The marginal distributions of the ’s are then given by the following two-groups model:
Our testing problem is now equivalent to testing simultaneously
It is assumed that , and depend on the number of hypotheses . The parameter is the theoretical proportion of non-nulls in the population. In sparse situations, where most of the ’s are zero or very small in magnitude, it is natural to assume that is small and converges to as the number of hypotheses tends to infinity. The variance component is typically assumed to be large to identify the true signals. Such a model is very natural where one has few potentially large signals among a large pool of noise terms and has been very popular in the literature. See, for example, Mitchell and Beauchamp (1988), for an early use of modeling of this kind in Bayesian variable selection where a uniform prior is used for the absolutely continuous part in place of the normal prior as in (1.1) above. The two-groups model has the advantage of capturing information across different tests through learning about the common hyperparameters based on information from all the data points. Fully Bayesian approaches towards multiple testing based on the two-groups model by placing hyperpriors on the underlying model parameters are available in the literature, see, for example, Scott and Berger (2006) and Bogdan et al. (2008). Empirical Bayes approaches using the two-groups formulation have been considered, for example, in Efron (2004, 2008), Storey (2007) and Bogdan et al. (2008), just to name a few. Under model (1.2) and the usual additive loss function, Bogdan et al. (2011) provided conditions under which the optimal Bayes risk (that is, the risk corresponding to the Bayes rule) can be attained asymptotically under sparsity by a multiple testing procedure as the number of tests grows to infinity. They referred to this property as Asymptotic Bayes Optimality under Sparsity (ABOS). In particular, they showed that the procedures of Benjamini and Hochberg (1995) and Bonferroni attain the ABOS property under mild conditions. The optimal Bayes rule is also referred to as a Bayes Oracle in Bogdan et al. (2008) and Bogdan et al. (2011) and will be discussed in detail further in Section 3.
In contrast to the above two-groups formulation, there are proposals to model the unknown parameters in sparse situations through hierarchical one-group “shrinkage” priors. Such priors can be expressed as scale-mixtures of normals and their use require substantially less computational effort than the two-groups model, especially, in high-dimensional problems as well as in complex parametric frameworks. These priors capture sparsity by assigning large probabilities to means close to zero while at the same time they give non-trivial probabilities to large means. This is achieved by employing two levels of parameters to express the prior variances of the ’s, namely, the “local shrinkage parameters”, which control the degree of shrinkage at the individual levels, and a “global shrinkage parameter”, common for all the ’s to cause an overall shrinking effect. If the mixing density corresponding to the local shrinkage parameters is appropriately heavy tailed, the large observations are left almost unshrunk which is often referred to as the “tail robustness” property. Choice of the global shrinkage parameter varies in different specifications and will be discussed in greater detail in Section 2.
Some early examples of one-group shrinkage priors are the -prior (Tipping (2001)), the Laplace prior in the context of Bayesian Lasso (Park and Casella (2008) and Hans (2009)) and the family of normal-exponential-gamma priors (Griffin and Brown (2005)). More recently, Carvalho et al. (2009, 2010) introduced a hierarchical Bayesian one-group prior called the horseshoe prior. Various new one-group shrinkage priors have been proposed in the literature and studied since then. Armagan et al. (2011) introduced the class of “three parameter beta normal” mixture priors while the class of “generalized double Pareto” priors was introduced in Armagan et al. (2012). The family of three parameter beta normal mixture priors generalizes some well known shrinkage priors such as the horseshoe, Strawderman-Berger and normal-exponential-gamma priors. See also, Polson and Scott (2011, 2012), Scott (2011) and Griffin and Brown (2010, 2012, 2013), in this context. Many of these one-group priors, including the horseshoe, employ local shrinkage parameters with priors having the aforesaid tail robustness property.
The horseshoe prior has acquired an important place in the literature on “shrinkage” priors and it has been used in estimation as well as in multiple testing and variable selection problems. Carvalho et al. (2010) proposed a new multiple testing procedure for the normal means problem based on the horseshoe prior. They observed through numerical findings that under sparsity of the true normal means, the procedure based on the horseshoe prior performs closely to the Bayes rule when the true data comes from a two-groups model and the loss of a testing procedure is taken as the number of misclassified hypotheses. Datta and Ghosh (2013) theoretically established this optimality by showing that the ratio of the Bayes risk for this procedure to that of the Bayes Oracle under the two-groups model (1.2) is within a constant factor asymptotically. Moreover, it was numerically shown in their paper, that priors having exponential or lighter tails, such as the Laplace or the normal prior, fail to achieve such optimality property.
As commented in Carvalho et al. (2009), a carefully chosen two-groups model can be considered a “gold standard” for sparse problems. Therefore, it may be used as a benchmark against which the “shrinkage” priors can be judged. Motivated by this and inspired by the results in Carvalho et al. (2010) and Datta and Ghosh (2013), we want to study in this paper asymptotic optimality properties of multiple testing procedures induced by a very general class of “shrinkage” priors which are heavy tailed and yet handle sparsity well. This class contains the “three parameter beta normal” mixture priors as well as the “generalized double Pareto” priors. We consider multiple testing rules based on these priors and apply them on data generated from a two-groups model. We establish that these rules achieve the same Bayesian optimality property as shown in Datta and Ghosh (2013) for the testing rule based on the horseshoe prior, assuming that the global shrinkage parameter is appropriately chosen based on the theoretical proportion of true alternatives. In case this proportion is unknown, we consider an empirical Bayes version of this test procedure, where the global shrinkage parameter is estimated using the data as in van der Pas et al. (2014). We show that the resulting empirical Bayes testing procedure also attains the optimal Bayes risk asymptotically up to the same multiplicative factor. We also study the performance of such rules on simulated data and our theoretical results are corroborated by the simulations.
The highlight of this paper is a unified treatment of the question of Bayesian optimality in multiple testing under sparsity based on a very general class of one-group priors, taking the same loss function as in Datta and Ghosh (2013). In the process, we not only generalize their results for a very broad class of tail robust shrinkage priors, but also strengthen their optimality result by deriving a sharper asymptotic upper bound to the corresponding Bayes risk. We have a new unifying argument that enables us to establish asymptotic bounds to the risk for this whole class of priors. Datta and Ghosh (2013) conjectured that for the present multiple testing problem, the generalized double Pareto prior should enjoy similar optimality property like the horseshoe prior. We settle this conjecture by showing that the generalized double Pareto is indeed a member of this general class of tail robust priors under consideration. Further, our general technique of proof shows that some of the arguments in Datta and Ghosh (2013) can be simplified.
The organization of this paper is as follows. In Section 2, we describe the general class of one-group priors under study and define the multiple testing procedure based on them. In Section 3, we present our main theoretical results after describing the optimal Bayes rule under the two-groups model and the asymptotic framework under which these theoretical results are derived. The main results of Section 3 crucially depend on some key inequalities involving the posterior distribution of the underlying shrinkage coefficients, that help us in deriving important asymptotic bounds to the type I and type II error probabilities. These inequalities and the bounds on both types of error probabilities are presented in Section 4. Section 5 contains the simulation results followed by a discussion in Section 6. Proofs of the theoretical results are given in the Appendix.
1.1 Notations and Definition
Given any two sequences of positive real numbers and , with , we write to denote . For any two sequences of real numbers and , with , we write if for all , for some positive real number independent of , and to denote . Thus if . Moreover, given any two positive real valued functions and , both having a common domain of definition , , we write as to denote .
By a random variable we mean a random variable having cumulative distribution function and probability density function and , respectively.
A positive measurable function defined over some , , is said to be slowly varying or is said to vary slowly (in Karamata’s sense) if for every fixed , as .
2 The one-group priors and the corresponding induced multiple testing procedures
As mentioned in the introduction, our aim in this paper is to study, through theoretical investigations and simulations, asymptotic risk properties of the multiple testing rules induced by a very broad class of one-group shrinkage priors, when applied to data that come from the two-groups model in (1.2). The class of one-group priors we study is inspired by a class of priors suggested in Polson and Scott (2011) which can be represented through the following hierarchical formulation:
Our specific choices of and are described and explained below. Note that the above hierarchy is in slight variation from that of Polson and Scott (2011) in that we bring the earlier in the sequence, while in their formulation the comes later through the conditional prior of given . But both formulations produce the same marginal prior distribution for the ’s.
The above one-group formulation is often referred to as a global-local scale mixtures of normals. The parameter is called a “global” shrinkage parameter, while the parameters ’s are called the “local” shrinkage parameters. The corresponding posterior mean of is given by,
where is called the -th shrinkage coefficient. It is observed in Carvalho et al. (2010) and Polson and Scott (2011) that under the two-groups model (1.2), for large , the posterior mean of can be approximated as
where denotes the posterior probability that is true. It may be noted further that when , most of the ’s are expected to be very close to zero unless is sufficiently large, in which case the corresponding is expected to be close to 1, provided is large enough. This ensures that the noise observations are mostly shrunk towards zero, while the large ’s are left mostly unshrunk. Here the parameter is responsible for achieving an overall shrinkage, while the large is helpful in discovering the true signals.
Using the above observations, for the one-group model, Polson and Scott (2011) argued that in sparse problems, the global shrinkage parameter (whose role is analogous to in the two-groups prior) should be small and its prior should have substantial mass near zero, whereas the prior for the local shrinkage parameters should have thick tails. This ensures that the resulting prior for the ’s is highly peaked near zero but also heavy tailed enough to accommodate large signals. In this sense, the one-group priors can be thought of as approximately similar to a two-groups prior with an appropriately heavy-tailed absolutely continuous part.
Motivated by the preceding discussion and the work of Polson and Scott (2011), we take to be of the form
in our hierarchical formulation. Here is the constant of proportionality, is a positive real number and is a positive measurable, non-constant, slowly varying function over . It follows from Theorem 1 of Polson and Scott (2011) that the above general class of one-group priors achieves the desired “tail robustness” property in the sense that for any given and , for large ’s. Since is assumed to be proper, the possibility of being a constant function is ruled out.
It will be proved in Section 2.2 and Section 2.3 that a very broad class of one-group priors, such as, the generalized double Pareto and the three parameter beta normal mixtures, actually fall inside the general class of shrinkage priors under consideration. It is worth pointing out here that in some one-group formulations, like the original form of the generalized double Pareto in Armagan et al. (2012), the global shrinkage parameter is not explicitly mentioned or equivalently it is kept fixed at 1. In some other cases, like the three parameter beta normal mixtures, a shared global shrinkage parameter is explicitly given. Armagan et al. (2011) opined that it is reasonable to put a prior on the global shrinkage parameter, but this parameter may also be kept fixed at a certain value which reflects the prior knowledge about sparsity if such information is available. In case such prior knowledge is unavailable, one can consider either of the two approaches, namely, (i) a full Bayes approach by placing further hyperprior over and (ii) an empirical Bayes approach by learning about through the data. In the line of recommendation of Polson and Scott (2011), for a full Bayes treatment of the present multiple testing problem, we consider the following joint prior distribution of ,
which will be used later in our simulation study. It should be noted here that Gelman (2006) strongly recommended the use of a half-Cauchy (or more generally, a folded non-central-) distribution as a prior for the global variance component in a hierarchical Bayesian formulation. Though, in his original recommendation, he suggested using a half-Cauchy prior for scaled by the error variance , we take since the error variance term appear earlier in our hierarchical formulation. We also consider an empirical Bayes approach to be discussed in detail shortly.
We describe below the multiple testing rules considered in this paper. We first consider two rules (defined in (2.5) and (2.7) below) for which asymptotic optimality results have been derived theoretically. For this we assume to be known and equal to 1. For the first rule, we treat as a tuning parameter to be chosen freely depending on the value of , while the second one is based on an empirical Bayes estimate of . Note that a comparison between the expressions in (2.1) and (2.2) for the posterior mean of , together with the previous discussion, suggest that the posterior shrinkage weights based on tail robust shrinkage priors, should behave like the posterior inclusion probability in the two-groups model. Using this observation, Carvalho et al. (2010) proposed a natural classification rule based on the posterior shrinkage weights under a symmetric 0-1 loss for the horseshoe prior. Borrowing the same idea, we consider the following multiple testing procedure based on our chosen class of tail-robust one-group shrinkage priors, given by:
As mentioned in the introduction, Datta and Ghosh (2013) considered the multiple testing rule defined in (2.5) based on the horseshoe prior. They showed that it asymptotically attains the optimal Bayes risk up to a multiplicative factor. It will be seen later that the Oracle optimality property of the decision rule in (2.5) based on our general class of one-group priors, critically depends on appropriate choice of depending on . This plays a significant role in the limiting value of the type II error measure and in controlling the rate of the overall contribution from type I error in the risk function. This is similar to the observations made in Datta and Ghosh (2013) for the above multiple testing rule based on the horseshoe prior.
In a recent article, van der Pas et al. (2014) considered the problem of estimating an -dimensional multivariate normal mean vector which is sparse in the nearly black sense, that is, the number of non-zero entries is of a smaller order than as . They modeled the mean vector through the horseshoe prior and estimated it by the corresponding posterior mean, namely, the horseshoe estimator. They showed that for suitably chosen depending on the proportion of non-zero elements of the mean vector, the horseshoe estimator asymptotically attains the corresponding minimax risk, possibly up to a multiplicative constant, and the corresponding posterior distribution contracts at this optimal rate. But in practice is usually unknown. A natural approach in such situations is to learn about from the data and then plug this choice into the corresponding posterior mean. When is unknown, van der Pas et al. (2014) proposed a natural estimator of and showed that the horseshoe estimator based on this estimate, attains the corresponding minimax risk up to some multiplicative factor. Inspired by this, we consider the following estimator of due to van der Pas et al. (2014) in case is unknown:
where and are some predetermined finite positive constants. Note that the above estimator of is truncated below by and hence it is not susceptible to collapsing to zero, which is a major concern for the use of such empirical Bayes approaches as mentioned in Carvalho et al. (2009), Scott and Berger (2010), Bogdan et al. (2008) and Datta and Ghosh (2013). We refer to van der Pas et al. (2014) for a detailed discussion on this point. Let denote the posterior shrinkage weight evaluated at . We consider the following empirical Bayes procedure based on , , given by,
For the simulations we consider two cases, firstly, a full Bayes treatment with given the joint prior distribution as in (2.4), and the corresponding rule is defined as
where denotes the -th posterior shrinkage weight after integrating with respect to the joint posterior density of . We also consider the empirical Bayes decisions as in (2.7) for the simulation study where we fix . We apply the above decision rules in (2.8) or (2.7) induced by these priors in the multiple testing problem (1.3), where the true data are generated from the two-groups mixture model (1.2) described before. We show in this paper, through theoretical analysis and simulations that the aforesaid decision rules enjoy similar optimality property as shown for the horseshoe prior in Datta and Ghosh (2013).
2.1 Some well known one-group shrinkage priors
In this section, we demonstrate that some popular shrinkage priors actually fall within the general class of one-group priors considered in this paper. This follows from observing that the mixing density corresponding to the local shrinkage parameter can be expressed in the form (2.3) where is a slowly varying function over . This in turn can be shown by proving that the corresponding converges to a finite positive limit as goes to infinity. We also show that for each of these priors the corresponding is uniformly bounded by some finite positive constant. The boundedness property of is important, as it makes the proofs of the theoretical results of this paper much simpler. This will become clear in Section 4 of this paper.
2.2 Three Parameter Beta Normal Mixtures
Let us consider the following global-local scale mixture formulation of one-group priors:
for . The mixing density given in (2.9), in fact, corresponds to an inverted-beta density (or, beta density of the second kind) with parameters and . The prior density corresponding to the shrinkage coefficients is then given by,
which corresponds to an density. Therefore, the above hierarchical one-group formulation can alternatively be represented as
This gives the three parameter beta normal mixture priors introduced by Armagan et al. (2011) and is denoted by . The TPBN family of priors is rich enough to generalize some well known shrinkage priors, such as the horseshoe prior with , the Strawderman-Berger prior with , and and the normal-exponential-gamma priors with , .
Note that the prior in (2.9) can also be written as,
where and . Clearly, , thereby implying that the TPBN family of priors falls within our general class of global-scale mixture normals. Also, note that which shows that the associated function is bounded as mentioned earlier.
2.3 Generalized Double Pareto Priors
Let us consider the hierarchical one-group global-local scale mixture formulation as described at the beginning of Section 2, where is defined as,
for some fixed and . It follows that has the density
The density in (2.10) above, corresponds to a generalized double Pareto density with shape parameter and scale parameter and is denoted by . Equivalently, it may also be interpreted as the density of a random variable multiplied by When and a distribution is known as the standard double Pareto distribution. We refer to this hierarchical global-local scale mixture formulation with defined as above, as the generalized double Pareto prior introduced by Armagan et al. (2012). For simulations in Section 5, in our hierarchical global-local scale mixture formulation, when we talk about the standard double Pareto prior, we mean that , and we mix further with respect to the joint density of for a full Bayes treatment or use an empirical Bayes estimate of taking to be fixed, as mentioned before.
Now we demonstrate that the generalized double Pareto prior falls within our chosen class of tail robust shrinkage priors. Towards this end, we first observe that the mixing density corresponding to the generalized double Pareto prior can be written as,
where and .
Now applying Lebesgue’s Dominated Convergence Theorem, we obtain,
which means that defined above slowly varies over This also shows that the mixing density given in (2.11) can be expressed in the form given by equation (2.3) with as above and . Thus, the generalized double Pareto prior falls within our general class of tail-robust shrinkage priors. Moreover, using the monotone convergence theorem, it follows that , which means the function defined above, is bounded as well.
3 Asymptotic framework and the main results
In this section we present our major theoretical results about asymptotic optimality of the multiple testing rules (2.5) and (2.7) under study. In Section 3.1, first we describe the decision theoretic setting and the optimal Bayes rule under this setting. We then describe the asymptotic framework under which our theoretical results are derived. Section 3.2 presents the main theoretical results of this paper involving asymptotic bounds to the Bayes risk of the induced decisions (2.5) and (2.7) under study. The Oracle optimality properties of these decision rules up to then follow immediately.
3.1 Optimal Bayes Rule and the Asymptotic Framework
Suppose are independently distributed according to the two-groups model (1.2), with . We are interested in the multiple testing problem (1.3). We assume a symmetric 0-1 loss for each individual test and the total loss of a multiple testing procedure is assumed to be the sum of the individual losses incurred in each test. Letting and denote the probabilities of type I and type II errors respectively of the -th test, the Bayes risk of a multiple testing procedure under the two-groups model (1.2) is given by
where denotes the marginal density of under while denotes that under and , with . The above rule is called Bayes Oracle since it makes use of the unknown parameters and , and hence is not attainable in finite samples. By introducing two new parameters and the above threshold becomes
Bogdan et al. (2011) considered the following asymptotic scheme:
The sequence of vectors satisfies the following conditions:
and the corresponding optimal Bayes risk is given by,
We want to study asymptotic optimality properties of the multiple testing rules (2.5) and (2.7), induced by our general class of one-group tail robust shrinkage priors when applied to data generated from the two-groups model (1.2), where the hyperparameters of the two-groups model satisfy Assumption 3.1. For simplicity of notation, henceforth we drop the subscript from , and . For the sake of completeness, we describe below the one-group prior specification for our theoretical analysis:
where , and is a non-constant slowly varying function over . Under (3.5), the shrinkage coefficients ’s are independently distributed given , with the posterior of only depending on and is given by
3.2 Main Theoretical Results
In this section, we present in Theorem 3.1 and Theorem 3.2 the main theoretical findings of this paper. Theorem 3.1 gives asymptotic upper and lower bounds to the Bayes risk of the multiple testing procedure (2.5) under study, when the global shrinkage parameter is treated as a tuning parameter, while Theorem 3.2 gives asymptotic upper bounds to the Bayes risk of the empirical Bayes procedure defined in (2.7). Proofs of Theorem 3.1 and Theorem 3.2 are based on some asymptotic bounds for the corresponding type I and type II error probabilities of the individual decisions in (2.5) and (2.7), which, in turn, depend on a set of concentration and moment inequalities. We present these inequalities and the asymptotic bounds on both kinds of error probabilities in Section 4 of this paper. Proofs of Theorem 3.1 and Theorem 3.2 are given in the Appendix.
Suppose , are i.i.d. observations having the two-groups normal mixture distribution in (1.2) with , and we wish to test the hypotheses vs , for , simultaneously, using the decision rule (2.5) induced by the one-group priors (3.5). Suppose Assumption 3.1 is satisfied by the sequence of parameters . Further assume that as such that , and is such that
and as .
Then, as , the Bayes risk of the multiple testing rules in (2.5), denoted , satisfies
for every fixed and . The terms above are not necessarily the same, tend to zero as and depend on the choice of and .
As a consequence of Theorem 3.1, for a very large class of priors covered by (I) or (II) of the said theorem, the ratio of the Bayes risk of the induced decisions in (2.5) to that of the Bayes Oracle (see (3.4) in Section 3.1) is asymptotically bounded by,
for every fixed and every fixed . That is,
For small values of and appropriately chosen and the ratios in (3.7) given above, can be made close to 1. Therefore, we see that, in sparse situations, when the global shrinkage parameter is asymptotically of the same order as that of the proportion of true alternatives the decision rules (2.5), imposed by a very broad class of tail robust one-group priors satisfying (I) or (II) of Theorem 3.1, asymptotically attain the optimal Bayes risk up to a multiplicative constant, the constant being close to 1. It may be seen that the condition (II) of Theorem 3.1 is satisfied if, in the prior on the local shrinkage parameter in (3.5), one has and is, say, uniformly bounded or . It has already been shown in Section 2 that the horseshoe prior, the Strawderman-Berger prior and members from the families of normal-exponential-gamma priors and generalized double Pareto priors with appropriate choice of , satisfy these conditions.
The theoretical results of the forthcoming sections of this paper suggest that, for the above Oracle optimality property to be true, the optimal choice of is such that it is asymptotically of the same order of , that is, has a finite, positive limit as the number of tests grows to infinity. It will be shown further that there are other choices of depending on , for which the desired Oracle optimality up to may no longer be true. These will be discussed later in a greater detail in Section 4.2 of this paper.
The next theorem gives an asymptotic upper bound for the Bayes risk of the empirical Bayes procedure defined in (2.7) under the asymptotic framework of Bogdan et al. (2011) together with the assumption that for . As a consequence, the Oracle optimality property of the empirical Bayes procedure (2.7) follows immediately. Note that, the condition , where , is very mild in nature and covers most of the cases of theoretical and practical interest.
Suppose , are i.i.d. observations having the two-groups mixture distribution described in (1.2) with , and we wish to test the hypotheses vs , , simultaneously, using the decision rule (2.7) induced by the one-group priors (3.5). Suppose Assumption 3.1 is satisfied by with , for some . Further assume that in the prior for the local shrinkage parameter in (3.5) satisfies:
and as .
Then, the Bayes risk of the multiple testing rules in (2.7), denoted , is bounded above by,
for every fixed and , where the term above tends to zero as and depends on the choice of and .
Now using Theorem 3.2 it follows immediately that
As before, for small values of , and appropriately chosen and , the ratio of risk can be made close to 1.
Using the techniques employed for deriving asymptotic upper bounds for the type I and type II error probabilities of the empirical Bayes decisions in (2.7), one can show easily that the empirical Bayes estimate defined in (2.6), consistently estimates the unknown degree of sparsity up to some multiplicative factor. This will be made more precise in Remark 4.4. As mentioned already that the desired Bayesian optimality property as presented in Theorem 3.1 holds when is asymptotically of the same order of , which seems to be an optimal choice of in case is known. This perhaps explains the good performance of our proposed empirical Bayes procedure using the estimate and gives a strong theoretical support in favor of using such a plug-in estimate of .
3.3 A comparison with the work of Datta and Ghosh (2013)
A careful inspection of the proof of Theorem 3.4 of Datta and Ghosh (2013) reveals the following. Under Assumption 3.1, when , the Bayes risk of the decision rules (2.5) induced by the horseshoe prior, denoted , satisfies,
for every fixed and . A comparison between the upper bounds in (3.6) and (3.9) shows that our results not only generalize the theoretical finding concerning the asymptotic Bayes optimality of the horseshoe prior, but at the same time, sharpens the upper bound to the Bayes risk of the induced decisions under study, for across the general class of priors given in (3.5), and satisfying conditions (I) or (II) of Theorem 3.1, including the horseshoe, in particular.
Although a few ideas employed in the proofs of this paper are similar to those in Datta and Ghosh (2013), our arguments heavily hinge upon appropriate use of properties of slowly varying functions. It will be observed later in this paper that application of well-known properties of slowly varying functions often leads to exact asymptotic orders of certain integrals, without the need to depend mainly on using algebraic upper and lower bounds which can be improved further. In fact, using this technique, we obtain a sharper asymptotic bound to the probability of type II errors and hence on the overall risk (in Theorem 3.1) as compared to that in Datta and Ghosh (2013). See Remark 4.2 in this context.
4 Some key inequalities and bounds on probabilities of type I and type II errors
In Section 4.1, we present some concentration and moment inequalities involving the posterior distributions of the shrinkage coefficients ’s. These inequalities are essential for deriving asymptotic bounds for probabilities of type I and type II errors of the multiple testing procedures (2.5) and (2.7) under study, presented in Section 4.2 and Section 4.3, respectively. Proofs of all these results are given in the Appendix.
4.1 Concentration and Moment Inequalities
Before presenting the theoretical results of this section, let us first briefly describe how they can be useful in studying the error probabilities of two kinds. Let and denote respectively the probabilities of type I and type II errors of the -th individual decision in (2.5). Then, by definition, and . It seems that finding the exact asymptotic orders of and is infeasible. Therefore, one convenient and fruitful approach to study their asymptotic behaviors is to find non-trivial asymptotic bounds for them. One way of accomplishing this is to obtain appropriate bounds for either of and (since can be bounded above by , for any ), followed by some judicious applications of these bounds.
The following theorem is our first step towards this and gives the first concentration inequality involving the posterior distribution of ’s. Using this theorem, one can derive an upper bound to in a very simple way in case the function in (3.5) is bounded above, as indicated in Remark 4.1 below.
Suppose independently for . Consider the one-group prior given in (3.5) and let . Then, for any fixed and any fixed ,
where the term above is independent of both the index and the data point , but depends on in such a way that .
In case the function is bounded above by some , then using Corollary 4.1 one can readily obtain the following upper bound on :
It has already been shown in Section 2.2 and Section 2.3 that for many of the commonly used shrinkage priors including the horseshoe, the corresponding is bounded above by some constant . Use of the upper bound from Theorem 4.1 makes the task of finding an upper bound for very simple in such cases. Finding an upper bound for in case of a general , as given in Theorem 4.2 below, is quite non-trivial and requires pretty delicate arguments based on properties of slowly varying functions.
where the term above is independent of both the index and the data point , but depends on in such a way that . Here is a constant depending on , such that, is bounded in every compact subset of .
The next theorem gives the second concentration inequality of this paper involving the term .
Under the setup of Theorem 4.1, for any fixed , and each fixed and ,
It is to be observed in this context that, for , the upper bound in Theorem 4.3 of the present article is of a smaller order compared to that derived in Theorem 3.2 of Datta and Ghosh (2013). In particular, using properties of slowly varying functions (see the Appendix), it can be easily established that the ratio of the former to the latter tends to zero as . The sharper asymptotic bound in Theorem 4.3 results in a sharper asymptotic upper bound to the probability of type-II error, and hence on the overall risk (in Theorem 3.1) of the procedure (2.5) as compared to that in Datta and Ghosh (2013).
Several important features of the posterior distribution of the shrinkage coefficients ’s based on our general class of tail robust shrinkage priors, now become clear from Theorem 4.1 through Theorem 4.3. These are listed in Corollary 4.2 - Corollary 4.5 given below. While Corollary 4.2 and Corollary 4.3 are derived using Theorem 4.1 and Theorem 4.2, respectively, the rest follow from Theorem 4.3. Proofs of these results are trivial and hence are omitted. It should however be remembered that these corollaries have no direct use in proving the main theoretical results of this paper.
Under the assumptions of Theorem 4.1, as for any fixed uniformly in
Thus, for each fixed , the posterior distribution of ’s, based on the tail robust priors under consideration, tend to concentrate near 1 for small values of .
Under the assumptions of Theorem 4.1, as for any fixed uniformly in .
Corollary 4.3 above says that for small values of , noise observations will be squelched towards the origin by the kind of one-group priors considered in this paper.
Under the assumptions of Theorem 4.1, as for any fixed and every fixed .
Under the assumptions of Theorem 4.1, as for any fixed .
Corollary 4.5 above shows that, for each of the heavy tailed shrinkage priors under consideration, even if the global variance component is very small, the amount of posterior shrinkage will be negligibly small for large ’s, thus leaving the large observations almost unshrunk.
4.2 Asymptotic bounds on probabilities of type I and type II errors when is treated as a tuning parameter
Theorem 4.4 and Theorem 4.5 below give asymptotic upper bounds to the probability of type I error and the probability of type II error , respectively, of the -th decision in (2.5), while Theorem 4.6 and Theorem 4.7 give asymptotic lower bounds for and , respectively. As mentioned before, these results lead to the asymptotic bounds on the Bayes risk () of the multiple testing procedure in (2.5).