Nearly optimal Bayesian Shrinkage for High Dimensional Regression

# Nearly optimal Bayesian Shrinkage for High Dimensional Regression

\fnmsQifan \snmSong\thanksreft1label=e1]qfsong@purdue.edu [    \fnmsFaming \snmLiang\thanksreft2label=e2]faliang@ufl.edu [ Purdue University
###### Abstract

During the past decade, shrinkage priors have received much attention in Bayesian analysis of high-dimensional data. In this paper, we study the problem for high-dimensional linear regression models. We show that if the shrinkage prior has a heavy and flat tail, and allocates a sufficiently large probability mass in a very small neighborhood of zero, then its posterior properties are as good as those of the spike-and-slab prior. While enjoying its efficiency in Bayesian computation, the shrinkage prior can lead to a nearly optimal contraction rate and selection consistency as the spike-and-slab prior. Our numerical results show that under posterior consistency, Bayesian methods can yield much better results in variable selection than the regularization methods, such as Lasso and SCAD. We also establish a Bernstein von-Mises type results comparable to Castillo et al (2015), this result leads to a convenient way to quantify uncertainties of the regression coefficient estimates, which has been beyond the ability of regularization methods.

[
B
\arxiv

arXiv:0000.0000 \startlocaldefs \endlocaldefs

\runtitle

Bayesian Shrinkage

{aug}

and

\thankstext

t1Supported in part by grant CTSI-208597 from Indiana Clinical Translational Sciences Institute. \thankstextt2Supported in part by the NSF grants DMS-1545202 and DMS-1545738, and the NIH grant R01GM117597.

class=MSC] \kwd[Primary ]62J07 \kwd[; secondary ]62F15

ayesian Variable Selection; Absolutely Continuous Shrinkage Prior; Heavy Tail; Posterior Consistency; High Dimensional Inference.

## 1 Introduction

The dramatic improvement in data collection and acquisition technologies during the last two decades has enabled scientists to collect a great amount of high-dimensional data. Due to their intrinsic nature, many of the high-dimensional data, such as omics data and SNP data, have a much smaller sample size than their dimension (a.k.a. small--large-). Toward an appropriate understanding of the system underlying the small--large- data, variable selection plays a vital role. In this paper, we consider the problem of variable selection for the high-dimensional linear regression

 y=Xβ+σε, (1.1)

where is an -dimensional response vector, is an by design matrix, is the vector of regression coefficients, is the standard deviation, and follows N. This problem has received much attention in the recent literature. Methods have been developed from both frequentist and Bayesian perspectives. The frequentist methods are usually regularization-based, which enforce the model sparsity through imposing a penalty on the native log-likelhood function. For example, Lasso [51] employs a -penalty, elastic net[66] employs a combination of and penalty, [17] employs a smoothly clipped absolute deviation (SCAD) penalty, [62] employs a minimax concave penalty (MCP), and [49] employs a reciprocal -penalty. In general, these penalty functions encourage model sparsity, which tend to shrink the coefficients of false predictors to exactly zero. Under appropriate conditions, consistency can be established for both variable selection and parameter estimation.

The Bayesian methods encourage sparsity of the posteriori model through choosing appropriate prior distributions. A classical choice is the spike-and-slab prior, , where is the degenerated “spike distribution” at zero, is an absolutely continuous “slab distribution”, and is the prior mixing proportion. Generally, it can be equivalently represented as the following hierarchical prior,

 ξ∼π(ξ),βξ∼hξ(βξ),βξc≡0, (1.2)

for some multivariate density function , where denotes a subset model, and denote the coefficient vectors of the covariates included in and excluded from the model , respectively. The theoretical properties of prior (1.2) have been thoroughly investigated [46, 32, 31, 35, 41, 48, 61, 10, 39]. Under proper choices of and , the spike-and-slab prior achieves (near-) optimal contraction rate, model selection consistency.

Alternative to the hierarchical priors, some shrinkage priors have been proposed for (1.1) motivated by the equivalence between the regularization estimator and the maximum a posteriori (MAP) estimator, see e.g. the discussion in [51]. Examples of such priors include the Laplace prior [43, 28], horseshoe prior [9], structuring shrinkage prior [25], double Pareto shrinkage prior [3] and Dirichlet Laplace prior [6]. Compared to the hierarchical priors, the shrinkage prior is conceptually much simpler. The former involves specification of the priors for a large set of models, while the latter does not involve this issue as for which only a single model is considered. Consequently, for the hierarchical priors, a trans-dimensional MCMC sampler is required for simulating the posterior from a huge space of submodels, and this has constituted the major obstacle for the use of Bayesian methods in high-dimensional variable selection. While for the shrinkage priors, since only a single model is considered in posterior simulations, some gradient-based MCMC algorithms, such as Hamiltonian Monte Carlo [15, 42] and its Reimann manifold variant [24], can potentially be used to accelerate the simulation.

Despite the popularity and potential advantages of shrinkage priors, few work have been done to study their theoretical properties. It is still unclear if the shrinkage priors or which types of shrinkage priors can lead to posterior consistency. Bayesian community already realized that Bayesian Lasso is not a good choice. [6, 10] showed that the contraction rate for Bayesian Lasso is suboptimal, and we also find that under regularity conditions, the posterior of Bayesian Lasso is actually inconsistent in the sense (This result is presented in the supplementary document). Therefore, many other choices of shrinkage prior have been proposed [1, 2, 6, 9, 23, 25, 26]. But the theoretical investigation in these works focuses on either the case of slowly increasing , i.e., [2, 7, 20, 65], or the normal means models [6, 23, 54, 55]. When , the non-invertibility and the eigen structure of design matrix complicate the analysis, and the results derived from normal means models don’t trivially apply to regression problem. Furthermore, most of the Bayesian works for normal means models [54, 6, 11] aimed to obtain a minimax contraction rate. A recent preprint [47] shows that for normal means problem, any estimator which asymptotically guarantees no false discovery, at the best, has an estimation rate of . This frequentist assertion implies that these existing Bayesian approaches can not consistently recover the underlying sparsity structure [also refer to theorem 3 of [56] and theorem 3.4 of [6]]. But under the framework of regression, selection consistency is an important issues.

In this work, we lay down a theoretical foundation of Bayesian high-dimensional linear regression with shrinkage priors. Instead of focusing on certain types of shrinkage priors, we investigate sufficient conditions of posterior consistency for general shrinkage priors. We show that if the prior density has a dominating peak around zero, and a heavy and flat tail, then its theoretical properties are as good as the spike-and-slab prior: its contraction rate is near-optimal, selection is consistent, and posterior follows certain BvM-type phenomenon. We verify two appropriate classes of shrinkage prior for high dimensional regression models, one is polynomially decaying prior distribution, and the other is the two-Gaussian-mixture prior [19]. Our numerical results indicate that the Bayesian method with a consistent shrinkage prior can lead to more accurate results in variable selection than regularization methods.

We would also mention some other relative Bayesian works for high dimensional shrinkage priors. [13] used DL prior for high dimensional factor models under , but their results only allow the magnitude of true parameters to increase very slowly with respect to . [5] studied the prediction risk under high dimensional shrinkage prior, rather than the posterior properties of . During the revision of this work, one thesis work [59] studied the shrinkage prior for high dimensional spline regression, and logistic regression, following the technical tools developed in this paper.

The rest of the paper is organized as follows. Section 2 presents the main theoretical results of this paper, where we lay down the theory of posterior consistency for high-dimensional linear regression with shrinkage priors. Section 3 applies these theorems to several general classes of continuous prior distributions, and establishes the posterior consistency for several types of commonly used shrinkage priors. Section 4 discusses some important practical issues on Bayesian high dimensional shrinkage, and demonstrates the performance of shrinkage priors using a toy example. Section 5 presents simulation studies, as well as a real data application with shrinkage prior. Section 6 concludes the paper with a brief discussion. The Appendix gives the proofs of the main theorems.

## 2 Main Theoretical Results

Notation. In what follows, we rewrite the dimension of the model (1.1) by to indicate that the number of covariates can increase with the sample size . We use superscript to indicate true parameter values, e.g. and . For simplicity, we assume that the true standard deviation is unknown but fixed, and it doesn’t change as grows. For vectors, we let denote the -norm; let denote the -norm; let denote the norm, i.e. the maximum absolute value among all entries of the vector; and let denote the norm, i.e. the number of non-zero entries. As in (1.2), we let denote a subset model, and let denote the size of the model . We let denote the size of the true model, i.e., . We let denote the sub-design matrix corresponding to the model , and let and denote the largest and smallest eigenvalues of a square matrix, respectively. We let denote the indicator function. For two positive sequences , and , means , means , and means or . We use to denote the Bayesian contraction rate which satisfies .

### 2.1 Posterior Consistency

The posterior distribution for model (1.1) follows a general form:

 π(β,σ2|Dn)∝f(β,σ2;Dn)π(β,σ2),

where is the likelihood function of the observed data , and denotes the prior density of and . Consider a general shrinkage prior: is subject to an inverse-gamma prior , where and denote the prior-hyperparameters; and conditional on , has independent prior for each entry, with an absolutely continuous density function of the form

 π(β|σ2)=∏j[gλ(βj/σ)/σ], (2.1)

where is some tuning parameter(s). Then it is to easy to derive

 logπ(β,σ2|Dn)=C+pn∑j=1loggλ(βjσ)−(n/2+pn/2+a0+1)log(σ2)−2b0+∥y−Xβ∥22σ2, (2.2)

The shape and scale of the pdf play a crucial role for posterior consistency. Intuitively, we may decompose the parameter space into three subsets: neighborhood set , “overfitting” set and the rest . Heuristically, the likelihood for any , . Therefore, to drive the posterior mass toward the set , it is sufficient to require that , and ratio is not too tiny. In other words, the prior should 1) assign at least a minimum probability mass around , and 2) assign tiny probability mass over the overfitting set. However, under high dimensional settings, the “overfitting” set is geometrically intractable (and it expands to infinite) due to the arbitrariness of eigen-structure of the design matrix. Therefore, analytically, it is difficult to directly study the prior on “overfitting” set. One possible way to control the prior on “overfitting” set, is to impose a strong prior concentration for each , such that the most of the prior mass is allocated on the “less-complicated” models under certain complexity measure. Under regular identifiability conditions, overfitting models are always complicated, hence prior on “overfitting” models is small. But it is worthy noting that overfitting models are subset of all complicated models, and strong prior concentration is only a sufficient condition. When the geometry of overfitting set is easier to handle, e.g. under or normal means models, overfitting set is still an neighborhood set of (potentially it is annulus-shaped), then it is absolutely unnecessary to require a strong prior concentration, and one only need to impose conditions on the local shape of prior around . see [20, 12, 56]. This is also the key difference between high dimensional models and slowly increasing models/normal means models.

Before rigorously studying the properties of the posterior distribution, we first state some regularity conditions on the eigen-structure of the design matrix :

1. : All the covariates are uniformly bounded. For simplicity, we assume they are all bounded by 1.

2. : The dimensionality is high: ;

3. : There exist some integer (depending on and ) and fixed constant , such that , and for any subset model ,.

Remarks: 1 implies that . 3 has been often used in the literature to overcome the non-identifiability of , see e.g., [41, 62, 49]. And this condition is equivalent to lower bounded compatibility number condition used in [10]. In general, should be much smaller than . For example, for an -design matrix with all entries iid distributed, the Marchenko-Pastur law states that the empirical distribution of the eigenvalues of the corresponding sample covariance matrix converges to . The random matrix theory typically allows with a high probability when the rows of are independent isotropic sub-Gaussian random vectors (Refer to Lemma 6.1 of [41] and Theorem 5.39 of [58]).

The next set of assumptions concerns the sparsity of , and magnitude of nonzero entries of .

1. : , where is the size of the true model, and is some fixed constant.

2. : for some fixed , and is nondecreasing with respect to .

Remarks: The condition 1 is regularly used in many high dimensional literature, it restricts the growth of sparsity to be of . The condition 2 essentially constrains the asymptotic growth of the nonzero true coefficient, such that .

Next theorem, we provide sufficient conditions for the posterior consistency. From now on, we let the contraction rate for some fixed constant .

###### Theorem 2.1 (Posterior Consistency).

Consider the linear regression models (1.1) with the design matrix and true satisfying conditions and .The is subject to an inverse-Gamma prior IG(), and the prior of follows (2.1). If satisfies

 1−∫an−angλ(x)dx≤p−(1+u)n,−log(infx∈[−En,En]gλ(x))=O(logpn), (2.3)

where is a constant, and constant is large enough, then the posterior consistency holds as follows,

 (2.4)

for some positive constants , and .

The proof of this theorem is given in the Appendix (Theorem A.1).

The results (2.4) imply that and . These show that the and contraction rates for the posterior of are and , respectively. These contraction rates are near-optimal (the minimax rate is [44]), and no worse than the convergence rates achieved by spike-and-slab Bayesian approaches [10]. In other words, there is no performance loss due to Bayesian shrinkage.

The conditions (2.3) in the above theorem are consistent to our heuristic arguments in the previous paragraphs. The first inequality of (2.3) concerns the prior concentration, it requires that the prior density of has a dominating peak inside a tiny interval . Such a steep prior peak plays the same role of the “spike” in spike-and-slab prior modeling. In literature, [10] assigned the prior for the spike as with (equation (2.2)), [41] employed a SSVS-type prior [19] under which the prior inclusion probability , and [61] assigns the spike prior as with . These above prior specifications are comparable to our condition with . Note [41] and [61] seem require less prior concentration, they both impose addition prior concentration condition to bound the model size as . The second inequality of (2.3) essentially requires the prior density around the true nonzero regression coefficient is at least , i.e. for some positive constant . Similar conditions, which require that the prior is “thick” around true parameter value, are regularly used in Bayesian literature [31, 33, 21, 22].

We also want to mention that this prior concentration condition is only sufficient, and in practice, a moderate degree of concentration would also lead to satisfactory results.

Similar theorems can be derived for the fitting error .

###### Theorem 2.2.

If the conditions of Theorem 2.1 hold, then

 P∗(π(∥Xβ−Xβ∗∥≥c1σ∗√nϵn|Dn)≥e−c2nϵ2n)≤e−c3nϵ2n, (2.5)

for some positive constants , and .

To conclude this subsection, we claim that under proper prior specifications, the shrinkage prior can lead to almost the same posterior accuracy as the spike-and-slab prior.

### 2.2 Variable Selection Consistency

In this subsection, we perform a theoretical study on how to retrieve sparsity structure of using shrinkage prior. In order to do so, it is necessary to “sparsify” the continuous posterior distribution induced by the continuous prior. In the literature, this is usually done by 1) hard (or adaptive) thresholding on or on the shrinkage weight [34, 50, 9, 30], or 2) decoupling shrinkage and selection methods [27, 60]. Note that the second class of approaches intend to incorporate the dependencies between covariates into the sparse posterior summary. All these approaches depend solely on the posterior magnitude, and take no account of the degree of prior concentration.

We propose to use a prior-dependent hard thresholding as for some thresholding value . This induces a sparse pseudo posterior , which thereafter can be used to assess the model uncertainty and conduct variable selection, as if it was induced by a spike-and-slab prior. The correlation structure of will also reflect the dependencies knowledge of .

First of all, the convergence result of Theorem 2.1 trivially implies that . Therefore, if , and , then, , and can consistently select the true model. However, one potential issue of using thresholding is that it greatly alter the theoretical characteristic of , in the sense that the contraction rate of can be as large as but not .

Therefore, we consider another choice of . As discussed previously, (2.3) implies a “spike” between [, ] for the prior of , which plays the same role as the Dirac measure in the spike-and-slab prior. Hence, from the view of prior specification, distinguishes zero and nonzero coefficients, and it is natural to consider . The posterior thus implies the selection rule as . The following theorem establishes its consistency.

###### Theorem 2.3.

(Variable selection consistency) Suppose that the conditions of Theorem 2.1 hold under and . Let be a measure of flatness of the function ,

 ln=maxj∈ξ∗supx1,x2∈β∗j/σ∗±c0ϵngλ(x1)gλ(x2),

where is some large constant. If for some sufficiently large and , then

 P∗{π[ξ(β,σ2)=ξ∗|Dn]>1−p−c3n}>1−p−c4n, (2.6)

for some positive constants and .

This theorem is a simple corollary of Theorem A.3 in the appendix. It requires a smaller value of and larger , i.e. a narrower and more concentrated prior peak, compared to Theorem 2.1. Besides the prior concentration and tail thickness, the condition also requires tail flatness, such that the prior density around the true value is not rugged. This flatness facilitates the analytic studies for posterior of . Generally speaking, for smooth , the flatness measure approximately follows , where is the first derivative of . In the extreme situation, we can utilize an exactly flat tail, such that . An example could be , where has the shape of non-concave penalty functions such as SCAD. If is not exactly 0, then the condition also imposes an additional constraint on the sparsity other than . Other discussion on can be found in Section 3.

The result of this theorem also implies a stronger posterior contraction for the false covariates such that their posterior, is bounded by .

### 2.3 Shape Approximation of the Posterior Distribution

Another important aspect of Bayesian asymptotics is the shape of the posterior distribution. The general theory on the posterior shape is the Bernstein von-Mises (BvM) theorem. It claims that the posterior distribution of the parameter in a regular finite dimensional model is approximately a normal distribution as , i.e.,

 ||π(⋅|Dn)−N(⋅;^θMLE,(n^I)−1)||TV→0, (2.7)

regardless of the choice of the prior . where is the posterior distribution given data , denotes the (multivariate) normal distribution, stands for the maximum likelihood estimator of , is Fisher’s information matrix, and denotes the total variation difference between two measures. The BvM theorem provides an important link between the frequentism limiting distribution and the posterior distribution, and it can be viewed as a frequentism justification for Bayesian credible intervals. To be specific, the Bayesian credible intervals are asymptotically equivalent to the Wald confidence intervals, and also have the long-run relative frequency interpretation.

Usually, the BvM theorem holds for fixed dimensional problems. For linear regression (with known ), the posterior normality always holds under (improper) uniform prior, as , as long as and the matrix is of full rank.

Under , to investigate the posterior shape intuitively, we note that the posteriors of all false coefficients are bounded by by Theorem 2.3. Combining this with the fact that when is sufficiently small, we have that

 π(β|Dn)∝L(βξ∗,β(ξ∗)c|X,y)π(βξ∗,β(ξ∗)c)≈L(βξ∗;Xξ∗,y)π(βξ∗)π(β(ξ∗)c).

If is sufficiently flat around and acts like a uniform prior, then the low dimensional term follows a normal BvM approximation. More rigorously, we have the next theorem.

###### Theorem 2.4.

(Shape Approximation) Assume the conditions of Theorem 2.3 hold, and , . Let , then with dominating probability,

 π(β,σ2|Dn) % converges in total variation toϕ(βξ∗;^βξ∗,σ2(XTξ∗Xξ∗)−1)∏j∉ξ∗π(βj|σ2)ig(σ2,n−s2,^σ2(n−s)2), (2.8)

where is the density of multivariate normal distribution, is the density of inverse-gamma distribution and is the conditional prior distribution of . and is the MLE of and given data .

Refer to Theorem A.4 for the proof of this theorem. Its condition is slightly stronger than that of Theorem 2.3; it requires that is smaller and the prior density is almost constant around the true value of . The following corollary can be easily derived from the above theorem.

###### Corollary 2.1.

Under the condition of Theorem 2.4, for any , the marginal posterior of converges to normal distribution , where in the th entry of , . Furthermore, if , the posterior converges in total variation to with probability approaching 1, where , , and . In other words, the BvM theorem holds for the parameter component .

Theorem 2.4 is comparable to the BvM result developed under spike-and-slab prior [10]. Under spike-and-slab prior, then the posterior density of can be rewritten as a mixture,

 π(β|Dn)=∑ξ⊂{1,…,p}π(ξ|Dn)π(βξ|Xξ,y)1{βξc=0}, (2.9)

where , and is defined in (1.2). If , converges to . Furthermore, if is sufficiently flat, and BvM holds for the low-dimensional term , then it leads to a posterior normal approximation as

 π(β|Dm)≈N(βξ∗;^βξ∗,(XTξ∗Xξ∗)−1)⊗δ0(β(ξ∗)c), (2.10)

where denotes product of measure.

Theorem 2.4 and Corollary 2.1 extend the BvM type result from the spike-and-slab prior to shrinkage prior specification. They show that the marginal posterior distribution for true covariates follows the BvM theorem as if under the low dimensional setting, while the marginal posterior for the false covariates can be approximated by its prior distribution. Note that the prior distribution is already highly concentrated, hence the posterior of false covariate being almost same as prior, doesn’t contradict our contraction results. Note that Bayesian procedure can be viewed as a process of updating the probabilistic knowledge of parameters. The concentrated prior distribution reflects our prior belief that almost all the predictors are inactive, and (2.8) can be interpreted as that, the Bayesian procedure correctly identifies the true model and updates the distribution of using the data, but it obtains no evidence to support for any and thus doesn’t update their concentrated prior distributions.

Let denote the posterior quantile credible interval of the th covariate. If is a symmetric distribution, then Corollary 2.1 implies that

 limP∗(β∗i∈CIi(α))=1−α,if i∈ξ∗ and limP∗(0∈CIi(α))=1,if i∉ξ∗, (2.11)

for any . This result implies that for the false covariates, the Bayesian credible interval is super-efficient: Asymptotically, it can be very narrow (as the prior is highly concentrated), but has always 100% probability coverage. This is much different from the confidence interval.

It is important to note that Theorem 2.4, and its counterpart (2.10) under spike-and-slab prior, both rely on selection consistency (and beta-min condition), thus drives post-selection Bayesian inference. Therefore, the frequentism coverage of the Bayesian credible interval (first equation of (2.11)) does not hold uniformly for all nonzero values, but only hold for those bounded away from 0. If the beta-min condition is violated, one can rewrite the posterior of shrinkage prior as a mixture distribution similar to (2.9), hence, the corresponding posterior inference will be model-average-based.

These above asymptotic studies are completely different from the frequentist sampling distribution-based inference tool such as debiased Lasso [53, 64]. The de-biased Lasso method established asymptotic normality as

 √n(^β−β∗)\lx@stackreld→N(0,σ∗2SXTXST/n), (2.12)

for any , even when it is arbitrary closed to zero, and is some surrogate inverse matrix of the sample covariance. Different from our posterior consistency result, the asymptotic distribution in the right hand sided of (2.12) is a divergent distribution when . Our coverage results (2.11) is also different from the Bayesian results delivered in [56] which doesn’t depend on selection consistency.

We conjecture that if consistent point estimation and inference of credible intervals are made simultaneously, the credible intervals will be super-efficient for the false covariates due to the sparsity constraints (i.e. the prior distribution) imposed on the regression coefficients. These constraints ensure the consistency of posterior distribution, and thus reduces the variability of the coefficients of false covariates. Based on this understanding, in the framework of consistent high dimensional Bayesian analysis, it seems that a separated post-selection inference procedure (without sparsity constraints) is necessary to induce correct second-order inference. For example, it can be done in a sequential manner (refer the idea to [37] and [52]): Attempting to add each of the unselected variables to the selected model, and calculating the corresponding credible interval for the unselected variable.

## 3 Consistent Shrinkage Priors

In the previous section, we establish some general theory for shrinkage priors based on abstract conditions. In this section, we will apply these results to several particular types of shrinkage prior, and study their corresponding posterior asymptotics.

The condition (2.3) requires certain balance between prior concentration and tail thickness. First of all, it is easy to see that the Laplace prior fails to satisfy condition (2.3) unless the tuning parameter and true coefficients are as tiny as for all . Therefore, we first consider the prior specification that has a heavier tail than exponential distribution.

### 3.1 Polynomial-tailed Distributions

We assume that the prior density of has the form , where is a scaling hyperparameter, and the density is symmetric and polynomial-tailed, i.e. as for some positive . Under the above prior choice, we adapt Theorem 2.1 as follows:

###### Theorem 3.1.

Assume conditions and hold for the linear regression models, and a polynomial-tailed prior is used. If , and the scaling parameter satisfies , for some , then

• the posterior consistency (2.4) holds when ;

• the model selection consistency (2.6) holds when , for sufficient large , and ;

• the posterior approximation (2.8) holds when , for sufficient large , , and .

Note that most commonly used polynomial decaying distributions satisfy , where with the rate

 |L(x)−1|=O(x−t),for some t≥0, (3.1)

then, it is easy to see that if for some large , , then Therefore, Theorem 3.1 can be refined as

###### Corollary 3.1.

Consider polynomial-tailed prior distributions satisfying (3.1). Assume condition holds, , and . Let the choice of satisfy , then

• If with , then posterior consistency holds with a nearly optimal contraction rate;

• If , with , for sufficiently large , then the variable selection consistency holds;

• If , with , for sufficiently large , then the posterior shape approximation holds.

Theorem 3.1 and Corollary 3.1 show that a nearly optimal contraction rate can be achieved by polynomial-tailed prior specifications for high dimensional linear regression with an appropriate value of . As suggested by Corollary 3.1, it is sufficient to choose the scale parameter as for some , since and . Compared with the choice under normal means models [54, 23], we comment that a stronger prior concentration is required for regression models. Our results allow the maximum magnitude of nonzero coefficients to increase up to a polynomial of . In contrast, DL prior only allows to increase with a logarithmic order of [6]. It is worth to note that the bounded condition on is not always necessary when polynomial decaying prior is used. Especially, under normal means models, when , one can show that such condition is not redundant [54, 23]. But under general regression settings, such condition may be necessary due to the dependencies among covariates. One should also notice that selection consistency or posterior normality result requires an additional condition on the true sparsity . For example, if for some constant , selection consistency (posterior normality) requires (). The reason we need this unpleasant condition is that the polynomial decaying prior modeling utilizes only one scaling hyperparameter. Although this simplifies the modeling part, we lose control on the shape, or tail flatness, of the prior distribution. If we utilize both scaling and shape hyperparameters in prior modeling, the conditions can be improved, as seen in Section 3.2.

For the convenience of posterior sampling, one way to construct polynomial decaying s is to design a hierarchical scale mixture of Gaussian distribution as

 βj∼N(0,λ2jσ2),λ2j∼πsn(λ2j),independently for all j, (3.2)

where is the scaling hyperparameter of the mixing distribution , i.e., . Equivalently, is the scaling parameter for the marginal prior of . The scale mixture of Gaussian distributions can also be viewed as a local-global shrinkage prior, where is the local shrinkage parameters, and is a deterministic global shrinkage parameter. As shown in the next lemma, the tail behavior of the marginal distribution of is determined by the tail behavior of .

###### Lemma 3.1.

If the mixing distribution is a polynomial-tailed distribution satisfying and , then the marginal prior distribution of induced by (3.2) is polynomial-tailed with order and satisfies , where is defined in (3.1)).

The proof of this lemma is trivial and hence omitted in this paper.

Combining the above lemma and Corollary 3.1, it is sufficient to assign a polynomial-tailed distribution and properly choose the scale parameter such that is decreasing and satisfies the conditions in Corollary 3.1. [23] also studied scaled mixture Gaussian prior (3.2) for normal means models and achieved rate-minimax contraction. But their results only apply to the case that the polynomial order of is between 1.5 and 2, but our result is more general and valid for any .

In what follow, we list some examples of polynomially decaying prior distributions which can be represented as scale mixture Gaussian, and all these priors satisfy condition (3.1):

• Student’s -distribution whose mixing distribution for is inverse gamma with .

• Normal-exponential-gamma (NEG) distribution [26], for which the mixing distribution is with .

• Generalized double Pareto distribution [2] with the density , for which the mixing distribution can be represented as a gamma mixture of exponential distributions with .

• Generalized Beta mixture of Gaussian distributions [1], for which the mixing distribution is inverted Beta: with . Note that the horseshoe prior is a special case of generalized Beta mixture of Gaussian distributions with .

In addition, Theorem 3.1 implies a simple way to remedy the inconsistency of Bayesian Lasso by imposing a heavy tail prior on the hyperparameter: , , where denotes the double exponential distribution , and the mixing distribution of has polynomial tail with scale parameter .

In the above analysis, we choose the scale parameters or to be deterministically decreasing as increases, hence in practice, certain tuning procedures are recommended as described in Section 4. Such hyperparameter tuning occurs in most Bayesian procedures under spike-and-slab modeling as well. In the literature, an adaptive Bayesian way to choose is to assign a hyper-prior on . [55] studied horseshoe prior for the normal means models, and they showed that the posterior consistency remains if is subject to a hyper-prior which is truncated on . However, results derived for normal means models may not be trivially applicable to regression models. Note there is a difference between regression models and normal means models, in terms of the norm for the columns in the design matrix, the result of [55] suggests to truncate the prior of on [] for regression models. A toy example (Figure 4) shows that such truncation still leads to many false discoveries. Besides, our posterior shape approximation result relies on the fact that is independent among ’s conditional on . If a hyper-prior on is used, the conditional independence no longer holds, and the BvM result (2.8) fails.

### 3.2 Two-Component Mixture Gaussian Distributions

Another prior that has been widely used in Bayesian linear regression analysis is the two-component mixture Gaussian distribution, see e.g., [19] and [41]:

 βj/σ∼(1−ξj)N(0,σ20)+ξjN(0,σ21),ξj∼Bernoulli(m1). (3.3)

The component has a very small and can be viewed as an approximation to the point mass at 0. In the literature, the interest in this prior has been focused only on the consistency of variable selection, i.e., . Here, we treat it as an absolutely continuous prior and study the posterior properties of in the next theorem:

###### Theorem 3.2.

Suppose that the two-component mixture Gaussian prior (3.3) is used for the high-dimensional linear regression model (1.1), and that the following conditions hold: condition , condition , , and for some . Then

• the posterior consistency (2.4) holds when ;

• the model selection consistency (2.6) holds when , , for sufficient large and ;

• the posterior approximation (2.8) holds when , , for sufficient large and .

The two normal mixture distribution has three hyperparameters , and , hence we have more control on the prior shape compared to the polynomial decaying priors, and the theoretic properties are improved slightly. Specifically, Theorem 3.2 allows us to choose for some , thus always holds, i.e, there will be no additional condition on the upper bound of the sparsity .

## 4 Bayesian Computation and an Illustration Example

In this section, we will first discuss some important practical issues, including posterior computation, model selection and hyperparameter tuning. Then we use some toy examples to illustrate the practical performance of shrinkage priors. For convenience, we will call the Bayesian method whose consistency is guaranteed by Theorem 2.1 with a shrinkage prior a Bayesian consistent shrinkage (BCS) method in what follows. In particular, we use the student- prior, as an example of shrinkage priors, and compare it with the Laplace prior.

The scale mixture Gaussian priors (3.2), under a proper hierarchical representation, usually lead to posterior conjugate Gibbs updates. For example, for the student- prior, the posterior distribution can be updated in the following way:

 σ2|β,λ1,…,λpn∼IG(a0+n+pn2,b0+∥y−Xβ∥22+∑jβ2j