Adaptive Bayesian Shrinkage Estimation Using LogScale Shrinkage Priors
Abstract
Globallocal shrinkage hierarchies are an important, recent innovation in Bayesian estimation of regression models. In this paper we propose to use logscale distributions as a basis for generating familes of flexible prior distributions for the local shrinkage hyperparameters within such hierarchies. An important property of the logscale priors is that by varying the scale parameter one may vary the degree to which the prior distribution promotes sparsity in the coefficient estimates, all the way from the simple proportional shrinkage ridge regression model up to extremely heavy tailed, sparsity inducing prior distributions. By examining the class of distributions over the logarithm of the local shrinkage parameter that have loglinear, or subloglinear tails, we show that many of standard prior distributions for local shrinkage parameters can be unified in terms of the tail behaviour and concentration properties of their corresponding marginal distributions over the coefficients . We use these results to derive upper bounds on the rate of concentration around , and the tail decay as , achievable by this class of prior distributions. We then propose a new type of ultraheavy tailed prior, called the log prior, which exhibits the property that, irrespective of the choice of associated scale parameter, the induced marginal distribution over always diverge at , and always possesses superCauchy tails. Finally, we propose to incorporate the scale parameter in the logscale prior distributions into the Bayesian hierarchy and derive an adaptive shrinkage procedure. Simulations show that in contrast to a number of standard prior distributions, our adaptive log procedure appears to always perform well, irrespective of the level of sparsity or signaltonoise ratio of the underlying model.
1 Introduction
The multiple means problem has been studied extensively since the introduction of the first shrinkage estimator by James and Stein [1961]. In the multiple means problem we observe a vector of samples from the model
(1) 
and are required to estimate the unknown coefficient (mean) vector . Part of the appeal of the multiple means problem is that it serves as an analogue for the more complex general linear model, while being substantially more amenable to analysis. The original work of James and Stein [James and Stein, 1961] showed that proportional shrinkage can be used to construct estimators that dominate leastsquares in terms of squarederror risk for , and these results were later extended to Bayesian proportional shrinkage estimators [Strawderman, 1971, Zellner, 1986], and more generally to Bayesian ridge regression. More recently there has been a focus on the sparse setting in which the majority of the entries of are exactly zero. In this setting the use of methods that promote sparsity, such as the lasso [Tibshirani, 1996] can lead to substantial improvements in estimation risk over conventional shrinkage estimators. An important contribution to the Bayesian regression literature was the introduction of the general globallocal shrinkage hierarchy proposed in Polson and Scott [2010], which models the coefficients by
(2)  
(3)  
(4) 
In the globallocal shrinkage hierarchy the local shrinkage hyperparameters control the degree of shrinkage applied to individual coefficients, while the global shrinkage hyperparameter controls the overall degree of shrinkage applied globally to all coefficients. The choice of the prior distributions controls the behaviour of the resulting shrinkage estimators; for example, if is a Dirichlet pointmass the hierarchy reduces to the standard ridge regression prior. Well known techniques that fall under the umbrella of globallocal shrinkage priors include the Bayesian lasso [Park and Casella, 2008], the normalgamma prior [Griffin and Brown, 2010], the horseshoe [Carvalho et al., 2010], the horseshoe+ [Bhadra et al., 2016], the beta mixture of Gaussians [Armagan et al., 2011] and the R2D2 prior [Zhang et al., 2016]. As there exists such a wide array of potential globallocal shrinkage priors, it is valuable to determine general properties that may or may not be beneficial for estimating in the multiple means problem. Carvalho et al. [2010] proposed two such properties that a good sparsity inducing prior should possess:

Property I: The prior should concentrate sufficient probability mass near to ensure that the marginal distribution as ;

Property II: The prior should decay sufficiently slowly as so that the marginal distribution has Cauchy, or superCauchy tails, to ensure that
as .
The first property ensures that the Bayes estimation risk of a procedure using such a prior will be low when the underlying coefficient vector is sparse. The second property ensures that very large effects are not overshrunk by the resulting procedure. A number of shrinkage priors, including the horseshoe and horseshoe+, satisfy both of these properties, and also possess a number of other favourable theoretical properties (see for example, van der Pas et al. [2017]). More classical shrinkage priors, such as the Bayesian lasso, and Bayesian ridge do not satisfy either Property I or II; despite this, it is not difficult to propose configurations of the underlying coefficients for which sparse estimation methods possess much greater squarederror risk than the humble Bayesian ridge. Generally the coefficient configurations that are problematic for sparse estimators are dense in the sense that a large number of the coefficients are nonzero. While it can be argued that sparsity inducing priors are inappropriate for such settings it is difficult to know with certainty whether a problem is best modelled a priori as sparse or dense, particularly when dealing with complex natural phenomena.
Statistical genetics problems have been an important application, and source of inspiration, for much of the recent work on sparsity inducing priors for high dimensional regression models. A key assumption driving much of this original work was that only a small number of the many genomic variants that exist are associated with any given disease. However, there is substantial evidence that there can be large numbers of variants associated with diseases [Boyle et al., 2017], though the levels of association, and therefore the signaltonoise ratio, will be low. Thus, there is a potential need for shrinkage priors that can adapt to the sparsity of the underlying coefficient vector, and bridge the gap between extreme sparsity inducing behaviour, and proportional shrinkage type priors, as the problem demands. Many of the existing sparsity inducing shrinkage priors have shape hyperparameters that can be tuned to adjust the degree to which they expect the underlying coefficient vector to be sparse, and some work has examined the tuning, and estimation, of these type of shape parameters (see for example, Bhattacharya et al. [2015], Griffin and Brown [2017] and Huber and Feldkircher [2017]). However, in general, the shape parameters are complex to sample, their interpretation is potentially difficult, and the tuning has often been focused on increasing the degree of a priori expected sparsity above and beyond that of a method such as the horseshoe.
In this paper we propose a framework for specifying prior distribution for local shrinkage hyperparameters based on logscale distributions. These logscale distributions have a single scale hyperparameter that can be used to vary the logscale shrinkage priors from highly sparsity promoting through to almost ridge regressionlike in behaviour, and are straightforward to integrate into an MCMC procedure. Apart from providing a degree of robustness to our prior assumptions regarding sparsity of the underlying coefficients, we believe that these types of adaptive shrinkage priors will have particular application in regression problems in which variables can be formed into logical groupings. In such situations, it is highly concievable that the coefficient vectors associated with some groups will be dense while the coefficient vectors associated with other groups could be very sparse. Some obvious examples of this include additive models based around polynomial expansions in which one of the input variables may be related to the target through a discontinuous nonlinearity, or statistical genomics, in which some genes or pathways may have large numbers of associations with disease, such as the HLA region [Kennedy et al., 2017] while other genes may have only one or two strongly associated variants.
1.1 Our Contribution
In this paper we show that viewing prior distributions in terms of the logarithm of the local shrinkage parameters, , has several distinct advantages. Our work was motivated by the observation that those shrinkage priors which strongly promoted sparsity spread their probability mass more thinly across the space. By viewing the standard prior distributions in space and introducing a scale parameter it becomes possible to vary the degree to which a prior promotes sparsity in the coefficient estimates, all the way from the simple proportional shrinkage ridge regression model up to extremely heavy tailed distributions. We call these logscale prior distributions.
Using this approach we show that many of standard local shrinkage parameter prior distributions can be unified in terms of tail behaviour and concentration properties of the resulting marginal distribution over . In particular, we consider the class of distributions over that have loglinear, or subloglinear tails. We derive upper bounds on the rate of concentration around , and tail decay as , achievable by this class of prior distributions. Further, we show that by introduction of a scale parameter, all of the common prior distributions can be made to behave equivalently to each other, irrespective of the specific shape parameters they may possess, in the sense that for a sufficiently small choice of scale the induced marginal distribution can be made to lose Properties I and II. We then propose a new class of ultraheavy tailed priors, called the log priors, which exhibit the property that irrespective of the choice of , the induced marginal distribution over never loses Properties I and II.
Finally, we utilise the simple interpretation of as the scale for the hyperparameters to derive an adaptive shrinkage procedure. We incorporate into the full Bayesian hierarchy and use this to estimate the degree of sparsity in the data generating model. This yields a prior that is able to vary from highly sparsity promoting through to a ridgelike, depending on the configuration of the true regression coefficients. However, by using the log prior distribution the resulting prior distribution over the coefficients never loses Properties I and II no matter how much mass is concentrated near .
2 Logscale Hyperprior Distributions
Let be a unimodal distribution over . If , then the translated and scaled random variate follows the probability distribution:
(5) 
Distributions of the form (5) are known as locationscale distributions, in which is the location parameter and is the scale parameter. Let be the natural logarithm of local shrinkage parameter in the globallocal shrinkage hierarchy (2)–(4). The primary motivation for studying distributions over is the fact that, if follows a locationscale distribution of the form (5), then:

location transformations of induce scale transformations on ;

scale transformations of induce powertransformations on .
The first fact is of less interest, as scale transformations of are generally taken care of by the presence of the global shrinkage hyperparameter in the standard globallocal shrinkage prior hierarchy. The second fact is far more interesting, as it reveals a simple way in which we can control the behaviour of the prior distribution when and . If we further restrict attention to the class of loglocationscale priors in which is symmetric around the prior distribution is symmetric around in the following sense:
(6) 
A property of the globallocal shrinkage hierarchy for the multiple means problem is that the posterior mean of the coefficients can be written as
where is the degree of shrinkage towards zero being applied to coefficient . Given , the corresponding degree of shrinkage is ; when is close to zero very little shrinkage is performed, and when is close to one, the corresponding coefficient is almost entirely shrunk to zero. This interpretation motivated the original horseshoe prior distribution, which placed a horseshoeshape prior over to promote either aggressive shrinkage or little shrinkage of the coefficients. The quantity can be interpreted as the degree of evidence in the data to support against , and the thresholding rule is frequently used as a variable selection criterion [Carvalho et al., 2010, Tang et al., 2016].
Placing a logscale distribution over that is symmetric around implies a distribution over with a median at . In the particular case that , property (6) implies that the resulting distribution models the prior belief that coefficients are just as likely to be shrunken towards zero as they are to be left untouched, and that a priori, a variable has a marginal prior probability of being selected of . These properties, coupled with the fact that scale transformations of result in powertransformations of suggest that specification of symmetric priors over may offer a fruitful approach to generate novel, adjustable priors for local shrinkage parameters that imply reasonable prior beliefs about the model coefficients.
2.1 Behaviour of Standard Shrinkage Priors in Space
It is of interest to examine the prior distributions implied by a number of standard shrinkage prior distributions for . The Bayesian lasso (double exponential) Park and Casella [2008] prior distribution over induces an exponential distribution over , and a distribution of the form
(7) 
over . This is an asymmetric distribution over , with the lefthand tail (that controls shrinkage less than ) being much heavier than the righthand tail (which controls shrinkage greater than ). This interquartile interval (first and third quartiles) for this distribution is approximately . Positive values of induce little shrinkage on coefficients, which is desirable for modeling very large effects. The skew towards negative values of exhibited by the Bayesian lasso demonstrates why it introduces bias in estimation when the underlying model coefficients are large. In terms of the coefficient of shrinkage, , where , the interquartile interval of the prior induced by (7) is approximately , respectively, demonstrating a clear preference towards shrinkage below .
The horseshoe prior [Carvalho et al., 2010] is often considered a default choice for sparse regression problems. The horseshoe prior places a standard halfCauchy distribution over , which is known to induce an unstandardised unit hyperbolic secant distribution over , with probability distribution given by
(8) 
where denotes the hyperbolic secant function. This distribution is symmetric around , and has an interquartile range of approximately for , and for . In contrast to the Bayesian lasso, the horseshoe prior clearly spreads its probability mass more thinly across the space, encodes a much wider range of a prior plausible shrinkage for the coefficients and is symmetric around . The horseshoe+ (HS+) prior distribution Bhadra et al. [2016] models as the product of two halfCauchy distributions, which leads to a prior distribution on of the form
(9) 
where is the hyperbolic cosecant function. This distribution is also symmetric around , which is straightforward to verify from the fact that is modelled as the sum of two hyperbolic secant random variables, which are themselves symmetric. The HS+ prior has an interquartile range of approximately for , which translates to an interquartile interval of on . The horseshoe+ prior more strongly promotes sparsity in the estimates of , and this is evident by the fact that more probability mass is concentrated in a larger region of , than for either the Bayesian lasso or the horseshoe. More generally, the beta prime class of hyperpriors, which begin by modelling the shrinkage factor as [Armagan et al., 2011], imply a distribution of the form
(10) 
over . The density (10) be identified as a distribution with zero mean, a scale of and shape parameters and . This class of distributions is symmetric if and only if , and generalizes a number of standard shrinkage hyperpriors. For example, the standard horseshoe is recovered by taking , while the StrawdermanBerger prior is recovered by taking and , which is asymmetric. The negativeexponentialgamma prior is found by taking and , with , which is asymmetric for all . In all these cases the hyperparameter controls the behaviour of the prior on as , and the hyperparameter controls the behaviour of the tail of the prior on as . By adjusting the and hyperparameters, the prior mass can be controlled to be spread more or less densely across space, controlling the degree to which the prior induces sparsity on the estimated coefficients.
For example, when the interquartile interval for is identical to the interval obtained for the horseshoe, while for and the interquartile range on expands to which is wider than interquartile range for the horseshoe+ prior. Taking , leads to a concentration of prior mass near , which can be used to approximate ridge regression. However, a potentially unwanted side effect of this is that when , the marginal prior distribution for loses both Properties I and II. The effect of and on the prior distribution has been used to attempt to adaptively estimate the degree of sparsity required from the data (for example, Griffin and Brown [2017] and Huber and Feldkircher [2017]). However, the functional form of the prior distribution, and the way in which the hyperparameters and control the prior, has the consequence that both the interpretation of the the hyperparameters, and the practical implementation of efficient sampling algorithms for them, is difficult.
2.2 LogScale Priors as Shrinkage Priors on
The logscale interpretation of standard shrinkage priors offers an alternative way of understanding the way in which both the tails, and behaviour near the origin, of a prior distribution for models prior beliefs regarding sparsity of , and the type of shrinkage behaviour introduced by the prior distribution. The standard prior distributions discussed in Section 2.1 all induce unimodal distributions on which tail off to zero as . They differ in how thinly they spread their prior probability across the space. In the standard globallocal shrinkage hierarchy (2)–(4), the prior distribution for is
If we assume that , we see that conditional on , the local shrinkage parameter is modelled as a random variable scaled by . This scale transformation of induces a location transformation on ; i.e., if is the density implied by the prior for over , and , then
(11) 
The global scale parameter determines the location of the prior over . The standard shrinkage priors on can therefore be viewed as shrinking the hyperparameters towards , with the more sparsity promoting prior distributions resulting in less shrinkage of the hyperparameters. Clearly (11) is of the form (5) with and . A natural generalization is then to allow the scale to take on an arbitrary value. The scale of the logscale prior (5) can be interpreted as modelling the a priori plausible range of values around the location . The smaller the scale parameter, the more prior probability is concentrated around , and the less variability is implied in the values of , with the result that most of the shrinkage coefficients will be concentrated around . In the limiting case that , the prior (5) concentrates all of its mass at , allowing for no variation in shrinkage between coefficients, and the prior hierarchy reduces to the Bayesian ridge. In contrast, the larger the scale parameter becomes, the more a prior variability in the hyperparameters is implied, with the caveat that less prior mass is placed around the neighbourhood of any particular . In the limiting case that , we recover the (improper) normalJeffreys prior , which is a uniform distribution over .
This interpretation motivates us to propose the introduction of a scale parameter to the prior distributions over space as a method to provide a hyperparameter that can be used to control the amount by which a prior promotes sparsity. Practically, this type of scale hyperparameter is easier to deal with than shape hyperparameters that control the tail behaviour of priors such as the beta prime prior (10), both in terms of interpretation as well as implementation with a sampling hierarchy. They also provide a unified form of hyperparameter that controls the behaviour of a shrinkage prior in the same, standard way, irrespective of the initial prior distribution ) that we start with.
3 Three LogScale Prior Distributions
In this section we examine three potential choices of logscale prior distribution for globallocal shrinkage hierarchies of the form (2)–(4). The first is the loghyperbolic secant prior, which is itself a generalization of the regular horseshoe prior distribution. The second distribution we consider is the asymmetric logLaplace prior distribution, which has the advantage of being amenable to analysis, while also exhibiting the same tail properties as the loghyperbolic secant prior. Furthermore, the logLaplace prior distribution can be used to derive upperbounds on the concentration and tail behaviours of a large class of prior distributions, which includes most of the common shrinkage priors. The final distribution we consider is the log, which is formed by modelling the hyperparameters using the Student distribution. The resulting density appears to be part of an entirely new class of prior distributions, and exhibits a special property that is, to the authors knowledge, not shared by any other known shrinkage prior.
3.1 LogHyperbolic Secant Prior
The horseshoe prior is generally considered a default choice of prior distribution for the local shrinkage parameters , and therefore forms a suitable starting point for generalisation through the introduction of scale parameter on the space. Our starting point is the density (8), which after introduction of a location parameter and scale parameter becomes
(12) 
This density is known in the literature as the hyperbolic secant distribution. The horseshoe prior (8) is a special case of (12) for and . Without any loss of generality we can let , as the location of the density will be determined by the value of the global shrinkage parameter as discussed in Section 2.2. Allowing leads to a prior distribution for of the form
(13) 
Examining (13) clearly shows that the scale parameter controls the tail and concentration behaviour of the induced distribution over . The larger the scale , the heavier the tail as , and the greater the concentration of prior mass around . For this prior reduces to the halfCauchy distribution; for the prior tends to zero as and for the prior exhibits a pole at .
It is of interest to compare the prior (13) to the prior over one would obtain by starting with a distribution with shape parameters and , mean of zero and a scale of and transforming this to , which yields
(14) 
Comparing (14) with (13) we see that the and shape parameters play exactly the same role as the scale parameter , the primary difference being the ability to varying the tail or concentration behaviour individually by appropriate choice of and . If we consider the case in which the shape parameters are the same, i.e., , for which the distribution is symmetric, we have the following result.
Proposition 1.
There exists a such that
where is a shape parameter.
The proof follows in a straightforward manner by application of the bound . From the monotone convergence theorem, Proposition 1 tells us that controlling the tails of the beta prime prior distribution over by variation of the shape parameters cannot lead to a heavier tailed marginal distribution over , or one with a greater concentration of mass at , than controlling the tails of the prior by varying the scale parameter of the log distribution (10) alone. This result is also confirmed by the fact that the tails of distribution are loglinear with an absolute loggradient of (BarndorffNielsen et al. [1982], p. 150).
3.2 LogLaplace Priors
We now examine a specific choice of logscale prior based on the Laplace (double exponential) distribution. The primary usefulness of this distribution is its ability to provide simple bounds for the entire class of loglocationscale prior distributions over with loglinear, or subloglinear tails, which itself includes the important subclass of logconcave densities. The logLaplace prior distribution for a local shrinkage hyperparameter is given by
where denotes an asymmetric Laplace distribution with a median of zero, a leftscale of , a rightscale of and probability density function
(15) 
where . The asymmetric Laplace distribution is essentially equivalent to two backtoback exponential distributions with different scale parameters for each of the exponential distributions, and leads to a piecewise probability density function over of the form:
(16) 
This distribution has a nondifferentiable point at , and is discontinuous at if . The piece of the function for is proportional to a beta distribution, , and the piece of the function for is a Pareto distribution with shape parameter . In the special case that the distribution reduces to the usual symmetric double exponential distribution which we denote by . An important property of the Laplace distribution is that it provides an upperbound for the entire class of logconcave probability distributions.
Proposition 2.
Let be a logconcave distribution with mode at , and let and be any two values of on either side of . Then, there exists a constant depending on , , such that
where is the derivative of the negative logarithm of the density .
This is simply a restatement of a well known result regarding logconcave functions [Gilks and Wild, 1992]. This result provides a useful upper bound which we use in Section 4 to provide results regarding the concentration properties and tail behaviour of the entire class of logconcave prior distributions over . For the specific case of the hyperbolic secant prior (13) we can construct the following upper and lower bound based on the symmetric Laplace distribution.
Proposition 3.
The loghyperbolic secant distribution satisfies
for all .
The fact that the Laplace distribution on which these bounds are based has the same scale as the hyperbolic secant distribution it is bounding can be used to demonstrate that the logLaplace distribution with scale and the logHS distribution with scale lead to marginal distributions for that have identical concentration properties and tail behaviour. More generally, we have the following result.
Proposition 4.
Let be a distribution over that is bounded from above. If satisfies
(17) 
where , , then there exists a such that
If satisfies
(18) 
where , , there exists a such that
This proposition tells us that the logLaplace distribution can provide an upper bound for any distribution over which is loglinear, or subloglinear, in its tails, and can provide a lowerbound if the distribution is loglinear, or superloglinear, in its tails. The advantage of these bounds are that the form of the logLaplace distribution allows for relatively simple analysis of concentration and tail properties of the marginal distribution , which we can use to derive bounds on the behaviour of the entire class of bounded prior densities over which have loglinear tails.
3.3 The log prior
The logLaplace prior discussed in Section 3.2 is important as it offers an upperbound on the entire class of prior distributions on with loglinear tails, and through the monotone convergence theorem, an upperbound on the marginal distributions over that they induce. It is of some interest then to examine an example of a prior distribution that cannot be bounded by the logLaplace distribution. Specifically, we examine the log prior distribution for :
where denotes a Student distribution centered at zero, with degreesoffreedom , scale and probability density
(19) 
Transforming the density (19) to a density on yields
(20) 
The density (20) is of the form , where
(21) 
is a function of slow variation (BarndorffNielsen et al. [1982], p. 155). In this sense, the density (20) can be thought of as the normalJeffreys’ prior multipled by a factor that slows its growth as , and increases the rate at which it decays as , by an amount sufficient to ensure the resulting prior density is proper. The log density dominates the logLaplace density in the following sense.
Proposition 5.
This result shows that the log density, irrespective of the choice of degreesoffreedom or scale parameter, always concentrates more probability mass near , and decays more slowly as becomes large, than any prior density for derived from a density on with loglinear tails. The log density over implies a density over the shrinkage factor, of the form
For all and degreesoffreedom this density satisfies
(22) 
which shows that regardless of the choice of degreesoffreedom parameter , or the scale parameter , the log prior distribution leads to a prior distribution over that is always infinite at “no shrinkage” () and “complete shrinkage to zero” (). However, despite this property, the log density can concentrate as much probability mass around as desired by an appropriate choice of , as formalised by the following result.
Proposition 6.
For all degreesoffreedom , and there exists a such that
This result implies that by choosing a small enough scale , the log prior distribution can become more and more similar to the ridge regression prior by allowing less a priori variation in shrinkage between model coefficients. However, the property (22) guarantees that regardless of how much prior probability mass is concentrated around the density always tends to infinity as and .
4 Discussion and Theoretical Results
In this section we examine the theoretical behaviour of the logscale prior distributions for , when used within the hierarchy
where is a unimodal distribution over . Define the marginal distribution of , relative to the prior distribution , by
(23) 
As discussed in Section 1, two desirable properties of a prior distribution over are that corresponding marginal distribution : (I) tends to infinity as , and (II) has Cauchy or superCauchy tails as . We will now show that for an appropriate choice of scale parameters the asymmetric logLaplace prior distribution results in a marginal distribution that posesses both of these properties. First, we examine the form of the marginal distribution when is an asymmetric logLaplace distribution.
Theorem 1.
Let follow a logLaplace distribution with leftscale and right scale , and let follow a normal distribution with variance . The marginal distribution (23) for the regression coefficient is
(24) 
where is generalized exponential integral and is the incomplete lowergamma function.
Using Theorem 1 we can examine the concentration properties of the marginal distribution when is an asymmetric logLaplace distribution.
Theorem 2.
Let follow a logLaplace distribution with leftscale and right scale , and let follow a normal distribution with variance . Then, for all , as , the marginal density satisfies

if ;

if ;

if .
as .
We also have the following theorem, which characterises the tail behaviour of marginal distribution when is an asymmetric logLaplace distribution.
Theorem 3.
Let follow a logLaplace distribution with left scale and right scale , and let follow a normal distribution with variance . Then, for all
as
Theorems 2 and 3, combined with Proposition 3 and the monotone convergence theorem can be used to obtain identical results for the loghyperbolic secant hyperprior. More generally, when combined with Proposition 4, Theorems 2 and 3 provide upper bounds on prior concentration at for the entire class of prior distributions over with loglinear, or subloglinear tails. An important aspect of these results is that both priors exhibit a pole at , and have Cauchy or superCauchy tails as , if and only if . If , insufficient mass is concentrated at to produce a pole at . If the tail of the prior distribution over is too light, and the marginal distribution of decays at a superCauchy rate. In contrast, when follows a Student distribution, as per Section (3.3), the logLaplace prior density no longer provides an upperbound, and the resulting marginal distribution for exhibits very different behaviour, as characterised by the following result.
Theorem 4.
Let follow a log distribution with scale and degreesoffreedom parameter , and let follow a normal distribution with variance . Then, the resulting marginal distribution over satisfies
for all , and
for all and .
This result demonstrates a very interesting property of the log prior distribution: namely, that irrespective of the choice of the degreesoffreedom parameter , or the scale parameter , the resulting marginal distribution over always possesses Properties I and II. To the best of the authors’ knowledge, this property appears to be unique amongst all the known prior distributions for .
4.1 Comparison with Standard Shrinkage Priors
It is interesting to compare the new logLaplace and log prior distributions discussed in Sections 3.3 and 3.2 with the standard shrinkage priors proposed in the literature. As a consequence of Proposition 2, the logLaplace density is useful because it serves as an upperbound for the entire class of probability densities that are logconcave on ; therefore, no prior density over that is logconcave can achieve greater concentration of marginal prior probability around , or heavier tails as , than some member of the logLaplace family. It is therefore interesting to determine which of the standard shrinkage prior distributions from the literature fall into this class.
The Bayesian lasso prior (7) and the regular horseshoe prior (8) over are easily verified to be logconcave by examination of their second derivatives. More generally, the beta prime family of prior densities over are characterised by the distribution (10) over , which is logconcave. This implies that regardless of how the shape hyperparameters and are chosen, the beta prime prior cannot result in a marginal distribution for with greater concentration near , or heavier tails as , than a logLaplace density (16) with appropriately chosen scale parameters.
A recent trend has been to propose prior densities for of the form
The horseshoe+, R2D2 prior [Zhang et al., 2016] and inverse gammagamma (IGG) prior [Bai and Ghosh, 2017] can all be represented in this manner. To prove that the density induced by this prior over is logconcave, it suffices to show that and are both distributed as per logconcave densities; the preservation of logconcavity under convolution therefore guarantees that the distribution of their sum will also be logconcave (Saumard and Wellner [2014], pp. 60–61). In the case of horseshoe+, and are both distributed as per hyperbolic secant distributions (8), which are logconcave. The IGG prior is given by
Irrespective of the choice of hyperparameters , and both and are distributed as per logconcave densities, and it follows again that the IGG shrinkage prior for is logconcave, and its behaviour is also bounded by the logLaplace shrinkage prior. Finally, the R2D2 prior (see Equation 7, Zhang et al. [2016]) is built as a scale mixture of doubleexponential prior distributions over :
is the betaprime density, and and are hyperparameters that control the tails of the density over . Using the standard scalemixture of normals representation of the doubleexponential distribution and we can rewrite this hierarchy as
from which it immediately follows that both and are distributed as per logconcave probability distributions. The interesting result here is that choice to to use a doubleexponential kernel for in place of a normal kernel made in Zhang et al. [2016] was motivated by the aim of producing a marginal prior density for with greater concentration at the origin, and heavier tails. However, as the implied density over is logconcave it is clear that the use of the doubleexponential kernel does not lead to prior with any different asymptotic properties than simply using a normal kernel with an appropriate logLaplace prior over .
In contrast to the above prior distributions, the log distribution does not have (sub) loglinear tails, and therefore does not fit into the class of prior distributions upperbounded by the logLaplace prior. However, our work is not the first to propose a larger, unifying class of shrinkage priors. In particular, recent work by Ghosh and Chakrabarti [2017] has explored the consistency properties of a class of globallocal shrinkage priors that can be written in a decomposition of the form
(25) 
where and is a slowlyvarying function that satisfies . To show that the log prior distribution also falls outside of this class of priors, we can transforming the log distribution (20) over to a distribution over , yielding
(26) 
where is given by (21). To express (26) in the form (25) we require that . Further, we note that tends to zero as . The log prior therefore violates both of the conditions required for a prior distribution to fall into the particular class studied by Ghosh and Chakrabarti [2017].
4.2 Estimation of
The discussion and results in Sections (2) and (4) suggest that introducing a scale parameter into a shrinkage density over provides a simple, unified method for controlling the concentration and tail behaviour of the resulting marginal distribution over the coefficient . By making the scale parameter suitably small we can concentrate probability mass near and obtain nearridge regression like behaviour. Conversely, making the scale parameter sufficiently large will spread the probability mass over the space more thinly, and allow for greater variation in the shrinkage coefficients. To build a shrinkage prior that has the ability to adapt to the degree of sparsity in the underlying coefficient vector we could put an appropriate prior over and incorporate into the globallocal prior hierarchy to allow for its estimation along with the other hyperparameters. An advantage of this approach, in comparison to earlier attempts to adaptively control tails by varying shape parameters, is that has a the same interpretation as a scale parameter irrespective of the shrinkage prior we begin with. A possible prior for might be
though there exists a large variety of priors for scale parameters in the literature that we can draw upon (see for example, Gelman [2006]).
In the case that is logconcave there exists a potential problem when varying the scale . Theorem 2 suggests that if is logconcave, and the scale parameter is too small, the resulting marginal distribution can lose Properties I and II discussed in Section 1. In fact, the following result shows that for all logconcave densities this is always a possibility.
Proposition 7.
If is a logconcave density over with a maxima at , and is the scaledensity
then there always exists a such that
where .
This result shows that if the density we use to create our scaledensity is logconcave, then there will always exist a choice of scale parameter such that the gradients of the logdensity exceed one on either side of the mode. We can use this in conjunction with Proposition 2 to show that we can always find a such that is upperbounded by a logLaplace density with and . Then, application of Theorem 2 shows the corresponding marginal density will not have a pole at , and application of Theorem 3 shows that the tails of will be subCauchy. The implication of this result is that if we allow estimation of the scale parameter from data, we can only obtain ridgelike behaviour at the cost of losing the ability to estimate large signals without bias and to aggressively shrink away small signals. We can also use the properties of the logLaplace prior to establish a sufficient condition for a prior distribution over to lead to a marginal distribution over that possesses Properties I and II; in particular, from Proposition 4 and Theorems 2 and 3 we know that it if we can find an asymmetric Laplace distribution that lowerbounds and has scale parameters , , then the resulting marginal distribution over will possess Properties I and II.
If we move outside of the class of logconcave scale prior distributions for we see that the results can be strikingly different. For example, if we consider prior distributions for built from the log distribution then Theorem 4 shows that these priors always result in marginal distributions that possess properties I and II, irrespective of the value of . Thus, the log priors may place as much mass around , or conversely , as desired, to restrict the variation in shrinkage coefficients while still providing the possibility of either heavily shrinking coefficients close to zero, or leaving very large coefficients virtually unshrunk. The log distribution would therefore appear to offer a family of shrinkage priors that smoothly transition from ultrasparse to ultradense prior beliefs, while being safe in the sense that they always provide an “out” for very large coefficients, or coefficients that are exactly zero. This suggests that this class of prior distributions is potentially a strong candidate around which to try and build adaptive shrinkage estimators that are minimax, or close to minimax, in terms of squared error risk. To the best of our knowledge, these are the only priors constructed so far that have this particular property.
5 Posterior Computation
Ideally, in addition to possessing favourable theoretical properties, a prior distribution should also result in a posterior distribution from which efficient simulation is possible. An interesting aspect of working in terms of , rather than directly in terms of , is that the conditional distributions are frequently both unimodal and logconcave, which allows for the use of simple and efficient rejection samplers. To see this, we note that for the globallocal shrinkage hierarchy (2)–(4) the conditional distribution of can be written in the form
(27) 
where . The first term in (27) is logconcave in , and provided that the prior is logconcave, the conditional distribution will also be logconcave. As discussed in Section 4.1 this condition is satisfied by many of the standard shrinkage priors (i.e., horseshoe, horseshoe+, Bayesian lasso). However, even in the case of nonlogconcave distributions, such as the Student, many symmetric locationscale distributions can be expressed as a scalemixture of normal distributions, i.e.,
where is a suitable mixing density. The normal distribution is logconcave, so that any distribution that can be expressed as a scalemixture of normals admits logconcave conditional distributions for , conditional on the latent variable . A further advantage of this representation is that the scale parameter , which controls the tails of the induced prior distribution over , appears as a simple variance parameter. This means that sampling , and adaptively controlling the tail weight of the prior distribution over as discussed in Section 4.2, becomes straightforward.
5.1 LogScale Prior Hierarchy
In this section we present the steps required to implement a Gibbs sampler for both the symmetric logLaplace prior density (16) and the log prior density (20), both with unknown scale parameters . We use the fact that both of these densities can be written as scalemixtures of normals in the space. We present the hierarchy for the multiple means model (1) with known noise variance , though adaptation to the general linear regression model with unknown variance is straightforward as the conditional distributions for the hyperparameters remain the same (see for example Makalic and Schmidt [2016a]). Our hierarchy is
(28)  
where , and denotes a standard halfCauchy distribution truncated to the interval , as recommend by van der Pas et al. [2017]. The choice of density for the latent variables determines the particular logscale prior to be used. For the logLaplace prior we use
and for the log prior we use
where is the degreesoffreedom parameter, which we assume to be known. The distribution can also be represented as a scale mixture, which would allow us to easily extend our hierarchy to an adaptive horseshoe procedure [BarndorffNielsen et al., 1982]. In practice, we use the following inverse gammainverse gamma mixture representation of the halfCauchy prior distribution [Makalic and Schmidt, 2016b]
This latent variable representation leads to simpler conditional distributions in a Gibbs sampling implementation than the alternative gammagamma representation commonly used Armagan et al. [2011].
5.2 Gibbs Sampling Procedure
Given observations , we can sample from the posterior distribution using the following Gibbs sampling procedure. The coefficients , can be sampled from
where . The hyperparameters can be sampled using the rejection sampler presented in Appendix II. This sampler is highly efficient, requiring approximately draws per accepted sample in the worst case setting. The latent variables are sampled according to the particular logscale prior distribution we have chosen. For the logLaplace prior is conditionally distributed as per