Adaptive Bayesian Shrinkage Estimation Using Log-Scale Shrinkage Priors

Adaptive Bayesian Shrinkage Estimation Using Log-Scale Shrinkage Priors

Abstract

Global-local shrinkage hierarchies are an important, recent innovation in Bayesian estimation of regression models. In this paper we propose to use log-scale distributions as a basis for generating familes of flexible prior distributions for the local shrinkage hyperparameters within such hierarchies. An important property of the log-scale priors is that by varying the scale parameter one may vary the degree to which the prior distribution promotes sparsity in the coefficient estimates, all the way from the simple proportional shrinkage ridge regression model up to extremely heavy tailed, sparsity inducing prior distributions. By examining the class of distributions over the logarithm of the local shrinkage parameter that have log-linear, or sub-log-linear tails, we show that many of standard prior distributions for local shrinkage parameters can be unified in terms of the tail behaviour and concentration properties of their corresponding marginal distributions over the coefficients . We use these results to derive upper bounds on the rate of concentration around , and the tail decay as , achievable by this class of prior distributions. We then propose a new type of ultra-heavy tailed prior, called the log- prior, which exhibits the property that, irrespective of the choice of associated scale parameter, the induced marginal distribution over always diverge at , and always possesses super-Cauchy tails. Finally, we propose to incorporate the scale parameter in the log-scale prior distributions into the Bayesian hierarchy and derive an adaptive shrinkage procedure. Simulations show that in contrast to a number of standard prior distributions, our adaptive log- procedure appears to always perform well, irrespective of the level of sparsity or signal-to-noise ratio of the underlying model.

1 Introduction

The multiple means problem has been studied extensively since the introduction of the first shrinkage estimator by James and Stein [1961]. In the multiple means problem we observe a vector of samples from the model

(1)

and are required to estimate the unknown coefficient (mean) vector . Part of the appeal of the multiple means problem is that it serves as an analogue for the more complex general linear model, while being substantially more amenable to analysis. The original work of James and Stein [James and Stein, 1961] showed that proportional shrinkage can be used to construct estimators that dominate least-squares in terms of squared-error risk for , and these results were later extended to Bayesian proportional shrinkage estimators [Strawderman, 1971, Zellner, 1986], and more generally to Bayesian ridge regression. More recently there has been a focus on the sparse setting in which the majority of the entries of are exactly zero. In this setting the use of methods that promote sparsity, such as the lasso [Tibshirani, 1996] can lead to substantial improvements in estimation risk over conventional shrinkage estimators. An important contribution to the Bayesian regression literature was the introduction of the general global-local shrinkage hierarchy proposed in Polson and Scott [2010], which models the coefficients by

(2)
(3)
(4)

In the global-local shrinkage hierarchy the local shrinkage hyperparameters control the degree of shrinkage applied to individual coefficients, while the global shrinkage hyperparameter controls the overall degree of shrinkage applied globally to all coefficients. The choice of the prior distributions controls the behaviour of the resulting shrinkage estimators; for example, if is a Dirichlet point-mass the hierarchy reduces to the standard ridge regression prior. Well known techniques that fall under the umbrella of global-local shrinkage priors include the Bayesian lasso [Park and Casella, 2008], the normal-gamma prior [Griffin and Brown, 2010], the horseshoe [Carvalho et al., 2010], the horseshoe+ [Bhadra et al., 2016], the beta mixture of Gaussians [Armagan et al., 2011] and the R2-D2 prior [Zhang et al., 2016]. As there exists such a wide array of potential global-local shrinkage priors, it is valuable to determine general properties that may or may not be beneficial for estimating in the multiple means problem. Carvalho et al. [2010] proposed two such properties that a good sparsity inducing prior should possess:

  • Property I: The prior should concentrate sufficient probability mass near to ensure that the marginal distribution as ;

  • Property II: The prior should decay sufficiently slowly as so that the marginal distribution has Cauchy, or super-Cauchy tails, to ensure that

    as .

The first property ensures that the Bayes estimation risk of a procedure using such a prior will be low when the underlying coefficient vector is sparse. The second property ensures that very large effects are not over-shrunk by the resulting procedure. A number of shrinkage priors, including the horseshoe and horseshoe+, satisfy both of these properties, and also possess a number of other favourable theoretical properties (see for example, van der Pas et al. [2017]). More classical shrinkage priors, such as the Bayesian lasso, and Bayesian ridge do not satisfy either Property I or II; despite this, it is not difficult to propose configurations of the underlying coefficients for which sparse estimation methods possess much greater squared-error risk than the humble Bayesian ridge. Generally the coefficient configurations that are problematic for sparse estimators are dense in the sense that a large number of the coefficients are non-zero. While it can be argued that sparsity inducing priors are inappropriate for such settings it is difficult to know with certainty whether a problem is best modelled a priori as sparse or dense, particularly when dealing with complex natural phenomena.

Statistical genetics problems have been an important application, and source of inspiration, for much of the recent work on sparsity inducing priors for high dimensional regression models. A key assumption driving much of this original work was that only a small number of the many genomic variants that exist are associated with any given disease. However, there is substantial evidence that there can be large numbers of variants associated with diseases [Boyle et al., 2017], though the levels of association, and therefore the signal-to-noise ratio, will be low. Thus, there is a potential need for shrinkage priors that can adapt to the sparsity of the underlying coefficient vector, and bridge the gap between extreme sparsity inducing behaviour, and proportional shrinkage type priors, as the problem demands. Many of the existing sparsity inducing shrinkage priors have shape hyperparameters that can be tuned to adjust the degree to which they expect the underlying coefficient vector to be sparse, and some work has examined the tuning, and estimation, of these type of shape parameters (see for example, Bhattacharya et al. [2015], Griffin and Brown [2017] and Huber and Feldkircher [2017]). However, in general, the shape parameters are complex to sample, their interpretation is potentially difficult, and the tuning has often been focused on increasing the degree of a priori expected sparsity above and beyond that of a method such as the horseshoe.

In this paper we propose a framework for specifying prior distribution for local shrinkage hyperparameters based on log-scale distributions. These log-scale distributions have a single scale hyperparameter that can be used to vary the log-scale shrinkage priors from highly sparsity promoting through to almost ridge regression-like in behaviour, and are straightforward to integrate into an MCMC procedure. Apart from providing a degree of robustness to our prior assumptions regarding sparsity of the underlying coefficients, we believe that these types of adaptive shrinkage priors will have particular application in regression problems in which variables can be formed into logical groupings. In such situations, it is highly concievable that the coefficient vectors associated with some groups will be dense while the coefficient vectors associated with other groups could be very sparse. Some obvious examples of this include additive models based around polynomial expansions in which one of the input variables may be related to the target through a discontinuous non-linearity, or statistical genomics, in which some genes or pathways may have large numbers of associations with disease, such as the HLA region [Kennedy et al., 2017] while other genes may have only one or two strongly associated variants.

1.1 Our Contribution

In this paper we show that viewing prior distributions in terms of the logarithm of the local shrinkage parameters, , has several distinct advantages. Our work was motivated by the observation that those shrinkage priors which strongly promoted sparsity spread their probability mass more thinly across the space. By viewing the standard prior distributions in space and introducing a scale parameter it becomes possible to vary the degree to which a prior promotes sparsity in the coefficient estimates, all the way from the simple proportional shrinkage ridge regression model up to extremely heavy tailed distributions. We call these log-scale prior distributions.

Using this approach we show that many of standard local shrinkage parameter prior distributions can be unified in terms of tail behaviour and concentration properties of the resulting marginal distribution over . In particular, we consider the class of distributions over that have log-linear, or sub-log-linear tails. We derive upper bounds on the rate of concentration around , and tail decay as , achievable by this class of prior distributions. Further, we show that by introduction of a scale parameter, all of the common prior distributions can be made to behave equivalently to each other, irrespective of the specific shape parameters they may possess, in the sense that for a sufficiently small choice of scale the induced marginal distribution can be made to lose Properties I and II. We then propose a new class of ultra-heavy tailed priors, called the log- priors, which exhibit the property that irrespective of the choice of , the induced marginal distribution over never loses Properties I and II.

Finally, we utilise the simple interpretation of as the scale for the hyper-parameters to derive an adaptive shrinkage procedure. We incorporate into the full Bayesian hierarchy and use this to estimate the degree of sparsity in the data generating model. This yields a prior that is able to vary from highly sparsity promoting through to a ridge-like, depending on the configuration of the true regression coefficients. However, by using the log- prior distribution the resulting prior distribution over the coefficients never loses Properties I and II no matter how much mass is concentrated near .

2 Log-scale Hyperprior Distributions

Let be a unimodal distribution over . If , then the translated and scaled random variate follows the probability distribution:

(5)

Distributions of the form (5) are known as location-scale distributions, in which is the location parameter and is the scale parameter. Let be the natural logarithm of local shrinkage parameter in the global-local shrinkage hierarchy (2)–(4). The primary motivation for studying distributions over is the fact that, if follows a location-scale distribution of the form (5), then:

  1. location transformations of induce scale transformations on ;

  2. scale transformations of induce power-transformations on .

The first fact is of less interest, as scale transformations of are generally taken care of by the presence of the global shrinkage hyperparameter in the standard global-local shrinkage prior hierarchy. The second fact is far more interesting, as it reveals a simple way in which we can control the behaviour of the prior distribution when and . If we further restrict attention to the class of log-location-scale priors in which is symmetric around the prior distribution is symmetric around in the following sense:

(6)

A property of the global-local shrinkage hierarchy for the multiple means problem is that the posterior mean of the coefficients can be written as

where is the degree of shrinkage towards zero being applied to coefficient . Given , the corresponding degree of shrinkage is ; when is close to zero very little shrinkage is performed, and when is close to one, the corresponding coefficient is almost entirely shrunk to zero. This interpretation motivated the original horseshoe prior distribution, which placed a horseshoe-shape prior over to promote either aggressive shrinkage or little shrinkage of the coefficients. The quantity can be interpreted as the degree of evidence in the data to support against , and the thresholding rule is frequently used as a variable selection criterion [Carvalho et al., 2010, Tang et al., 2016].

Placing a log-scale distribution over that is symmetric around implies a distribution over with a median at . In the particular case that , property (6) implies that the resulting distribution models the prior belief that coefficients are just as likely to be shrunken towards zero as they are to be left untouched, and that a priori, a variable has a marginal prior probability of being selected of . These properties, coupled with the fact that scale transformations of result in power-transformations of suggest that specification of symmetric priors over may offer a fruitful approach to generate novel, adjustable priors for local shrinkage parameters that imply reasonable prior beliefs about the model coefficients.

2.1 Behaviour of Standard Shrinkage Priors in Space

It is of interest to examine the prior distributions implied by a number of standard shrinkage prior distributions for . The Bayesian lasso (double exponential) Park and Casella [2008] prior distribution over induces an exponential distribution over , and a distribution of the form

(7)

over . This is an asymmetric distribution over , with the left-hand tail (that controls shrinkage less than ) being much heavier than the right-hand tail (which controls shrinkage greater than ). This interquartile interval (first and third quartiles) for this distribution is approximately . Positive values of induce little shrinkage on coefficients, which is desirable for modeling very large effects. The skew towards negative values of exhibited by the Bayesian lasso demonstrates why it introduces bias in estimation when the underlying model coefficients are large. In terms of the coefficient of shrinkage, , where , the interquartile interval of the prior induced by (7) is approximately , respectively, demonstrating a clear preference towards shrinkage below .

The horseshoe prior [Carvalho et al., 2010] is often considered a default choice for sparse regression problems. The horseshoe prior places a standard half-Cauchy distribution over , which is known to induce an unstandardised unit hyperbolic secant distribution over , with probability distribution given by

(8)

where denotes the hyperbolic secant function. This distribution is symmetric around , and has an interquartile range of approximately for , and for . In contrast to the Bayesian lasso, the horseshoe prior clearly spreads its probability mass more thinly across the space, encodes a much wider range of a prior plausible shrinkage for the coefficients and is symmetric around . The horseshoe+ (HS+) prior distribution Bhadra et al. [2016] models as the product of two half-Cauchy distributions, which leads to a prior distribution on of the form

(9)

where is the hyperbolic cosecant function. This distribution is also symmetric around , which is straightforward to verify from the fact that is modelled as the sum of two hyperbolic secant random variables, which are themselves symmetric. The HS+ prior has an interquartile range of approximately for , which translates to an interquartile interval of on . The horseshoe+ prior more strongly promotes sparsity in the estimates of , and this is evident by the fact that more probability mass is concentrated in a larger region of , than for either the Bayesian lasso or the horseshoe. More generally, the beta prime class of hyperpriors, which begin by modelling the shrinkage factor as  [Armagan et al., 2011], imply a distribution of the form

(10)

over . The density (10) be identified as a -distribution with zero mean, a scale of and shape parameters and . This class of distributions is symmetric if and only if , and generalizes a number of standard shrinkage hyperpriors. For example, the standard horseshoe is recovered by taking , while the Strawderman-Berger prior is recovered by taking and , which is asymmetric. The negative-exponential-gamma prior is found by taking and , with , which is asymmetric for all . In all these cases the hyperparameter controls the behaviour of the prior on as , and the hyperparameter controls the behaviour of the tail of the prior on as . By adjusting the and hyperparameters, the prior mass can be controlled to be spread more or less densely across space, controlling the degree to which the prior induces sparsity on the estimated coefficients.

For example, when the interquartile interval for is identical to the interval obtained for the horseshoe, while for and the interquartile range on expands to which is wider than interquartile range for the horseshoe+ prior. Taking , leads to a concentration of prior mass near , which can be used to approximate ridge regression. However, a potentially unwanted side effect of this is that when , the marginal prior distribution for loses both Properties I and II. The effect of and on the prior distribution has been used to attempt to adaptively estimate the degree of sparsity required from the data (for example, Griffin and Brown [2017] and Huber and Feldkircher [2017]). However, the functional form of the prior distribution, and the way in which the hyperparameters and control the prior, has the consequence that both the interpretation of the the hyperparameters, and the practical implementation of efficient sampling algorithms for them, is difficult.

2.2 Log-Scale Priors as Shrinkage Priors on

The log-scale interpretation of standard shrinkage priors offers an alternative way of understanding the way in which both the tails, and behaviour near the origin, of a prior distribution for models prior beliefs regarding sparsity of , and the type of shrinkage behaviour introduced by the prior distribution. The standard prior distributions discussed in Section 2.1 all induce unimodal distributions on which tail off to zero as . They differ in how thinly they spread their prior probability across the space. In the standard global-local shrinkage hierarchy (2)–(4), the prior distribution for is

If we assume that , we see that conditional on , the local shrinkage parameter is modelled as a random variable scaled by . This scale transformation of induces a location transformation on ; i.e., if is the density implied by the prior for over , and , then

(11)

The global scale parameter determines the location of the prior over . The standard shrinkage priors on can therefore be viewed as shrinking the hyperparameters towards , with the more sparsity promoting prior distributions resulting in less shrinkage of the hyperparameters. Clearly (11) is of the form (5) with and . A natural generalization is then to allow the scale to take on an arbitrary value. The scale of the log-scale prior (5) can be interpreted as modelling the a priori plausible range of values around the location . The smaller the scale parameter, the more prior probability is concentrated around , and the less variability is implied in the values of , with the result that most of the shrinkage coefficients will be concentrated around . In the limiting case that , the prior (5) concentrates all of its mass at , allowing for no variation in shrinkage between coefficients, and the prior hierarchy reduces to the Bayesian ridge. In contrast, the larger the scale parameter becomes, the more a prior variability in the hyperparameters is implied, with the caveat that less prior mass is placed around the neighbourhood of any particular . In the limiting case that , we recover the (improper) normal-Jeffreys prior , which is a uniform distribution over .

This interpretation motivates us to propose the introduction of a scale parameter to the prior distributions over space as a method to provide a hyperparameter that can be used to control the amount by which a prior promotes sparsity. Practically, this type of scale hyperparameter is easier to deal with than shape hyperparameters that control the tail behaviour of priors such as the beta prime prior (10), both in terms of interpretation as well as implementation with a sampling hierarchy. They also provide a unified form of hyperparameter that controls the behaviour of a shrinkage prior in the same, standard way, irrespective of the initial prior distribution ) that we start with.

3 Three Log-Scale Prior Distributions

In this section we examine three potential choices of log-scale prior distribution for global-local shrinkage hierarchies of the form (2)–(4). The first is the log-hyperbolic secant prior, which is itself a generalization of the regular horseshoe prior distribution. The second distribution we consider is the asymmetric log-Laplace prior distribution, which has the advantage of being amenable to analysis, while also exhibiting the same tail properties as the log-hyperbolic secant prior. Furthermore, the log-Laplace prior distribution can be used to derive upper-bounds on the concentration and tail behaviours of a large class of prior distributions, which includes most of the common shrinkage priors. The final distribution we consider is the log-, which is formed by modelling the hyperparameters using the Student- distribution. The resulting density appears to be part of an entirely new class of prior distributions, and exhibits a special property that is, to the authors knowledge, not shared by any other known shrinkage prior.

3.1 Log-Hyperbolic Secant Prior

The horseshoe prior is generally considered a default choice of prior distribution for the local shrinkage parameters , and therefore forms a suitable starting point for generalisation through the introduction of scale parameter on the space. Our starting point is the density (8), which after introduction of a location parameter and scale parameter becomes

(12)

This density is known in the literature as the hyperbolic secant distribution. The horseshoe prior (8) is a special case of (12) for and . Without any loss of generality we can let , as the location of the density will be determined by the value of the global shrinkage parameter as discussed in Section 2.2. Allowing leads to a prior distribution for of the form

(13)

Examining (13) clearly shows that the scale parameter controls the tail and concentration behaviour of the induced distribution over . The larger the scale , the heavier the tail as , and the greater the concentration of prior mass around . For this prior reduces to the half-Cauchy distribution; for the prior tends to zero as and for the prior exhibits a pole at .

It is of interest to compare the prior (13) to the prior over one would obtain by starting with a -distribution with shape parameters and , mean of zero and a scale of and transforming this to , which yields

(14)

Comparing (14) with (13) we see that the and shape parameters play exactly the same role as the scale parameter , the primary difference being the ability to varying the tail or concentration behaviour individually by appropriate choice of and . If we consider the case in which the shape parameters are the same, i.e., , for which the -distribution is symmetric, we have the following result.

Proposition 1.

There exists a such that

where is a shape parameter.

The proof follows in a straightforward manner by application of the bound . From the monotone convergence theorem, Proposition 1 tells us that controlling the tails of the beta prime prior distribution over by variation of the shape parameters cannot lead to a heavier tailed marginal distribution over , or one with a greater concentration of mass at , than controlling the tails of the prior by varying the scale parameter of the log- distribution (10) alone. This result is also confirmed by the fact that the tails of -distribution are log-linear with an absolute log-gradient of (Barndorff-Nielsen et al. [1982], p. 150).

3.2 Log-Laplace Priors

We now examine a specific choice of log-scale prior based on the Laplace (double exponential) distribution. The primary usefulness of this distribution is its ability to provide simple bounds for the entire class of log-location-scale prior distributions over with log-linear, or sub-log-linear tails, which itself includes the important sub-class of log-concave densities. The log-Laplace prior distribution for a local shrinkage hyperparameter is given by

where denotes an asymmetric Laplace distribution with a median of zero, a left-scale of , a right-scale of and probability density function

(15)

where . The asymmetric Laplace distribution is essentially equivalent to two back-to-back exponential distributions with different scale parameters for each of the exponential distributions, and leads to a piecewise probability density function over of the form:

(16)

This distribution has a non-differentiable point at , and is discontinuous at if . The piece of the function for is proportional to a beta distribution, , and the piece of the function for is a Pareto distribution with shape parameter . In the special case that the distribution reduces to the usual symmetric double exponential distribution which we denote by . An important property of the Laplace distribution is that it provides an upper-bound for the entire class of log-concave probability distributions.

Proposition 2.

Let be a log-concave distribution with mode at , and let and be any two values of on either side of . Then, there exists a constant depending on , , such that

where is the derivative of the negative logarithm of the density .

This is simply a restatement of a well known result regarding log-concave functions [Gilks and Wild, 1992]. This result provides a useful upper bound which we use in Section 4 to provide results regarding the concentration properties and tail behaviour of the entire class of log-concave prior distributions over . For the specific case of the hyperbolic secant prior (13) we can construct the following upper and lower bound based on the symmetric Laplace distribution.

Proposition 3.

The log-hyperbolic secant distribution satisfies

for all .

The fact that the Laplace distribution on which these bounds are based has the same scale as the hyperbolic secant distribution it is bounding can be used to demonstrate that the log-Laplace distribution with scale and the log-HS distribution with scale lead to marginal distributions for that have identical concentration properties and tail behaviour. More generally, we have the following result.

Proposition 4.

Let be a distribution over that is bounded from above. If satisfies

(17)

where , , then there exists a such that

If satisfies

(18)

where , , there exists a such that

This proposition tells us that the log-Laplace distribution can provide an upper bound for any distribution over which is log-linear, or sub-log-linear, in its tails, and can provide a lower-bound if the distribution is log-linear, or super-log-linear, in its tails. The advantage of these bounds are that the form of the log-Laplace distribution allows for relatively simple analysis of concentration and tail properties of the marginal distribution , which we can use to derive bounds on the behaviour of the entire class of bounded prior densities over which have log-linear tails.

3.3 The log- prior

The log-Laplace prior discussed in Section 3.2 is important as it offers an upper-bound on the entire class of prior distributions on with log-linear tails, and through the monotone convergence theorem, an upper-bound on the marginal distributions over that they induce. It is of some interest then to examine an example of a prior distribution that cannot be bounded by the log-Laplace distribution. Specifically, we examine the log- prior distribution for :

where denotes a Student- distribution centered at zero, with degrees-of-freedom , scale and probability density

(19)

Transforming the density (19) to a density on yields

(20)

The density (20) is of the form , where

(21)

is a function of slow variation (Barndorff-Nielsen et al. [1982], p. 155). In this sense, the density (20) can be thought of as the normal-Jeffreys’ prior multipled by a factor that slows its growth as , and increases the rate at which it decays as , by an amount sufficient to ensure the resulting prior density is proper. The log- density dominates the log-Laplace density in the following sense.

Proposition 5.

For all , , and the log- density (20) satisfies

where is the log-Laplace density (16).

This result shows that the log- density, irrespective of the choice of degrees-of-freedom or scale parameter, always concentrates more probability mass near , and decays more slowly as becomes large, than any prior density for derived from a density on with log-linear tails. The log- density over implies a density over the shrinkage factor, of the form

For all and degrees-of-freedom this density satisfies

(22)

which shows that regardless of the choice of degrees-of-freedom parameter , or the scale parameter , the log- prior distribution leads to a prior distribution over that is always infinite at “no shrinkage” () and “complete shrinkage to zero” (). However, despite this property, the log- density can concentrate as much probability mass around as desired by an appropriate choice of , as formalised by the following result.

Proposition 6.

For all degrees-of-freedom , and there exists a such that

This result implies that by choosing a small enough scale , the log- prior distribution can become more and more similar to the ridge regression prior by allowing less a priori variation in shrinkage between model coefficients. However, the property (22) guarantees that regardless of how much prior probability mass is concentrated around the density always tends to infinity as and .

4 Discussion and Theoretical Results

In this section we examine the theoretical behaviour of the log-scale prior distributions for , when used within the hierarchy

where is a unimodal distribution over . Define the marginal distribution of , relative to the prior distribution , by

(23)

As discussed in Section 1, two desirable properties of a prior distribution over are that corresponding marginal distribution : (I) tends to infinity as , and (II) has Cauchy or super-Cauchy tails as . We will now show that for an appropriate choice of scale parameters the asymmetric log-Laplace prior distribution results in a marginal distribution that posesses both of these properties. First, we examine the form of the marginal distribution when is an asymmetric log-Laplace distribution.

Theorem 1.

Let follow a log-Laplace distribution with left-scale and right scale , and let follow a normal distribution with variance . The marginal distribution (23) for the regression coefficient is

(24)

where is generalized exponential integral and is the incomplete lower-gamma function.

Using Theorem 1 we can examine the concentration properties of the marginal distribution when is an asymmetric log-Laplace distribution.

Theorem 2.

Let follow a log-Laplace distribution with left-scale and right scale , and let follow a normal distribution with variance . Then, for all , as , the marginal density satisfies

  1. if ;

  2. if ;

  3. if .

as .

We also have the following theorem, which characterises the tail behaviour of marginal distribution when is an asymmetric log-Laplace distribution.

Theorem 3.

Let follow a log-Laplace distribution with left scale and right scale , and let follow a normal distribution with variance . Then, for all

as

Theorems 2 and 3, combined with Proposition 3 and the monotone convergence theorem can be used to obtain identical results for the log-hyperbolic secant hyperprior. More generally, when combined with Proposition 4, Theorems 2 and 3 provide upper bounds on prior concentration at for the entire class of prior distributions over with log-linear, or sub-log-linear tails. An important aspect of these results is that both priors exhibit a pole at , and have Cauchy or super-Cauchy tails as , if and only if . If , insufficient mass is concentrated at to produce a pole at . If the tail of the prior distribution over is too light, and the marginal distribution of decays at a super-Cauchy rate. In contrast, when follows a Student- distribution, as per Section (3.3), the log-Laplace prior density no longer provides an upper-bound, and the resulting marginal distribution for exhibits very different behaviour, as characterised by the following result.

Theorem 4.

Let follow a log- distribution with scale and degrees-of-freedom parameter , and let follow a normal distribution with variance . Then, the resulting marginal distribution over satisfies

for all , and

for all and .

This result demonstrates a very interesting property of the log- prior distribution: namely, that irrespective of the choice of the degrees-of-freedom parameter , or the scale parameter , the resulting marginal distribution over always possesses Properties I and II. To the best of the authors’ knowledge, this property appears to be unique amongst all the known prior distributions for .

4.1 Comparison with Standard Shrinkage Priors

It is interesting to compare the new log-Laplace and log- prior distributions discussed in Sections 3.3 and 3.2 with the standard shrinkage priors proposed in the literature. As a consequence of Proposition 2, the log-Laplace density is useful because it serves as an upper-bound for the entire class of probability densities that are log-concave on ; therefore, no prior density over that is log-concave can achieve greater concentration of marginal prior probability around , or heavier tails as , than some member of the log-Laplace family. It is therefore interesting to determine which of the standard shrinkage prior distributions from the literature fall into this class.

The Bayesian lasso prior (7) and the regular horseshoe prior (8) over are easily verified to be log-concave by examination of their second derivatives. More generally, the beta prime family of prior densities over are characterised by the -distribution (10) over , which is log-concave. This implies that regardless of how the shape hyperparameters and are chosen, the beta prime prior cannot result in a marginal distribution for with greater concentration near , or heavier tails as , than a log-Laplace density (16) with appropriately chosen scale parameters.

A recent trend has been to propose prior densities for of the form

The horseshoe+, R2-D2 prior [Zhang et al., 2016] and inverse gamma-gamma (IGG) prior [Bai and Ghosh, 2017] can all be represented in this manner. To prove that the density induced by this prior over is log-concave, it suffices to show that and are both distributed as per log-concave densities; the preservation of log-concavity under convolution therefore guarantees that the distribution of their sum will also be log-concave (Saumard and Wellner [2014], pp. 60–61). In the case of horseshoe+, and are both distributed as per hyperbolic secant distributions (8), which are log-concave. The IGG prior is given by

Irrespective of the choice of hyperparameters , and both and are distributed as per log-concave densities, and it follows again that the IGG shrinkage prior for is log-concave, and its behaviour is also bounded by the log-Laplace shrinkage prior. Finally, the R2-D2 prior (see Equation 7, Zhang et al. [2016]) is built as a scale mixture of double-exponential prior distributions over :

is the beta-prime density, and and are hyperparameters that control the tails of the density over . Using the standard scale-mixture of normals representation of the double-exponential distribution and we can rewrite this hierarchy as

from which it immediately follows that both and are distributed as per log-concave probability distributions. The interesting result here is that choice to to use a double-exponential kernel for in place of a normal kernel made in Zhang et al. [2016] was motivated by the aim of producing a marginal prior density for with greater concentration at the origin, and heavier tails. However, as the implied density over is log-concave it is clear that the use of the double-exponential kernel does not lead to prior with any different asymptotic properties than simply using a normal kernel with an appropriate log-Laplace prior over .

In contrast to the above prior distributions, the log- distribution does not have (sub) log-linear tails, and therefore does not fit into the class of prior distributions upper-bounded by the log-Laplace prior. However, our work is not the first to propose a larger, unifying class of shrinkage priors. In particular, recent work by Ghosh and Chakrabarti [2017] has explored the consistency properties of a class of global-local shrinkage priors that can be written in a decomposition of the form

(25)

where and is a slowly-varying function that satisfies . To show that the log- prior distribution also falls outside of this class of priors, we can transforming the log- distribution (20) over to a distribution over , yielding

(26)

where is given by (21). To express (26) in the form (25) we require that . Further, we note that tends to zero as . The log- prior therefore violates both of the conditions required for a prior distribution to fall into the particular class studied by Ghosh and Chakrabarti [2017].

4.2 Estimation of

The discussion and results in Sections (2) and (4) suggest that introducing a scale parameter into a shrinkage density over provides a simple, unified method for controlling the concentration and tail behaviour of the resulting marginal distribution over the coefficient . By making the scale parameter suitably small we can concentrate probability mass near and obtain near-ridge regression like behaviour. Conversely, making the scale parameter sufficiently large will spread the probability mass over the space more thinly, and allow for greater variation in the shrinkage coefficients. To build a shrinkage prior that has the ability to adapt to the degree of sparsity in the underlying coefficient vector we could put an appropriate prior over and incorporate into the global-local prior hierarchy to allow for its estimation along with the other hyperparameters. An advantage of this approach, in comparison to earlier attempts to adaptively control tails by varying shape parameters, is that has a the same interpretation as a scale parameter irrespective of the shrinkage prior we begin with. A possible prior for might be

though there exists a large variety of priors for scale parameters in the literature that we can draw upon (see for example, Gelman [2006]).

In the case that is log-concave there exists a potential problem when varying the scale . Theorem 2 suggests that if is log-concave, and the scale parameter is too small, the resulting marginal distribution can lose Properties I and II discussed in Section 1. In fact, the following result shows that for all log-concave densities this is always a possibility.

Proposition 7.

If is a log-concave density over with a maxima at , and is the scale-density

then there always exists a such that

where .

This result shows that if the density we use to create our scale-density is log-concave, then there will always exist a choice of scale parameter such that the gradients of the log-density exceed one on either side of the mode. We can use this in conjunction with Proposition 2 to show that we can always find a such that is upper-bounded by a log-Laplace density with and . Then, application of Theorem 2 shows the corresponding marginal density will not have a pole at , and application of Theorem 3 shows that the tails of will be sub-Cauchy. The implication of this result is that if we allow estimation of the scale parameter from data, we can only obtain ridge-like behaviour at the cost of losing the ability to estimate large signals without bias and to aggressively shrink away small signals. We can also use the properties of the log-Laplace prior to establish a sufficient condition for a prior distribution over to lead to a marginal distribution over that possesses Properties I and II; in particular, from Proposition 4 and Theorems 2 and 3 we know that it if we can find an asymmetric Laplace distribution that lower-bounds and has scale parameters , , then the resulting marginal distribution over will possess Properties I and II.

If we move outside of the class of log-concave scale prior distributions for we see that the results can be strikingly different. For example, if we consider prior distributions for built from the log- distribution then Theorem 4 shows that these priors always result in marginal distributions that possess properties I and II, irrespective of the value of . Thus, the log- priors may place as much mass around , or conversely , as desired, to restrict the variation in shrinkage coefficients while still providing the possibility of either heavily shrinking coefficients close to zero, or leaving very large coefficients virtually unshrunk. The log- distribution would therefore appear to offer a family of shrinkage priors that smoothly transition from ultra-sparse to ultra-dense prior beliefs, while being safe in the sense that they always provide an “out” for very large coefficients, or coefficients that are exactly zero. This suggests that this class of prior distributions is potentially a strong candidate around which to try and build adaptive shrinkage estimators that are minimax, or close to minimax, in terms of squared error risk. To the best of our knowledge, these are the only priors constructed so far that have this particular property.

5 Posterior Computation

Ideally, in addition to possessing favourable theoretical properties, a prior distribution should also result in a posterior distribution from which efficient simulation is possible. An interesting aspect of working in terms of , rather than directly in terms of , is that the conditional distributions are frequently both unimodal and log-concave, which allows for the use of simple and efficient rejection samplers. To see this, we note that for the global-local shrinkage hierarchy (2)–(4) the conditional distribution of can be written in the form

(27)

where . The first term in (27) is log-concave in , and provided that the prior is log-concave, the conditional distribution will also be log-concave. As discussed in Section 4.1 this condition is satisfied by many of the standard shrinkage priors (i.e., horseshoe, horseshoe+, Bayesian lasso). However, even in the case of non-log-concave distributions, such as the Student-, many symmetric location-scale distributions can be expressed as a scale-mixture of normal distributions, i.e.,

where is a suitable mixing density. The normal distribution is log-concave, so that any distribution that can be expressed as a scale-mixture of normals admits log-concave conditional distributions for , conditional on the latent variable . A further advantage of this representation is that the scale parameter , which controls the tails of the induced prior distribution over , appears as a simple variance parameter. This means that sampling , and adaptively controlling the tail weight of the prior distribution over as discussed in Section 4.2, becomes straightforward.

5.1 Log-Scale Prior Hierarchy

In this section we present the steps required to implement a Gibbs sampler for both the symmetric log-Laplace prior density (16) and the log- prior density (20), both with unknown scale parameters . We use the fact that both of these densities can be written as scale-mixtures of normals in the space. We present the hierarchy for the multiple means model (1) with known noise variance , though adaptation to the general linear regression model with unknown variance is straightforward as the conditional distributions for the hyperparameters remain the same (see for example Makalic and Schmidt [2016a]). Our hierarchy is

(28)

where , and denotes a standard half-Cauchy distribution truncated to the interval , as recommend by van der Pas et al. [2017]. The choice of density for the latent variables determines the particular log-scale prior to be used. For the log-Laplace prior we use

and for the log- prior we use

where is the degrees-of-freedom parameter, which we assume to be known. The -distribution can also be represented as a scale mixture, which would allow us to easily extend our hierarchy to an adaptive horseshoe procedure [Barndorff-Nielsen et al., 1982]. In practice, we use the following inverse gamma-inverse gamma mixture representation of the half-Cauchy prior distribution [Makalic and Schmidt, 2016b]

This latent variable representation leads to simpler conditional distributions in a Gibbs sampling implementation than the alternative gamma-gamma representation commonly used Armagan et al. [2011].

5.2 Gibbs Sampling Procedure

Given observations , we can sample from the posterior distribution using the following Gibbs sampling procedure. The coefficients , can be sampled from

where . The hyperparameters can be sampled using the rejection sampler presented in Appendix II. This sampler is highly efficient, requiring approximately draws per accepted sample in the worst case setting. The latent variables are sampled according to the particular log-scale prior distribution we have chosen. For the log-Laplace prior is conditionally distributed as per