Inverse stable prior for exponential models

# Inverse stable prior for exponential models

## Abstract

We consider a class of non-conjugate priors as a mixing family of distributions for a parameter (e.g., Poisson or gamma rate, inverse scale or precision of an inverse-gamma, inverse variance of a normal distribution) of an exponential subclass of discrete and continuous data distributions. The prior class is proper, nonzero at the origin (unlike the gamma prior with shape parameter less than one), and is easy to generate random numbers from. The prior class also provides flexibility (includes the Airy and the half-normal) in capturing different prior beliefs as modulated by a bounded parameter The resulting posterior family in the single-parameter case can be expressed in closed-form and is proper, making calibration unnecessary. The mixing induced by the inverse stable family results to a marginal prior distribution in the form of a generalized Mittag-Leffler function, which covers a broad array of distributional shapes. We derive closed-form expressions of some properties like the moment generating function and moments. We propose algorithms to generate samples from the posterior distribution and calculate the Bayes estimators for real data analysis. We formulate the predictive prior and posterior distributions. We test the proposed Bayes estimators using Monte Carlo simulations. The extension to hierarchical modeling and inverse variance components models is straightforward. We illustrate the methodology using a real data set, introduce a hyperprior density for the hyperparameters, and extend the model to a heavy-tailed distribution.

Keywords: rate, inverse scale, inverse variance, precision, Jeffreys, Mittag-Leffler

## 1 Introduction

We illustrate some motivations of this paper using common practical applications of the widely used conjugate gamma (, ) prior density

 p(λ|ν0,λ0)=λν00Γ(ν0)λν0−1e−λ0λ,λ>0,ν0>0,λ0>0. (1)

If and as , the limiting density fails to exist. Therefore, an infinite mass at the origin introduces bias toward values near the origin.

Application 1. If a discrete dataset is a random sample from Poisson () with probability mass function then we obtain the Poisson()-gamma (, ) model where the posterior density is proportional to

 λ∑nj=1xj−(1−ν0)e−(n+λ0)λ. (2)

As and whenever the posterior diverges (with infinite limit). This posits a limitation on inference for the Poisson-gamma model. A true zero rate (or infinite variance) cannot be ignored but an infinite mass cannot be assigned only at/near the origin as well. Note that observing in practice, for instance, is not uncommon especially if a relatively small dataset is sampled from zero-inflated populations.

Furthermore, the Jeffreys prior for the Poisson distribution also tends to infinity as The corresponding posterior kernel is

 λ∑nj=1xj−1/2e−nλ, (3)

which again suffers from the same drawback as whenever

Application 2. If is continuous and is a random sample from gamma () with probability density function then we have the gamma ()-gamma () model. The posterior density kernel is

 λν0+nν−1e−[λ0+∑nj=1xj]λ. (4)

If both shape parameters ( and ) above are relatively small and such that then the limit ceases to exist also as

Application 3. If the sample data come from inverse-gamma () with probability density function then we obtain the inverse-gamma ()-gamma () model for the inverse scale parameter which yields another gamma posterior kernel

 λν0+nν−1e−[λ0+∑nj=11/xj]λ. (5)

If the same conditions in the second application above are satisfied then the posterior suffers from the same drawback as well. Clearly, finding a prior that is proper and robust against the restriction or bias introduced near the origin for the above models is one motivation of this paper besides providing informative and proper priors to a rate parameter say, of frequently used exponential models.

The rest of the paper is organized as follows. The inverse stable density is introduced in Section 2. The single parameter formulation is presented in Section 3. The computational algorithms are given in Section 4. Predictive distributions are discussed in Section 5. The extensions to the hierarchical setting are in Section 6. Illustrations using real data are in Section 7. The extension to heavy-tailed prior specification and some concluding remarks are presented in Section 8.

## 2 Inverse stable density

The inverse stable (IS) density has been increasingly becoming popular in several areas of study particularly in physics and mathematics. It is a probability model for time-fractional differential equations, which leads to closed-form solutions (Piryatinska et al., 2005; Mainardi et al., 2010; Meerschaert et al., 2011; Meerschaert and Straka, 2013; Iksanov et al., 2017). It is also used as a subordinator (as the operational time rather than the physical time) for time-fractional diffusions and for Poisson-type processes (Beghin and Orsingher, 2010).

The inverse stable function is related to the strict -stable density (see all references above) as follows: If then

 Θ\lx@stackreld=ρ1/αθ−1−1/ααgα(ρ1/αθ−1/α),θ>0, (6)

where the Laplace transform of the -stable density is Note that the Airy () and half-normal () distributions are special cases. As , the distribution becomes degenerate, i.e.,

 ISα,ρ(θ)⟶δ(θ−ρ). (7)

The prior family (indexed by ) does not vanish at , i.e, (unlike the gamma prior with shape parameter less than one). The large-sample behavior (Mainardi et al., 2010; Meerschaert and Straka, 2013) of can be simply expressed in terms of as

 ISα,1(θ/α)∼c1(α)θ(α−1/2)/(1−α)exp[−c2(α)θ1/(1−α)], (8)

where and Moreover,

 EΘk=ρkΓ(1+k)Γ(1+αk),ϕΘ(β)=Eα(−βρ),ϕρ(β)=β1−1/αe−θβ, (9)

where is the Mittag-Leffler function (Bingham, 1971). Random variates from (6) can be generated using the structural representation

 Θ\lx@stackreld=ρS−α,S\lx@stackreld=gα(s). (10)

The random variable can be generated using the following formula (Kanter, 1975; Chambers et al., 1976):

 S\lx@stackreld=sin(απU1)[sin((1−α)πU2)]1/α−1[sin(πU2)]1/α|lnU1|1/α−1, (11)

where and are independently and uniformly distributed in .

Figure 1 below reveals some members of the prior family as a function of the hyperparameter Apparently, controls the shape of For smaller values of , assigns finite masses near the origin and becomes bounded as Clearly, the class of priors allows considerable flexibility in capturing prior beliefs.

## 3 Single parameter formulation

Proposition 1. Let be a random sample from a population belonging to an exponential family of distributions with the likelihood kernel

 L(θ|x)∼e−a⋅θ+b⋅log(θ), (12)

where . Using the non-conjugate inverse stable density as a prior for yields the the proper or normalized posterior distribution

 p(θ|x)=e−a⋅θ+b⋅log(θ)ISα,ρ(θ)Γ(b+1)ρbEb+1α,αb+1(−aρ), (13)

where and are the hyperparameters,

 Eτη,ν(w)=∞∑j=0(τ)jj!Γ(ηj+ν)wj (14)

is the generalized Mittag–Leffler function (Prabhakar, 1971), , is the Pochammer symbol, and

Proof. See Appendix.

For the Poisson-gamma model in Section 1, as and , the posterior , i.e., the limit exists. For both the gamma-gamma and inv-gamma-gamma models above, as , i.e., as the posterior as well as .

Corollary 1. The marginal density of the data given hyperparameters and is

 p(x|α,ρ)=Γ(b+1)ρbEb+1α,αb+1(−aρ). (15)

Proof. It is trivial and is omitted. ∎

The generalized Mittag-Leffler function in (15) above is absolutely convergent for all (see Shukla and Prajapati, 1997). If then we obtain the Mittag-Leffler function.

Corollary 2. The th posterior moment is

 E(Θk∣∣x)=Γ(k+b+1)ρkEk+b+1α,α(k+b)+1(−aρ)Γ(b+1)Eb+1α,αb+1(−aρ). (16)

Proof. It is trivial and thus is not provided. ∎

When , the formula in (13) is checked as a valid probability density function.

Corollary 3. The moment generating function is straightforward to calculate as

 E(exp(βΘ)|x)=Eb+1α,αb+1(−(a−β)ρ)Eb+1α,αb+1(−aρ). (17)

Proof. The proof follows from the property of the inverse stable distribution and is trivial.∎

Corollary 4. As ,

 p(θ|x)⟶e−a⋅θθbδ(θ−ρ)e−aρρb (18)

because .

Proof. The proof follows from the property of the inverse stable density and is omitted. ∎

The preceding corollary suggests that the Bayes estimator . It also indicates how controls the shape and/or variability of the posterior distribution from a non-degenerate family to a degenerate one . This shows the relevance of as when and as goes large.

Table 1 below shows some likelihood models with their corresponding parameterizations following the above formulation and notation. The inverse stable density can also be a prior for a rate or an inverse scale parameter of two-parameter models like the inverse-Weibull, lognormal and normal-inverse densities conditional on the other parameter (see e.g. Miller, 1980; Kundu, 2008).

## 4 Estimation algorithms

### 4.1 Monte Carlo integration

From (88) of Appendix A, we have

 Eω+1α,αω+1(−aρ)=1ρωΓ(ω+1)∫R+e−ayyω⋅ISα,ρ(y)dy≈∑Jj=1e−aYjYωjJρωΓ(ω+1), (19)

where Thus, the th moment can be estimated using the following algorithm.

Algorithm 1

Step 1. Generate

Step 2. Let

 Vα,ρ,k=∑Jj=1e−aYjYb+kjJρ(b+k)Γ(k+b+1). (20)

Step 3. Set

 ˆE(Θk∣∣x)=Γ(b+k+1)ρkVα,ρ,kΓ(b+1)Vα,ρ,0. (21)

In particular, the Bayes estimator of with respect to the squared error loss criterion can be directly computed as

 ˆθ1=(b+1)ρVα,ρ,1Vα,ρ,0 (22)

while the variance can be estimated as

 ˆVar(Θ∣∣x)=(b+1)ρ2αVα,ρ,0⎡⎣(b+2)Vα,ρ,2−(b+1)(Vα,ρ,1)2Vα,ρ,0⎤⎦. (23)

### 4.2 Posterior simulation

Let the target density be and candidate density be and are both known. Then

 p(θ|x)ISα,ρ(θ)=e−a⋅θθbΓ(b+1)ρbEb+1α,αb+1(−aρ). (24)

The preceding expression is maximized at and

 e−a⋅θθbΓ(b+1)ρbEb+1α,αb+1(−aρ)

Below is an accept-reject algorithm for generating observations from the posterior distribution (13).

Algorithm 2

Step 1. Generate and

Step 2. Accept if

 u

or equivalently, otherwise, go back to step 1.

Step 3. Repeat steps times and set .

As , if and only if and Thus, we expect high acceptance rates (close to ) of Algorithm 2 if the conditions (, and ) are close to being satisfied. The same acceptance rates are to be observed for , but is the solution of

We generated 500 observations from the different posterior distributions in 1000 replications based on observed data sizes to test both algorithms above. We calculated the mean point estimate, the median absolute deviation (MAD) and the maximum likelihood estimates (MLE); the results are in Table 4 of Appendix B. Both algorithms are generally more precise (with smaller MAD) than the MLE, and Algorithm 2 tends to be the most precise (with the smallest MAD) for small values of and the sample size. The point estimates from both algorithms are very close but can be considerably different from the MLE even for relatively large data sizes. It is observed that both bias and variability decrease as the sample size increases in all cases as expected.

## 5 Predictive distributions

Consider the exponential data model

 f(x|θ)=c⋅e−a⋅θ+b⋅log(θ), (27)

where The prior predictive distribution of a conditionally independent new observation is

 p(x∗)=∫θf(x∗|θ)⋅ISα,ρ(θ)dθ∼c∗Γ(b∗+1)ρb∗Eb∗+1α,αb∗+1(−a∗ρ), (28)

where , , and is an appropriate normalizing constant or function. We now propose a class of prior predictive densities for a family of transformations that provides a way of simulating by generating or inverting , and hence allows testing of prior beliefs. Note that these predictive distributions can be easily plotted using the maximum likelihood estimates of the hyperparameters and the Monte Carlo approximation (19).

Proposition 2. Let and be free of (see Table 1) and be a one-to-one mapping and invertible. Then the prior predictive density of is

 p(a∗)=bρΓ(1+α(b−1))Eb+1α,αb+1(−a∗ρ). (29)

Proof. The proof follows from the well-known Mellin transform of the generalized Mittag-Leffler function (Shukla and Prajapati, 1997) below for as

 ∫R+wz−1Eτη,ν(−ρw)dw=Γ(z)Γ(τ−z)Γ(τ)ρzΓ(ν−zη). (30)

The above family of proper density functions can easily be plotted using the approximation in (19).

Corollary 5. The th moment can be calculated using formula (30) as

 E(A∗)k=Γ(αb+1−α)Γ(k+1)Γ(b+1−(k+1))Γ(b)ρkΓ(αb+1−(k+1)α). (31)

Proof. It is trivial and is omitted. ∎

Formula (29) can be checked to be proper when in the preceding corollary. The th moment exists provided .

Example 1. Let Then and thus,

 (32)

Example 2. Let . Then and therefore,

 p(a∗)=ρ1/2(1/2)E3/2α,α/2+1(−a∗ρ);E(A∗)k=E(X∗)2k=Γ(1−k)Γ(k+1/2)√πΓ(1−αk)ρk,k∉N+. (33)

When , the preceding ’s are easily checked as proper prior predictive density functions.

After observing the data , the posterior predictive distribution of a conditionally independent new observation is

 p(a∗|x)=∫θf(a∗|θ,x)p(θ|x)dθ∼c∗ρb∗Eb′+1α,αb′+1(−(a+a∗)ρ),b′=b∗+b, (34)

where and is some normalizing function or constant.

Example 3. Let Then and thus,

 p(a∗|x)∼ρEn+2α,α(n+1)+1(−(a+a∗)ρ). (35)

Example 4. Let . Then and hence,

 p(a∗|x)∼ρ1/2E(n+1)/2+1α,α(n+1)/2+1(−(a+a∗)ρ). (36)

The above posterior predictive distributions can directly be plotted using (19) also.

## 6 Hierarchical models

Recall that , where the marginal prior follows from Corollary 1, with the inverse stable density as the mixing distribution. The marginal prior family can be plotted using the Monte Carlo algorithm above and can be made proper using the Mellin transform property (30). Below are some applications of the inverse stable prior in hierarchical contexts.

Normal-normal model

Let be the inverse variance and consider

 ¯¯¯¯¯X|λ\lx@stackreld=N(λ,1/n),Λ|θ\lx@stackreld=N(0,θ),θ\lx@stackreld=ISα,ρ(θ),α,ρ-known. (37)

The marginal prior of can be shown as

 p(λ)∼ρ1/2E1/2+1α,α/2+1(−λ2ρ/2). (38)

Also, the hyperparameters and can be estimated by maximizing

 data|α,ρ∼EΘ[(2πn+ΘnΘ)−1/2exp(−nΘ¯¯¯x22(n+Θ))]. (39)

The Gibbs sampler for this model can be performed as

 Λ|data,θ\lx@stackreld=N(¯¯¯x1+θ/n,(n+θ)−1), (40)

and

 Θ|λ,data∼θ1/2exp(−λ22θ)⋅ISα,ρ(θ). (41)

Poisson-exponential model

Let

 data|λ\lx@stackreld=Poisson(λ),Λ|θ\lx@stackreld=θe−λθ,Θ\lx@stackreld=ISα,ρ(θ),α,ρ-known. (42)

The marginal prior of can be obtained as

 p(λ)∼ρE2α,α+1(−ρλ). (43)

Moreover, maximizing

 data|α,ρ∼EΘ⎡⎢ ⎢⎣Γ(∑nj=1xj+1)(n+Θ)∑nj=1xj+1∏nj=1xj⎤⎥ ⎥⎦ (44)

gives estimates of the hyperparameters and The Gibbs sampler for this model can be accomplished as

 (45)

and

 Θ|λ,data∼θexp(−λθ)⋅ISα,ρ(θ). (46)

Poisson-exponential with unequal intensity rates

Let

 Xi|λi\lx@stackreld=Poisson(λi),Λi|θ\lx@stackreld=θe−λiθ,Θ\lx@stackreld=ISα(θ)=ISα,1(θ),α-% known. (47)

The marginal prior of can be calculated as

 p(λ)∼En+1α,αn+1(−n∑i=1λi),λ=(λ1,λ2,…,λn)′. (48)

Then

 Λi|data,θ\lx@stackreld=Gamma(xi+1,1+θ), (49)

and

 E(Λi|data,θ)=(xi+1)/(1+θ)=(1−κ)xi+1θκ, (50)

where . Hence,

 E(Λi|data)=(xi+1)EΘ|data[(1+θ)−1], (51)

where

 Θ|data\lx@stackreld=K(11+θ)∑nj=1xj(1−11+θ)nISα(θ)=K(1−κ)∑nj=1xjκnISα(θ). (52)

The normalizing constant can be straightforwardly calculated using Monte Carlo integration. Therefore,

Global shrinkage in normal-normal model

Let be the inverse variance. Consider

 Xi|λi\lx@stackreld=N(λi,1),Λi|θ\lx@stackreld=N(0,θ),Θ\lx@stackreld=ISα(θ),α-known. (53)

The marginal prior for is

 p(λ)∼En/2+1α,αn/2+1(−||λ||22/2). (54)

Then

 Λi|data,θ\lx@stackreld=N(xi1+θ,11+θ), (55)

and

 E(Λi|data,θ)=xi/(1+θ)=(1−κ)xi. (56)

Hence,

 E(Λi|data)=xi⋅EΘ|data[(1+θ)−1], (57)

where

 Θ|data \lx@stackreld= K(11+θ)n/2exp(−θ||x||22/[2(1+θ)])ISα(θ) (58) = K(1−κ)n/2exp(−κ||x||22/2)ISα(θ). (59)

Thus,

 E(Λi|data)=xi⋅EΘ[K(1−κ)n/2+1exp(−κ||x||22/2)].

Local shrinkage in normal-normal model

Let be the local inverse variance. When and above, then the marginal prior for is

 p(λi)∼E1/2+1α,α/2+1(−λ2i/2). (60)

Furthermore,

 Λi|xi,θi\lx@stackreld=N(xi1+θi,11+θi), (61)

and

 E(Λi|xi,θi)=xi/(1+θi)=(1−κi)xi,κi=θi/(1+θi). (62)

Therefore,

 E(Λi|xi)=xi⋅EΘi|xi[(1+θi)−1], (63)

where

 Θi|xi \lx@stackreld= K(11+θi)1/2exp(−θix2i/[2(1+θi)])ISα(θi) (64) = K(1−κi)1/2exp(−κix2i/2)ISα(θi). (65)

Hence,

 E(Λi|xi)=xi⋅EΘi[K(1−κi)1/2+1exp(−κix2i/2)].