Trace class Markov chains for the NormalGamma Bayesian shrinkage model
Abstract
Highdimensional data, where the number of variables exceeds or is comparable to the sample size, is now pervasive in many scientific applications. In recent years, Bayesian shrinkage models have been developed as effective and computationally feasible tools to analyze such data, especially in the context of linear regression. In this paper, we focus on the NormalGamma shrinkage model developed by Griffin and Brown [5]. This model subsumes the popular Bayesian lasso model, and a threeblock Gibbs sampling algorithm to sample from the resulting intractable posterior distribution has been developed in [5]. We consider an alternative twoblock Gibbs sampling algorithm, and rigorously demonstrate its advantage over the threeblock sampler by comparing specific spectral properties. In particular, we show that the Markov operator corresponding to the twoblock sampler is trace class (and hence HilbertSchmidt), whereas the operator corresponding to the threeblock sampler is not even HilbertSchmidt. The trace class property for the twoblock sampler implies geometric convergence for the associated Markov chain, which justifies the use of Markov chain CLT’s to obtain practical error bounds for MCMC based estimates. Additionally, it facilitates theoretical comparisons of the twoblock sampler with sandwich algorithms which aim to improve performance by inserting inexpensive extra steps in between the two conditional draws of the twoblock sampler.
10.1214/154957804100000000 0000 \volume0 \firstpage0 \lastpage0 \startlocaldefs \endlocaldefs\doi10.1214/154957804100000000 0000 \volume0 \firstpage0 \lastpage0
Trace class Markov chains for NormalGamma model
and
t2Kshitij Khare (\printeade2) is Associate Professor, Department of Statistics, University of Florida. Liyuan Zhang (\printeade1) is Graduate Student, Department of Statistics, University of Florida.
University of Florida
class=MSC] \kwd[Primary ]60J05 \kwd60J20 \kwd[; Secondary ]33C10
Data Augmentation \kwdMarkov chain Monte Carlo \kwdNormalGamma model \kwdBayesian shrinakge \kwdtrace class operators
Contents
1 Introduction
In recent years, the explosion of data, due to advances in science and information technology, has left almost no field untouched. The availability of highthroughput data from genomic, finance, environmental, marketing (among other) applications has created an urgent need for methodology and tools for analyzing highdimensional data. In particular, consider the linear model , where is an real valued response variable, is a known matrix, is an unknown vector of regression coefficients, is an unknown scale parameter and the entries of are independent standard normals. In the highdimensional datasets mentioned above, often , and classical least squares methods fail. The lasso [22] was developed to provide sparse estimates of the regression coefficient vector in these samplestarved settings (several adaptations/alternatives have been proposed since then). It was observed in [22] that the lasso estimate is the posterior mode obtained when one puts i.i.d Laplace priors on the elements of (conditional on ). This observation has led to a flurry of recent research concerning the development of prior distributions for that yield posterior distributions with high (posterior) probability around sparse values of , i.e., values of that have many entries equal to . Such prior distributions are referred to as “continuous shrinkage priors " and the corresponding models are referred to as “Bayesian shrinkage models". Bayesian shrinkage methods have gained popularity and have been extensively used in a variety of applications including ecology, finance, image processing and neuroscience (see, for example, [24, deloscampos2009, 3, jacquemin2014, xing2012, mishchenko2012, gu2013, 19, 20]).
In this paper, we focus on the wellknown NormalGamma shrinkage model introduced in Griffin and Brown [5]. The model is specified as follows:
(1) 
where denotes the variate normal density, and is a diagonal matrix with diagonal entries given by . Also, InverseGamma and Gamma denote the InverseGamma and Gamma densities with shape parameters and , and rate parameters and respectively. The marginal density of given in the above model is given by
where is the modified Bessel function of the second kind. The popular Bayesian lasso model of Park and Casella [17] is a special case of the NormalGamma model above with , where the marginal density of simplifies to
In this case, the marginal density for each (given ) is the double exponential density. The NormalGamma family offers a wider choice for the tail behavior (as decreases, the marginal distribution becomes more peaked at zero, but has heavier tails), and thereby a more flexible mechanism for model shrinkage.
The posterior density of for the NormalGamma model is intractable in the sense that closed form computation or direct sampling is not feasible. Griffin and Brown [5] note that the full conditional densities of and are easy to sample from, and develop a threeblock Gibbs sampling Markov chain to generate samples from the desired posterior density. This Markov chain, denoted by (on the state space ), is driven by the Markov transition density (Mtd)
(2) 
Here denotes the conditional density of the first group of arguments given the second group of arguments. The onestep dynamics of this Markov chain to move from the current state, , to the next state, can be described as follows:

Draw from .

Draw from .

Draw from .
In [15], the authors show that the distribution of the Markov chain converges to the desired posterior distribution at a geometric rate (as the number of steps converges to ).
As mentioned previously, the Bayesian lasso Markov chain of [17] is a special case of the NormalGamma Markov chain when . In recent work [21], it was shown that a twoblock version of the Bayesian lasso chain (and a variety of other chains for Bayesian regression) can be developed. The authors in [21] then focus their theoretical investigations on the Bayesian lasso, and show that the twoblock Bayesian lasso chain has a better behaved spectrum than the original threeblock Bayesian lasso chain in the following sense: the Markov operator corresponding to the twoblock Bayesian lasso chain is trace class (eigenvalues are countable and summable, and hence in particular squaresummable), while the Markov operator corresponding to the original threeblock Bayesian lasso chain is not HilbertSchmidt (the corresponding absolute value operator either does not have a countable spectrum, or has a countable set of eigenvalues that are not squaresummable).
Based on the method outlined in [21], a twoblock version of the threeblock NormalGamma Markov chain can be constructed as follows. The twoblock Markov chain, denoted by (on the state space ), is driven by the Markov transition density (Mtd)
(3) 
The onestep dynamics of this Markov chain to move from the current state, , to the next state, can be described as follows:

Draw from .

Draw from and draw from .
The goal of this paper is to investigate whether the theoretical results for the Bayesian lasso in [21] hold for the more general and complex setting of the NormalGamma model. In particular, we establish that the Markov operator corresponding to the twoblock chain is trace class when (Theorem 1). On the other hand, the Markov operator corresponding to the threeblock chain is not HilbertSchmidt for all values of (Theorem 2). These results hold for all values of the sample size and the number of independent variables . Since the Bayesian lasso is a special case with , our results subsume the spectral results in [21]. We note that establishing the results in the NormalGamma setup is much harder than the Bayesian lasso setting. This is in part due to the fact that the modified Bessel function does not in general have a closed form when , and a heavy dose of various identities and bounds for these Bessel functions is needed to analyze the appropriate integrals in the proofs of Theorem 1 and Theorem 2.
We now discuss further some of the implications of establishing the traceclass property for . Note that the Markov chain arises from a twoblock Data Augmentation (DA) algorithm, with as the parameter block of interest and as the augmented parameter block. Hence the corresponding Markov operator, denoted by , is a positive, selfadjoint operator (see [6]). Establishing that a positive selfadjoint operator is trace class implies that it has a discrete spectrum, and that (countably many, nonnegative) eigenvalues are summable. The trace class property implies compactness, which further implies geometric ergodicity of the underlying Markov chain (see [16, Section 2], for example). Geometric ergodicity, in turn, facilitates use of Markov chain central limit theorems to provide error bounds for Markov chain based estimates of relevant posterior expectations. The DA interpretation of also enables us to use the Haar PXDA technique from [6] and construct a “sandwich" Markov chain by adding an inexpensive extra step in between the two conditional draws involved in one step of (see Section 5 for details). The trace class property for , along with results in [9], implies that the sandwich chain is also trace class, and that each ordered eigenvalue of the sandwich chain is dominated by the corresponding ordered eigenvalue of (with at least one strict domination). Recent work in [2] provides a rigorous approach to approximate eigenvalues of trace class Markov chains whose Mtd is available in closed form. These results are not applicable to the twoblock sampler as its is not available in closed form, and extending results in [2] to such settings is a topic of ongoing research.
The rest of the paper is organized as follows. In Section 2, we provide the form of the relevant conditional densities for the Markov chains and . In Section 3, we establish the trace class property for the twoblock Markov chain . In Section 4, we show that the threeblock Markov chain is not HilbertSchmidt. In Section 5, we derive the Haar PXDA sandwich chain corresponding to the twoblock DA chain. Finally, in Section 6 we compare the performance of the twoblock, threeblock and the Haar PXDA based chains on simulated and real datasets.
2 Form of relevant densities
In this section, we present expressions for various densities corresponding to the NormalGamma model in (1). These densities appear in the Mtd for the Markov chains and .
The joint density for the parameter vector conditioned on the data vector is given by the following:
(4) 
Based on the joint density in (4), the following conditional distributions can be derived in a straightforward fashion.
In particular,
(5) 
for
In particular,
(6) 
for
In particular,
(7) 
for

Given and the variables are conditionally independent, and the conditional density of given and is GIG
In particular,
(8) 
for
3 Properties of the twoblock Gibbs sampler
In this section, we show that the operator associated with the twoblock Gibbs sampler with Markov transition density specified in (3) is trace class when and is not trace class when .
Theorem 1.
For all values of and , the Markov operator corresponding to the twoblock Markov chain is trace class (and hence HilbertSchmidt) when and is not trace class when
Proof In the current setting, the trace class property is equivalent to the finiteness of the integral (see [16, Section 2], for example)
(9) 
We will consider five separate cases: , , , and . In the first three cases, we will show that the integral in (9) is finite, and in the last two cases we will show that the integral in (9) is infinite. The proof is a lengthy and intricate algebraic exercise involving careful upper/lower bounds for modified Bessel functions and conditional densities, and we will try to provide a roadmap/explanation whenever possible. We will start with the case .
Case 1:
By the definition of , we have
(10)  
As a first step, we will gather all the terms with , and then focus on finding an upper bound for the inner integral with respect to . Using (5), (7) and (8), we get,
where and . Note that follows from
and follows from
We now focus on the inner integral in (LABEL:eq:cond6) defined by
(12) 
Let denote the largest eigenvalue of . Using the definition of , it follows that
where
We now examine a generic term of the sum in (LABEL:eq:cond7). Note that and are both (unnormalized) GIG densities. Hence, for any subset of , using the form of the GIG density, we get
(14)  
First, by [11, Page 266], we get that
for all . Next, using the fact that if then is an increasing function for (again, see [11, Page 266]), we get
Hence, from (14), we get that
(15) 
By (LABEL:eq:cond6) and (12), the trace class property will be established if we show that for every , the integral
is finite. We proceed to show this by first simplifying the inner integral with respect to . Using the form of the Gamma density, we get
(16)  
It follows by (16) that
As discussed above, this establishes the trace class property in the case .
Case 2:
In this case, we first note that all arguments in Case 1 go through verbatim until (14). Next, we note that
(17) 
If then , and by [11, Page 266], we get
(18) 
Using the property that (see [1], Page 375), we obtain
If then . Since is increasing in for (see [11, Page 266]), it follows that for . Also, by the integral formula (see [1], Page 376)
Since for any ( is increasing on ), we get
for . In particular, Hence for all we have
(19) 