Trace class Markov chains for the Normal-Gamma Bayesian shrinkage model

Trace class Markov chains for the Normal-Gamma Bayesian shrinkage model

\fnmsLiyuan \snmZhanglabel=e1]zhangliyuan@stat.ufl.edu [    \fnmsKshitij \snmKharelabel=e2]kdkhare@stat.ufl.edu [ affl
Abstract

High-dimensional data, where the number of variables exceeds or is comparable to the sample size, is now pervasive in many scientific applications. In recent years, Bayesian shrinkage models have been developed as effective and computationally feasible tools to analyze such data, especially in the context of linear regression. In this paper, we focus on the Normal-Gamma shrinkage model developed by Griffin and Brown [5]. This model subsumes the popular Bayesian lasso model, and a three-block Gibbs sampling algorithm to sample from the resulting intractable posterior distribution has been developed in [5]. We consider an alternative two-block Gibbs sampling algorithm, and rigorously demonstrate its advantage over the three-block sampler by comparing specific spectral properties. In particular, we show that the Markov operator corresponding to the two-block sampler is trace class (and hence Hilbert-Schmidt), whereas the operator corresponding to the three-block sampler is not even Hilbert-Schmidt. The trace class property for the two-block sampler implies geometric convergence for the associated Markov chain, which justifies the use of Markov chain CLT’s to obtain practical error bounds for MCMC based estimates. Additionally, it facilitates theoretical comparisons of the two-block sampler with sandwich algorithms which aim to improve performance by inserting inexpensive extra steps in between the two conditional draws of the two-block sampler.

[
\kwd
\doi

10.1214/154957804100000000 0000 \volume0 \firstpage0 \lastpage0 \startlocaldefs \endlocaldefs\doi10.1214/154957804100000000 0000 \volume0 \firstpage0 \lastpage0

\runtitle

Trace class Markov chains for Normal-Gamma model

and

\thankstext

t2Kshitij Khare (\printeade2) is Associate Professor, Department of Statistics, University of Florida. Liyuan Zhang (\printeade1) is Graduate Student, Department of Statistics, University of Florida.

University of Florida

class=MSC] \kwd[Primary ]60J05 \kwd60J20 \kwd[; Secondary ]33C10

Data Augmentation \kwdMarkov chain Monte Carlo \kwdNormal-Gamma model \kwdBayesian shrinakge \kwdtrace class operators

1 Introduction

In recent years, the explosion of data, due to advances in science and information technology, has left almost no field untouched. The availability of high-throughput data from genomic, finance, environmental, marketing (among other) applications has created an urgent need for methodology and tools for analyzing high-dimensional data. In particular, consider the linear model , where is an real valued response variable, is a known matrix, is an unknown vector of regression coefficients, is an unknown scale parameter and the entries of are independent standard normals. In the high-dimensional datasets mentioned above, often , and classical least squares methods fail. The lasso [22] was developed to provide sparse estimates of the regression coefficient vector in these sample-starved settings (several adaptations/alternatives have been proposed since then). It was observed in [22] that the lasso estimate is the posterior mode obtained when one puts i.i.d Laplace priors on the elements of (conditional on ). This observation has led to a flurry of recent research concerning the development of prior distributions for that yield posterior distributions with high (posterior) probability around sparse values of , i.e., values of that have many entries equal to . Such prior distributions are referred to as “continuous shrinkage priors " and the corresponding models are referred to as “Bayesian shrinkage models". Bayesian shrinkage methods have gained popularity and have been extensively used in a variety of applications including ecology, finance, image processing and neuroscience (see, for example, [24, deloscampos2009, 3, jacquemin2014, xing2012, mishchenko2012, gu2013, 19, 20]).

In this paper, we focus on the well-known Normal-Gamma shrinkage model introduced in Griffin and Brown [5]. The model is specified as follows:

(1)

where denotes the -variate normal density, and is a diagonal matrix with diagonal entries given by . Also, Inverse-Gamma and Gamma denote the Inverse-Gamma and Gamma densities with shape parameters and , and rate parameters and respectively. The marginal density of given in the above model is given by

where is the modified Bessel function of the second kind. The popular Bayesian lasso model of Park and Casella [17] is a special case of the Normal-Gamma model above with , where the marginal density of simplifies to

In this case, the marginal density for each (given ) is the double exponential density. The Normal-Gamma family offers a wider choice for the tail behavior (as decreases, the marginal distribution becomes more peaked at zero, but has heavier tails), and thereby a more flexible mechanism for model shrinkage.

The posterior density of for the Normal-Gamma model is intractable in the sense that closed form computation or direct sampling is not feasible. Griffin and Brown [5] note that the full conditional densities of and are easy to sample from, and develop a three-block Gibbs sampling Markov chain to generate samples from the desired posterior density. This Markov chain, denoted by (on the state space ), is driven by the Markov transition density (Mtd)

(2)

Here denotes the conditional density of the first group of arguments given the second group of arguments. The one-step dynamics of this Markov chain to move from the current state, , to the next state, can be described as follows:

  • Draw from .

  • Draw from .

  • Draw from .

In [15], the authors show that the distribution of the Markov chain converges to the desired posterior distribution at a geometric rate (as the number of steps converges to ).

As mentioned previously, the Bayesian lasso Markov chain of [17] is a special case of the Normal-Gamma Markov chain when . In recent work [21], it was shown that a two-block version of the Bayesian lasso chain (and a variety of other chains for Bayesian regression) can be developed. The authors in [21] then focus their theoretical investigations on the Bayesian lasso, and show that the two-block Bayesian lasso chain has a better behaved spectrum than the original three-block Bayesian lasso chain in the following sense: the Markov operator corresponding to the two-block Bayesian lasso chain is trace class (eigenvalues are countable and summable, and hence in particular square-summable), while the Markov operator corresponding to the original three-block Bayesian lasso chain is not Hilbert-Schmidt (the corresponding absolute value operator either does not have a countable spectrum, or has a countable set of eigenvalues that are not square-summable).

Based on the method outlined in [21], a two-block version of the three-block Normal-Gamma Markov chain can be constructed as follows. The two-block Markov chain, denoted by (on the state space ), is driven by the Markov transition density (Mtd)

(3)

The one-step dynamics of this Markov chain to move from the current state, , to the next state, can be described as follows:

  • Draw from .

  • Draw from and draw from .

The goal of this paper is to investigate whether the theoretical results for the Bayesian lasso in [21] hold for the more general and complex setting of the Normal-Gamma model. In particular, we establish that the Markov operator corresponding to the two-block chain is trace class when (Theorem 1). On the other hand, the Markov operator corresponding to the three-block chain is not Hilbert-Schmidt for all values of (Theorem 2). These results hold for all values of the sample size and the number of independent variables . Since the Bayesian lasso is a special case with , our results subsume the spectral results in [21]. We note that establishing the results in the Normal-Gamma setup is much harder than the Bayesian lasso setting. This is in part due to the fact that the modified Bessel function does not in general have a closed form when , and a heavy dose of various identities and bounds for these Bessel functions is needed to analyze the appropriate integrals in the proofs of Theorem 1 and Theorem 2.

We now discuss further some of the implications of establishing the trace-class property for . Note that the Markov chain arises from a two-block Data Augmentation (DA) algorithm, with as the parameter block of interest and as the augmented parameter block. Hence the corresponding Markov operator, denoted by , is a positive, self-adjoint operator (see [6]). Establishing that a positive self-adjoint operator is trace class implies that it has a discrete spectrum, and that (countably many, non-negative) eigenvalues are summable. The trace class property implies compactness, which further implies geometric ergodicity of the underlying Markov chain (see [16, Section 2], for example). Geometric ergodicity, in turn, facilitates use of Markov chain central limit theorems to provide error bounds for Markov chain based estimates of relevant posterior expectations. The DA interpretation of also enables us to use the Haar PX-DA technique from [6] and construct a “sandwich" Markov chain by adding an inexpensive extra step in between the two conditional draws involved in one step of (see Section 5 for details). The trace class property for , along with results in [9], implies that the sandwich chain is also trace class, and that each ordered eigenvalue of the sandwich chain is dominated by the corresponding ordered eigenvalue of (with at least one strict domination). Recent work in [2] provides a rigorous approach to approximate eigenvalues of trace class Markov chains whose Mtd is available in closed form. These results are not applicable to the two-block sampler as its is not available in closed form, and extending results in [2] to such settings is a topic of ongoing research.

The rest of the paper is organized as follows. In Section 2, we provide the form of the relevant conditional densities for the Markov chains and . In Section 3, we establish the trace class property for the two-block Markov chain . In Section 4, we show that the three-block Markov chain is not Hilbert-Schmidt. In Section 5, we derive the Haar PX-DA sandwich chain corresponding to the two-block DA chain. Finally, in Section 6 we compare the performance of the two-block, three-block and the Haar PX-DA based chains on simulated and real datasets.

2 Form of relevant densities

In this section, we present expressions for various densities corresponding to the Normal-Gamma model in (1). These densities appear in the Mtd for the Markov chains and .

The joint density for the parameter vector conditioned on the data vector is given by the following:

(4)

Based on the joint density in (4), the following conditional distributions can be derived in a straightforward fashion.

In particular,

(5)

for

In particular,

(6)

for

In particular,

(7)

for

  • Given and the variables are conditionally independent, and the conditional density of given and is GIG

In particular,

(8)

for

3 Properties of the two-block Gibbs sampler

In this section, we show that the operator associated with the two-block Gibbs sampler with Markov transition density specified in (3) is trace class when and is not trace class when .

Theorem 1.

For all values of and , the Markov operator corresponding to the two-block Markov chain is trace class (and hence Hilbert-Schmidt) when and is not trace class when

Proof In the current setting, the trace class property is equivalent to the finiteness of the integral (see [16, Section 2], for example)

(9)

We will consider five separate cases: , , , and . In the first three cases, we will show that the integral in (9) is finite, and in the last two cases we will show that the integral in (9) is infinite. The proof is a lengthy and intricate algebraic exercise involving careful upper/lower bounds for modified Bessel functions and conditional densities, and we will try to provide a road-map/explanation whenever possible. We will start with the case .

Case 1:

By the definition of , we have

(10)

As a first step, we will gather all the terms with , and then focus on finding an upper bound for the inner integral with respect to . Using (5), (7) and (8), we get,

where and . Note that follows from

and follows from

We now focus on the inner integral in (LABEL:eq:cond6) defined by

(12)

Let denote the largest eigenvalue of . Using the definition of , it follows that

where

We now examine a generic term of the sum in (LABEL:eq:cond7). Note that and are both (unnormalized) GIG densities. Hence, for any subset of , using the form of the GIG density, we get

(14)

First, by [11, Page 266], we get that

for all . Next, using the fact that if then is an increasing function for (again, see [11, Page 266]), we get

Hence, from (14), we get that

(15)

It follows from (12), (LABEL:eq:cond7) and (15) that

By (LABEL:eq:cond6) and (12), the trace class property will be established if we show that for every , the integral

is finite. We proceed to show this by first simplifying the inner integral with respect to . Using the form of the Gamma density, we get

(16)

It follows by (16) that

As discussed above, this establishes the trace class property in the case .

Case 2:

In this case, we first note that all arguments in Case 1 go through verbatim until (14). Next, we note that

(17)

If then , and by [11, Page 266], we get

(18)

Using the property that (see [1], Page 375), we obtain

If then . Since is increasing in for (see [11, Page 266]), it follows that for . Also, by the integral formula (see [1], Page 376)

Since for any ( is increasing on ), we get

for . In particular, Hence for all we have

(19)

It follows from (17), (18) and (19) that