Efficient Marginalization-based MCMC Methods for Hierarchical Bayesian Inverse Problems

Efficient Marginalization-based MCMC Methods for Hierarchical Bayesian Inverse Problems

Arvind K. Saibaba Department of Mathematics, North Carolina State University, Raleigh, NC asaibab@ncsu.edu    Johnathan Bardsley Department of Mathematical Sciences, University of Montana, Missoula, MT bardsleyj@mso.umt.edu    D. Andrew Brown School of Mathematical and Statistical Sciences, Clemson University, Clemson, SC ab7@clemson.edu    Alen Alexanderian Department of Mathematics, North Carolina State University, Raleigh, NC alexanderian@ncsu.edu
Abstract

Hierarchical models in Bayesian inverse problems are characterized by an assumed prior probability distribution for the unknown state and measurement error precision, and hyper-priors for the prior parameters. Combining these probability models using Bayes’ law often yields a posterior distribution that cannot be sampled from directly, even for a linear model with Gaussian measurement error and Gaussian prior. Gibbs sampling [3] can be used to sample from the posterior, but problems arise when the dimension of the state is large. This is because the Gaussian sample required for each iteration can be prohibitively expensive to compute, and because the statistical efficiency of the Markov chain degrades as the dimension of the state increases. The latter problem can be mitigated using marginalization-based techniques, such as those found in [13, 20, 25], but these can be computationally prohibitive as well. In this paper, we combine the low-rank techniques of [7] with the marginalization approach of [25]. We consider two variants of this approach: delayed acceptance and pseudo-marginalization. We provide a detailed analysis of the acceptance rates and computational costs associated with our proposed algorithms, and compare their performances on two numerical test cases—image deblurring and inverse heat equation.

1 Introduction

Inverse problems arise in a wide range of applications and typically involve estimating unknown parameters in a physical model from noisy, indirect measurements. In the applications that we are interested in, the unknown parameter vector results from the discretization of a continuous function defined on the computational domain and hence is high-dimensional. Additionally, the inverse problem of recovering the unknown parameters from the data is ill-posed: a solution may not exist, may not be unique, or may be sensitive to the noise in the data.

To set the context of our work and some notation, we consider a discrete measurement error model of the form

(1)

where denotes the measurements, is the discretized forward model, is the unknown estimand, and is the error precision. The inverse problem here seeks to recover the unknown from the measurement .

The statistical model eq. 1 implies that the probability density function for is given by

(2)

where ‘’ denotes proportionality. To address the ill-posedness, we also assume a Gaussian prior probability density function (akin to choosing a quadratic regularization function) of the form

(3)

For convenience, we take the prior to have zero mean, but a nonzero mean can also be easily incorporated into our framework. Then, if we define , through Bayes’ law we obtain the posterior density function of conditioned on and :

(4)

The maximizer of is known as the maximum a posteriori (MAP) estimator, which is also the minimizer of . This, in turn, has the form of a Tikhonov regularization, thus establishing a connection between Bayesian and classical inverse problems.

The scaling terms in eq. 4 involving and arise from the normalizing constant of the Gaussian measurement error and prior models. In the hierarchical Bayesian approach we will also treat as unknown. The a priori uncertainty about plausible values of is quantified in the prior distribution with density . For convenience, we assume that the precision parameters and are independent and Gamma-distributed. Specifically, we assume

(5)

Other choices are also possible and are discussed in [15, 26, 7] and elsewhere. Applying Bayes’ law again yields a posterior density function of and conditioned on :

(6)

The marginal distribution can be derived by integrating over ; i.e., . Explicit expressions for these distributions are provided in section 2.

The focus of this paper is on the problem of sampling from the posterior distribution . In general, the full-posterior is not Gaussian and to explore this distribution the prevalent approach is to use Markov chain Monte Carlo (MCMC) methods [14]. Several MCMC algorithms have been treated in the context of inverse problems in recent literature, which we briefly review. Specifically, in [3], conjugacy relationships are exploited to define a Gibbs sampler, in which samples from the conditional densities and (which are Gamma- and Gaussian-distributed, respectively) are cyclically computed. There are two major drawbacks to this approach as  [1]: first, the MCMC chain becomes increasingly correlated; and second, sampling from the conditional becomes prohibitively expensive. Thus as increases, due to the first drawback, longer MCMC chains are needed in order to obtain a sufficient number of samples from , and due to the second drawback, each MCMC sample is more expensive to compute.

To address the issue of degeneracy, an alternative to the Gibbs sampler is presented in [25], where a proposed state is computed in two stages by first drawing from a proposal distribution and then drawing . The proposal is accepted or rejected jointly using a Metropolis-Hastings step to obtain an approximate draw from the posterior . This approach, called the one-block algorithm [25], does not have the same degeneracy issues as the Gibbs sampler as . However, it is can be expensive to implement when evaluating is computationally demanding and one still has to compute a sample from at every iteration.. An alternative approach based on the Partially-Collapsed Gibbs sampler that was developed in [20] also suffers from high computational cost per iteration.

Several approaches exist for handling the computational cost. The approach taken in [7] is to precompute a low-rank approximation of the so-called prior preconditioned data misfit part of the Hessian. This low-rank representation allows efficient sampling from the conditional distribution, reducing the overall computational cost. When the forward operator and the prior covariance matrix are diagonalized by the Fourier transform, algorithms such as the Fast Fourier Transform can be used effectively [5]. Other methods based on Krylov subspace solvers, e.g., Conjugate Gradient, have also been developed [17]. All of these approaches still suffer from the degeneracy issue as .

The contributions of this paper are as follows. To tackle the two drawbacks of the Gibbs sampler, we combine the low-rank approach of [7] with the one-block algorithm of [25]. The use of a low-rank approximation defines an approximate posterior density function . In our first algorithm, which we call approximate one-block, is used as a proposal for Metropolis-Hastings, with the proposal samples computed using the one-block algorithm. We propose two other variants of the one-block algorithm that make use of . Specifically, we embed one-block applied to within both the delayed acceptance [8] and pseudo-marginal [2] frameworks to obtain samples from the full posterior . The proposed algorithms are fairly general in that any appropriate approximate posterior distribution can be used. We present theoretical results that provide insight into the acceptance rates and the performance of the algorithms. The main take away is that when the low-rank approximation is sufficiently close (in a sense that we make precise in section 4), the algorithms have similar behavior to the one-block algorithm.

The paper is organized as follows. In section 2, we present the hierarchical Gibbs sampler and discuss the infinite dimensional limit, presenting the result of [1] showing the degeneracy of hierarchical Gibbs as . We then present the one-block algorithm of [25], which does not have the same degeneracy issues. In section 3, we present the new algorithms making only limited assumptions regarding the approximate full posterior distribution. In section 4, we provide a theoretical analysis of our proposed algorithms, making explicit use of low rank structure. The numerical experiments in section 5 include a model 1D deblurring problem which allows us to compare and contrast the various algorithms and a PDE-based example of inverse heat equation that demonstrates the computational benefits of our approaches. We summarize our work and discuss future research in section 6.

2 Review of MCMC Algorithms

In this section, we present two MCMC algorithms for background. The first method is known as hierarchical Gibbs and is standard. Its convergence characteristics serve as motivation for the second algorithm, which is known as one-block.

2.1 The Hierarchical Gibbs Sampler

Under our assumed model, we provide explicit expressions for the posterior and the marginal distribution. To obtain the expression for the posterior distribution, combine eq. 4 with eq. 5 via eq. 6 to obtain

(7)

where and It follows that the marginal distribution, , is given by

(8)

Observe that the joint posterior density satisfies , where .

We begin with the hierarchical Gibbs sampler of [3]. Our choice of prior eq. 3 for , and the hyper-prior eq. 5 for , respectively, were made with conjugacy relationships in mind [16]; i.e., so that the ‘full conditional’ densities have the same form as the corresponding priors:

(9)
(10)
(11)

Note that eqs. 10 and 9 are Gamma densities, while eq. 11 is the density of a Gaussian distribution. algorithm 1 follows immediately from eqs. 11, 10 and 9 and is precisely the hierarchical Gibbs sampling algorithm of [3].

Input: Set , and define and burn-in period .
Output: Approximate samples from the posterior distribution .
1 for  to  do
2       Compute . Compute . Compute , where .
3 end for
Algorithm 1 Hierarchical Gibbs Sampler

The values of and are determined by the number of measurements and the size of the numerical mesh, respectively, making the problems discrete. Since is the dimension of our measurement vector, we assume that it is a fixed value. However, we are free to choose as we please, and the behavior of our approaches as is an important question. In what follows, we briefly discuss the infinite-dimensional limit, pointing the interested reader to the extensive treatments found in [28, 12] for more details.

Consider the linear inverse problem, which typically arises from the discretization of a Fredholm integral equation of the first-kind, for example

(12)

where is the model output function, is the computational domain, is the integral kernel or point spread function, and is the unknown which we seek to estimate. We define to be the forward operator discretized only in its range, so that , where , and is the spatial domain. For example, in one-dimensional deconvolution, (with ) one can have

Discretizing this integral in the variable, e.g., using a uniform mesh on with grid elements and midpoint quadrature, then yields . Then we have that

(13)

Note that here denotes the discretized version of . For the prior eq. 3, it is typical to choose to be the numerical discretization of the inverse of a differential operator. That is, letting denote the infinite dimensional prior covariance operator, we define , where is a differential operator. A basic requirement on is that it has to be trace-class on . That is, for any orthonormal basis of , ; see [28]. Moreover, if is the standard inner product, we can use midpoint quadrature to obtain , where the boldface letters indicate discretized versions of the variables and . Using this notation,

(14)

Combining eq. 13 and eq. 14 yields

(15)

There are two issues that arise when we consider the limit as : the first is mathematical, and the second is computational. A question that immediately arises is whether or not one can define an infinite dimensional limit of the posterior density function. We cannot define a probability density function on a function space. The reason for this is that in finite dimensions, the posterior density is none but the Radon–Nikodym derivative of posterior probability law of the inference parameter with respect to Lebesgue measure, but one cannot define a Lebesgue measure on an infinite-dimensional function space. However, as shown in [28], given that the prior covariance operator is trace class and is a continuous linear transformation, the posterior law of , which we denote by , is a Gaussian measure on , whose mean and covariance are given by the following expressions:

As for the construction of the prior covariance operator, as mentioned before, a common approach is to define them as inverses of differential operators. For example, we can define

with Neumann boundary conditions. To ensure that the covariance operator so defined is trace-class, we require , where is the spatial dimension of the problem. For the one-dimensional example in eq. 12, would suffice. For problems with or , a convenient option is . Note this assumption on also ensures that the prior draws are almost surely continuous. For further details on the definition of Gaussian measures on infinite-dimensional Hilbert spaces, see [10, 11].

A second question that arises is whether or not the performance of the hierarchical Gibbs sampler is dependent upon . For inverse problems in which the infinite dimensional limit is well-defined, MCMC methods whose performance is independent of the discretization ( in this case) are desirable. In line 3 of algorithm 1, we see that appears in the Gamma conditional density , thus it should not be surprising to find the -chain is dependent on . The exact nature of this dependence is the subject of [1, Theorem 3.4], where under reasonable assumptions it is shown that the expected step in the -chain scales like . Specifically, for any ,

where denotes expectation and is bounded uniformly in . Moreover, the variance of the step also scales like ; for any ,

A consequence of these results, as is noted in [1], is that the expected squared jumping distance of the Markov chain for is . Moreover, it is noted that the lag-1 autocorrelation of the -chain behaves like for some constant, but . Hence, the Monte Carlo error associated with draws in stationarity is . Thus, the -chain becomes increasing correlated as . This phenomenon is illustrated in fig. 1, which displays the empirical autocorrelation functions for the - and -chains generated by algorithm 1 for a one-dimensional image deblurring test problem. Note that as increases by a power of , so does the integrated autocorrelation time (IACT).

Figure 1: Plots of the empirical autocorrelation functions for the chains in the one-dimensional deblurring example: (left) Block-Gibbs and (right) Approximate One-Block discussed in section 3.1. See section 5.1 for details regarding the experimental setup.

2.2 One-Block MCMC

An alternative algorithm for computing a sample is to first compute a sample from the marginal density , defined in eq. 8, and then compute a sample from the conditional density . In principle, we can use this procedure as an alternative to algorithm 1. In practice, it is often not possible to sample directly from . In the one-block algorithm of [25], a Markov chain is generated in which if is the current element of the chain, a state is proposed from some proposal distribution with density , then is drawn, and the pair is accepted with probability

(16)

The resulting MCMC method is given by algorithm 2.

Input: Set , and define and burn-in period .
Output: Approximate samples from the posterior distribution .
1 for  to  do
2       Compute . Compute . Set with probability , defined in eq. 16, else set .
3 end for
Algorithm 2 One-Block MCMC

Note that the conditional sample effectively marginalizes (i.e., integrates out) from the posterior, as is illustrated by the acceptance ratio eq. 16, which does not depend upon . If is used as a Metropolis-Hastings proposal for sampling from the marginal density alone without going through —this is the MCMC method proposed in [13]—the acceptance ratio is also given by eq. 16. Thus, the -chain generated by algorithm 2 converges in distribution to . The marginalization circumvents the degeneracy exhibited by the -chain generated by algorithm 1.

3 Algorithms

For some applications, evaluating the marginal density is not computationally tractable, nor is computing exact samples . In this case, implementation of algorithm 2 is infeasible. However, suppose that we have an approximate posterior density function for which the marginal density can be evaluated, and exact samples can be drawn efficiently.

In this section, we propose three MCMC algorithms, each of which uses to approximate algorithm 2. We note that Rue and Held [25] also consider approximate marginalization of in the one-block algorithm. However, the difference between our approach and that of [25] is the nature of the approximate posterior distribution. Specifically, motivated by non-Gaussian conditional distributions that arise in, e.g., Poisson counts for disease mapping, [25] use an approximation based on a Taylor expansion of the log-likelihood about the mode of the full conditional distribution, resulting in a Gaussian proposal in the Metropolis-Hastings algorithm. By contrast, we are concerned with cases in which the full conditional is still Gaussian distributed, but drawing realizations from the distribution and evaluating the density are computationally prohibitive because of the dimensionality of the unknown parameters. We discuss in section 4.4 our construction of the approximation and its effect on the proposed algorithms.

3.1 Approximate One-Block MCMC

Suppose we generate a proposal by first drawing from a proposal distribution with density followed by drawing . Using this in a Metropolis-Hastings sampler with target distribution is simply the one-block algorithm with in place of . The proposal density is then given by

and hence, the acceptance probability is

(17)

The full procedure is summarized in algorithm 3.

Input: Initialize . Define the length of the chain and burn-in period .
Output: Approximate sample from the posterior distribution .
1 Compute . for  to  do
2       Draw from distribution with density . Draw from distribution with density Set with probability defined in eq. 17; else set .
3 end for
Algorithm 3 Approximate One-Block MCMC.

There are two observations to make about algorithm 3, each of which motivates the MCMC methods that follow. First, computing the ratio eq. 17 requires evaluating the true posterior density . In cases in which this is computationally burdensome, the MCMC method presented next seeks to avoid extraneous evaluations of using a technique known as delayed acceptance [8]. Second, algorithm 3 only approximately integrates out of , since , but this is not an exact marginalization and thus some dependence between the and chains may remain, slowing convergence of the algorithm. However, as we discuss below, algorithm 3 is a special case of the so-called pseudo-marginal MCMC algorithm [2]. By generalizing the approximate one-block algorithm, we can improve the approximation to and thus improve the mixing of the Markov chains.

3.2 Approximate One-Block MCMC with Delayed Acceptance

It will often be the case that evaluating is computationally prohibitive and/or significantly more expensive than evaluating , in which case one would like to generate samples from while minimizing the number of times its density is evaluated. This motivates the use of the delayed acceptance framework of [8] to improve the computational efficiency of algorithm 3. In this approach, one step of algorithm 2 is used with taken as the target distribution. Only if is accepted as a sample from is it proposed as a sample from . The idea is to only evaluate for proposed states that have a high probability of being accepted as draws from the true distribution. This prevents us from wasting computational effort on rejected proposals. The approximate distribution in the first stage is essentially a computationally cheap “filter” that prevents this from occurring. The resulting approximate one-block MCMC with delayed acceptance (ABDA) procedure is given in algorithm 4.

Input: Initialize . Define the length of the chain and burn-in period .
Output: Approximate sample from the posterior distribution .
1 Compute . for  to  do
       // Stage 1: apply the one-block algorithm to .
       Draw from the distribution with density . Draw from the distribution with density . Compute defined in section 3.2 Set and promote to Stage 2 with probability else set and promote. // Stage 2: accept/reject as a proposal for .
2       Define
(18)
where is defined in eq. 20. Set with probability , else set .
3 end for
Algorithm 4 Approximate One-Block MCMC with delayed acceptance.

Let be given by eq. 16, but with replaced by ; i.e.,

(19)

Then, Stage 2 of algorithm 4 is a Metropolis-Hastings algorithm with target distribution and proposal density given by

(20)

where . Note that there is never a need to evaluate , since when the promoted state is ,

and the chain remains at the same point. Conversely, when the composite sample is promoted, and

We discuss in section 4 the conditions under which the acceptance rate in the second stage is high, thus preventing wasted computational effort on rejected proposals.

For the ABDA algorithm, suppose that the support of the target density () is completely contained in the support of the approximating density () and that the support of the Metropolis-Hastings proposal distribution for completely contains the support of the marginal distribution of , given any current state of the chain. Then the support of is contained in the support of , for all such that . But the approximate distributions we consider are such that if and only if . Hence, provided that an acceptable proposal distribution is used in the first stage (i.e., one that would be used in the ordinary one-block algorithm), the properties of the Metropolis-Hastings algorithm guarantee that the ABDA transition kernel satisfies detailed balance with respect to the target joint distribution of . In other words, is a stationary distribution of the Markov chain produced by ABDA. This can also easily be gleaned by recognizing that the ABDA algorithm is simply a special case of the delayed acceptance algorithm in [8], with first stage proposal given by , the approximating density given by , and target density . This latter observation allows us to apply Theorem 1 of [8] to establish irreducibility and aperiodicity of the ABDA Markov chain. Standard ergodic theory (e.g., [24]) then provides that is, in fact, the limiting distribution.

3.3 Pseudo-Marginal MCMC

As noted at the end of algorithm 2, the -chain generated by algorithm 2 is independent of the -chain. Hence, it does not suffer from the same degeneracy issues as the -chain generated by algorithm 1. When it is not computationally feasible to implement algorithm 2, and an approximate posterior is used as in algorithms 4 and 3, the marginalization is only approximate so that there still remains some dependence between the and chains. This dependence can be mitigated as the approximation to the marginal distribution of improves. In particular, we can use importance sampling, with importance density , to approximate the integration over . Specifically, we have that

(21)

where . The idea behind pseudo-marginal MCMC [2], in our setting, is to generalize algorithm 3 by using as an approximation to . The resulting algorithm is given in algorithm 5.

Input: Initialize . Define the length of the chain and burn-in period .
Output: Approximate samples from the posterior distribution .
1 Compute with as in eq. 21. Define . for  to  do
2       Compute . Compute using as in eq. 21. Compute
(22)
With probability , set and , else set and .
3 end for
Algorithm 5 Pseudo-Marginal MCMC.

When algorithm 5 simply reduces to algorithm 3. Conversely, as , , and hence the Markov chains produced by algorithms 2 and 5 become indistinguishable. Consequently, as increases, the dependence between the -chain and the -chain dissipates as desired. It is shown in [2] that the value of computed in step can be reused in step so that a new set of importance samples does not need to be computed in eq. 21. Furthermore, this paper showed that the corresponding Markov chain will converge in distribution to the target density, in our case .

4 Analysis

algorithms 5, 4 and 3 are all approximations of the one-block algorithm, algorithm 2, and they are meant to be used in cases in which implementing one-block is computationally expensive. All three algorithms require an approximation of the posterior, , which we assume has the form eq. 7, but with the conditional covariance replaced by an approximation . We will use the following acronyms in what follows: AOB for Approximate One-Block, algorithm 3, and ABDA for Approximate One-Block with Delayed Acceptance, algorithm 4.

All three algorithms require the computation of samples from the approximate conditional , which is of the form

where . In section 4.4, we tailor these results to a specific choice of . The following quantity will be important in what follows:

(23)

An expression for the moments of can be computed analytically by using properties of Gaussian integrals and was established in [7]. For positive integers ,

(24)

where

(25)

with

Further results for can be derived if is known explicitly. When is constructed using the low-rank approach outlined in section 4.4 below, we have that

4.1 Analysis of the Approximate One Block acceptance ratio

Our first result derives a simplified version of the acceptance ratio of AOB (algorithm 3) that is more amenable to interpretation.

Proposition 1.

In algorithm 3, let denote the current state of the AOB chain and let be the proposed state. Then, the acceptance ratio simplifies to

where .

Proof.

Consider the ratio between the full posterior distribution and the approximate conditional ,

From [7, Proposition 1], the ratio of the exact and the approximate conditional densities simplifies to

(26)

Substituting eq. 26 into eq. 17 yields the desired result. ∎

The meaning of this result is clear if we combine eq. 24 and [7, Proposition 2] to obtain

(27)

In other words, given the current state and a proposed state , the acceptance rate in algorithm 3, when averaged over , is that of the one-block algorithm. Observe that if one takes , so that , then