Likelihood-informed dimension reduction for nonlinear inverse problems
The intrinsic dimensionality of an inverse problem is affected by prior information, the accuracy and number of observations, and the smoothing properties of the forward operator. From a Bayesian perspective, changes from the prior to the posterior may, in many problems, be confined to a relatively low-dimensional subspace of the parameter space. We present a dimension reduction approach that defines and identifies such a subspace, called the “likelihood-informed subspace” (LIS), by characterizing the relative influences of the prior and the likelihood over the support of the posterior distribution. This identification enables new and more efficient computational methods for Bayesian inference with nonlinear forward models and Gaussian priors. In particular, we approximate the posterior distribution as the product of a lower-dimensional posterior defined on the LIS and the prior distribution marginalized onto the complementary subspace. Markov chain Monte Carlo sampling can then proceed in lower dimensions, with significant gains in computational efficiency. We also introduce a Rao-Blackwellization strategy that de-randomizes Monte Carlo estimates of posterior expectations for additional variance reduction. We demonstrate the efficiency of our methods using two numerical examples: inference of permeability in a groundwater system governed by an elliptic PDE, and an atmospheric remote sensing problem based on Global Ozone Monitoring System (GOMOS) observations.
Keywords: Inverse problem, Bayesian inference, dimension reduction, low-rank approximation, Markov chain Monte Carlo, variance reduction
Inverse problems arise from indirect observations of parameters of interest. The Bayesian approach to inverse problems formalizes the characterization of these parameters through exploration of the posterior distribution of parameters conditioned on data [1, 2, 3]. Computing expectations with respect to the posterior distribution yields not only point estimates of the parameters (e.g., the posterior mean), but a complete description of their uncertainty via the posterior covariance and higher moments, marginal distributions, quantiles, or event probabilities. Uncertainty in parameter-dependent predictions can also be quantified by integrating over the posterior distribution.
The parameter of interest in inverse problems is often a function of space or time, and hence an element of an infinite-dimensional function space . In practice, the parameter field must be discretized, and the resulting inference problem acquires a high but finite dimension. The computation of posterior expectations then proceeds via posterior sampling, most commonly using Markov chain Monte Carlo (MCMC) methods [4, 5, 6]. The computational cost and efficiency of an MCMC scheme can be strongly affected by the parameter dimension, however. The convergence rates of standard MCMC algorithms usually degrade with parameter dimension [7, 8, 9, 10, 11]; one manifestation of this degradation is an increase in the mixing time of the chain, which in turn leads to higher variance in posterior estimates. Some recent MCMC algorithms, formally derived in the infinite-dimensional setting [12, 13], do not share this scaling problem. Yet even in this setting, we will argue that significant variance reduction can be achieved through explicit dimension reduction and through de-randomization of posterior estimates, explained below.
This paper proposes a method for dimension reduction in Bayesian inverse problems. We reduce dimension by identifying a subspace of the parameter space that is likelihood-informed; this notion will be precisely defined in a relative sense, i.e., relative to the prior. Our focus is on problems with nonlinear forward operators and Gaussian priors, but builds on low-rank approximations  and optimality results  developed for the linear-Gaussian case. Our dimension reduction strategy will thus reflect the combined impact of prior smoothing, the limited accuracy or number of observations, and the smoothing properties of the forward operator. Identification of the likelihood-informed subspace (LIS) will let us write an approximate posterior distribution wherein the distribution on the complement of this subspace is taken to be independent of the data; in particular, the posterior will be approximated as the product of a low-dimensional posterior on the LIS and the marginalization of the prior onto the complement of the LIS. The key practical benefit of this approximation will be variance reduction in the evaluation of posterior expectations. First, Markov chain Monte Carlo sampling can be restricted to coordinates in the likelihood-informed space, enabling greater sampling efficiency—i.e., more independent samples in a given number of MCMC steps or a given computational time. Second, the product form of the approximate posterior will allow sampling in the complement of the likelihood-informed space to be avoided altogether, thus producing Rao-Blackwellized or analytically conditioned estimates of certain posterior expectations.
Dimension reduction for inverse problems has been previously pursued in several ways.  constructs a low dimensional representation of the parameters by using the truncated Karhunen-Lòeve expansion [17, 18] of the prior distribution. A different approach, combining prior and likelihood information via low-rank approximations of the prior-preconditioned Hessian of the log-likelihood, is used in  to approximate the posterior covariance in linear inverse problems. In the nonlinear setting, low-rank approximations of the prior-preconditioned Hessian are used to construct proposal distributions in the stochastic Newton MCMC method  or to make tractable Gaussian approximations at the posterior mode in —either as a Laplace approximation, as the proposal for an independence MCMC sampler, or as the fixed preconditioner for a stochastic Newton proposal. We note that these schemes bound the tradeoff between evaluating Hessian information once (at the posterior mode) or with every sample (in local proposals). In all cases, however, MCMC sampling proceeds in the full-dimensional space.
The dimension reduction approach explored in this paper, by contrast, confines sampling to a lower-dimensional space. We extend the posterior approximation proposed in  to the nonlinear setting by making essentially a low-rank approximation of the posterior expectation of the prior-preconditioned Hessian, from which we derive a projection operator. This projection operator then yields the product-form posterior approximation discussed above, which enables variance reduction through lower-dimensional MCMC sampling and Rao-Blackwellization of posterior estimates.
We note that our dimension reduction approach does not depend on the use of any specific MCMC algorithm, or even on the use of MCMC. The low-dimensional posterior defined on coordinates of the LIS is amenable to a range of posterior exploration or integration approaches. We also note that the present analysis enables the construction of dimension-independent analogues of existing MCMC algorithms with essentially no modification. This is possible because in inverse problems with formally discretization-invariant posteriors—i.e., problems where the forward model converges under mesh refinement and the prior distribution satisfies certain regularity conditions [21, 3]—the LIS can also be discretization invariant. We will demonstrate these discretization-invariance properties numerically.
The rest of this paper is organized as follows. In Section 2, we briefly review the Bayesian formulation for inverse problems. In Section 3, we introduce the likelihood-informed dimension reduction technique, and present the posterior approximation and reduced-variance Monte Carlo estimators based on the LIS. We also present an algorithm for constructing the likelihood-informed subspace. In Section 4, we use an elliptic PDE inverse problem to demonstrate the accuracy and computational efficiency of our posterior estimates and to explore various properties of the LIS, including its dependence on the data and its discretization invariance. In Section 5, we apply our variance reduction technique to an atmospheric remote sensing problem. Section 6 offers concluding remarks.
2 Bayesian formulation for inverse problems
Here is a random variable representing noise and/or model error, which appears additively, and is a known mapping from the parameters to the observables. In a Bayesian setting, we model the parameters as a random variable and, for simplicity, assume that the range of this random variable is a finite dimensional space . Then the parameter of interest is characterized by its posterior distribution conditioned on a realization of the data, :
We assume that all distributions have densities with respect to Lebesgue measure. The posterior probability density function above is the product of two terms: the prior density , which models knowledge of the parameters before the data are observed, and the likelihood function , which describes the probability distribution of for any value of .
We assume that the prior distribution is a multivariate Gaussian , where the covariance matrix can be also defined by its inverse, , commonly referred to as the precision matrix. We model the additive noise with a zero mean Gaussian distribution, i.e., . This lets us define the data-misfit function
such that the likelihood function is proportional to .
3.1 Optimal dimension reduction for linear inverse problems
Consider a linear forward model, , with a Gaussian likelihood and a Gaussian prior as defined in Section 2. The resulting posterior is also Gaussian, , with mean and covariance given by
where is the Hessian of the data-misfit function (3). Without loss of generality we can assume zero prior mean and a positive definite prior covariance matrix.
Now consider approximations to the posterior distribution of the form
where is a rank- projector and is an approximation to the original likelihood function . Approximations of this form can be computationally advantageous when operations involving the prior (e.g., evaluations or sampling) are less expensive than those involving the likelihood. As described in , they are also the natural form with which to approximate a Bayesian update, particularly in the inverse problem setting with high-dimensional . In the deterministic case, inverse problems are ill-posed; the data cannot inform all directions in the parameter space. Equivalently, the spectrum of is compact or decays quickly. Thus one should be able to explicitly project the argument of the likelihood function onto a low-dimensional space without losing much information in the process. The posterior covariance remains full rank, but the update from prior covariance to posterior covariance will be low rank. The challenge, of course, is to find the best projector for any given . The answer will involve balancing the influence of the prior and the likelihood. In the following theorem, we introduce the optimal projector and characterize its approximation properties.
Let be a symmetric decomposition of the prior covariance matrix and let be the eigenvalue-eigenvector pairs of the prior-preconditioned Hessian such that . Define the directions and together with the matrices and . Then, the projector given by:
yields an approximate posterior density of the form and is optimal in the following sense:
minimizes the Förstner distance  from the exact posterior covariance over the class of positive definite matrices that can be written as rank negative semidefinite updates of the prior covariance.
minimizes the Bayes risk over the class of all linear transformations of the data with .
We refer the reader to  for a proof and detailed discussion. ∎
The vectors span the range of the optimal projector; we call this range the likelihood-informed subspace of the linear inverse problem. Note that the are eigenvectors of the pencil . Hence, the th basis vector maximizes the Rayleigh quotient
over the subspace . This Rayleigh quotient helps interpret the as directions where the data are most “informative” relative to the prior. For example, consider a direction representing a rough mode in the parameter space. If the prior is smoothing, then the denominator of (6) will be large; also, if the forward model output is relatively insensitive to variation in the direction, the numerator of (6) will be small. Thus the Rayleigh quotient will be small and is not particularly data-informed relative to the prior. Conversely, if is smooth then the prior variance in this direction may be large and the likelihood may be relatively constraining; this direction is then data-informed. Of course, there are countless intermediate cases, but in general, directions for which (6) are large will lie in the range of .
Note also that diagonalizes both and . We are particularly interested in the latter property: the modes are orthogonal (and can be chosen orthonormal) with respect to the inner product induced by the prior precision matrix. This property will be preserved later in the nonlinear case, and will be important to our posterior sampling schemes.
For nonlinear inverse problems, we seek an approximation of the posterior distribution in the same form as (5). In particular, the range of the projector will be determined by blending together local likelihood-informed subspaces from regions of high posterior probability. The construction of the approximation will be detailed in the following section.
3.2 LIS construction for nonlinear inverse problems
When the forward model is nonlinear, the Hessian of the data-misfit function varies over the parameter space, and thus the likelihood-informed directions are embedded in some nonlinear manifold. We aim to construct a global linear subspace to capture the majority of this nonlinear likelihood-informed manifold.
Let the forward model be first-order differentiable. The linearization of the forward model at a given parameter value , where , provides the local sensitivity of the parameter-to-observable map. Inspired by the dimension reduction approach for the linear inverse problem, we use the linearized forward model to construct the Gauss-Newton approximation of the Hessian of the data-misfit function, . Now consider a local version of the Rayleigh quotient (6),
Introducing the change of variable , we can equivalently use
to quantify the local impact of the likelihood relative to the prior. As in the linear problem, this suggests the following procedure for computing a local LIS given some truncation threshold :
Problem 1 (Construction of the local likelihood-informed subspace).
Given the Gauss-Newton Hessian of the data misfit function at a given , find the eigendecompostion of the prior-preconditioned Gauss-Newton Hessian (ppGNH)
Given a truncation threshold , the local LIS is spanned by , where corresponds to the leading eigenvalues such that .
For a direction with , the local impact of the likelihood and the prior are balanced. Thus, to retain a comprehensive set of likelihood-informed directions, we typically choose a truncation threshold less than 1.
To extend the pointwise criterion (7) into a global criterion for likelihood-informed directions, we consider the expectation of the Rayleigh quotient over the posterior
where is the expected ppGNH over the posterior,
Then we can naturally construct the global LIS through the eigendecomposition of as in the linear case. We consider approximating using the Monte Carlo estimator
where , , are posterior samples. Since the local Hessian is usually not explicitly available and is not feasible to store for large-scale problems, we use its prior-preconditioned low-rank approximation as defined in Problem 1. Thus the global LIS can be constructed by the following procedure:
Problem 2 (Construction of global likelihood-informed subspace).
Suppose we have a set of posterior samples , , where for each sample , the ppGNH is approximated by the truncated low rank eigendecomposition
The global LIS has the non-orthogonal basis , where the eigenvectors correspond to the leading eigenvalues of (10), , for some truncation threshold . Here we choose to be equal to the threshold in Problem 1.
In many applications we can only access the Gauss-Newton Hessian by computing its action on vectors, which involves one forward model evaluation and one adjoint model evaluation. In such a case, the ppGNH can be approximated by finding the eigendecomposition (8) using either Krylov subspace algorithms  or randomized algorithms [24, 25].
The number of samples required to construct the global LIS depends on the degree to which (or its dominant eigenspace) varies over the posterior. In Section 3.5, we present an adaptive construction procedure that automatically explores the directional landscape of the likelihood.
3.3 Posterior approximation
By projecting the likelihood function onto the global likelihood-informed subspace (LIS), we obtain the approximate posterior
where is a projector onto the global LIS. This projector is self-adjoint with respect to the inner product induced by the prior precision matrix, and leads to a natural decomposition of the parameter space as , where is the LIS and is the complement subspace (CS). This choice leads to a factorization of the prior distribution into the product of two distributions, one defined on the low-dimensional LIS and the other on the CS. This factorization is the key to our dimension reduction technique.
We define the projectors and , and a corresponding parameter decomposition, as follows:
Suppose the LIS basis computed in Problem 2 is , where is orthonormal. Define the matrix such that . The projector has the form
Choose such that forms a complete orthonormal system in . Then the projector can be written as
where and .
Naturally, the parameter can be decomposed as
where each projection can be represented as the linear combination of the corresponding basis vectors. Consider the “LIS parameter” and the “CS parameter” , which are the weights associated with the LIS basis and CS basis , respectively. Then we can define the following pair of linear transformations between the parameter and :
Figure 1 illustrates the transformations between the parameter projected onto the LIS, , and the LIS parameter . The same relation holds for the transformations between and the CS parameter . And as Definition 1 makes clear, is an oblique projector.
Suppose we have as defined in Definition 1(b). Then the prior distribution can be decomposed as
where and .
Applying the linear transformation from to as defined in Equation (12), we can rewrite the approximate posterior for the parameters as
which is the product of the reduced posterior
and the complement prior . To compute a Monte Carlo estimate of the expectation of a function over the approximate posterior distribution (13), we only need to sample the reduced posterior , since properties of the Gaussian complement prior are known analytically.
One can also combine MCMC samples from the reduced posterior with independent samples from the complement prior to provide samples that are approximately drawn from the full posterior . By correcting these samples via importance weights or a Metropolis scheme, one would then obtain a sampling algorithm for the original full-space posterior. This idea is not pursued further here, and in the rest of this work we will emphasize the analytical properties of the complement prior , using them to reduce the variance of Monte Carlo estimators.
3.4 Reduced-variance estimators
Suppose we have a function for which the conditional expectation over the approximate posterior (11)
can be calculated either analytically or through some high-precision numerical quadrature scheme. Then variance reduction can be achieved as follows:
Subspace MCMC. Use MCMC in the LIS to simulate a “subspace Markov chain” with target distribution (13). Any number of MCMC algorithms developed in the literature can be applied off-the-shelf, e.g., adaptive MCMC [26, 27, 28, 29, 30], the stochastic Newton algorithm of , or the Riemannian manifold algorithms of . Since the dimension of the LIS can be quite small relative to the original parameter space, the subspace MCMC approach can yield lower sample correlations (better mixing) than applying any of these MCMC algorithms directly to the full posterior (2).
Rao-Blackwellization. We approximate by the expectation of the function over the approximate posterior , i.e., . Given a set of subspace MCMC samples where , a Monte Carlo estimator of is given by
This procedure mitigates many of the difficulties of posterior exploration in high dimensions, provided that the prior-to-posterior update is reasonably low rank. Variance reduction is achieved not only by increasing the effective sample size per MCMC iteration (via subspace MCMC), but also by reducing the variance of Monte Carlo estimators using Rao-Blackwellization. In effect, we argue that the Gaussian CS can be explored separately and via a calculation (15) that does not involve sampling.
Note that while this procedure necessarily reduces the variance of a Monte Carlo estimator, it introduces bias since we replace the expectation over the full posterior with an expectation over the approximate posterior . Thus this variance reduction is particularly useful in situations where the variance of the estimator (17) derived from full-space MCMC samples is large compared with the bias, which is often the case for high-dimensional inverse problems.
Beyond variance reduction, subspace MCMC offers several additional computational advantages over MCMC methods applied to the full posterior directly: (i) The storage requirement for saving subspace MCMC samples is much lower than that of an MCMC scheme that samples the full posterior. (ii) For MCMC methods where the proposal distribution involves operations with square root of the prior covariance matrix (e.g., the stochastic Newton  and preconditioned Crank-Nicolson [12, 13] techniques) the computational cost of handling the full prior covariance can be much higher than the computational cost of handling the reduced prior , which has identity covariance.
The Monte Carlo estimator (16) can be further simplified if the function of interest can be expressed as either the product or the sum of two separate functions, and , defined on the LIS and CS, respectively. In the multiplicative case , the conditional expectation (15) can be written as
In the additive case , it can be written as
Thus the expectation can be decomposed either into the product (in the multiplicative case) or the sum (in the additive case) of the pair of expectations
which are associated with the LIS and CS, respectively. The expectation in (18) can be computed by the subspace MCMC methods described above, whereas the expectation in (19) is computed analytically or through high-order numerical integration.
Now we give two particularly useful examples of the analytical treatment of the complement space.
Example 1 (Reduced variance estimator of the posterior mean).
Suppose we have obtained the empirical posterior mean of the reduced parameter using subspace MCMC. The resulting reduced-variance estimator of the posterior mean is
Example 2 (Reduced variance estimator of the posterior covariance).
Suppose we have the empirical posterior covariance of the reduced parameter , estimated using subspace MCMC. The resulting reduced-variance estimator of the posterior covariance is
3.5 Algorithms for the LIS
Constructing the global LIS requires a set of posterior samples. Since the computational cost of solving Problem 1 for any sample is much greater than the cost of evaluating the forward model, we wish to limit the number of samples used in Problem 2 while ensuring that we adequately capture the posterior variation of the ppGNH. Thus we choose samples using the following adaptive procedure.
Algorithm 1 (Global LIS construction using subspace MCMC).
First, compute the posterior mode . Set the initial sample set for Problem 2 to . Solve Problem 2 to find , the initial LIS basis , and its left-inverse .
Subchain simulation. Simulate the -dimensional subspace MCMC chain for iterations, so that the last state of this chain, denoted by , is uncorrelated with its initial state. Then transformed back to the original parameter space, , is used as the next sample point. Enrich the sample set to .
LIS construction. Solve Problem 2 with the sample set . Then update the LIS basis to and . Set the initial state of the next subspace MCMC chain to .
The convergence criterion in step (3) is based on an incremental distance between likelihood-informed subspaces. The distance penalizes changes in the dominant directions (those with large eigenvalues ) more heavily than changes in the less important directions (those with smaller ).
Definition 2 (Weighted subspace distance).
At iteration , define the basis/weights pair , where is the orthonormal LIS basis from Problem 2 and is a diagonal matrix consisting of normalized weights
Note that in Step (i) of Algorithm 8, we construct the global LIS by always sampling in an adaptively enriched subspace. This offers computational benefits, since the MCMC exploration is always confined to a lower dimensional space. However, a potential problem with this approach is that it might ignore some directions that are also data-informed. A more conservative approach would be to introduce a conditional update at the end of each subchain simulation: perform Metropolized independence sampling in the current CS using the complement prior as proposal. This would enable the subchain to explore the full posterior, but would result in higher-dimensional sampling when constructing the LIS. In our numerical examples, described below, no conditional updates were required for good performance; constructing the LIS using samples from the full posterior and using the subspace approach gave essentially the same results. Of course, one could also simply employ a standard MCMC algorithm to sample the full posterior, and then construct the LIS using the resulting posterior samples. However, the efficiency of the MCMC algorithm in this case will be affected by the dimensionality of the problem.
4 Example 1: Elliptic PDE
Our first example is an elliptic PDE inverse problem used to demonstrate (i) construction of the LIS and the impact of mesh refinement; (ii) the application of low-rank posterior mean and variance estimators; and (iii) changes in the LIS with varying amounts of observational data.
4.1 Problem setup
Consider the problem domain , with boundary . We denote the spatial coordinate by . Consider the permeability field , the pressure field , and sink/source terms . The pressure field for a given permeability and source/sink configuration is governed by the Poisson equation
where is the outward normal vector on the boundary. To make a well-posed boundary value problem, a further boundary condition
is imposed. The source/sink term is defined by the superposition of four weighted Gaussian plumes with standard deviation (i.e., spatial width) , centered at four corners , , and , with weights . The system of equations (21) is solved by the finite element method with bilinear elements.
The discretized permeability field is endowed with a log-normal prior distribution, i.e.,
where the covariance matrix is defined through an anisotropic exponential covariance kernel
for . In this example, we set the anisotropic correlation tensor to
the prior standard deviation to , and the correlation length to . The “true” permeability field is a realization from the prior distribution. The true permeability field, the sources/sinks, the simulated pressure field, and the synthetic data are shown in Figure 2.
Partial observations of the pressure field are collected at measurement sensors as shown by the black dots in Figure 2(c). The observation operator is simply the corresponding “mask” operation. This yields observed data as
with additive error . The standard deviation of the measurement noise is prescribed so that the observations have signal-to-noise ratio 10, where the signal-to-noise ratio is defined as . The noisy data are shown in Figure 2(d).
Figure 3 shows two draws from the prior distribution, the eigenspectrum of the prior covariance, and the cumulative prior variance integrated over (i.e., the running sum of the prior covariance eigenvalues). In order to keep percent of the energy in the prior, eigenmodes are required. Because of this slow decay of the prior covariance spectrum, a priori dimension reduction based on a truncated eigendecomposition of the prior covariance (as described in ) would be very inefficient for this problem. Information carried in high-frequency eigenfunctions cannot be captured unless an enormous number of prior modes are retained; thus, a better basis is required.
4.2 LIS construction
Now we demonstrate the process of LIS construction using Algorithm 1, and the structure of the LIS under mesh refinement. To compute the LIS, we run Algorithm 1 for iterations, using adaptive MALA  to simulate each subchain with length . We choose the truncation thresholds . To explore the dimensionality and structure of the LIS versus mesh refinement, we carry out the same numerical experiment on a coarse grid, a intermediate grid, and a fine grid. The dimension of the LIS versus number of iterations, the evolution of the convergence diagnostic (20), and the generalized eigenvalues after 500 iterations—for each level of grid refinement—are shown in Figure 4. Also, Figure 5 shows the first five LIS basis vectors for each level of discretization.
As shown in Figure 4(a), the dimension of the LIS changes rapidly in the first 100 iterations, then it stabilizes. Change in the dimension reflects the fact that the log-likelihood Hessian varies locally in this non-Gaussian problem. We also observe that the grid has a slightly larger final LIS dimension than the two refined grids: at the end of the adaptive construction, the LIS of the grid has dimension , while the grid and the grid yield LIS dimensions of . This effect may be ascribed to larger discretization errors in the grid.
The weighted distance (20) between each adjacent pair of likelihood-informed subspaces is used as the convergence diagnostic during the construction process. With any of the three discretizations, the weighted subspace distance at the end of adaptive construction is several orders of magnitude lower than at the beginning, as shown in Figure 4(b). We also observe that the rates of convergence of this diagnostic are comparable for all three levels of discretization. These figures suggest that while local variation of the Hessian is important in this problem (e.g., the dimension of the LIS doubles over the course of the iterations), much of this variation is well-explored after 100 or 200 iterations of Algorithm 1.
Since the forward model converges with grid refinement, we expect that the associated LIS should also converge. The generalized eigenvalues for all three grids are shown in Figure 4(c), where the spectra associated with all three subspaces have very similar values. And as shown in Figure 5, the leading LIS basis vectors have similar shapes for all three levels of grid refinement. Refinement leads to slightly more structure in , but the overall mode shapes are very close.
4.3 Estimation of the posterior mean and variance
With an LIS in hand, we apply the variance reduction procedure described in Section 3.3 to estimate the posterior mean and variance of the permeability field. Calculations in this subsection use the discretization of the PDE and inversion parameters.
We first demonstrate the sampling performance of subspace MCMC, where we use adaptive MALA  to sample the LIS-defined reduced posterior (14). We compare the results of subspace MCMC with the results of Hessian-preconditioned Langevin MCMC applied to the full posterior (2) (referred to as “full-space MCMC” hereafter). The latter MCMC scheme results from an explicit discretization of the Langevin SDE, preconditioned by the inverse of the log-posterior Hessian evaluated at the posterior mode (see  for details). Note that we cannot precondition the full-dimensional Langevin SDE by the empirical posterior covariance as in adaptive MALA because of the high parameter dimension (). In this setup, subspace MCMC and full-space MCMC require the same number of forward model and gradient evaluations for a given number of MCMC iterations.
To examine sampling performance, the autocorrelation of the log-likelihood function and the autocorrelations of the parameters projected onto the first, third, and fifth LIS basis vectors are used as benchmarks. These results are shown in Figure 6. We run both algorithms for iterations and discard the first half of the chains as burn-in. The top row of Figure 6 shows these benchmarks for both samplers. For all four benchmarks, subspace MCMC produces a faster decay of autocorrelation as a function of sample lag—i.e., a lower correlation between samples after any given number of MCMC steps.
Furthermore, as discussed in Section 3.3, even though the same number of forward model evaluations are required by subspace MCMC and full-space MCMC for a given number of samples, the computational cost of operations involving the square root of the prior covariance—used in sampling and evaluating the proposal distribution—can be much higher for full-space MCMC than subspace MCMC. In this test case, running subspace MCMC for iterations cost seconds of CPU time, while running full-space MCMC for the same number of iterations took seconds. To incorporate this cost difference, the second row of Figure 6 shows the autocorrelation of the four benchmark quantities as a function of CPU time rather than sample lag. Here, we immediately observe that the autocorrelation per CPU time is further reduced by using subspace MCMC.
Of course, recall that to construct the LIS we simulated Algorithm 1 for iterations. This costs roughly seconds of CPU time, which is only of the time required to run full-space MCMC for steps. Therefore subspace MCMC, including the cost of LIS construction, takes less time to produce a given number of samples than full-space MCMC and these samples are less correlated—i.e., of higher quality.
We now compare reduced-variance estimates of the posterior mean and variance (obtained with subspace MCMC) with estimates obtained via full-space MCMC. The results are shown in Figures 7 and 8. Full-space MCMC and subspace MCMC yield very similar mean and variance estimates. Figures 8(a) and (b) distinguish the two components of the Rao-Blackwellized variance estimates described in Example 2. Variance in the LIS, shown in Figure 8(a), is estimated from MCMC samples, while variance in the CS, shown in Figure 8(b), is calculated analytically from the prior and the LIS projector. The sum of these two variance fields is shown in Figure 8(c), and it is nearly the same as the full-space result in Figure 8(d). In the central part of the domain where measurement sensors are not installed, we can observe that the variance is larger in the CS than in the LIS, and hence this part of the domain is prior-dominated. In the right part of the domain, the variance is less prior-dominated, since this region is covered by observations.
4.4 The influence of data
The amount of information carried in the data affects the dimension and structure of the LIS. To demonstrate the impact of the data, we design a case study where different likelihood-informed subspaces are constructed under various observational scenarios. The same stationary groundwater problem defined in Section 4.1 is employed here. For the sake of computational efficiency, the problem domain is discretized by a slightly coarser mesh. And to provide a stronger impulse to the groundwater system, the source/sink terms used in this example are different from those used in Sections 4.1–4.3. Along the boundary of the domain , we evenly distribute a set of sources with a distance of between the source centers. Two sinks are placed in the interior of the domain at locations and . Each source has weight , while each sink has weight . We distributed sensors evenly over the domain ; starting with an inter-sensor spacing of , we incrementally refine the sensor distribution with spacings of , , and . This results in four different data sets, containing the noisy readings of , , , and sensors, respectively. The true permeability field, the sources/sinks, the simulated pressure field, and sensor distributions are shown in Figure 9.
As in Section 4.2, we run Algorithm 1 for iterations to construct the LIS, using subchains of length . For data sets 1–4, the resulting LISs have dimension , , , and , respectively. The generalized eigenvalue spectrum for each LIS is shown in Figure 10. We note that the eigenvalues decay more slowly with increasing amounts of data. This behavior is expected; since the generalized eigenvalues reflect the impact of the likelihood, relative to the prior, more data should lead to more directions where the likelihood dominates the prior.
Since the sensors for all four data sets occupy the same area of the spatial domain, we expect that the four likelihood-informed subspaces should share a similar low frequency structure. However, the high frequency structures in each LIS might differ from each other under refinement of the sensor distribution. Thus the LIS basis vectors corresponding to the largest eigenvalues should share a similar pattern, while the LIS basis vectors corresponding to the relatively small eigenvalues might have different patterns. We observe this effect in the numerical experiments carried out here; Figure 11 shows the first and fifteenth LIS basis vector for each of the data sets.
5 Example 2: atmospheric remote sensing
In this section, we apply the dimension reduction approach to a realistic atmospheric satellite remote sensing problem. The problem is to invert the concentrations of various gases in the atmosphere using the measurement system applied in the GOMOS satellite instrument, which stands for Global Ozone MOnitoring System.
GOMOS is an instrument on board ESA’s Envisat satellite, and was operational for about 10 years before the connection with the satellite was lost in May 2012. The GOMOS instrument performs so-called star occultation measurements; it measures, at different wavelengths, the absorption of starlight as it travels through the atmosphere. Different gases in the atmosphere (such as ozone, nitrogen dioxide and aerosols) leave fingerprints in the measured intensity spectra. The task of the inversion algorithm is to infer the concentrations of these gases based on the measurements.
The GOMOS inverse problem is known to be ill-posed; the intensity spectra may contain strong information about the major gases (like O) at some altitudes, whereas some minor gases (like aerosols) at some altitudes may be practically unidentifiable and totally described by the prior. Thus, the GOMOS problem is a good candidate for our approach: the dimension of the likelihood informed subspace is expected to be small and the prior contribution large.
Next, we briefly present the GOMOS theory and the inverse problem setup. For more details about the GOMOS instrument and the Bayesian treatment of the inverse problem, see  and the references therein.
5.1 The GOMOS model
The GOMOS instrument repeatedly measures light intensities at different wavelengths . First, a reference intensity spectrum is measured above the atmosphere. The so-called transmission spectrum is defined as . The transmissions measured at wavelength along the ray path are modelled using Beer’s law:
where is the density of a gas (unknown parameter) at tangential height . The so called cross-sections , known from laboratory measurements, define how much a gas absorbs light at a given wavelength.
To approximate the integrals in (25), the atmosphere is discretized. The geometry used for inversion resembles an onion: the gas densities are assumed to be constant within spherical layers around the Earth. The GOMOS measurement principle is illustrated in Figure 12 below.
Here, we assume that the cross-sections do not depend on height. In the inverse problem we have gases, wavelengths, and the atmosphere is divided into layers. The discretisation is fixed so that number of measurement lines is equal to the number of layers. Approximating the integrals by sums in the chosen grid, and combining information from all lines and all wavelengths, we can write the model in matrix form as follows:
where are the modelled transmissions, contains the cross-sections, contains the unknown densities and is the geometry matrix that contains the lengths of the lines of sight in each layer.
Computationally, it is convenient to deal with vectors of unknowns. We vectorize the above model using the identity , where denotes the Kronecker product and is the standard vectorization obtained by stacking the columns of the matrix argument on top of each other. Thus, the likelihood model is written in vector form as follows:
where is the measurement error, for which we apply an independent Gaussian model with known variances.
Note that, in principle, the model (27) could be linearized by taking logarithms of both sides, which is usually done for such tomography problems (like X-ray computerized tomography). For this problem, linearisation can cause problems, since the signal from the star is often smaller compared to the background noise in the measurement.
5.2 Data and prior
Here, we generate synthetic data by solving the forward model (27) with known gas densities . In the example, we have 4 gas profiles to be inverted. The atmosphere is discretized into 50 layers, and the total dimension of the problem is thus 200. The simulated data are illustrated in Figure 13.
We estimate the log-profiles of the gases instead of the densities directly. We set a Gaussian process prior for the profiles, which yields , where denotes the elements of vector corresponding to gas . The elements of the 50 50 covariance matrices are calculated based on the squared exponential covariance function
where the parameter values are , , , , and for all . The priors are chosen to promote smooth profiles and to give a rough idea about the magnitude of the density values. The prior is illustrated in Figure 14.
5.3 Inversion results
In this particular synthetic example, we know that gas 1 is very well identified by the data. The data also contain information about gases 2 and 3 at some altitudes. Gas 4, on the other hand, is totally unidentified by the data.
The LIS is constructed using samples—i.e., 200 iterations of Algorithm 1—starting with the Hessian at the posterior mode. The subspace convergence diagnostic and the generalized eigenvalues are shown in Figure 15. We choose the truncation thresholds . The dimension of the LIS in the end was 22.
We compute samples in both the LIS and in the full 200-dimensional space using the Hessian-preconditioned MALA algorithm. In Figure 16, the first two columns show the mean gas profile and the mean 1 and 2 standard deviations in the LIS and in the complement space (CS). The third column shows the combined posterior from the LIS and the CS; for comparison, results from full-space MCMC are shown in the fourth column. Note the different scales on the horizontal axes throughout the figure. We observe that the subspace approach, where MCMC is applied only in a 22-dimensional space, yields results very similar to those of full MCMC. In addition, comparing the contributions of the LIS and CS indicates that gas 1 is dominated by the likelihood, whereas the posterior distribution of gas 4 is entirely determined by the prior. Note that the CS contribution for gas 1 is tiny (check the scale), while the LIS contribution for gas 4 is also very small. For gases 2 and 3, the lower altitudes are likelihood-dominated, while the higher altitudes have more contribution from the prior. The full-space MCMC results for gas 4 show a slightly non-uniform mean, but this appears to be the result of sampling variance. By avoiding sampling altogether in the CS, the subspace approach most likely yields a more accurate posterior in this case.
To further illustrate the approach, we plot the first six basis vectors of the LIS in Figure 17. One can see that the first basis vectors mainly include features of gas 1, which is most informed by the data. The first basis vectors also contain some features of gases 2 and 3 in lower altitudes. Gas 4 is not included in the LIS at all.
The dimension reduction obtained via the subspace approach is expected to yield better mixing than the full-space MCMC. For the GOMOS case, the chain autocorrelations for subspace and full-space MCMC are compared in Figure 18. The subspace sampler shows much faster decay of the autocorrelations than full-space MCMC.
In this test case, the subspace MCMC also has lower computational cost compared to full-space MCMC. To simulate a Markov chain for iterations, the subspace MCMC consumed about seconds of CPU time, while the full-space MCMC cost CPU seconds. We note that the CPU time reduction is not as significant as the elliptic example, because the prior covariance is a dimensional matrix, which is much smaller than the covariance matrix used in the elliptic example. To construct the LIS, we simulated Algorithm 1 for iterations. This cost about seconds of CPU time, which is only about of the CPU time used to run full-space MCMC for steps.
In this paper, we present a new approach for dimension reduction in nonlinear inverse problems with Gaussian priors. Our approach is based on dividing the parameter space into two subspaces: a likelihood-informed subspace (LIS) where the likelihood has a much greater influence on the posterior than the prior distribution, and the complement to the LIS where the Gaussian prior dominates. We explore the posterior projected onto the LIS (the “difficult” and non-Gaussian part of the problem) with Markov chain Monte Carlo while treating the complement space as exactly Gaussian. This approximation allows us to analytically integrate many functions over the complement space when estimating their posterior expectations; the result is a Rao-Blackwellization or de-randomization procedure that can greatly reduce the variance of posterior estimates. Particularly in inverse problems—where information in the data is often limited and the solution of the problem relies heavily on priors—the dimension of the LIS is expected to be small, and the majority of the directions in the parameter space can be handled analytically.
The dimension reduction approach is based on theory developed for the linear case; in  it is shown that in linear-Gaussian problems, the eigendecomposition of the prior-preconditioned log-likelihood Hessian yields an optimal low-rank update from the prior to the posterior, which can be interpreted in terms of a projector whose range is the LIS. Here, we generalize the approach to nonlinear problems, where the log-likelihood Hessian varies over the parameter space. Our solution is to construct many local likelihood-informed subspaces over the support of the posterior and to combine them into a single global LIS. We show how the global LIS can be constructed efficiently in an adaptive manner, starting with the LIS computed at the posterior mode and iteratively enriching the global LIS until a weighted subspace convergence criterion is met.
We demonstrate the approach with two numerical examples. First is an elliptic PDE inverse problem, based on a simple model of subsurface flow. Though the dimension of the parameter space in our experiments ranges from 1200 to 10800, the dimension of the LIS remains only around 20 and is empirically discretization-invariant. Exploring the LIS by MCMC and analytically treating the Gaussian complement produces mean and variance fields very similar to those computed via MCMC in the full space. Yet the mixing properties and the computational cost of MCMC in the LIS are dramatically improved over those of full-space MCMC. Our second demonstration is an atmospheric remote sensing problem, where the goal is to infer the concentrations of chemical species in the atmosphere using star occultation measurements, as on the satellite-borne GOMOS instrument. The dimension of the full problem used here was 200 (four gaseous species and 50 altitudes for each), while the dimension of the LIS was 22. Again, dimension reduction significantly improves the mixing properties of MCMC without sacrificing accuracy.
To conclude, our dimension reduction approach appears to offer an efficient way to probe and exploit the structure of nonlinear inverse problems in order to perform Bayesian inference at a large scale, where standard algorithms are plagued by the curse of dimensionality. The approach also opens up interesting further research questions: it may be useful, for instance, to apply reduced-order and surrogate modeling techniques in the LIS, making them applicable to much larger problems than before.
- The dimension of the global LIS can vary at each iteration. Let denote the dimension of the global LIS at iteration . To be precise, we should then write and , but for brevity we will simplify notation to and when possible.
- A. Tarantola. Inverse Problem Theory and Methods for Model Parameter Estimation. Society for Industrial Mathematics, Philadelphia, 2005.
- J. P. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems, volume 160. Springer, New York, 2004.
- A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19:451–559, 2010.
- W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in practice, volume 2. CRC press, 1996.
- J. S. Liu. Monte Carlo strategies in Scientific Computing. Springer, New York, 2001.
- S. Brooks, A. Gelman, G. Jones, and X. L. Meng, editors. Handbook of Markov Chain Monte Carlo. Taylor & Francis, 2011.
- G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7:110–120, 1997.
- G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60:255–268, 1998.
- G. O. Roberts and J. S. Rosenthal. Optimal scaling of various Metropolis-Hastings algorithms. Statistical Science, 16(4):351–367, 2001.
- J. C. Mattingly, N. Pillai, and A. M. Stuart. Diffusion limits of the random walk Metropolis algorithm in high dimensions. Annals of Applied Probability, 22:881–930, 2012.
- N. S. Pillai, A. M. Stuart, and A. H. Thiery. Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions,. Annals of Applied Probability, 22:2320–2356, 2012.
- A. Beskos, G. O. Roberts, A. M. Stuart, and J. Voss. MCMC methods for diffusion bridges. Stochastic Dynamics, 8(3):319–350, 2008.
- S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28:424–446, 2013.
- H. P. Flath, L. C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, and O. Ghattas. Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial hessian approximations. SIAM Journal on Scientific Computing, 33(1):407–432, 2011.
- A. Spantini, A. Solonen, T. Cui, J. Martin, L. Tenorio, and Y. Marzouk. Optimal low-rank approximation of linear Bayesian inverse problems. arXiv preprint, arXiv:1407.3463, 2014.
- Y. M. Marzouk and H. N. Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics, 228:1862–1902, 2009.
- K. Karhunen. Über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys, 37:1–79, 1947.
- M. Loève. Probability theory, Vol. II, volume 46 of Graduate Texts in Mathematics. Springer-Verlag, Berlin, 4 edition, 1978.
- J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion. SIAM Journal on Scientific Computing, 34(3):A1460–A1487, 2012.
- Noemi Petra, James Martin, Georg Stadler, and Omar Ghattas. A computational framework for infinite-dimensional Bayesian inverse problems: Part II. stochastic Newton MCMC with application to ice sheet flow inverse problems. SIAM Journal on Scientific Computing, to appear, 2014. arXiv:1308.6221.
- M. Lassas, E. Saksman, and S. Siltanen. Discretization invariant Bayesian inversion and Besov space priors. Inverse Problems and Imaging, 3(1):87–122, 2009.
- W. Förstner and M. Boudewijn. A metric for covariance matrices. In Geodesy-The Challenge of the 3rd Millennium, pages 299–309. Springer Berlin Heidelberg, 2003.
- G. H. Golub and C. F. Van Loan. Matrix Computations. JHU Press, 2012.
- N. Halko, P. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
- E. Liberty, F. Woolfe, P. G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
- H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
- C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
- Y. F. Atchade. An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift. Methodology and Computing in Applied Probability, 8(2):235–254, 2006.
- G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of Applied Probability, 44(2):458–475, 2007.
- G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
- M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
- G. Casella and C. P. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
- F. Li, Q. Dai, W. Xu, and G. Er. Weighted subspace distance and its applications to object recognition and retrieval with image sets. IEEE Signal Processing Letters, 16(3):227–230, 2009.
- T. Cui, K. J. H. Law, and Y. M. Marzouk. Dimension–independent likelihood–informed MCMC. arXiv preprint, 2014.
- H. Haario, M. Laine, M. Lehtinen, E. Saksman, and J. Tamminen. Markov chain Monte Carlo methods for high dimensional inversion in remote sensing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66:591–608, 2004.