Likelihoodinformed dimension reduction for nonlinear inverse problems
Abstract
The intrinsic dimensionality of an inverse problem is affected by prior information, the accuracy and number of observations, and the smoothing properties of the forward operator. From a Bayesian perspective, changes from the prior to the posterior may, in many problems, be confined to a relatively lowdimensional subspace of the parameter space. We present a dimension reduction approach that defines and identifies such a subspace, called the “likelihoodinformed subspace” (LIS), by characterizing the relative influences of the prior and the likelihood over the support of the posterior distribution. This identification enables new and more efficient computational methods for Bayesian inference with nonlinear forward models and Gaussian priors. In particular, we approximate the posterior distribution as the product of a lowerdimensional posterior defined on the LIS and the prior distribution marginalized onto the complementary subspace. Markov chain Monte Carlo sampling can then proceed in lower dimensions, with significant gains in computational efficiency. We also introduce a RaoBlackwellization strategy that derandomizes Monte Carlo estimates of posterior expectations for additional variance reduction. We demonstrate the efficiency of our methods using two numerical examples: inference of permeability in a groundwater system governed by an elliptic PDE, and an atmospheric remote sensing problem based on Global Ozone Monitoring System (GOMOS) observations.
Keywords: Inverse problem, Bayesian inference, dimension reduction, lowrank approximation, Markov chain Monte Carlo, variance reduction
1 Introduction
Inverse problems arise from indirect observations of parameters of interest. The Bayesian approach to inverse problems formalizes the characterization of these parameters through exploration of the posterior distribution of parameters conditioned on data [1, 2, 3]. Computing expectations with respect to the posterior distribution yields not only point estimates of the parameters (e.g., the posterior mean), but a complete description of their uncertainty via the posterior covariance and higher moments, marginal distributions, quantiles, or event probabilities. Uncertainty in parameterdependent predictions can also be quantified by integrating over the posterior distribution.
The parameter of interest in inverse problems is often a function of space or time, and hence an element of an infinitedimensional function space [3]. In practice, the parameter field must be discretized, and the resulting inference problem acquires a high but finite dimension. The computation of posterior expectations then proceeds via posterior sampling, most commonly using Markov chain Monte Carlo (MCMC) methods [4, 5, 6]. The computational cost and efficiency of an MCMC scheme can be strongly affected by the parameter dimension, however. The convergence rates of standard MCMC algorithms usually degrade with parameter dimension [7, 8, 9, 10, 11]; one manifestation of this degradation is an increase in the mixing time of the chain, which in turn leads to higher variance in posterior estimates. Some recent MCMC algorithms, formally derived in the infinitedimensional setting [12, 13], do not share this scaling problem. Yet even in this setting, we will argue that significant variance reduction can be achieved through explicit dimension reduction and through derandomization of posterior estimates, explained below.
This paper proposes a method for dimension reduction in Bayesian inverse problems. We reduce dimension by identifying a subspace of the parameter space that is likelihoodinformed; this notion will be precisely defined in a relative sense, i.e., relative to the prior. Our focus is on problems with nonlinear forward operators and Gaussian priors, but builds on lowrank approximations [14] and optimality results [15] developed for the linearGaussian case. Our dimension reduction strategy will thus reflect the combined impact of prior smoothing, the limited accuracy or number of observations, and the smoothing properties of the forward operator. Identification of the likelihoodinformed subspace (LIS) will let us write an approximate posterior distribution wherein the distribution on the complement of this subspace is taken to be independent of the data; in particular, the posterior will be approximated as the product of a lowdimensional posterior on the LIS and the marginalization of the prior onto the complement of the LIS. The key practical benefit of this approximation will be variance reduction in the evaluation of posterior expectations. First, Markov chain Monte Carlo sampling can be restricted to coordinates in the likelihoodinformed space, enabling greater sampling efficiency—i.e., more independent samples in a given number of MCMC steps or a given computational time. Second, the product form of the approximate posterior will allow sampling in the complement of the likelihoodinformed space to be avoided altogether, thus producing RaoBlackwellized or analytically conditioned estimates of certain posterior expectations.
Dimension reduction for inverse problems has been previously pursued in several ways. [16] constructs a low dimensional representation of the parameters by using the truncated KarhunenLòeve expansion [17, 18] of the prior distribution. A different approach, combining prior and likelihood information via lowrank approximations of the priorpreconditioned Hessian of the loglikelihood, is used in [14] to approximate the posterior covariance in linear inverse problems. In the nonlinear setting, lowrank approximations of the priorpreconditioned Hessian are used to construct proposal distributions in the stochastic Newton MCMC method [19] or to make tractable Gaussian approximations at the posterior mode in [20]—either as a Laplace approximation, as the proposal for an independence MCMC sampler, or as the fixed preconditioner for a stochastic Newton proposal. We note that these schemes bound the tradeoff between evaluating Hessian information once (at the posterior mode) or with every sample (in local proposals). In all cases, however, MCMC sampling proceeds in the fulldimensional space.
The dimension reduction approach explored in this paper, by contrast, confines sampling to a lowerdimensional space. We extend the posterior approximation proposed in [15] to the nonlinear setting by making essentially a lowrank approximation of the posterior expectation of the priorpreconditioned Hessian, from which we derive a projection operator. This projection operator then yields the productform posterior approximation discussed above, which enables variance reduction through lowerdimensional MCMC sampling and RaoBlackwellization of posterior estimates.
We note that our dimension reduction approach does not depend on the use of any specific MCMC algorithm, or even on the use of MCMC. The lowdimensional posterior defined on coordinates of the LIS is amenable to a range of posterior exploration or integration approaches. We also note that the present analysis enables the construction of dimensionindependent analogues of existing MCMC algorithms with essentially no modification. This is possible because in inverse problems with formally discretizationinvariant posteriors—i.e., problems where the forward model converges under mesh refinement and the prior distribution satisfies certain regularity conditions [21, 3]—the LIS can also be discretization invariant. We will demonstrate these discretizationinvariance properties numerically.
The rest of this paper is organized as follows. In Section 2, we briefly review the Bayesian formulation for inverse problems. In Section 3, we introduce the likelihoodinformed dimension reduction technique, and present the posterior approximation and reducedvariance Monte Carlo estimators based on the LIS. We also present an algorithm for constructing the likelihoodinformed subspace. In Section 4, we use an elliptic PDE inverse problem to demonstrate the accuracy and computational efficiency of our posterior estimates and to explore various properties of the LIS, including its dependence on the data and its discretization invariance. In Section 5, we apply our variance reduction technique to an atmospheric remote sensing problem. Section 6 offers concluding remarks.
2 Bayesian formulation for inverse problems
This section provides a brief overview of the Bayesian framework for the inverse problems as introduced in [1, 2, 3]. Consider the inverse problem of estimating parameters from data , where
(1) 
Here is a random variable representing noise and/or model error, which appears additively, and is a known mapping from the parameters to the observables. In a Bayesian setting, we model the parameters as a random variable and, for simplicity, assume that the range of this random variable is a finite dimensional space . Then the parameter of interest is characterized by its posterior distribution conditioned on a realization of the data, :
(2) 
We assume that all distributions have densities with respect to Lebesgue measure. The posterior probability density function above is the product of two terms: the prior density , which models knowledge of the parameters before the data are observed, and the likelihood function , which describes the probability distribution of for any value of .
We assume that the prior distribution is a multivariate Gaussian , where the covariance matrix can be also defined by its inverse, , commonly referred to as the precision matrix. We model the additive noise with a zero mean Gaussian distribution, i.e., . This lets us define the datamisfit function
(3) 
such that the likelihood function is proportional to .
3 Methodology
3.1 Optimal dimension reduction for linear inverse problems
Consider a linear forward model, , with a Gaussian likelihood and a Gaussian prior as defined in Section 2. The resulting posterior is also Gaussian, , with mean and covariance given by
(4) 
where is the Hessian of the datamisfit function (3). Without loss of generality we can assume zero prior mean and a positive definite prior covariance matrix.
Now consider approximations to the posterior distribution of the form
(5) 
where is a rank projector and is an approximation to the original likelihood function . Approximations of this form can be computationally advantageous when operations involving the prior (e.g., evaluations or sampling) are less expensive than those involving the likelihood. As described in [15], they are also the natural form with which to approximate a Bayesian update, particularly in the inverse problem setting with highdimensional . In the deterministic case, inverse problems are illposed; the data cannot inform all directions in the parameter space. Equivalently, the spectrum of is compact or decays quickly. Thus one should be able to explicitly project the argument of the likelihood function onto a lowdimensional space without losing much information in the process. The posterior covariance remains full rank, but the update from prior covariance to posterior covariance will be low rank. The challenge, of course, is to find the best projector for any given . The answer will involve balancing the influence of the prior and the likelihood. In the following theorem, we introduce the optimal projector and characterize its approximation properties.
Theorem 1.
Let be a symmetric decomposition of the prior covariance matrix and let be the eigenvalueeigenvector pairs of the priorpreconditioned Hessian such that . Define the directions and together with the matrices and . Then, the projector given by:
yields an approximate posterior density of the form and is optimal in the following sense:

minimizes the Förstner distance [22] from the exact posterior covariance over the class of positive definite matrices that can be written as rank negative semidefinite updates of the prior covariance.

minimizes the Bayes risk over the class of all linear transformations of the data with .
Proof.
We refer the reader to [15] for a proof and detailed discussion. ∎
The vectors span the range of the optimal projector; we call this range the likelihoodinformed subspace of the linear inverse problem. Note that the are eigenvectors of the pencil . Hence, the th basis vector maximizes the Rayleigh quotient
(6) 
over the subspace . This Rayleigh quotient helps interpret the as directions where the data are most “informative” relative to the prior. For example, consider a direction representing a rough mode in the parameter space. If the prior is smoothing, then the denominator of (6) will be large; also, if the forward model output is relatively insensitive to variation in the direction, the numerator of (6) will be small. Thus the Rayleigh quotient will be small and is not particularly datainformed relative to the prior. Conversely, if is smooth then the prior variance in this direction may be large and the likelihood may be relatively constraining; this direction is then datainformed. Of course, there are countless intermediate cases, but in general, directions for which (6) are large will lie in the range of .
Note also that diagonalizes both and . We are particularly interested in the latter property: the modes are orthogonal (and can be chosen orthonormal) with respect to the inner product induced by the prior precision matrix. This property will be preserved later in the nonlinear case, and will be important to our posterior sampling schemes.
For nonlinear inverse problems, we seek an approximation of the posterior distribution in the same form as (5). In particular, the range of the projector will be determined by blending together local likelihoodinformed subspaces from regions of high posterior probability. The construction of the approximation will be detailed in the following section.
3.2 LIS construction for nonlinear inverse problems
When the forward model is nonlinear, the Hessian of the datamisfit function varies over the parameter space, and thus the likelihoodinformed directions are embedded in some nonlinear manifold. We aim to construct a global linear subspace to capture the majority of this nonlinear likelihoodinformed manifold.
Let the forward model be firstorder differentiable. The linearization of the forward model at a given parameter value , where , provides the local sensitivity of the parametertoobservable map. Inspired by the dimension reduction approach for the linear inverse problem, we use the linearized forward model to construct the GaussNewton approximation of the Hessian of the datamisfit function, . Now consider a local version of the Rayleigh quotient (6),
Introducing the change of variable , we can equivalently use
(7) 
to quantify the local impact of the likelihood relative to the prior. As in the linear problem, this suggests the following procedure for computing a local LIS given some truncation threshold :
Problem 1 (Construction of the local likelihoodinformed subspace).
Given the GaussNewton Hessian of the data misfit function at a given , find the eigendecompostion of the priorpreconditioned GaussNewton Hessian (ppGNH)
(8) 
Given a truncation threshold , the local LIS is spanned by , where corresponds to the leading eigenvalues such that .
For a direction with , the local impact of the likelihood and the prior are balanced. Thus, to retain a comprehensive set of likelihoodinformed directions, we typically choose a truncation threshold less than 1.
To extend the pointwise criterion (7) into a global criterion for likelihoodinformed directions, we consider the expectation of the Rayleigh quotient over the posterior
where is the expected ppGNH over the posterior,
(9) 
Then we can naturally construct the global LIS through the eigendecomposition of as in the linear case. We consider approximating using the Monte Carlo estimator
where , , are posterior samples. Since the local Hessian is usually not explicitly available and is not feasible to store for largescale problems, we use its priorpreconditioned lowrank approximation as defined in Problem 1. Thus the global LIS can be constructed by the following procedure:
Problem 2 (Construction of global likelihoodinformed subspace).
Suppose we have a set of posterior samples , , where for each sample , the ppGNH is approximated by the truncated low rank eigendecomposition
by solving Problem 1. We have for all and all . To construct the global LIS, we consider the eigendecompostion of the Monte Carlo estimator of the expected Hessian in (9), which takes the form
(10) 
The global LIS has the nonorthogonal basis , where the eigenvectors correspond to the leading eigenvalues of (10), , for some truncation threshold . Here we choose to be equal to the threshold in Problem 1.
In many applications we can only access the GaussNewton Hessian by computing its action on vectors, which involves one forward model evaluation and one adjoint model evaluation. In such a case, the ppGNH can be approximated by finding the eigendecomposition (8) using either Krylov subspace algorithms [23] or randomized algorithms [24, 25].
The number of samples required to construct the global LIS depends on the degree to which (or its dominant eigenspace) varies over the posterior. In Section 3.5, we present an adaptive construction procedure that automatically explores the directional landscape of the likelihood.
3.3 Posterior approximation
By projecting the likelihood function onto the global likelihoodinformed subspace (LIS), we obtain the approximate posterior
(11) 
where is a projector onto the global LIS. This projector is selfadjoint with respect to the inner product induced by the prior precision matrix, and leads to a natural decomposition of the parameter space as , where is the LIS and is the complement subspace (CS). This choice leads to a factorization of the prior distribution into the product of two distributions, one defined on the lowdimensional LIS and the other on the CS. This factorization is the key to our dimension reduction technique.
Definition 1.
We define the projectors and , and a corresponding parameter decomposition, as follows:

Suppose the LIS basis computed in Problem 2 is , where is orthonormal. Define the matrix such that . The projector has the form
Choose such that forms a complete orthonormal system in . Then the projector can be written as
where and .

Naturally, the parameter can be decomposed as
where each projection can be represented as the linear combination of the corresponding basis vectors. Consider the “LIS parameter” and the “CS parameter” , which are the weights associated with the LIS basis and CS basis , respectively. Then we can define the following pair of linear transformations between the parameter and :
(12)
Figure 1 illustrates the transformations between the parameter projected onto the LIS, , and the LIS parameter . The same relation holds for the transformations between and the CS parameter . And as Definition 1 makes clear, is an oblique projector.
Lemma 1.
Suppose we have as defined in Definition 1(b). Then the prior distribution can be decomposed as
where and .
Following Definition 1(b) and Lemma 1, the approximate posterior distribution (11) can be reformulated as
Applying the linear transformation from to as defined in Equation (12), we can rewrite the approximate posterior for the parameters as
(13) 
which is the product of the reduced posterior
(14) 
and the complement prior . To compute a Monte Carlo estimate of the expectation of a function over the approximate posterior distribution (13), we only need to sample the reduced posterior , since properties of the Gaussian complement prior are known analytically.
One can also combine MCMC samples from the reduced posterior with independent samples from the complement prior to provide samples that are approximately drawn from the full posterior . By correcting these samples via importance weights or a Metropolis scheme, one would then obtain a sampling algorithm for the original fullspace posterior. This idea is not pursued further here, and in the rest of this work we will emphasize the analytical properties of the complement prior , using them to reduce the variance of Monte Carlo estimators.
3.4 Reducedvariance estimators
Suppose we have a function for which the conditional expectation over the approximate posterior (11)
(15) 
can be calculated either analytically or through some highprecision numerical quadrature scheme. Then variance reduction can be achieved as follows:

Subspace MCMC. Use MCMC in the LIS to simulate a “subspace Markov chain” with target distribution (13). Any number of MCMC algorithms developed in the literature can be applied offtheshelf, e.g., adaptive MCMC [26, 27, 28, 29, 30], the stochastic Newton algorithm of [19], or the Riemannian manifold algorithms of [31]. Since the dimension of the LIS can be quite small relative to the original parameter space, the subspace MCMC approach can yield lower sample correlations (better mixing) than applying any of these MCMC algorithms directly to the full posterior (2).

RaoBlackwellization. We approximate by the expectation of the function over the approximate posterior , i.e., . Given a set of subspace MCMC samples where , a Monte Carlo estimator of is given by
(16) As an application of the RaoBlackwellization principle (see [32] and references therein), the estimator (16) has a lower variance than the standard estimator
(17) where .
This procedure mitigates many of the difficulties of posterior exploration in high dimensions, provided that the priortoposterior update is reasonably low rank. Variance reduction is achieved not only by increasing the effective sample size per MCMC iteration (via subspace MCMC), but also by reducing the variance of Monte Carlo estimators using RaoBlackwellization. In effect, we argue that the Gaussian CS can be explored separately and via a calculation (15) that does not involve sampling.
Note that while this procedure necessarily reduces the variance of a Monte Carlo estimator, it introduces bias since we replace the expectation over the full posterior with an expectation over the approximate posterior . Thus this variance reduction is particularly useful in situations where the variance of the estimator (17) derived from fullspace MCMC samples is large compared with the bias, which is often the case for highdimensional inverse problems.
Beyond variance reduction, subspace MCMC offers several additional computational advantages over MCMC methods applied to the full posterior directly: (i) The storage requirement for saving subspace MCMC samples is much lower than that of an MCMC scheme that samples the full posterior. (ii) For MCMC methods where the proposal distribution involves operations with square root of the prior covariance matrix (e.g., the stochastic Newton [19] and preconditioned CrankNicolson [12, 13] techniques) the computational cost of handling the full prior covariance can be much higher than the computational cost of handling the reduced prior , which has identity covariance.
The Monte Carlo estimator (16) can be further simplified if the function of interest can be expressed as either the product or the sum of two separate functions, and , defined on the LIS and CS, respectively. In the multiplicative case , the conditional expectation (15) can be written as
In the additive case , it can be written as
Thus the expectation can be decomposed either into the product (in the multiplicative case) or the sum (in the additive case) of the pair of expectations
(18)  
(19) 
which are associated with the LIS and CS, respectively. The expectation in (18) can be computed by the subspace MCMC methods described above, whereas the expectation in (19) is computed analytically or through highorder numerical integration.
Now we give two particularly useful examples of the analytical treatment of the complement space.
Example 1 (Reduced variance estimator of the posterior mean).
Suppose we have obtained the empirical posterior mean of the reduced parameter using subspace MCMC. The resulting reducedvariance estimator of the posterior mean is
Example 2 (Reduced variance estimator of the posterior covariance).
Suppose we have the empirical posterior covariance of the reduced parameter , estimated using subspace MCMC. The resulting reducedvariance estimator of the posterior covariance is
3.5 Algorithms for the LIS
Constructing the global LIS requires a set of posterior samples. Since the computational cost of solving Problem 1 for any sample is much greater than the cost of evaluating the forward model, we wish to limit the number of samples used in Problem 2 while ensuring that we adequately capture the posterior variation of the ppGNH. Thus we choose samples using the following adaptive procedure.
Algorithm 1 (Global LIS construction using subspace MCMC).
First, compute the posterior mode . Set the initial sample set for Problem 2 to . Solve Problem 2 to find , the initial LIS basis , and its leftinverse .

Subchain simulation. Simulate the dimensional subspace MCMC chain for iterations, so that the last state of this chain, denoted by , is uncorrelated with its initial state. Then transformed back to the original parameter space, , is used as the next sample point. Enrich the sample set to .

LIS construction. Solve Problem 2 with the sample set . Then update the LIS basis to and . Set the initial state of the next subspace MCMC chain to .
The convergence criterion in step (3) is based on an incremental distance between likelihoodinformed subspaces. The distance penalizes changes in the dominant directions (those with large eigenvalues ) more heavily than changes in the less important directions (those with smaller ).
Definition 2 (Weighted subspace distance).
At iteration , define the basis/weights pair , where is the orthonormal LIS basis from Problem 2 and is a diagonal matrix consisting of normalized weights
computed from the eigenvalues of Problem 2. For two adjacent steps and , we compute the weighted subspace distance of [33], which has the form
(20) 
Note that in Step (i) of Algorithm 8, we construct the global LIS by always sampling in an adaptively enriched subspace. This offers computational benefits, since the MCMC exploration is always confined to a lower dimensional space. However, a potential problem with this approach is that it might ignore some directions that are also datainformed. A more conservative approach would be to introduce a conditional update at the end of each subchain simulation: perform Metropolized independence sampling in the current CS using the complement prior as proposal. This would enable the subchain to explore the full posterior, but would result in higherdimensional sampling when constructing the LIS. In our numerical examples, described below, no conditional updates were required for good performance; constructing the LIS using samples from the full posterior and using the subspace approach gave essentially the same results. Of course, one could also simply employ a standard MCMC algorithm to sample the full posterior, and then construct the LIS using the resulting posterior samples. However, the efficiency of the MCMC algorithm in this case will be affected by the dimensionality of the problem.
4 Example 1: Elliptic PDE
Our first example is an elliptic PDE inverse problem used to demonstrate (i) construction of the LIS and the impact of mesh refinement; (ii) the application of lowrank posterior mean and variance estimators; and (iii) changes in the LIS with varying amounts of observational data.
4.1 Problem setup
Consider the problem domain , with boundary . We denote the spatial coordinate by . Consider the permeability field , the pressure field , and sink/source terms . The pressure field for a given permeability and source/sink configuration is governed by the Poisson equation
(21) 
where is the outward normal vector on the boundary. To make a wellposed boundary value problem, a further boundary condition
(22) 
is imposed. The source/sink term is defined by the superposition of four weighted Gaussian plumes with standard deviation (i.e., spatial width) , centered at four corners , , and , with weights . The system of equations (21) is solved by the finite element method with bilinear elements.
The discretized permeability field is endowed with a lognormal prior distribution, i.e.,
(23) 
where the covariance matrix is defined through an anisotropic exponential covariance kernel
(24) 
for . In this example, we set the anisotropic correlation tensor to
the prior standard deviation to , and the correlation length to . The “true” permeability field is a realization from the prior distribution. The true permeability field, the sources/sinks, the simulated pressure field, and the synthetic data are shown in Figure 2.
Partial observations of the pressure field are collected at measurement sensors as shown by the black dots in Figure 2(c). The observation operator is simply the corresponding “mask” operation. This yields observed data as
with additive error . The standard deviation of the measurement noise is prescribed so that the observations have signaltonoise ratio 10, where the signaltonoise ratio is defined as . The noisy data are shown in Figure 2(d).
Figure 3 shows two draws from the prior distribution, the eigenspectrum of the prior covariance, and the cumulative prior variance integrated over (i.e., the running sum of the prior covariance eigenvalues). In order to keep percent of the energy in the prior, eigenmodes are required. Because of this slow decay of the prior covariance spectrum, a priori dimension reduction based on a truncated eigendecomposition of the prior covariance (as described in [16]) would be very inefficient for this problem. Information carried in highfrequency eigenfunctions cannot be captured unless an enormous number of prior modes are retained; thus, a better basis is required.
4.2 LIS construction
Now we demonstrate the process of LIS construction using Algorithm 1, and the structure of the LIS under mesh refinement. To compute the LIS, we run Algorithm 1 for iterations, using adaptive MALA [28] to simulate each subchain with length . We choose the truncation thresholds . To explore the dimensionality and structure of the LIS versus mesh refinement, we carry out the same numerical experiment on a coarse grid, a intermediate grid, and a fine grid. The dimension of the LIS versus number of iterations, the evolution of the convergence diagnostic (20), and the generalized eigenvalues after 500 iterations—for each level of grid refinement—are shown in Figure 4. Also, Figure 5 shows the first five LIS basis vectors for each level of discretization.
As shown in Figure 4(a), the dimension of the LIS changes rapidly in the first 100 iterations, then it stabilizes. Change in the dimension reflects the fact that the loglikelihood Hessian varies locally in this nonGaussian problem. We also observe that the grid has a slightly larger final LIS dimension than the two refined grids: at the end of the adaptive construction, the LIS of the grid has dimension , while the grid and the grid yield LIS dimensions of . This effect may be ascribed to larger discretization errors in the grid.
The weighted distance (20) between each adjacent pair of likelihoodinformed subspaces is used as the convergence diagnostic during the construction process. With any of the three discretizations, the weighted subspace distance at the end of adaptive construction is several orders of magnitude lower than at the beginning, as shown in Figure 4(b). We also observe that the rates of convergence of this diagnostic are comparable for all three levels of discretization. These figures suggest that while local variation of the Hessian is important in this problem (e.g., the dimension of the LIS doubles over the course of the iterations), much of this variation is wellexplored after 100 or 200 iterations of Algorithm 1.
Since the forward model converges with grid refinement, we expect that the associated LIS should also converge. The generalized eigenvalues for all three grids are shown in Figure 4(c), where the spectra associated with all three subspaces have very similar values. And as shown in Figure 5, the leading LIS basis vectors have similar shapes for all three levels of grid refinement. Refinement leads to slightly more structure in , but the overall mode shapes are very close.
4.3 Estimation of the posterior mean and variance
With an LIS in hand, we apply the variance reduction procedure described in Section 3.3 to estimate the posterior mean and variance of the permeability field. Calculations in this subsection use the discretization of the PDE and inversion parameters.
We first demonstrate the sampling performance of subspace MCMC, where we use adaptive MALA [28] to sample the LISdefined reduced posterior (14). We compare the results of subspace MCMC with the results of Hessianpreconditioned Langevin MCMC applied to the full posterior (2) (referred to as “fullspace MCMC” hereafter). The latter MCMC scheme results from an explicit discretization of the Langevin SDE, preconditioned by the inverse of the logposterior Hessian evaluated at the posterior mode (see [34] for details). Note that we cannot precondition the fulldimensional Langevin SDE by the empirical posterior covariance as in adaptive MALA because of the high parameter dimension (). In this setup, subspace MCMC and fullspace MCMC require the same number of forward model and gradient evaluations for a given number of MCMC iterations.
To examine sampling performance, the autocorrelation of the loglikelihood function and the autocorrelations of the parameters projected onto the first, third, and fifth LIS basis vectors are used as benchmarks. These results are shown in Figure 6. We run both algorithms for iterations and discard the first half of the chains as burnin. The top row of Figure 6 shows these benchmarks for both samplers. For all four benchmarks, subspace MCMC produces a faster decay of autocorrelation as a function of sample lag—i.e., a lower correlation between samples after any given number of MCMC steps.
Furthermore, as discussed in Section 3.3, even though the same number of forward model evaluations are required by subspace MCMC and fullspace MCMC for a given number of samples, the computational cost of operations involving the square root of the prior covariance—used in sampling and evaluating the proposal distribution—can be much higher for fullspace MCMC than subspace MCMC. In this test case, running subspace MCMC for iterations cost seconds of CPU time, while running fullspace MCMC for the same number of iterations took seconds. To incorporate this cost difference, the second row of Figure 6 shows the autocorrelation of the four benchmark quantities as a function of CPU time rather than sample lag. Here, we immediately observe that the autocorrelation per CPU time is further reduced by using subspace MCMC.
Of course, recall that to construct the LIS we simulated Algorithm 1 for iterations. This costs roughly seconds of CPU time, which is only of the time required to run fullspace MCMC for steps. Therefore subspace MCMC, including the cost of LIS construction, takes less time to produce a given number of samples than fullspace MCMC and these samples are less correlated—i.e., of higher quality.
We now compare reducedvariance estimates of the posterior mean and variance (obtained with subspace MCMC) with estimates obtained via fullspace MCMC. The results are shown in Figures 7 and 8. Fullspace MCMC and subspace MCMC yield very similar mean and variance estimates. Figures 8(a) and (b) distinguish the two components of the RaoBlackwellized variance estimates described in Example 2. Variance in the LIS, shown in Figure 8(a), is estimated from MCMC samples, while variance in the CS, shown in Figure 8(b), is calculated analytically from the prior and the LIS projector. The sum of these two variance fields is shown in Figure 8(c), and it is nearly the same as the fullspace result in Figure 8(d). In the central part of the domain where measurement sensors are not installed, we can observe that the variance is larger in the CS than in the LIS, and hence this part of the domain is priordominated. In the right part of the domain, the variance is less priordominated, since this region is covered by observations.
4.4 The influence of data
The amount of information carried in the data affects the dimension and structure of the LIS. To demonstrate the impact of the data, we design a case study where different likelihoodinformed subspaces are constructed under various observational scenarios. The same stationary groundwater problem defined in Section 4.1 is employed here. For the sake of computational efficiency, the problem domain is discretized by a slightly coarser mesh. And to provide a stronger impulse to the groundwater system, the source/sink terms used in this example are different from those used in Sections 4.1–4.3. Along the boundary of the domain , we evenly distribute a set of sources with a distance of between the source centers. Two sinks are placed in the interior of the domain at locations and . Each source has weight , while each sink has weight . We distributed sensors evenly over the domain ; starting with an intersensor spacing of , we incrementally refine the sensor distribution with spacings of , , and . This results in four different data sets, containing the noisy readings of , , , and sensors, respectively. The true permeability field, the sources/sinks, the simulated pressure field, and sensor distributions are shown in Figure 9.
As in Section 4.2, we run Algorithm 1 for iterations to construct the LIS, using subchains of length . For data sets 1–4, the resulting LISs have dimension , , , and , respectively. The generalized eigenvalue spectrum for each LIS is shown in Figure 10. We note that the eigenvalues decay more slowly with increasing amounts of data. This behavior is expected; since the generalized eigenvalues reflect the impact of the likelihood, relative to the prior, more data should lead to more directions where the likelihood dominates the prior.
Since the sensors for all four data sets occupy the same area of the spatial domain, we expect that the four likelihoodinformed subspaces should share a similar low frequency structure. However, the high frequency structures in each LIS might differ from each other under refinement of the sensor distribution. Thus the LIS basis vectors corresponding to the largest eigenvalues should share a similar pattern, while the LIS basis vectors corresponding to the relatively small eigenvalues might have different patterns. We observe this effect in the numerical experiments carried out here; Figure 11 shows the first and fifteenth LIS basis vector for each of the data sets.
5 Example 2: atmospheric remote sensing
In this section, we apply the dimension reduction approach to a realistic atmospheric satellite remote sensing problem. The problem is to invert the concentrations of various gases in the atmosphere using the measurement system applied in the GOMOS satellite instrument, which stands for Global Ozone MOnitoring System.
GOMOS is an instrument on board ESA’s Envisat satellite, and was operational for about 10 years before the connection with the satellite was lost in May 2012. The GOMOS instrument performs socalled star occultation measurements; it measures, at different wavelengths, the absorption of starlight as it travels through the atmosphere. Different gases in the atmosphere (such as ozone, nitrogen dioxide and aerosols) leave fingerprints in the measured intensity spectra. The task of the inversion algorithm is to infer the concentrations of these gases based on the measurements.
The GOMOS inverse problem is known to be illposed; the intensity spectra may contain strong information about the major gases (like O) at some altitudes, whereas some minor gases (like aerosols) at some altitudes may be practically unidentifiable and totally described by the prior. Thus, the GOMOS problem is a good candidate for our approach: the dimension of the likelihood informed subspace is expected to be small and the prior contribution large.
Next, we briefly present the GOMOS theory and the inverse problem setup. For more details about the GOMOS instrument and the Bayesian treatment of the inverse problem, see [35] and the references therein.
5.1 The GOMOS model
The GOMOS instrument repeatedly measures light intensities at different wavelengths . First, a reference intensity spectrum is measured above the atmosphere. The socalled transmission spectrum is defined as . The transmissions measured at wavelength along the ray path are modelled using Beer’s law:
(25) 
where is the density of a gas (unknown parameter) at tangential height . The so called crosssections , known from laboratory measurements, define how much a gas absorbs light at a given wavelength.
To approximate the integrals in (25), the atmosphere is discretized. The geometry used for inversion resembles an onion: the gas densities are assumed to be constant within spherical layers around the Earth. The GOMOS measurement principle is illustrated in Figure 12 below.
Here, we assume that the crosssections do not depend on height. In the inverse problem we have gases, wavelengths, and the atmosphere is divided into layers. The discretisation is fixed so that number of measurement lines is equal to the number of layers. Approximating the integrals by sums in the chosen grid, and combining information from all lines and all wavelengths, we can write the model in matrix form as follows:
(26) 
where are the modelled transmissions, contains the crosssections, contains the unknown densities and is the geometry matrix that contains the lengths of the lines of sight in each layer.
Computationally, it is convenient to deal with vectors of unknowns. We vectorize the above model using the identity , where denotes the Kronecker product and is the standard vectorization obtained by stacking the columns of the matrix argument on top of each other. Thus, the likelihood model is written in vector form as follows:
(27) 
where is the measurement error, for which we apply an independent Gaussian model with known variances.
Note that, in principle, the model (27) could be linearized by taking logarithms of both sides, which is usually done for such tomography problems (like Xray computerized tomography). For this problem, linearisation can cause problems, since the signal from the star is often smaller compared to the background noise in the measurement.
5.2 Data and prior
Here, we generate synthetic data by solving the forward model (27) with known gas densities . In the example, we have 4 gas profiles to be inverted. The atmosphere is discretized into 50 layers, and the total dimension of the problem is thus 200. The simulated data are illustrated in Figure 13.
We estimate the logprofiles of the gases instead of the densities directly. We set a Gaussian process prior for the profiles, which yields , where denotes the elements of vector corresponding to gas . The elements of the 50 50 covariance matrices are calculated based on the squared exponential covariance function
(28) 
where the parameter values are , , , , and for all . The priors are chosen to promote smooth profiles and to give a rough idea about the magnitude of the density values. The prior is illustrated in Figure 14.
5.3 Inversion results
In this particular synthetic example, we know that gas 1 is very well identified by the data. The data also contain information about gases 2 and 3 at some altitudes. Gas 4, on the other hand, is totally unidentified by the data.
The LIS is constructed using samples—i.e., 200 iterations of Algorithm 1—starting with the Hessian at the posterior mode. The subspace convergence diagnostic and the generalized eigenvalues are shown in Figure 15. We choose the truncation thresholds . The dimension of the LIS in the end was 22.
We compute samples in both the LIS and in the full 200dimensional space using the Hessianpreconditioned MALA algorithm. In Figure 16, the first two columns show the mean gas profile and the mean 1 and 2 standard deviations in the LIS and in the complement space (CS). The third column shows the combined posterior from the LIS and the CS; for comparison, results from fullspace MCMC are shown in the fourth column. Note the different scales on the horizontal axes throughout the figure. We observe that the subspace approach, where MCMC is applied only in a 22dimensional space, yields results very similar to those of full MCMC. In addition, comparing the contributions of the LIS and CS indicates that gas 1 is dominated by the likelihood, whereas the posterior distribution of gas 4 is entirely determined by the prior. Note that the CS contribution for gas 1 is tiny (check the scale), while the LIS contribution for gas 4 is also very small. For gases 2 and 3, the lower altitudes are likelihooddominated, while the higher altitudes have more contribution from the prior. The fullspace MCMC results for gas 4 show a slightly nonuniform mean, but this appears to be the result of sampling variance. By avoiding sampling altogether in the CS, the subspace approach most likely yields a more accurate posterior in this case.
To further illustrate the approach, we plot the first six basis vectors of the LIS in Figure 17. One can see that the first basis vectors mainly include features of gas 1, which is most informed by the data. The first basis vectors also contain some features of gases 2 and 3 in lower altitudes. Gas 4 is not included in the LIS at all.
The dimension reduction obtained via the subspace approach is expected to yield better mixing than the fullspace MCMC. For the GOMOS case, the chain autocorrelations for subspace and fullspace MCMC are compared in Figure 18. The subspace sampler shows much faster decay of the autocorrelations than fullspace MCMC.
In this test case, the subspace MCMC also has lower computational cost compared to fullspace MCMC. To simulate a Markov chain for iterations, the subspace MCMC consumed about seconds of CPU time, while the fullspace MCMC cost CPU seconds. We note that the CPU time reduction is not as significant as the elliptic example, because the prior covariance is a dimensional matrix, which is much smaller than the covariance matrix used in the elliptic example. To construct the LIS, we simulated Algorithm 1 for iterations. This cost about seconds of CPU time, which is only about of the CPU time used to run fullspace MCMC for steps.
6 Conclusions
In this paper, we present a new approach for dimension reduction in nonlinear inverse problems with Gaussian priors. Our approach is based on dividing the parameter space into two subspaces: a likelihoodinformed subspace (LIS) where the likelihood has a much greater influence on the posterior than the prior distribution, and the complement to the LIS where the Gaussian prior dominates. We explore the posterior projected onto the LIS (the “difficult” and nonGaussian part of the problem) with Markov chain Monte Carlo while treating the complement space as exactly Gaussian. This approximation allows us to analytically integrate many functions over the complement space when estimating their posterior expectations; the result is a RaoBlackwellization or derandomization procedure that can greatly reduce the variance of posterior estimates. Particularly in inverse problems—where information in the data is often limited and the solution of the problem relies heavily on priors—the dimension of the LIS is expected to be small, and the majority of the directions in the parameter space can be handled analytically.
The dimension reduction approach is based on theory developed for the linear case; in [15] it is shown that in linearGaussian problems, the eigendecomposition of the priorpreconditioned loglikelihood Hessian yields an optimal lowrank update from the prior to the posterior, which can be interpreted in terms of a projector whose range is the LIS. Here, we generalize the approach to nonlinear problems, where the loglikelihood Hessian varies over the parameter space. Our solution is to construct many local likelihoodinformed subspaces over the support of the posterior and to combine them into a single global LIS. We show how the global LIS can be constructed efficiently in an adaptive manner, starting with the LIS computed at the posterior mode and iteratively enriching the global LIS until a weighted subspace convergence criterion is met.
We demonstrate the approach with two numerical examples. First is an elliptic PDE inverse problem, based on a simple model of subsurface flow. Though the dimension of the parameter space in our experiments ranges from 1200 to 10800, the dimension of the LIS remains only around 20 and is empirically discretizationinvariant. Exploring the LIS by MCMC and analytically treating the Gaussian complement produces mean and variance fields very similar to those computed via MCMC in the full space. Yet the mixing properties and the computational cost of MCMC in the LIS are dramatically improved over those of fullspace MCMC. Our second demonstration is an atmospheric remote sensing problem, where the goal is to infer the concentrations of chemical species in the atmosphere using star occultation measurements, as on the satelliteborne GOMOS instrument. The dimension of the full problem used here was 200 (four gaseous species and 50 altitudes for each), while the dimension of the LIS was 22. Again, dimension reduction significantly improves the mixing properties of MCMC without sacrificing accuracy.
To conclude, our dimension reduction approach appears to offer an efficient way to probe and exploit the structure of nonlinear inverse problems in order to perform Bayesian inference at a large scale, where standard algorithms are plagued by the curse of dimensionality. The approach also opens up interesting further research questions: it may be useful, for instance, to apply reducedorder and surrogate modeling techniques in the LIS, making them applicable to much larger problems than before.
References
Footnotes
 The dimension of the global LIS can vary at each iteration. Let denote the dimension of the global LIS at iteration . To be precise, we should then write and , but for brevity we will simplify notation to and when possible.
References
 A. Tarantola. Inverse Problem Theory and Methods for Model Parameter Estimation. Society for Industrial Mathematics, Philadelphia, 2005.
 J. P. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems, volume 160. Springer, New York, 2004.
 A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19:451–559, 2010.
 W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in practice, volume 2. CRC press, 1996.
 J. S. Liu. Monte Carlo strategies in Scientific Computing. Springer, New York, 2001.
 S. Brooks, A. Gelman, G. Jones, and X. L. Meng, editors. Handbook of Markov Chain Monte Carlo. Taylor & Francis, 2011.
 G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7:110–120, 1997.
 G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60:255–268, 1998.
 G. O. Roberts and J. S. Rosenthal. Optimal scaling of various MetropolisHastings algorithms. Statistical Science, 16(4):351–367, 2001.
 J. C. Mattingly, N. Pillai, and A. M. Stuart. Diffusion limits of the random walk Metropolis algorithm in high dimensions. Annals of Applied Probability, 22:881–930, 2012.
 N. S. Pillai, A. M. Stuart, and A. H. Thiery. Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions,. Annals of Applied Probability, 22:2320–2356, 2012.
 A. Beskos, G. O. Roberts, A. M. Stuart, and J. Voss. MCMC methods for diffusion bridges. Stochastic Dynamics, 8(3):319–350, 2008.
 S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28:424–446, 2013.
 H. P. Flath, L. C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, and O. Ghattas. Fast algorithms for Bayesian uncertainty quantification in largescale linear inverse problems based on lowrank partial hessian approximations. SIAM Journal on Scientific Computing, 33(1):407–432, 2011.
 A. Spantini, A. Solonen, T. Cui, J. Martin, L. Tenorio, and Y. Marzouk. Optimal lowrank approximation of linear Bayesian inverse problems. arXiv preprint, arXiv:1407.3463, 2014.
 Y. M. Marzouk and H. N. Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics, 228:1862–1902, 2009.
 K. Karhunen. Über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.Phys, 37:1–79, 1947.
 M. Loève. Probability theory, Vol. II, volume 46 of Graduate Texts in Mathematics. SpringerVerlag, Berlin, 4 edition, 1978.
 J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for largescale statistical inverse problems with application to seismic inversion. SIAM Journal on Scientific Computing, 34(3):A1460–A1487, 2012.
 Noemi Petra, James Martin, Georg Stadler, and Omar Ghattas. A computational framework for infinitedimensional Bayesian inverse problems: Part II. stochastic Newton MCMC with application to ice sheet flow inverse problems. SIAM Journal on Scientific Computing, to appear, 2014. arXiv:1308.6221.
 M. Lassas, E. Saksman, and S. Siltanen. Discretization invariant Bayesian inversion and Besov space priors. Inverse Problems and Imaging, 3(1):87–122, 2009.
 W. Förstner and M. Boudewijn. A metric for covariance matrices. In GeodesyThe Challenge of the 3rd Millennium, pages 299–309. Springer Berlin Heidelberg, 2003.
 G. H. Golub and C. F. Van Loan. Matrix Computations. JHU Press, 2012.
 N. Halko, P. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
 E. Liberty, F. Woolfe, P. G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the lowrank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
 H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
 C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
 Y. F. Atchade. An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift. Methodology and Computing in Applied Probability, 8(2):235–254, 2006.
 G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of Applied Probability, 44(2):458–475, 2007.
 G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
 M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
 G. Casella and C. P. Robert. RaoBlackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
 F. Li, Q. Dai, W. Xu, and G. Er. Weighted subspace distance and its applications to object recognition and retrieval with image sets. IEEE Signal Processing Letters, 16(3):227–230, 2009.
 T. Cui, K. J. H. Law, and Y. M. Marzouk. Dimension–independent likelihood–informed MCMC. arXiv preprint, 2014.
 H. Haario, M. Laine, M. Lehtinen, E. Saksman, and J. Tamminen. Markov chain Monte Carlo methods for high dimensional inversion in remote sensing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66:591–608, 2004.