Fast computation of Uncertainty Quantification measures in the Geostatistical approach to solve Inverse Problems
Abstract
We consider the computational challenges associated with uncertainty quantification involved in parameter estimation such as seismic slowness and hydraulic transmissivity fields. The reconstruction of these parameters can be mathematically described as Inverse Problems which we tackle using the Geostatistical approach. The quantification of uncertainty in the Geostatistical approach involves computing the posterior covariance matrix which is prohibitively expensive to fully compute and store. We consider an efficient representation of the posterior covariance matrix at the maximum a posteriori (MAP) point as the sum of the prior covariance matrix and a lowrank update that contains information from the dominant generalized eigenmodes of the data misfit part of the Hessian and the inverse covariance matrix. The rank of the lowrank update is typically independent of the dimension of the unknown parameter. The cost of our method scales as where dimension of unknown parameter vector space. Furthermore, we show how to efficiently compute measures of uncertainty that are based on scalar functions of the posterior covariance matrix. The performance of our algorithms is demonstrated by application to model problems in synthetic traveltime tomography and steadystate hydraulic tomography. We explore the accuracy of the posterior covariance on different experimental parameters and show that the cost of approximating the posterior covariance matrix depends on the problem size and is not sensitive to other experimental parameters.
1 Introduction
One of the central challenges in the field of geosciences is to develop computationally efficient statistical methods for optimizing the use of limited and noisy environmental data to accurately estimate heterogeneous subsurface geological properties. In addition, it is necessary to quantify the corresponding predictive uncertainty. Mathematically, imaging can be performed using inverse problems theory, which uses measurements to make inference of system parameters. Efficient algorithms for inverse problems are necessary to solve problems of realistic sizes, quantified by the spatial resolution of the reconstructed parameters and number of measurements available for reconstruction. Using these efficient algorithms scientists can gain better knowledge of soil moisture content, the porosity of geologic formations, distributions of dissolved pollutants, and the locations of oil deposits or buried liquid contaminants. These detailed images can then be used to better locate natural resources, treat pollution, and monitor underground networks associated with geothermal plants, nuclear waste repositories, and carbon dioxide sequestration sites. We aim to solve these problems by employing the geostatistical approach that stochastically models unknowns as random fields and uses Bayes’ rule to infer unknown parameters by conditioning on measurements. However, due to high computational costs in identifying small scale features, these methods are challenging. These costs occur because solving inverse problems requires multiple expensive simulations of partial differential equations as well as representing high dimensional random fields, especially on irregular grids and complicated domains. Additional details about the Geostatistical approach have been provided in Section 2.
Uncertainty in the context of Bayesian inverse problems is represented by the posterior probability density function. For linear inverse problems, if the measurement noise is additive Gaussian and the prior model is specified by a Gaussian random field, then the resulting posterior probability density function (PDF) is also Gaussian and is fully specified by calculating the mean (which coincides with the maximum a posteriori, or MAP, estimate), and the posterior covariance matrix. Computing the mean leads to a weighted linear least squares optimization problem, which can be tackled by several efficient numerical algorithms. For nonlinear inverse problems, a linearization of the measurement operator yields a local Gaussian for the posterior PDF. The MAP point can be computed by solving a weighted regularized nonlinear least squares problem and the posterior covariance matrix can be approximated by the inverse of the Hessian of the negative loglikelihood of the posterior PDF computed at the MAP estimate. Although considerable effort has been devoted to computing the MAP estimate (see for example [25, 36, 35]), relatively fewer number of works have addressed the computation of the posterior covariance matrix. In this work we focus on an efficient representation of the posterior covariance matrix which is a measure of uncertainty associated with the reconstruction of the parameters of interest.
Computing and storing the approximation to the posterior covariance matrix is computationally infeasible because the prior covariance matrices arising from finely discretized fields and certain covariance kernels are dense, and computing the dense measurement operator requires solving many forward PDE problems, which can be computationally intractable. In illposed inverse problems, the data is informative only about a lowdimensional manifold in the parameter space. This property has been exploited previously in developing efficient approximate representations to the posterior covariance matrix as the sum of the prior covariance matrix and a lowrank update that contains combined information from both the prior and the data misfit part of the Hessian (see for example, [11, 18, 12]). The lowrank modification is computed by the solution of a largescale eigenvalue problem involving the priorpreconditioned Hessian. The prior covariance matrices can be modeled as discrete representations of operators of the form , where is a partial differential operator (for e.g., the Laplacian) and is a parameter chosen such that the infinite dimensional formulation is wellposed [39]. Another choice for prior covariance matrices is using Spartan Gibbs random field [23]. In this work we focus on the Matérn class of covariance kernels [40].
The ability of being able to compute measures of uncertainty is extremely important for the field of Optimal Experimental Design (OED), which seeks to determine the experimental setups which maximize the amount of information that can be gained about the parameters of interest. The design variables which control the accuracy of the parameter reconstructions could be the measurements or measurement types, numbers, locations of sources and/or detectors and other experimental conditions. A prevalent approach to OED involves optimizing an objective function which involves a scalar measure of uncertainty associated with the parameter reconstruction (defined on the basis of the posterior covariance matrix) and attempts to minimize this objective function with respect to design parameters. Since during the context of optimizing the experimental setup, the inverse problem has to be solved several times and the resulting uncertainty needs to be estimated at each iteration of the optimization routine, we would like an efficient method for computing the objective function (i.e., the measure of uncertainty). We will provide an efficient method for computing a few of these uncertainty measures. A good review of optimal experimental design in the Bayesian context is provided in [13]. Common optimality criteria which can be used as objective functions include the alphabetic criteria, for example, A, C, D, E, and T optimality criteria (these will be defined in Section 6). The definitions of the optimality criteria in the geostatistical context, along with a discussion of physical and statistical significance of these criteria and its applicability in nonGaussian settings is available in [32].
Contributions: We model the prior covariance matrix with entries arising from covariance kernels, as is common practice [25]. Although the resulting covariance matrices are dense, in our previous work [36, 4], we have shown that we can obtain the best estimate using techniques (such as FFT based methods and matrix approach) to reduce the storage and computational cost from to (m is the number of unknown parameters). However, except when the number of measurements are small, e.g., , computing entries of the posterior covariance matrix is computationally impractical. We show how to compute an efficient representation of the posterior covariance matrix as a lowrank modification of and the lowrank update is computed efficiently using a randomized algorithm. A major advantage of our approach is that there is great flexibility in experimenting with several covariance kernels, since the prior covariance matrix computations are handled in a blackbox fashion. This is a major difference in our work compared to [11, 18, 12], that we consider directly modeling the entries of the prior covariance matrix using the Matérn class of covariance kernels, instead of modeling the prior as the inverse of a discretized differential operator (such as the Laplacian). Although these two approaches appear disparate, their equivalence has been established in [28].
A second contribution of this paper is that we provide an algorithm for approximating the posterior covariance matrix that does not require forming the square root (or equivalently Cholesky factorization) and inverse of the prior covariance matrix . The approach in [11, 18, 12] considers the prior preconditioned Hessian (defined as , where is the GaussNewton Hessian of the data misfit term. Computing the square root of a matrix is an expensive operation for finely discretized grids arising from largescale 3D problems. The work in [11, 18, 12] avoids this issue by considering priors for which the squareroot is explicitly known. Since the matrix square root is not explicitly known for arbitrary covariance matrices, this assumption is very restrictive from a modeling stand point. The algorithm we propose only requires forming matrixvector products (henceforth, referred to as matvecs). The key idea is to consider an equivalent generalized eigenvalue problem where is the Hessian corresponding to the datamisfit and is the inverse prior covariance matrix. The randomized algorithm that we propose for approximating the posterior covariance matrix is simple to implement, is computationally efficient, and comes with error bounds established in our previous work [37].
Another important contribution of our work is the efficient computation of various measures of uncertainty which leverages the efficient representation of the posterior covariance matrix, written as a lowrank correction to the prior covariance matrix. A second computational burden occurs when the number of measurements are large because of operations on a dense crosscovariance matrix, which scale as , where is the number of measurements. While some of the criteria (A and C) can be evaluated when the number of measurements are small, other criteria (such as D and E) are altogether computationally infeasible. However, using an efficient representation of the approximate posterior covariance and using matrixfree techniques, we show how several of these optimality criteria can be computed more efficiently. We note a further advantage of using covariance kernels to model : Computing the variance of the posterior covariance requires computing the diagonals of the prior covariance matrix which can be easily computed, when the covariance kernel is specified explicitly. The application of computing these uncertainty measures in the context of optimal experimental design has been also covered in [1, 3, 2].
2 Geostatistical approach to inverse problems
The Geostatistical approach (described in the following papers [25, 27, 26]) is one of the prevalent approaches to solve inverse problems. The geostatistical approach represents the unknown field to be estimated as a random field composed as the sum of a few deterministic term, typically low order polynomials, and a stochastic term which models small scale variability. The structure of the stochastic term is represented through the prior probability density function, which in practical applications is often parameterized through variograms and generalized covariance functions. The method has found several applications because it is generally practical and has the ability to quantify uncertainty. The method can generate best estimates, which can be determined in a Bayesian framework as a posteriori mean values or most probable values, measures of uncertainty as posterior variances or credibility intervals, and conditional realizations, which are sample functions from the ensemble of the posterior probability distribution.
In this section, we briefly review the geostatistical approach. Let , the function to be estimated, be modeled by a Gaussian random field. After discretization, we can write where is a matrix of loworder polynomials known as the drift matrix, are a set of drift coefficients to be determined and is a covariance matrix with entries , and is a generalized covariance kernel [15, 30]. We have that and refers to the grid size. In several instances, the measurements are corrupted by noise and we model these as Gaussian random variables. The measurement equation can be written as
(1) 
where represents the noisy measurements, is known as the measurement operator or parametertoobservation map and is a random vector of observation error with mean zero and covariance matrix . The matrices , and are part of a modeling choice and more details to choose them can be obtained from the following reference [25]. The parameters to be reconstructed are the values of the function at the grid locations and which are the drift coefficients.
We can write down the following expressions for the probability density functions (PDF) for the prior and the measurements
(2) 
and
(3) 
where we have used the vector norm and is defined for an arbitrary symmetric positive definite matrix. Inference of the parameters from the measurements is obtained by invoking the Bayes’ theorem, through the posterior PDF which is the product of two parts  the likelihood of the measurements and the prior distribution of the parameters. Employing Bayes’ rule and assuming that the prior PDF for is uniform, i.e. , we can write the expression for the PDF of the posterior distribution as
(4) 
The MAP estimate is computed by maximizing the negative log likelihood of the posterior displayed in Equation (4), and can be obtained by solving the following optimization problem
(5) 
The choice of prior covariance kernels is the Matérn family of covariance kernels [40]
(6) 
where , is the Gamma function, is a scaling factor, is the modified Bessel function of the second kind of order . Equation (6) takes special forms for certain parameters . For example, when , corresponds to the exponential covariance function when and is an integer, is the product of an exponential covariance and a polynomial of order . In the limit as , and for appropriate scaling of , converges to the Gaussian covariance kernel. For a more detailed discussion of permissible covariance kernels, we refer the reader to the following references [30, 15]. Other possible choices for modeling the unknown fields are using the Spartan Gibbs random field models [23]. Estimating the covariance parameters can be accomplished using the restricted maximum likelihood approach that has been outlined in [25]. The choice of the parameter depends on the a priori information available of the smoothness of the field we wish to reconstruct. The effect of the parameters has also been studied in Section 4.2. Additional information about the Matérn class can be found in [40].
We discuss the computational costs involving matrix vector products and . For stationary or translational invariant covariance kernels with points located on a regular equispaced grid, the computational cost for can be reduced from using the naive approach, to by exploiting the connection between Toeplitz structure in 1D or BlockToeplitz structure in 2D etc, and the Fast Fourier Transform (FFT) [31]. For irregular grids, it can be shown that the cost for approximate matrixvector products (matvecs) involving the prior covariance matrix can be reduced to using Hierarchical matrices [36] or using matrices or Fast Multipole Method (FMM) [4]. For forming matvecs , we use an iterative solver such as GMRES with a preconditioner that employs approximate cardinal functions based on local centers and special points [7]. The cost of constructing the preconditioner is or . Assuming that the number of iterations is independent of the size of the system, the cost of forming is also or , where is the number of iterations required to converge to the desired tolerance. In conclusion, the cost for forming and is .
In the case that the operator is linear, the resulting posterior PDF in Equation (4) is also Gaussian. The MAP point (5) can be calculated by introducing auxillary variables and which satisfy the following system of equations
(7) 
The MAP estimate can be calculated as . Further details on the formulation and the procedure to compute the solution to the linear system of equations in Equation (7) can be found in [36, 35]. Furthermore, the posterior covariance matrix is given by the expression
(8) 
3 Posterior Covariance approximation
In [18, 11], the approach taken to approximate the posterior covariance is to compute the dominant modes of a prior preconditioned data misfit Hessian , which combines the information from the data misfit portion of the Hessian and the prior covariance matrix . Here, , where is the likelihood of the measurements conditioned on the data. The matrix often has a rapidly decaying spectrum for several illposed inverse problems [18, 11, 9, 10]. Computing the dominant modes of this eigenvalue problem can be accomplished in a cost that is a small multiple of the cost of simulating the forward (or adjoint) problem, and this multiple is independent of the size of the parameter dimension. Once this lowrank approximation is computed, an approximation to the posterior covariance matrix can be computed as a lowrank update to the prior covariance matrix using the Woodbury matrix identity. This results in a scalable algorithm for estimating uncertainty in largescale statistical inverse problems.
3.1 Avoiding and
As mentioned earlier, the key drawback of [18, 11] is that the algorithms for approximating posterior covariance matrix involves computing (matvecs with) the square root of the prior covariance matrix. When the prior covariance matrix is represented by a discrete representation of a (possibly fractional) differential operator [39], the square root can be computed using sparse Cholesky representation that scales as in 2D and for 3D problems.
Several matrixfree techniques exist in the literature for computing matvecs and , that are based on polynomial approximation [14, 17] or rational approximations and contour integrals [20]. However, the convergence of polynomial approximations is only algebraic when the smallest eigenvalue is close to zero. Rational approximations and contour integral based methods do not suffer from the same problem, however they require solutions of a number of shifted systems totaling , where is the condition number defined for a matrix as . Although the number of systems to be solved is often small, even for illconditioned problems, solving each system can be expensive in practice. Recent developments also include symmetric squareroot factorizations, such as Cholesky [8]. However, they are complicated to implement and have low accuracy for a reasonable computational cost. Therefore, it is highly desirable to develop squareroot free algorithms for approximating the posterior covariance matrix.
Our approach, instead, is to consider the dominant eigenmodes of the generalized Hermitian eigenvalue problem (henceforth referred to as GHEP) , which has the same spectrum as the matrix prior preconditioned data misfit Hessian . However, as we shall show in Section 3.2 the dominant eigenmodes can be computed without relying on squareroots of the matrix . We consider the generalized eigenvalue problem
(9) 
Recall that . Since both matrices and are symmetric and is symmetric positive definite, we have the following eigendecomposition
(10) 
where is the matrix of eigenvectors obtained by the solution of Equation (9) and is a diagonal matrix with eigenvalues. Equation (9) does not have squareroots but still has . We further make the following variable transformation, . Therefore, Equation (9) becomes
(11) 
Note that this transformation still preserves the same eigenvalues . After solving for the eigenpair the eigenvector of Equation (9) can be recovered by the following transformation . As a result, we have completely removed the need for inverting or forming square roots. We now consider efficient solvers for computing the dominant eigenmodes of the GHEP (9).
3.2 A randomized algorithm for approximating posterior covariance
In Section 3.1, we described an efficient representation of the posterior covariance matrix, based on the dominant eigenmodes of the eigenvalue problem described in Equation (9). Equation (9) is a GHEP, it is a special case of a generalized eigenvalue problem with A being Hermitian and B being Hermitian positive definite. Several efficient methods exist for solving the GHEP. These include approaches based on power and inverse iteration methods, Lanczos based methods and JacobiDavidson method. For a good review on this material, please refer to [6, chapter 5] and [34]. A good review of existing software available for solving eigenvalue problems is also available in [22].
In this paper, we choose to use the squareroot free randomized algorithms that we have recently developed in [37]. Randomized algorithms are gaining popularity because 1) they are easy to implement, 2) they are well suited for computationally intensive problems and modern computing environments, and 3) have wellestablished error bounds and computational costs. The error analysis suggests that the randomized algorithm developed in [37] is most accurate when the generalized singular values of the matrix decay rapidly. In Section 3.1, we have argued that the generalized eigenvalues of the GHEP (9) are rapidly decaying and therefore, we expect the randomized algorithm to be fairly accurate. The randomized algorithms that we employ are also extremely efficient because they avoid forming expensive matvecs with and . In addition, they produce a symmetric lowrank representation with Borthonormal eigenvectors. This symmetry in lowrank representation has been exploited in deriving an efficient, symmetric representaion of the posterior covariance matrix in Section 3.3.
We briefly review the randomized algorithm described in [37] for computing dominant eigenmodes of the GHEP . In the context of solving the problem (9), we have
The key observation is that the matrix is symmetric with respect to the inner product . Suppose we wanted to compute the largest generalized eigenpairs of . The randomized Algorithm 1 calculates a matrix , which is orthonormal and approximately spans the column space of , i.e. satisfies the following error bound . Given such a matrix , it can be shown that , i.e. . As a result, a symmetric rank approximation can be computed, from which the approximate eigendecomposition can be computed.
To produce a symmetric rank approximation, the algorithm proceeds as follows: first, we sample a matrix with entries randomly chosen from , . We choose , where is an oversampling factor, which we choose to be . Numerical evidence for this choice of oversampling factor has been provided in [21, 37]. Form and . Then, we compute the QR factorization of such that . This can be accomplished by modified GramSchmidt algorithm with inner products. Then, we form and compute its eigenvalue decomposition . We then have the approximate generalized eigendecomposition
Here, is also orthonormal. The cost of a second round of matvecs with while computing can be avoided using the following approximation
Therefore, can be computed as . The results are compactly summarized in Algorithm 1. For an efficient implementation we describe the algorithm using and and .
The efficiency and accuracy of this algorithm has been studied in detail in [37]. Here we summarize the main conclusions. The error in the lowrank approximation is
where is a constant that depends on and and is independent of the spectrum of the matrices, is the th generalized singular value of the matrix . Since the generalized singular values are not known in advance, a randomized estimator for the error in the lowrank representation is also proposed and analysed in [37]. Given error in the lowrank representation , it can be shown that the error in approximating the true eigenvalue and eigenvector satisfies the following error bounds
where is the gap between the approximate eigenvalue and any other eigenvalue and . This result states that the accuracy in the eigenvalue/eigenvector calculations depends not only on the accuracy of the lowrank representations but also on the spectral gap . When the eigenvalues are clustered, the spectral gap is small and the eigenvalue calculations are accurate as long as the error in the lowrank representation is small. However, in this case the resulting eigenvector calculations maybe inaccurate because the parameter appears in the denominator for the approximation of the angle between the true and approximate eigenvector. This result has consequences in the accuracy approximation of the posterior covariance and will be discussed in Section 4.2.
3.3 Posterior covariance approximation
Let us define the matrix . We follow the approach in [18, 29] to derive an approximation to as a lowrank update to the prior covariance matrix . Plugging in the approximate eigendecomposition for from Equation (10), we obtain
Using the ShermanMorrisonWoodbury formula, we can derive the following expression
(12)  
where for . The error in the approximation to the matrix and its inverse can be established by the following inequalities
(13)  
The inequality holds for both the cases that and . Here, the induced matrix norm is defined as , defined for symmetric positive definite matrices .
For many illposed inverse problems, the approximate numerical rank of the prior preconditioned Hessian is small and independent of the problem size, i.e. the number of unknowns. Since this matrix has the same eigenvalues as the GHEP , we are justified in retaining only the eigenmodes corresponding to the largest eigenvalues. We can exploit this lowrank structure to develop scalable algorithms to approximate the posterior covariance matrix. The generalized eigendecomposition combines information from the prior and the reduced Hessian and takes advantage of the eigenvalue decay in one (or both) matrices  when the reduced Hessian has rapidly decaying eigenvalues, or the prior is of smoothing type. Analytical evidence for the eigenvalue decay of the reduced Hessian is provided in [18] in the context of advectiondiffusion based inverse problems and in [9, 10] for inverse scattering problems. For the case of prior covariance matrices, the eigenvalue spectrum is known to decay rapidly when the covariance kernels are smooth [38]. The retained eigenvectors are the modes along which the parameter field is informed by a combination of the data and the prior. Typically the data and prior are informative about the lowfrequency modes and as a result local information and fine scale information is hard to recover from the data. The inequalities in (13) suggests choosing such that .
Finally, the posterior covariance matrix can be constructed by using the following decomposition in Equation 8
(14) 
where and .
An approximation to the posterior covariance matrix (14) can be constructed by plugging in the approximation to which has been derived in Equation (12). It should be noted that computing and storing the posterior covariance matrix or its inverse is infeasible for very large problem sizes. However, neither computing nor storing the matrix is necessary, nor recommended. Computing can be performed as
By the same argument, forming and is approximately the same cost as forming matrixvector product with . Having computed and , the matrixvector product involving is dominated by the cost of computing and can be computed efficiently in .
4 Application: Ray Tomography
4.1 Setup
In this application, we consider a synthetic setup of crosswell tomography [4]. The goal is to the image the slowness in the medium where slowness is defined as the reciprocal of seismic velocity. As a first order approximation, the seismic wave is modeled as traveling along a straight line from the sources to the receivers without reflections or refractions. Each sourcereceiver pair generates one measurement and therefore, there are measurements. Here is the number of receivers and are the number of sources. In this application, we pick and . The domain is discretized into cells and within each cell, the slowness is assumed to be constant. Therefore, the time taken from the source to the receiver is a weighted sum of the slowness in the cell, weighted by the length of the ray within the cell. The inverse problem can be stated as follows: given measurements of time delay between each sourcereceiver pair, what is the slowness of the medium? We assume that the travel times are corrupted by Gaussian noise, so that the measurement takes the following form
(15) 
where are the observed (synthetic) travel times, is the slowness that we are interested in imaging and is the measurement operator, whose rows correspond to each sourcereceiver pair and are constructed such that their inner product with the slowness would result in the travel time. Constructed as above results in as a sparse matrix with nonzero entries  each row has entries and there are measurements. As a “true field”, we take a realization from a Gaussian random field , with , and covariance matrix is constructed according to the covariance kernel
(16) 
with and . The realization of the stationary Gaussian process is generated using the FFTbased approach in [16] by embedding the Toeplitz matrix in a circulant matrix. The matvecs with are also performed with FFT based methods in .
To simulate experimental error, we add Gaussian noise of 0.1 to the measurements. The system of Equations (7) is solved using an iterative solver, which is chosen to be restarted GMRES (50), and the matvecs can be formed in a matrixfree manner [36, 35]. As a preconditioner, we use the method developed in [36, Section 3.3]. The preconditioner requires computing a lowrank decomposition of which can be computed using any eigenvalue solver. We use the randomized GHEP algorithm described in 3.2 with and , with eigenpairs used to form the lowrank decomposition and an oversampling factor . The other steps in the construction of the preconditioner are identical. To simulate experimental error, we add Gaussian noise of 0.1 to the measurements. The iterative solver converged in iterations. The error in the reconstruction was . Figure 1 shows the true field which is a realization of a Gaussian process with mean [s/m] and covariance kernel (16) and the reconstructed field obtained by solving Equation (7). The grid size was and the relative reconstruction error in the l sense was . Figure 2 visualizes the posterior variance obtained as the diagonals of the posterior covariance matrix calculated using the formula . The error in the uncertainty associated with estimation of measured as (see Equation (8)) is [s/m]. The visualization a few selected eigenvectors of the generalized eigenvalue problem (10) solved using Algorithm 1 can be seen in Figure 3.
4.2 Variance approximation
In this section, we study the effect of truncation of the spectrum and the choice of prior on the accuracy of the variance of the reconstruction, that is the diagonals of the posterior covariance matrix. We consider three different covariance kernels chosen from the Matérn class of covariance kernels corresponding to parameters , i.e.
(17) 
The comparison of the rate of decay of the eigenvalues of the matrices , and the GHEP is illustrated in Figure 4. What does affect the spectrum of the generalized eigenvalue problem for a given measurement setup is the choice of the covariance kernel used in the reconstruction. Here we illustrate this by choosing a ray tomography setup with measurements and unknowns discretized on a grid. For the Matérn class of covariance kernels, by increasing the parameter , the underlying stochastic process becomes more smooth and the rate of decay of the eigenvalues increases [40, 38]. This affects the number of eigenvalues retained to maintain the accuracy of the posterior covariance matrix  smoother the kernel, fewer eigenvalues need to be retained. The number of retained eigenvalues is , where is chosen such that the retained eigenvalues satisfy the cutoff . A typical choice is . This is illustrated in Figure 5. We see that for the covariance kernel corresponding to , the eigenvalue decay is not rapid enough to satisfy the cutoff and as a result, all the nonzero eigenvalues that correspond to the number of measurements . Physical parameters also control the decay of eigenvalues and as a result, the number of eigenmodes that are retained. Setups that have more number of measurements due to increased number of sources and receivers, have a slower rate of decay of eigenvalues because more information is propagated through the rows of the measurement operator. This is also illustrated in Figure 5.
4.2.1 Accuracy of variance approximation
In Figure 6, we plot the relative error in the computation of the approximate variance as a function of the number of eigenvalues retained for different prior covariance kernels. The error is defined to be the relative error computed with the exact variance. We observe that the error decreases with increasing number of eigenvalues retained. Consequently, inclusion of a larger number of eigenvalues presents a situation of diminishing returns in terms of the accuracy of variance calculations. The result of the computation of the accuracy of the variance with increasing eigenvalues is plotted in Figure 6. As can be seen, the error in the approximation to the variance decreases rapidly with increasing number of eigenvalues retained. Furthermore, the accuracy improves dramatically when considering kernels with higher values of , since eigenvalues of covariance matrices arising from the Matérn family corresponding to larger decay more rapidly. In our computations, we assumed that the number of measurements are only , this makes the task of computing the exact variance computationally feasible using the method described in [4]. For larger number of measurements, we cannot compute the variance easily and we must resort to the methods described in this paper.
4.2.2 Cost of computing variance
Two factors are necessary to ensure that the cost of computing the variance scales linearly with the dimensionality of the unknowns parameters  the number of eigenvalues retained is independent of the mesh size and the various costs required to capture the dominant eigenmodes scales linearly with the number of unknowns.
We address the first issue, that is, the number of eigenvalues retained is independent of the mesh size. Figure 7 shows the dependence of the spectrum of the generalized eigenvalue problem with the dimension of the input parameter space. We use the same setup just described above, with measurements. The unknown parameters, in this case the slowness of the medium, is discretized on a number of grids ranging from to . The prior is chosen as a member of the Matérn covariance class with i.e. , see Equation (17). As can be seen from the figure, the number of unknowns does not affect the spectrum of the generalized eigenvalue problem and thus, does not affect the number of retained eigenvalues required to approximate the posterior covariance matrix. We address the second issue  computational costs of computing eigendecomposition. The major component in this calculation is the matvecs involving , and . In Section 3, we have already argued that the costs involving the aforementioned matvecs scale linearly or almost linearly with the number of unknowns. Therefore, we expect the cost of the eigendecomposition, and as a result the variance calculations, to scale similarly with the number of unknowns. This is illustrated in Figure 7. As can be seen from the figure, the computational costs scale linearly with the number of unknowns.
5 Application: Steadystate Hydraulic Tomography
Hydraulic Tomography (HT) is a technique for estimating the subsurface parameters such as transmissivity by a series of pumping tests, in which pressure (head) is measured and this data is used to reconstruct the parameters of interest by using suitable inversion algorithms. The governing equations are given by the ground water flow equations, which in 2D take the following form
(18)  
where are the pumping locations, and homogenous Dirichlet boundary conditions are applied everywhere on the boundary .
The inverse problem is to reconstruct the transmissivity field from discrete measurements of obtained by repeated pumping tests at several pumping locations. In order for the system of Equations (18) to be well posed, the reconstructions of need to be positive and therefore, we make a logtransformation to ensure positivity. This transformation makes the problem of reconstruction a quasilinear inverse problem. The setup of the pumping tests is provided in Figure 8; the black squares indicate the locations at which water is pumped and the head response is computed at the receiver locations (indicated by red asterisks).
The reconstruction of transmissivity field from discrete measurements of is a nonlinear illposed inverse problem. For nonlinear inverse problems, unlike linear inverse problems, the posterior distribution is no longer Gaussian. The measurement operator is linearized about the current estimate to obtain
The MAP point is the minimizer of the negative loglikelihood posterior PDF described in Equation (5) and can be computed using the quasilinear geostatistical approach [25]. Essentially this involves solving a sequence of regularized linear least squares problems until the algorithm converges. The system of Equations (19) is solved using restarted GMRES. The prior covariance matrix and the Jacobian are not constructed explicitly. Matvecs involving are handled in using the FFT based approach or using matrices, whereas the matvecs involving the Jacobian are handled using the approach described in [19]. In order to accelerate the convergence of GMRES, we use a preconditioner developed in [36]. The algorithm for computing the MAP estimate is summarized in Algorithm 2.
(19) 
(20) 
Once we have the MAP estimate, we approximate the measurement operator by a linearizing about the MAP point so as to approximate the posterior distribution about the MAP estimate by a Gaussian distribution. This is a reasonable approximation when the measurement operator is nearly linear. This is a useful approximation even when the posterior distribution is highly nonlinear, since the Gaussian approximation is frequently used as a proposal distribution for MCMC algorithms used to explore the posterior [33]. The approximate posterior distribution can be expressed as
(21) 
where and are the parameters evaluated at the MAP point, and is the Jacobian calculated at the MAP point .
Computing the posterior covariance matrix or its inverse explicitly is computationally infeasible for large scale problems. This is because (1) constructing the Jacobian explicitly requires the solution of several (possibly timedependent) forward or adjoint partial differential equations (2) the covariance matrix is typically dense, and therefore computing matrix products and inverses involving the covariance matrix is nearly impossible in terms of storage and computational cost and finally, (3) the resulting matrix is dense and its storage can be infeasible for large problems. The approximation of the posterior covariance matrix is approximated using the same randomized approach described in Section 3.
We discuss the implementation of these ideas to a concrete application of steadystate hydraulic tomography. For the reconstruction, we use a covariance kernel, that is part of the Matèrn family, and is given by
(22) 
To simulate experimental error, we add Gaussian noise of 0.1 to the measurements. The MAP estimate is computed by solving the optimization problem (5) using the GaussNewton algorithm that was described in Algorithm 2. At each iteration the system of equations is solved using restarted GMRES (50) using the same preconditioner described in [36]. The lowrank decomposition is computed in the same fashion described in Section 4 with terms in the lowrank approximation. The lowrank decomposition is computed only once; however, the preconditioner is rebuilt every GaussNewton iteration. The iterative solver took about iterations on average to converge to a relative tolerance of . The GaussNewton solver took about iterations for the relative difference between subsequent iterates to go below . The error in the reconstruction was and the reconstruction is plotted in Figure 9.
The generalized eigenvalue problem in Equation (9) is solved with the Jacobian computed at the MAP estimate and the eigenpairs corresponding to the largest eigenvalues are computed using the randomized GHEP algorithm described in Section 3.2. The posterior covariance matrix is computed using Equation (21). The variance computed as the diagonals of the posterior covariance is visualized in Figure 10. The generalized eigenvalues are plotted in Figure 11. Since the true variance is hard to compute, as it requires sampling from the true posterior distribution, comparison of the accuracy against the true variance is hard to establish. However, from the analysis in Section 4.2 the computation of using the formula in (12) a cutoff of is a good approximation to the true computed at the MAP estimate. As can be seen in the Figure 11, the eigenvalues of the GHEP defined in Equation (9) and the number of eigenvalues retained that satisfies the cutoff , is far smaller than the number of unknowns or the number of measurements . Finally, in Figure 12, we report the cost of computing the dominant eigenmodes of the GHEP defined in Equation (9). The major component in this calculation is the matvecs involving , and . In Section 3, we have already argued that the costs involving the aforementioned matvecs scale linearly or almost linearly with the number of unknowns. Therefore, we expect the cost of the eigendecomposition, and as a result the variance calculations, to scale similarly with the number of unknowns. This is illustrated in Figure 12. As can be seen from the figure, the computational costs scale linearly with the number of unknowns.
6 Measures of uncertainty
An experimentalist interested in designing experiments to estimate parameters is interested in obtaining as much information as possible from laborious or expensive experiments. In the context of Bayesian experimental design, one tries to optimize certain criteria based on scalar measures of conditional uncertainty. Several studies have focused their attention on optimal experimental design targeted to regressionlike problems. However, the large computational costs associated with storing and computing the covariance matrices prohibit their utility in optimal design for geostatistical approach. Moreover, in contrast to regression based methods the inverse problems arising from environmental sciences are highly underdetermined, i.e. . As a result, several uncertainty measures defined in the geostatistical context have different theoretical properties and significance compared to the (typically) overdetermined regressionbased problems. For a good review of uncertainty measures in the context of the geostatistical approach, please refer to [32]. Several measures of uncertainty are proposed based on alphabetic optimality that is prevalent in literature on regression. Because of the high computational complexity associated with storing and computing with the posterior covariance matrices, several of the optimality criteria are either computationally intensive or completely infeasible (for example, Doptimality criterion). Several of these optimality measures can be easily computed when the number of measurements are small, even when the unknowns are discretized on a very fine grid. However, as the number of measurements grow, the computational cost associated with estimating the optimality criteria can grow as either in certain cases.
The optimality criteria are then based upon the invariants of the posterior covariance matrix, such as the trace, log determinant etc. When the posterior PDF is Gaussian, the posterior covariance matrix is provided by Equation (8). When the posterior distribution is strongly nonGaussian, the approximation using (21) is no longer a good approximation to the posterior covariance and one has to resort to MonteCarlo methods for computing these optimality criteria [24]. Our approach is to approximate the posterior covariance by computing the dominant eigenmodes of the GHEPin Equation (9) and using the approximation described in (14) and the procedure described in Section 3. Here, we describe the calculation of a few measures of uncertainty based on the posterior covariance matrix and efficient ways to calculate it. These notations are also described in [32].

Aoptimality
(23) For the Aoptimality criterion takes the special form . To compute , we start with the observation that
Since is of size we can compute and its trace explicitly in . Since is , can be computed in . Finally, can be computed as
(24) which can be computed in and is known from the covariance kernel. In summary, the calculation of is . In practice, is , so the terms involving it have negligible contribution to the overall computational cost.
For , can be approximated using the Hutchinson trace estimator. Essentially,
where the entries of are chosen as with equal probability. This forms an unbiased estimator for the trace of . However, the variance of the estimator can be large. To remedy this, other sophisticated sampling schemes for estimating the trace of a matrix, based only on the availability of fast matvecs, have been proposed in [5].

Coptimality
(25) where is the sensitivity of a scalar model prediction with respect to the estimated parameters, yielding the prediction variance for . This criterion can be easily extended to vector valued model predictions as well. For details, see [32]. is a special case of . In particular,
Given the vector , it is easy to see that can be calculated in . Furthermore, can computed efficiently using adjointbased techniques [32].

Doptimality
For numerical reasons, the logarithm is considered here instead of the determinant itself. The calculation of the determinant is computationally infeasible. However, since is a symmetric positive definite matrix
Then, in a procedure similar to the optimality criterion, we can compute the trace of the matrix using the Hutchinson trace estimator
As before, the entries of are chosen as with equal probability. Calculating the entries of is also computationally infeasible. However, matvecs of the form can be formed efficiently using matrixfree techniques described in [14, 20]. When the posterior PDF is Gaussian, i.e. for linear inverse problems, the Doptimality criterion is closely related to maximizing the entropy of the random variable. The entropy of a random variable with PDF is defined as , where is the expectation. For Gaussian distributions , the entropy can be calculated as . Therefore, .
An alternative way to compute Doptimality is to use the properties of determinants  the determinant of the product of matrices is the product of the determinants. Applying this result to Equation (14), we have that
(26) This result follows because
where the equality in the second line followed from a straightforward application of the Sylvester’s determinant lemma followed by the use of the fact that the generalized eigenvectors were orthonormal. Further, we can treat as constant because it does not change with the measurements.

Eoptimality
This is related to the Coptimality as . The resulting vector that maximizes the Coptimality criterion is the direction of the largest variation. Computing the largest eigenvalue (and eigenvector) can be easily computed by using a few matvecs involving using few iterations of either the power iteration or Lanczos method. It can also be efficiently estimated using the randomized GHEP algorithm that we describe in Section 3.2 with and .
To illustrate these ideas, we consider again the ray tomography example in Section 4. We choose two different experimental setups to compare the uncertainty measures described in Section 6. Both setups have the same number of sources and same number of receivers discretized on a domain of size . However, they differ in the locations of the sources and as a result produce different reconstructions of the same field, see Figure 13. But a natural question to ask is: which experimental setup is better? Intuitively, one would expect that when the sources cover a larger area the reconstruction would be better over a larger area and hence, the resulting uncertainty (measured as a function of the posterior covariance) would be lower in the entire domain. Indeed, this is the case. Quantitatively, the variance of setup is larger than that of setup and this is reflected in all the measures of uncertainty. This is listed in Table 1.
Setup  

For the computation of , we chose , whereas for computing we chose as a vector of ones. We observe that is the most sensitive measure of uncertainty because the difference in between the two setups is the largest, even in relative magnitude. The least sensitive is the Eoptimality criterion which only takes into account the largest eigenmode. measures the trace of the posterior covariance matrix. Both and account for both largescale and smallscale variations since it takes into account all the eigenmodes of the posterior GHEP in Equation (9). While computing it is required to compute , and requires weighted sum of the . Therefore, small changes made to the lowest eigenmodes (corresponding to finescale variations) have a higher impact in the calculations of , rather than . By the special choice of the vector , which is chosen to be a vector of ones, is proportional to the variance of predicting the global mean and can also be interpreted as proportional to the average conditional integral scale [32]. Since in setup 1 the sources are spread over a larger area and captures the large scale variability well but has difficulty with small scale features, leading to smaller integral scale and lower . By a similar reasoning , which represents the uncertainty in the parameters , is larger for setup .
7 Conclusions and Future Work
Several papers have focused their attention on solving inverse problems. The problem of quantifying the uncertainty associated with the estimate is challenging because of the high dimensionality of the space of unknowns and the high computational costs involving the representing unknown random fields and computing the solution to the forward . We consider the problem of quantifying the associated uncertainty by computing an approximation to the posterior covariance matrix which is roughly composed of two terms  the prior covariance matrix and a low rank term that contains the dominant eigenmodes of a generalized Hermitian eigenvalue problem (GHEP) involving the misfit portion of the Hessian and the inverse prior covariance matrix. For several inverse problems, the spectrum of the eigenproblem, described above, decays rapidly and we are justified in retaining only the largest eigenvalues satisfying a cutoff . We have provided an efficient algorithm for computing this lowrank representation using a randomized algorithm that deliberately avoids expensive computations involving squareroot (or its inverse) of the prior covariance matrix. The accuracy of the lowrank representation and the parameters controlling the number of retained eigenvalues have been explored. Finally, we have shown how to approximately compute measures of uncertainty based on the approximate representation of the posterior covariance matrix. The resulting algorithms are highly scalable and their performance has been demonstrated on two challenging applications  ray based tomography and steadystate hydraulic tomography.
For nonlinear problems we have used the approximate posterior distribution which is approximated to be a Gaussian by linearizing the measurement operator at the MAP point. However, for highly nonlinear problems this approximation may not be reasonable. In such cases, one may have to resort to sampling from the full posterior in order to quantify the uncertainty [29]. However, such a process would be challenging for much the same reasons  high dimensionality of the input spaces and the high computational costs associated with the forward problem. A limitation of our analysis is that it is restricted to the case for which the priors can be represented as a Gaussian random field. Future work is necessary to extend our analysis to different priors.
In future, we would like to explore the possibility of using the uncertainty measures described in Section 6 to optimize the control variables in an experimental setup in order to obtain the most amount of information from laborious or expensive experiments. Finally, another challenging application that would benefit from an efficient representation of posterior covariance, is data assimilation using Kalman filter. Implementing Kalman filter can be computationally expensive because of the high computational costs in updating the posterior distribution. For a certain class of problems, we anticipate that the posterior covariance matrix can be represented as the combination of the prior and a lowrank term, that can be recursively updated efficiently. This could lead to a highly scalable implementation of the Kalman filter. This will be discussed in subsequent papers.
8 Acknowledgements
The major work on this paper was completed when the first author was a PhD candidate in ICME Stanford University and he would like to thank Ivy Huang for all her help and support during this process and beyond. The authors were supported by NSF Award 0934596, Subsurface Imaging and Uncertainty Quantification. We thank Tania Bakhos for a careful reading of the manuscript and providing valuable suggestions to improve readability.
References
 [1] Alen Alexanderian, Philip Gloor, and Omar Ghattas. On bayesian aand doptimal experimental designs in infinite dimensions. arXiv preprint arXiv:1408.6323, 2014.
 [2] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas. Aoptimal design of experiments for infinitedimensional bayesian linear inverse problems with regularized ell_0sparsification. SIAM Journal on Scientific Computing, 36(5):A2122–A2148, 2014.
 [3] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas. A fast and scalable method for aoptimal design of experiments for infinitedimensional bayesian nonlinear inverse problems. arXiv preprint arXiv:1410.5899, 2014.
 [4] Sivaram Ambikasaran, Judith Yue Li, Peter K Kitanidis, and Eric Darve. Largescale stochastic linear inversion using hierarchical matrices. Computational Geosciences, 17(6):913–927, 2013.
 [5] Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. Journal of the ACM (JACM), 58(2):8, 2011.
 [6] Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk Van Der Vorst. Templates for the solution of algebraic eigenvalue problems: a practical guide, volume 11. Society for Industrial and Applied Mathematics, 1987.
 [7] R.K. Beatson, J.B. Cherrie, and C.T. Mouat. Fast fitting of radial basis functions: Methods based on preconditioned GMRES iteration. Advances in Computational Mathematics, 11(2):253–270, 1999.
 [8] M. Bebendorf. Hierarchical LU decompositionbased preconditioners for BEM. Computing, 74(3):225–247, 2005.
 [9] T. BuiThanh and O. Ghattas. Analysis of the Hessian for inverse scattering problems: I. Inverse shape scattering of acoustic waves. Inverse Problems, 28(5):055001, 2012.
 [10] T. BuiThanh and O. Ghattas. Analysis of the Hessian for inverse scattering problems: II. Inverse medium scattering of acoustic waves. Inverse Problems, 28(5):055002, 2012.
 [11] Tan BuiThanh, Carsten Burstedde, Omar Ghattas, James Martin, Georg Stadler, and Lucas C Wilcox. Extremescale UQ for Bayesian inverse problems governed by PDEs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 3. IEEE Computer Society Press, 2012.
 [12] Tan BuiThanh, Omar Ghattas, James Martin, and Georg Stadler. A computational framework for infinitedimensional Bayesian inverse problems part i: The linearized case, with application to global seismic inversion. SIAM Journal on Scientific Computing, 35(6):A2494–A2523, 2013.
 [13] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273–304, 1995.
 [14] Jie Chen, Mihai Anitescu, and Yousef Saad. Computing via least squares polynomial approximations. SIAM Journal on Scientific Computing, 33(1):195–222, 2011.
 [15] G. Christakos. On the problem of permissible covariance and variogram models. Water Resources Research, 20(2):251–265, 1984.
 [16] CR Dietrich and Garry Neil Newsam. Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix. SIAM Journal on Scientific Computing, 18(4):1088–1107, 1997.
 [17] CR Dietrich and GN Newsam. Efficient generation of conditional simulations by chebyshev matrix polynomial approximations to the symmetric square root of the covariance matrix. Mathematical geology, 27(2):207–228, 1995.
 [18] H.P. Flath, L.C. Wilcox, V. Akçelik, J. Hill, B. van Bloemen Waanders, and O. Ghattas. Fast algorithms for Bayesian uncertainty quantification in largescale linear inverse problems based on lowrank partial hessian approximations. SIAM Journal on Scientific Computing, 33(1):407–432, 2011.
 [19] E. Haber, U.M. Ascher, and D. Oldenburg. On optimization techniques for solving nonlinear inverse problems. Inverse problems, 16(5):1263, 2000.
 [20] Nicholas Hale, Nicholas J Higham, and Lloyd N Trefethen. Computing , , and related matrix functions by contour integrals. SIAM Journal on Numerical Analysis, 46(5):2505–2523, 2008.
 [21] N. Halko, P.G. Martinsson, and J.A. Tropp. Finding structure with randomness: Pr