Scalable optimization-based sampling on function space

Scalable optimization-based sampling on function space

Zheng Wang Center for Computational Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 USA (zheng_w@mit.edu, ymarz@mit.edu)    Tiangang Cui School of Mathematical Sciences, Monash University, Victoria 3800, Australia (tiangang.cui@monash.edu)    Johnathan M. Bardsley Department of Mathematical Sciences, Montana, University of Montana, Missoula, MT 59812 USA (bardsleyj@mso.umt.edu)    Youssef M. Marzouk11footnotemark: 1
Abstract

Optimization-based samplers provide an efficient and parallellizable approach to solving large-scale Bayesian inverse problems. These methods solve randomly perturbed optimization problems to draw samples from an approximate posterior distribution. “Correcting” these samples, either by Metropolization or importance sampling, enables characterization of the original posterior distribution. This paper presents a new geometric interpretation of the randomize-then-optimize (RTO) method [1] and a unified transport-map interpretation of RTO and other optimization-based samplers, i.e., implicit sampling [19] and randomized-maximum-likelihood [20]. We then introduce a new subspace acceleration strategy that makes the computational complexity of RTO scale linearly with the parameter dimension. This subspace perspective suggests a natural extension of RTO to a function space setting. We thus formalize a function-space version of RTO and establish sufficient conditions for it to produce a valid Metropolis–Hastings proposal, yielding dimension-independent sampling performance. Numerical examples corroborate the dimension-independence of RTO and demonstrate sampling performance that is also robust to small observational noise.

Key words. Markov chain Monte Carlo, Metropolis independence sampling, Bayesian inference, infinite-dimensional inverse problems

AMS subject classifications. 15A29, 65C05, 65C60

1 Introduction

The Bayesian statistical approach is widely used for uncertainty quantification in inverse problems—i.e., inferring parameters of mathematical models given indirect, limited, and/or noisy data [15, 27]. In a Bayesian setting, the parameters are described as random variables and endowed with prior distributions. Conditioning on a realized value of the data yields the posterior distribution of these parameters, which characterizes uncertainty in possible parameter values. Solving the inverse problem amounts to computing posterior expectations, e.g., posterior means, variances, marginals, or other quantities of interest.

Sampling methods—in particular, Markov chain Monte Carlo (MCMC) algorithms—provide an extremely flexible way of estimating posterior expectations [4]. The design of effective MCMC methods, however, rests on the careful construction of proposal distributions: efficiency demands proposal distributions that reflect the geometry of the posterior [12], e.g., anisotropy, strong correlations, and even non-Gaussianity [21]. Another significant challenge in applying MCMC is parameter dimensionality. In many inverse problems governed by partial differential equations, the “parameter” is in fact a function of space and/or time that, for computational purposes, must be represented in a discretized form. Discretizations that sufficiently resolve the spatial or temporal heterogeneity of this function are often high dimensional. Yet, as analyzed in [17, 18, 24, 25], the performance of many common MCMC algorithms may degrade as the dimension of the discretized parameter increases, meaning that more MCMC iterations are required to obtain an effectively independent sample. One can design MCMC algorithms that do not degrade in this manner by formulating them in function space and ensuring that the proposal distribution satisfies a certain absolute continuity condition [8, 27]. These samplers are called dimension independent [8, 9]. Yet another core challenge is that MCMC algorithms are, in general, intrinsically serial: sampling amounts to simulating a discrete-time Markov process. The literature has seen many attempts at parallelizing the evaluation of proposed points [5] or sharing information across multiple chains [7, 14], but none of these is embarrassingly parallel.

A promising approach to many of these challenges is to convert optimization methods into samplers (i.e., Monte Carlo methods). This idea has been proposed in many forms: key examples include randomize-then-optimize (RTO) [1], Metropolized randomized-maximum-likelihood (RML) [20, 28], and implicit sampling [6, 19]. In its most basic form, RTO requires Bayesian inverse problems with Gaussian priors and noise models, although it can extend to problems with non-Gaussian priors via a change of variables [29]. Metropolized RML has problem requirements similar to those of RTO, but requires evaluating second-order derivatives of the forward model. Implicit sampling applies to target densities whose contours enclose star-convex regions, in that any ray starting from the mode (or some other nominal point) intersects each contour only once; each proposal sample can then be generated cheaply by solving a line search.

In general, each of these algorithms solves randomly-perturbed realizations of an optimization problem to generate samples from a probability distribution that is “close to” the posterior. The probability density function of this distribution is computable, and thus the distribution can be used as an independent proposal within MCMC or as a biasing distribution in importance sampling. For non-Gaussian targets, these proposal distributions are non-Gaussian. In general, they are adapted to the target distribution. The computational complexity and dimension-scalability of the resulting sampler can be linked to the structure of the corresponding optimization problem. In addition, these sampling methods are embarrassingly parallel, and are easily implemented with existing optimization tools developed for solving deterministic inverse problems.

This paper considers optimization-based sampling in high dimensions. We focus on the analysis and scalable implementation of the RTO method. To begin, we present a new geometric interpretation of RTO that provides intuition for the method and its regime of applicability (Section LABEL:sec:geo_rto). We also interpret RTO and other optimization-based samplers as ways of realizing the actions of particular transport maps [16] (Section LABEL:sec:map_rto). Using these two interpretations, we next motivate and construct a subspace-accelerated version of RTO whose computational complexity scales linearly with parameter dimension (Section LABEL:sec:scalable). This approach significantly accelerates RTO in high-dimensional settings. Subspace acceleration reveals that RTO’s mapping acts differently on different subspaces of the parameter space. Exploiting the separation of the parameter subspaces allows us to cast the transport map generated by RTO in an infinite-dimensional (i.e., function space) setting [27] (Section LABEL:sec:proof). As our main theoretical result, we establish sufficient conditions for the probability distribution induced by RTO’s mapping to be absolutely continuous with respect to the posterior. This result justifies RTO’s observed dimension-independent sampling behavior: the acceptance rate and autocorrelation time of an MCMC chain using RTO as its proposal do not degrade as the parameter dimension increases. Similarly, the performance of importance sampling using RTO as a biasing distribution will stabilize in high dimensions. This result is analogous to the arguments in [3, 27, 9, 2, 26] showing that (generalized) preconditioned Crank–Nicolson (pCN), dimension-independent likelihood-informed (DILI) MCMC, and other infinite-dimensional geometric MCMC methods are dimension-independent. However, our MCMC construction relies on non-Gaussian proposals in a Metropolis independence setting, where the Markov chain can be run at essentially zero cost after the computationally costly step of drawing proposal samples and evaluating the proposal density. Because the latter step is embarrassingly parallel, the overall MCMC scheme is immediately parallelizable, unlike the above-mentioned MCMC samplers that rely on the iterative construction of Markov chains.

In Section LABEL:sec:numerics, we provide a numerical illustration of our algorithms, exploring the factors that influence RTO’s sampling efficiency. We observe that neither the parameter dimension nor the magnitude of the observational noise influence RTO’s performance per MCMC step, though they both impact the computational cost of each step. Despite its more costly steps, RTO outperforms simple pCN in this example. Overall, our results show that RTO can tackle inverse problems with large parameter dimensions and arbitrarily small observational noise.

2 RTO and its geometric interpretation

RTO generates samples from an approximation to the target distribution in two steps. First, it repeatedly solves perturbed optimization problems to generate independent proposal samples. Second, it uses this collection of samples to describe an independent proposal within Metropolis–Hastings (MH) or a biasing distribution in self-normalized importance sampling. In this section, we will describe the target distributions to which RTO can be applied and detail the two steps above. We will also present a geometric interpretation in which the RTO proposal can be viewed as a projection. This interpretation helps understand the sampling efficiency of RTO and the technical requirements for the RTO procedure to be valid.

2.1 Target distribution

As formulated in [1], RTO applies to target distributions on whose densities can be written as

(1)

where is a vector-valued function of the parameters with an output dimension of , for any . This structure is found in Bayesian inverse problems and other similar problems with parameters, observations, a Gaussian prior, and additive Gaussian observational noise. To illustrate, let

where is the data, is the forward model, is the unknown parameter, and is the additive noise, assumed independent of . Here, is the prior mean, and and are the covariance matrices of the observational noise and prior. We can simplify the problem via an affine change of variables that transforms the covariance matrices to identity matrices. Let us define the matrix square roots,

and new whitened variables,

where is the whitened unknown parameter, is the whitened forward model, and is the whitened observational noise. The inverse problem becomes

The data is shifted to the origin after whitening. The posterior density of the whitened variable is then

which is in the required form (LABEL:eq:lsform) with defined as

(2)

Given a sample from the target density , we can obtain a posterior sample of the original parameter by applying the transformation

Notice that the form of in (LABEL:eq:lsform) is identical to the probability density function of an –dimensional standard normal distribution, , evaluated at . This paints the following geometric picture of the required target distribution: the target density , up to a normalizing constant, is the same as the density of the –dimensional Gaussian distribution evaluated on the –dimensional manifold parameterized by .

2.2 The RTO algorithm

The RTO algorithm requires an orthonormal basis for an -dimensional subspace of . Let this basis be collected in a matrix ; in principle, can be any matrix with orthonormal columns. One common choice of follows from first finding a linearization point, which is often (but not necessarily) taken to be the mode of the target distribution, i.e.,

(3)

Then one can compute from a thin QR factorization of ; this sets the basis to span the range of .

Using this matrix, we obtain proposal samples by repeatedly drawing independent –dimensional standard normal vectors and solving the nonlinear system of equations

(4)

which is equivalent to solving the optimization problem

(5)

if the minimum of (LABEL:eq:optprob) is zero. To ensure that the system of equations (LABEL:eq:rtoequation) has a unique solution and that the probability density of the resulting samples can be calculated explicitly, we require additional conditions as follows.

Assumption 1 (Sufficient conditions for valid RTO [1]).
  1. The function is continuously differentiable with Jacobian .

  2. The Jacobian has full column rank for every .

  3. The map is invertible.

Under these conditions, the RTO samples are distributed according to the probability density

(6)

Most of these assumptions follow from the inverse problem setup; for example, it is natural for the forward model to be continuous and the prior to be a proper distribution. The final assumption, however, is more delicate and may not always hold. We will use the geometric interpretation of RTO to elaborate on this assumption.

(a) Top view
(b) Front view
Figure 1: Interpretation of RTO as a projection, in the case . The –dimensional Gaussian samples (green circles), are projected onto the manifold (red line), to determine the proposal samples (green x’s). The projection is orthogonal to the range of . In the front view, the proposal samples (green x’s) are shown to be distributed according to a proposal density (green line) that is close to the target density (red line).

Proposal samples generated via RTO can be interpreted as a projection of –dimensional Gaussian samples onto the –dimensional manifold . The samples are projected along the directions orthogonal to the range of . Figure LABEL:fig:proj depicts the steps of RTO’s proposal for the case . This geometric interpretation also illustrates conditions for the RTO procedure to be valid: there should exist only one point that is both on the manifold and along the projection directions of . Under these conditions, the nonlinear system of equations (LABEL:eq:rtoequation) has a unique solution. If a projection of in the direction onto does not exist, or if there exists more than one intersection point, then the RTO procedure breaks down.

As shown in [1], RTO’s proposal is exact (i.e., is the target) when the forward model is linear, and its proposal is expected to be close to the target when the forward model is close to linear. For weakly nonlinear problems, the proposal can be a good approximation to the posterior and hence can be used in MCMC and importance sampling.

These proposal samples are used either as an independent proposal in Metropolis–Hastings (MH) or as a biasing distribution in self-normalized importance sampling. For the former case, the Metropolis–Hastings acceptance ratio can be written as

where the weight is defined as

(7)

The resulting method, called RTO–MH, is summarized in Algorithm LABEL:alg:rto.

For importance sampling, since the normalizing constant of the target density is unknown, the weights must be normalized as

where is the number of samples and the sum of weights is thus one. The proposal samples and weights can then be used to compute posterior expectations of some quantity of interest using the self-normalizing importance sampling formula:

1:Find using (LABEL:eq:opmode)
2:Determine
3:Compute , whose columns are an orthonormal basis for the range of
4:for  do in parallel
5:     Sample from an –dimensional standard normal distribution
6:     Solve for a proposal sample using (LABEL:eq:optprob)
7:     Compute from (LABEL:eq:w(theta))
8:Set
9:for  do in series
10:     Sample from a uniform distribution on [0,1]
11:     if  then
12:          =
13:     else
14:          =      
Algorithm 1 RTO Metropolis–Hastings (RTO-MH)

3 Optimization-based samplers as transport maps

One can also interpret RTO as a means of pushing forward a Gaussian reference distribution to some approximation of the target distribution. In other words, RTO realizes the action of a particular transport map [21]. This interpretation also extends to other optimization-based samplers.

Recall that RTO solves the following nonlinear system of equations

Since is a standard Gaussian and the columns of are orthonormal, the projection of , denoted by , is also a standard Gaussian. This way, the nonlinear system of equations can be expressed as

(8)

This equation describes a deterministic coupling between the target random variable and the standard Gaussian “reference” random variable . The coupling is defined by the forward model, the data, the observational noise, and the prior, through the function and the matrix .

The left hand side of (LABEL:eq:RTOmap), which we write compactly as

defines a transport map from to . By solving the nonlinear system (LABEL:eq:RTOmap) for a given to obtain a proposal , the RTO procedure implicitly inverts the transport map ; that is, it evaluates on each , . The inverse map pushes forward the reference distribution to an approximation of . In particular, the normalized probability density of generated by RTO is given by the standard formula for pushforward densities:

This transport map interpretation explains why Assumption LABEL:assumoverall.LABEL:assum3 in the previous section is necessary for the validity of the RTO procedure. The transport map interpretation and the geometric interpretation both describe taking a Gaussian random variable and transforming it through a projection to obtain the proposal random variable .

Other optimization-based sampling algorithms such as the random-map implementation of implicit sampling [19] and Metropolized RML [20] similarly yield deterministic couplings of two random variables. A summary of each algorithm’s mapping is given in Table LABEL:tab:maps. Each algorithm describes a different map and uses computation (i.e., solves an optimization problem) to evaluate . Detailed derivations of these mappings are provided in Appendix LABEL:sec:app_imp.

Algorithm

Target distribution

Transport map

RTO

Implicit

sampling

Metropolized

RML


Table 1: Transport map interpretation of the three optimization-based samplers. In RTO, with default settings, the matrix comes from a thin QR factorization, and in implicit sampling, the matrix comes from any matrix square root.

All three algorithms use a standard normal as the reference distribution and the pushforward of this reference through as a proposal distribution. The proposal distribution is then used as an independent proposal in Metropolis–Hastings or as a biasing distribution in importance sampling. By specifying different relationships between and —that is, different choices of the mapping —we can derive all of these optimization-based samplers in a unified framework.

4 Scalable implementation of RTO

Solving the optimization problem (LABEL:eq:optprob) and computing the RTO probability density (LABEL:eq:RTOproposal) can be computationally costly when the parameter vector is high-dimensional. For example, the matrix–vector product with in each evaluation of (LABEL:eq:optprob) costs floating point operations, and computing the determinant in the proposal density (LABEL:eq:RTOproposal) scales as , where is the number of parameters. In this section, we introduce a subspace acceleration strategy to make these operations of RTO scale linearly with the parameter dimension.

This scalable implementation avoids computing and storing the QR factorization of the full-rank matrix . Instead, it opts to construct (and store) a singular value decomposition (SVD) of the smaller linearized forward model . This way, the computational complexity of the evaluation of the determinant in (LABEL:eq:RTOproposal) and the matrix–vector product with in (LABEL:eq:optprob) can scale linearly with the parameter dimension .

To begin, we note from the definition (LABEL:eq:H) of that

Recall from RTO’s mapping (LABEL:eq:RTOmap) that the RTO proposal samples are found by

where the columns of form an orthonormal basis for the range of and is computed from the thin QR decomposition of . To avoid computing with the dense matrix , we use instead

which can be shown to have orthonormal columns and the same range as . If the SVD of , which we assume has rank , is given by , then

where and are the identity matrices of size and , respectively. Using the last identity, we obtain, after some algebraic manipulation,

which we use to obtain

(9)

Thus the nonlinear system solved in Step 6 of Algorithm LABEL:alg:rto can be rewritten as

(10)

Equation (LABEL:eq:scale) separates into two parts: one in the column space of and another in its orthogonal complement. Suppose we have a basis for the complement space  111We do not actually compute . It is used to represent the complement space mathematically. such that form a complete orthonormal basis for . We can solve the nonlinear system of equations (LABEL:eq:scale) by defining

setting the components in the orthogonal complement of to

(11)

and then solving the –dimensional optimization problem

(12)

Equations (LABEL:eq:svd1) and (LABEL:eq:svd2) replace the –dimensional optimization problem in (LABEL:eq:optprob). In addition, at Step 7 of Algorithm LABEL:alg:rto, we need to calculate the expression in (LABEL:eq:w(theta)). First, note that (LABEL:eq:svdet1) implies

Hence, the determinant term is given by

(13)

where in line above we use Sylvester’s determinant identity, and in line we use the fact that . This formulation replaces the determinant calculation with one of size .

To summarize, we presented an implementation of RTO that avoids the QR factorization of and opts to store the SVD of the smaller matrix . As a result, we reduce the size of the optimization problem and reduce the size of the matrix determinant calculation for each proposal. The new scalable implementation reduces the computational complexity of taking a step in MCMC from to . This is advantageous when the size of the parameter vector is much larger than the number of observations. This implementation is outlined in Algorithm LABEL:alg:scale.

1:Find using (LABEL:eq:opmode).
2:Determine the Jacobian matrix of the forward model, .
3:Compute , and , which is the SVD of .
4:for  do in parallel
5:     Sample from an –dimensional standard normal distribution.
6:     Solve for a proposal sample using (LABEL:eq:svd1) and (LABEL:eq:svd2).
7:     Compute from (LABEL:eq:w(theta)) using the determinant from (LABEL:eq:det).
8:Set .
9:for  do in series
10:     Sample from a uniform distribution on [0,1].
11:     if  then
12:          = .
13:     else
14:          = .      
Algorithm 2 Scalable implementation of RTO–MH
(a) Rank =
(b) Rank =
(c) Rank =
Figure 2: Truncating the SVD in a two-dimensional toy example with a nonlinear forward model and standard normal prior. Top: contours of the prior, posterior and RTO’s proposal density. Bottom: contours of the prior and posterior densities, and samples from RTO’s proposal.
Remark 2.

We can truncate the rank of the SVD in cases where data is abundant, i.e., when is a large vector. This will change RTO’s map and the resulting proposal distribution. Figure LABEL:fig:trunc_nonlin shows the effect on RTO’s proposal of truncating the SVD, in a toy example with a nonlinear forward model and a standard normal prior. Truncation restricts the role of the data misfit term in the construction of the proposal distribution. As the rank decreases, the proposal distribution becomes broader. In the extreme case, when the is truncated to zero, RTO’s proposal reverts to the prior. Note, however, the non-Gaussianity of the proposal for in this nonlinear example.

5 Dimension independence of RTO

In numerical experiments (see Section LABEL:sec:numerics and [29]), we observe that the acceptance rate and the effective sample size (for a fixed number of MCMC iterations) of RTO is independent of the parameter dimension. In this section, we will justify this behavior by showing that RTO yields a valid MCMC proposal in the function space setting [27].

Suppose the parameter space is a separable Hilbert space and endowed with a Gaussian prior measure such that . We denote the probability measure induced by the RTO mapping as . The following theorem establishes conditions on such that the MH algorithm using RTO as its independence proposal will be well defined on .

Theorem 3.

Suppose the target measure is equivalent to the prior measure , i.e., . If the Radon–Nikodym derivative of the target measure with respect to the RTO measure,

(14)

is positive almost surely with respect to , then the acceptance probability of the MH algorithm using RTO as its independence proposal is positive almost surely with respect to .

Proof.

The proof can be viewed as a special case of Theorem 5.1 in [27]. Let denote the probability measure of the MH proposal starting from the point . We can define the following pair of measures on :

The acceptance probability of MH is then given by the Radon–Nikodym derivative

Since RTO is an independent proposal, the resulting MH proposal measure becomes

and the pair of transition measures become

This way, the acceptance probability can be expressed as

If and only if is positive –almost surely, the acceptance probability is positive –almost surely.     

If the acceptance probability is positive almost surely with respect to , the posterior is invariant for the Markov chain of the resulting MH algorithm. Hence, as shown in [27], the algorithm is well-posed in function space and has sampling performance independent of the parameter discretization dimension.

In the rest of this section, we will focus on establishing sufficient conditions for the Radon–Nikodym derivative to be positive –almost surely. We outline the broad steps of the derivation as follows. In Section LABEL:sec:unwhite_map, we derive the unwhitened mapping for RTO in a finite-dimensional (discrete) setting. Because the unwhitened prior corresponds to a discretized Gaussian process, we can naturally generalize RTO to an infinite-dimensional Hilbert space setting in Section LABEL:sec:inf_setup. Motivated by Section LABEL:sec:scalable, we introduce a particular rank– projection operator and use it to define RTO’s inverse map. In Section LABEL:sec:inf_rnd, we derive RTO’s proposal measure as the pullback of the prior measure through this inverse map. This construction allows us to establish conditions ensuring that is positive –almost surely.

5.1 Unwhitened mapping in finite dimensions

In this subsection, we derive the unwhitened mapping for RTO in a finite-dimensional setting. In Figure LABEL:fig:quadrv, we specify the relationship between four finite-dimensional random variables: , , , and , where is a newly defined random variable distributed according to the prior. Note that here we use and to denote random variables distributed according to the unwhitened and whitened proposal distributions, respectively, rather than the corresponding posterior distributions. Recall that and were defined in Section LABEL:sec:target.

Figure 3: Relationship between four random variables.

The RTO mapping (LABEL:eq:scale) in the unwhitened coordinates takes the form

Defining the oblique projectors

and multiplying both sides of the above equation by , we can express the unwhitened RTO mapping in the following split form:

(15)

where

(16)

is an affine transformation of the nonlinear forward model .

Define the matrix and let be its columns. These columns are orthonormal with respect to the –inner product (i.e., ).222We use the notation for any and symmetric positive definite . Let us define a second matrix such that the columns of form a complete orthonormal basis for with respect to the –inner product. Then, is the –orthogonal projector that projects any vector onto the first basis vectors (the columns of ). Now, let denote the map that takes a vector and returns its coefficients in the basis. Specifically, we define this map via inner products as:

By this definition, is a vector of size . Let be defined in a similar manner.

The mapping (LABEL:eq:oblique) can thus be rewritten in terms of these coefficients:

(17)

where the function can be correspondingly expressed as

Note that the two equations (LABEL:eq:oblique) and (LABEL:eq:coef) are equivalent. They both describe RTO’s mapping in finite dimensions: the first in terms of the vectors in the range and null space of , and the second in terms of coefficients in our chosen basis (the columns of and ).

5.2 RTO mapping in function space

In this section, we define an RTO mapping in function space. Until now we worked with finite dimensional random vectors and . Here and in Section LABEL:sec:inf_rnd, we will use the same symbols and to denote random variables taking values in a separable Hilbert space . Let be distributed according to the prior, which we assume to be a Gaussian measure on with trace-class covariance operator and mean function . (In general, we will use the same notation in the function space setting as we do in the finite dimensional setting, except that we will now use to denote the square root of the prior covariance operator. This overloading of notation is intended to preserve interpretability.)

Recall that the Cameron–Martin space associated with , , is equipped with the inner product

for any . Suppose that we are given a finite set of orthonormal functions , and also suppose that each . Let be defined as Then, by construction, are orthonormal with respect to the –inner product. In addition, since , we have that . We can overwrite the definition of the –inner product, typically only used for elements in the Cameron–Martin space, for inputs of type . Namely, for any and , we write

where the inner product on the right is defined in the usual manner (i.e., between two elements in ). In other words, we are treating as the dual of with respect to the –inner product. This allows us to compute –inner products of with any function in .

Let us define the linear subspace to be

where denotes the closure of the span. Analogous to the finite-dimensional case, we define the operators and as

for any function and any vector of coefficients . Let the projection operator be defined as

for any and let the subspace be defined as

Then, one can show that maps and maps . Any function can be uniquely split into two parts using the projection .

Now, the infinite-dimensional analogue to RTO’s unwhitened mapping (LABEL:eq:oblique) and (LABEL:eq:coef) between the random variables and can be expressed as:

(18)

where

Remark 4.

Above we assumed that are in or, in other words, that are in . This is true when we obtain from the linearization of the forward model evaluated at the MAP. Note that

The operator has rank with