Scalable optimizationbased sampling on function space
Abstract
Optimizationbased samplers provide an efficient and parallellizable approach to solving largescale Bayesian inverse problems. These methods solve randomly perturbed optimization problems to draw samples from an approximate posterior distribution. “Correcting” these samples, either by Metropolization or importance sampling, enables characterization of the original posterior distribution. This paper presents a new geometric interpretation of the randomizethenoptimize (RTO) method [1] and a unified transportmap interpretation of RTO and other optimizationbased samplers, i.e., implicit sampling [19] and randomizedmaximumlikelihood [20]. We then introduce a new subspace acceleration strategy that makes the computational complexity of RTO scale linearly with the parameter dimension. This subspace perspective suggests a natural extension of RTO to a function space setting. We thus formalize a functionspace version of RTO and establish sufficient conditions for it to produce a valid Metropolis–Hastings proposal, yielding dimensionindependent sampling performance. Numerical examples corroborate the dimensionindependence of RTO and demonstrate sampling performance that is also robust to small observational noise.
Key words. Markov chain Monte Carlo, Metropolis independence sampling, Bayesian inference, infinitedimensional inverse problems
AMS subject classifications. 15A29, 65C05, 65C60
1 Introduction
The Bayesian statistical approach is widely used for uncertainty quantification in inverse problems—i.e., inferring parameters of mathematical models given indirect, limited, and/or noisy data [15, 27]. In a Bayesian setting, the parameters are described as random variables and endowed with prior distributions. Conditioning on a realized value of the data yields the posterior distribution of these parameters, which characterizes uncertainty in possible parameter values. Solving the inverse problem amounts to computing posterior expectations, e.g., posterior means, variances, marginals, or other quantities of interest.
Sampling methods—in particular, Markov chain Monte Carlo (MCMC) algorithms—provide an extremely flexible way of estimating posterior expectations [4]. The design of effective MCMC methods, however, rests on the careful construction of proposal distributions: efficiency demands proposal distributions that reflect the geometry of the posterior [12], e.g., anisotropy, strong correlations, and even nonGaussianity [21]. Another significant challenge in applying MCMC is parameter dimensionality. In many inverse problems governed by partial differential equations, the “parameter” is in fact a function of space and/or time that, for computational purposes, must be represented in a discretized form. Discretizations that sufficiently resolve the spatial or temporal heterogeneity of this function are often high dimensional. Yet, as analyzed in [17, 18, 24, 25], the performance of many common MCMC algorithms may degrade as the dimension of the discretized parameter increases, meaning that more MCMC iterations are required to obtain an effectively independent sample. One can design MCMC algorithms that do not degrade in this manner by formulating them in function space and ensuring that the proposal distribution satisfies a certain absolute continuity condition [8, 27]. These samplers are called dimension independent [8, 9]. Yet another core challenge is that MCMC algorithms are, in general, intrinsically serial: sampling amounts to simulating a discretetime Markov process. The literature has seen many attempts at parallelizing the evaluation of proposed points [5] or sharing information across multiple chains [7, 14], but none of these is embarrassingly parallel.
A promising approach to many of these challenges is to convert optimization methods into samplers (i.e., Monte Carlo methods). This idea has been proposed in many forms: key examples include randomizethenoptimize (RTO) [1], Metropolized randomizedmaximumlikelihood (RML) [20, 28], and implicit sampling [6, 19]. In its most basic form, RTO requires Bayesian inverse problems with Gaussian priors and noise models, although it can extend to problems with nonGaussian priors via a change of variables [29]. Metropolized RML has problem requirements similar to those of RTO, but requires evaluating secondorder derivatives of the forward model. Implicit sampling applies to target densities whose contours enclose starconvex regions, in that any ray starting from the mode (or some other nominal point) intersects each contour only once; each proposal sample can then be generated cheaply by solving a line search.
In general, each of these algorithms solves randomlyperturbed realizations of an optimization problem to generate samples from a probability distribution that is “close to” the posterior. The probability density function of this distribution is computable, and thus the distribution can be used as an independent proposal within MCMC or as a biasing distribution in importance sampling. For nonGaussian targets, these proposal distributions are nonGaussian. In general, they are adapted to the target distribution. The computational complexity and dimensionscalability of the resulting sampler can be linked to the structure of the corresponding optimization problem. In addition, these sampling methods are embarrassingly parallel, and are easily implemented with existing optimization tools developed for solving deterministic inverse problems.
This paper considers optimizationbased sampling in high dimensions. We focus on the analysis and scalable implementation of the RTO method. To begin, we present a new geometric interpretation of RTO that provides intuition for the method and its regime of applicability (Section LABEL:sec:geo_rto). We also interpret RTO and other optimizationbased samplers as ways of realizing the actions of particular transport maps [16] (Section LABEL:sec:map_rto). Using these two interpretations, we next motivate and construct a subspaceaccelerated version of RTO whose computational complexity scales linearly with parameter dimension (Section LABEL:sec:scalable). This approach significantly accelerates RTO in highdimensional settings. Subspace acceleration reveals that RTO’s mapping acts differently on different subspaces of the parameter space. Exploiting the separation of the parameter subspaces allows us to cast the transport map generated by RTO in an infinitedimensional (i.e., function space) setting [27] (Section LABEL:sec:proof). As our main theoretical result, we establish sufficient conditions for the probability distribution induced by RTO’s mapping to be absolutely continuous with respect to the posterior. This result justifies RTO’s observed dimensionindependent sampling behavior: the acceptance rate and autocorrelation time of an MCMC chain using RTO as its proposal do not degrade as the parameter dimension increases. Similarly, the performance of importance sampling using RTO as a biasing distribution will stabilize in high dimensions. This result is analogous to the arguments in [3, 27, 9, 2, 26] showing that (generalized) preconditioned Crank–Nicolson (pCN), dimensionindependent likelihoodinformed (DILI) MCMC, and other infinitedimensional geometric MCMC methods are dimensionindependent. However, our MCMC construction relies on nonGaussian proposals in a Metropolis independence setting, where the Markov chain can be run at essentially zero cost after the computationally costly step of drawing proposal samples and evaluating the proposal density. Because the latter step is embarrassingly parallel, the overall MCMC scheme is immediately parallelizable, unlike the abovementioned MCMC samplers that rely on the iterative construction of Markov chains.
In Section LABEL:sec:numerics, we provide a numerical illustration of our algorithms, exploring the factors that influence RTO’s sampling efficiency. We observe that neither the parameter dimension nor the magnitude of the observational noise influence RTO’s performance per MCMC step, though they both impact the computational cost of each step. Despite its more costly steps, RTO outperforms simple pCN in this example. Overall, our results show that RTO can tackle inverse problems with large parameter dimensions and arbitrarily small observational noise.
2 RTO and its geometric interpretation
RTO generates samples from an approximation to the target distribution in two steps. First, it repeatedly solves perturbed optimization problems to generate independent proposal samples. Second, it uses this collection of samples to describe an independent proposal within Metropolis–Hastings (MH) or a biasing distribution in selfnormalized importance sampling. In this section, we will describe the target distributions to which RTO can be applied and detail the two steps above. We will also present a geometric interpretation in which the RTO proposal can be viewed as a projection. This interpretation helps understand the sampling efficiency of RTO and the technical requirements for the RTO procedure to be valid.
2.1 Target distribution
As formulated in [1], RTO applies to target distributions on whose densities can be written as
(1) 
where is a vectorvalued function of the parameters with an output dimension of , for any . This structure is found in Bayesian inverse problems and other similar problems with parameters, observations, a Gaussian prior, and additive Gaussian observational noise. To illustrate, let
where is the data, is the forward model, is the unknown parameter, and is the additive noise, assumed independent of . Here, is the prior mean, and and are the covariance matrices of the observational noise and prior. We can simplify the problem via an affine change of variables that transforms the covariance matrices to identity matrices. Let us define the matrix square roots,
and new whitened variables,
where is the whitened unknown parameter, is the whitened forward model, and is the whitened observational noise. The inverse problem becomes
The data is shifted to the origin after whitening. The posterior density of the whitened variable is then
which is in the required form (LABEL:eq:lsform) with defined as
(2) 
Given a sample from the target density , we can obtain a posterior sample of the original parameter by applying the transformation
Notice that the form of in (LABEL:eq:lsform) is identical to the probability density function of an –dimensional standard normal distribution, , evaluated at . This paints the following geometric picture of the required target distribution: the target density , up to a normalizing constant, is the same as the density of the –dimensional Gaussian distribution evaluated on the –dimensional manifold parameterized by .
2.2 The RTO algorithm
The RTO algorithm requires an orthonormal basis for an dimensional subspace of . Let this basis be collected in a matrix ; in principle, can be any matrix with orthonormal columns. One common choice of follows from first finding a linearization point, which is often (but not necessarily) taken to be the mode of the target distribution, i.e.,
(3) 
Then one can compute from a thin QR factorization of ; this sets the basis to span the range of .
Using this matrix, we obtain proposal samples by repeatedly drawing independent –dimensional standard normal vectors and solving the nonlinear system of equations
(4) 
which is equivalent to solving the optimization problem
(5) 
if the minimum of (LABEL:eq:optprob) is zero. To ensure that the system of equations (LABEL:eq:rtoequation) has a unique solution and that the probability density of the resulting samples can be calculated explicitly, we require additional conditions as follows.
Assumption 1 (Sufficient conditions for valid RTO [1]).

The function is continuously differentiable with Jacobian .

The Jacobian has full column rank for every .

The map is invertible.
Under these conditions, the RTO samples are distributed according to the probability density
(6) 
Most of these assumptions follow from the inverse problem setup; for example, it is natural for the forward model to be continuous and the prior to be a proper distribution. The final assumption, however, is more delicate and may not always hold. We will use the geometric interpretation of RTO to elaborate on this assumption.
Proposal samples generated via RTO can be interpreted as a projection of –dimensional Gaussian samples onto the –dimensional manifold . The samples are projected along the directions orthogonal to the range of . Figure LABEL:fig:proj depicts the steps of RTO’s proposal for the case . This geometric interpretation also illustrates conditions for the RTO procedure to be valid: there should exist only one point that is both on the manifold and along the projection directions of . Under these conditions, the nonlinear system of equations (LABEL:eq:rtoequation) has a unique solution. If a projection of in the direction onto does not exist, or if there exists more than one intersection point, then the RTO procedure breaks down.
As shown in [1], RTO’s proposal is exact (i.e., is the target) when the forward model is linear, and its proposal is expected to be close to the target when the forward model is close to linear. For weakly nonlinear problems, the proposal can be a good approximation to the posterior and hence can be used in MCMC and importance sampling.
These proposal samples are used either as an independent proposal in Metropolis–Hastings (MH) or as a biasing distribution in selfnormalized importance sampling. For the former case, the Metropolis–Hastings acceptance ratio can be written as
where the weight is defined as
(7) 
The resulting method, called RTO–MH, is summarized in Algorithm LABEL:alg:rto.
For importance sampling, since the normalizing constant of the target density is unknown, the weights must be normalized as
where is the number of samples and the sum of weights is thus one. The proposal samples and weights can then be used to compute posterior expectations of some quantity of interest using the selfnormalizing importance sampling formula:
3 Optimizationbased samplers as transport maps
One can also interpret RTO as a means of pushing forward a Gaussian reference distribution to some approximation of the target distribution. In other words, RTO realizes the action of a particular transport map [21]. This interpretation also extends to other optimizationbased samplers.
Recall that RTO solves the following nonlinear system of equations
Since is a standard Gaussian and the columns of are orthonormal, the projection of , denoted by , is also a standard Gaussian. This way, the nonlinear system of equations can be expressed as
(8) 
This equation describes a deterministic coupling between the target random variable and the standard Gaussian “reference” random variable . The coupling is defined by the forward model, the data, the observational noise, and the prior, through the function and the matrix .
The left hand side of (LABEL:eq:RTOmap), which we write compactly as
defines a transport map from to . By solving the nonlinear system (LABEL:eq:RTOmap) for a given to obtain a proposal , the RTO procedure implicitly inverts the transport map ; that is, it evaluates on each , . The inverse map pushes forward the reference distribution to an approximation of . In particular, the normalized probability density of generated by RTO is given by the standard formula for pushforward densities:
This transport map interpretation explains why Assumption LABEL:assumoverall.LABEL:assum3 in the previous section is necessary for the validity of the RTO procedure. The transport map interpretation and the geometric interpretation both describe taking a Gaussian random variable and transforming it through a projection to obtain the proposal random variable .
Other optimizationbased sampling algorithms such as the randommap implementation of implicit sampling [19] and Metropolized RML [20] similarly yield deterministic couplings of two random variables. A summary of each algorithm’s mapping is given in Table LABEL:tab:maps. Each algorithm describes a different map and uses computation (i.e., solves an optimization problem) to evaluate . Detailed derivations of these mappings are provided in Appendix LABEL:sec:app_imp.
Algorithm 
Target distribution 
Transport map 
RTO  
Implicit sampling 


Metropolized RML 


All three algorithms use a standard normal as the reference distribution and the pushforward of this reference through as a proposal distribution. The proposal distribution is then used as an independent proposal in Metropolis–Hastings or as a biasing distribution in importance sampling. By specifying different relationships between and —that is, different choices of the mapping —we can derive all of these optimizationbased samplers in a unified framework.
4 Scalable implementation of RTO
Solving the optimization problem (LABEL:eq:optprob) and computing the RTO probability density (LABEL:eq:RTOproposal) can be computationally costly when the parameter vector is highdimensional. For example, the matrix–vector product with in each evaluation of (LABEL:eq:optprob) costs floating point operations, and computing the determinant in the proposal density (LABEL:eq:RTOproposal) scales as , where is the number of parameters. In this section, we introduce a subspace acceleration strategy to make these operations of RTO scale linearly with the parameter dimension.
This scalable implementation avoids computing and storing the QR factorization of the fullrank matrix . Instead, it opts to construct (and store) a singular value decomposition (SVD) of the smaller linearized forward model . This way, the computational complexity of the evaluation of the determinant in (LABEL:eq:RTOproposal) and the matrix–vector product with in (LABEL:eq:optprob) can scale linearly with the parameter dimension .
To begin, we note from the definition (LABEL:eq:H) of that
Recall from RTO’s mapping (LABEL:eq:RTOmap) that the RTO proposal samples are found by
where the columns of form an orthonormal basis for the range of and is computed from the thin QR decomposition of . To avoid computing with the dense matrix , we use instead
which can be shown to have orthonormal columns and the same range as . If the SVD of , which we assume has rank , is given by , then
where and are the identity matrices of size and , respectively. Using the last identity, we obtain, after some algebraic manipulation,
which we use to obtain
(9) 
Thus the nonlinear system solved in Step 6 of Algorithm LABEL:alg:rto can be rewritten as
(10) 
Equation (LABEL:eq:scale) separates into two parts: one in the column space of and another in its orthogonal complement. Suppose we have a basis for the complement space ^{1}^{1}1We do not actually compute . It is used to represent the complement space mathematically. such that form a complete orthonormal basis for . We can solve the nonlinear system of equations (LABEL:eq:scale) by defining
setting the components in the orthogonal complement of to
(11) 
and then solving the –dimensional optimization problem
(12) 
Equations (LABEL:eq:svd1) and (LABEL:eq:svd2) replace the –dimensional optimization problem in (LABEL:eq:optprob). In addition, at Step 7 of Algorithm LABEL:alg:rto, we need to calculate the expression in (LABEL:eq:w(theta)). First, note that (LABEL:eq:svdet1) implies
Hence, the determinant term is given by
(13) 
where in line above we use Sylvester’s determinant identity, and in line we use the fact that . This formulation replaces the determinant calculation with one of size .
To summarize, we presented an implementation of RTO that avoids the QR factorization of and opts to store the SVD of the smaller matrix . As a result, we reduce the size of the optimization problem and reduce the size of the matrix determinant calculation for each proposal. The new scalable implementation reduces the computational complexity of taking a step in MCMC from to . This is advantageous when the size of the parameter vector is much larger than the number of observations. This implementation is outlined in Algorithm LABEL:alg:scale.



Remark 2.
We can truncate the rank of the SVD in cases where data is abundant, i.e., when is a large vector. This will change RTO’s map and the resulting proposal distribution. Figure LABEL:fig:trunc_nonlin shows the effect on RTO’s proposal of truncating the SVD, in a toy example with a nonlinear forward model and a standard normal prior. Truncation restricts the role of the data misfit term in the construction of the proposal distribution. As the rank decreases, the proposal distribution becomes broader. In the extreme case, when the is truncated to zero, RTO’s proposal reverts to the prior. Note, however, the nonGaussianity of the proposal for in this nonlinear example.
5 Dimension independence of RTO
In numerical experiments (see Section LABEL:sec:numerics and [29]), we observe that the acceptance rate and the effective sample size (for a fixed number of MCMC iterations) of RTO is independent of the parameter dimension. In this section, we will justify this behavior by showing that RTO yields a valid MCMC proposal in the function space setting [27].
Suppose the parameter space is a separable Hilbert space and endowed with a Gaussian prior measure such that . We denote the probability measure induced by the RTO mapping as . The following theorem establishes conditions on such that the MH algorithm using RTO as its independence proposal will be well defined on .
Theorem 3.
Suppose the target measure is equivalent to the prior measure , i.e., . If the Radon–Nikodym derivative of the target measure with respect to the RTO measure,
(14) 
is positive almost surely with respect to , then the acceptance probability of the MH algorithm using RTO as its independence proposal is positive almost surely with respect to .
Proof.
The proof can be viewed as a special case of Theorem 5.1 in [27]. Let denote the probability measure of the MH proposal starting from the point . We can define the following pair of measures on :
The acceptance probability of MH is then given by the Radon–Nikodym derivative
Since RTO is an independent proposal, the resulting MH proposal measure becomes
and the pair of transition measures become
This way, the acceptance probability can be expressed as
If and only if is positive –almost surely, the acceptance probability is positive –almost surely.
If the acceptance probability is positive almost surely with respect to , the posterior is invariant for the Markov chain of the resulting MH algorithm. Hence, as shown in [27], the algorithm is wellposed in function space and has sampling performance independent of the parameter discretization dimension.
In the rest of this section, we will focus on establishing sufficient conditions for the Radon–Nikodym derivative to be positive –almost surely. We outline the broad steps of the derivation as follows. In Section LABEL:sec:unwhite_map, we derive the unwhitened mapping for RTO in a finitedimensional (discrete) setting. Because the unwhitened prior corresponds to a discretized Gaussian process, we can naturally generalize RTO to an infinitedimensional Hilbert space setting in Section LABEL:sec:inf_setup. Motivated by Section LABEL:sec:scalable, we introduce a particular rank– projection operator and use it to define RTO’s inverse map. In Section LABEL:sec:inf_rnd, we derive RTO’s proposal measure as the pullback of the prior measure through this inverse map. This construction allows us to establish conditions ensuring that is positive –almost surely.
5.1 Unwhitened mapping in finite dimensions
In this subsection, we derive the unwhitened mapping for RTO in a finitedimensional setting. In Figure LABEL:fig:quadrv, we specify the relationship between four finitedimensional random variables: , , , and , where is a newly defined random variable distributed according to the prior. Note that here we use and to denote random variables distributed according to the unwhitened and whitened proposal distributions, respectively, rather than the corresponding posterior distributions. Recall that and were defined in Section LABEL:sec:target.
The RTO mapping (LABEL:eq:scale) in the unwhitened coordinates takes the form
Defining the oblique projectors
and multiplying both sides of the above equation by , we can express the unwhitened RTO mapping in the following split form:
(15) 
where
(16) 
is an affine transformation of the nonlinear forward model .
Define the matrix and let be its columns. These columns are orthonormal with respect to the –inner product (i.e., ).^{2}^{2}2We use the notation for any and symmetric positive definite . Let us define a second matrix such that the columns of form a complete orthonormal basis for with respect to the –inner product. Then, is the –orthogonal projector that projects any vector onto the first basis vectors (the columns of ). Now, let denote the map that takes a vector and returns its coefficients in the basis. Specifically, we define this map via inner products as:
By this definition, is a vector of size . Let be defined in a similar manner.
The mapping (LABEL:eq:oblique) can thus be rewritten in terms of these coefficients:
(17) 
where the function can be correspondingly expressed as
Note that the two equations (LABEL:eq:oblique) and (LABEL:eq:coef) are equivalent. They both describe RTO’s mapping in finite dimensions: the first in terms of the vectors in the range and null space of , and the second in terms of coefficients in our chosen basis (the columns of and ).
5.2 RTO mapping in function space
In this section, we define an RTO mapping in function space. Until now we worked with finite dimensional random vectors and . Here and in Section LABEL:sec:inf_rnd, we will use the same symbols and to denote random variables taking values in a separable Hilbert space . Let be distributed according to the prior, which we assume to be a Gaussian measure on with traceclass covariance operator and mean function . (In general, we will use the same notation in the function space setting as we do in the finite dimensional setting, except that we will now use to denote the square root of the prior covariance operator. This overloading of notation is intended to preserve interpretability.)
Recall that the Cameron–Martin space associated with , , is equipped with the inner product
for any . Suppose that we are given a finite set of orthonormal functions , and also suppose that each . Let be defined as Then, by construction, are orthonormal with respect to the –inner product. In addition, since , we have that . We can overwrite the definition of the –inner product, typically only used for elements in the Cameron–Martin space, for inputs of type . Namely, for any and , we write
where the inner product on the right is defined in the usual manner (i.e., between two elements in ). In other words, we are treating as the dual of with respect to the –inner product. This allows us to compute –inner products of with any function in .
Let us define the linear subspace to be
where denotes the closure of the span. Analogous to the finitedimensional case, we define the operators and as
for any function and any vector of coefficients . Let the projection operator be defined as
for any and let the subspace be defined as
Then, one can show that maps and maps . Any function can be uniquely split into two parts using the projection .
Now, the infinitedimensional analogue to RTO’s unwhitened mapping (LABEL:eq:oblique) and (LABEL:eq:coef) between the random variables and can be expressed as:
(18) 
where
Remark 4.
Above we assumed that are in or, in other words, that are in . This is true when we obtain from the linearization of the forward model evaluated at the MAP. Note that
The operator has rank with