Preprint of final article which will appear as Stochastic Processes and their Applications Volume 123, Issue 10, October 2013, Pages 3828-3860Posterior Contraction Rates for the Bayesian Approach to Linear Ill-Posed Inverse Problems

Preprint of final article which will appear as Stochastic Processes and their Applications Volume 123, Issue 10, October 2013, Pages 3828-3860
Posterior Contraction Rates for the Bayesian Approach to Linear Ill-Posed Inverse Problems

[    [    [
Abstract

We consider a Bayesian nonparametric approach to a family of linear inverse problems in a separable Hilbert space setting with Gaussian noise. We assume Gaussian priors, which are conjugate to the model, and present a method of identifying the posterior using its precision operator. Working with the unbounded precision operator enables us to use partial differential equations (PDE) methodology to obtain rates of contraction of the posterior distribution to a Dirac measure centered on the true solution. Our methods assume a relatively weak relation between the prior covariance, noise covariance and forward operator, allowing for a wide range of applications.

m1]Sergios Agapiou m2]Stig Larsson m1]Andrew M. Stuart

$\ast$$\ast$footnotetext: Corresponding author.

Preprint of final article which will appear as Stochastic Processes and their Applications Volume 123, Issue 10, October 2013, Pages 3828-3860

Posterior Contraction Rates for the Bayesian Approach to Linear Ill-Posed Inverse Problems

aMathematics Institute, University of Warwick

Coventry CV4 7AL, United Kingdom

bDepartment of Mathematical Sciences, Chalmers University of Technology

and University of Gothenburg, SE-412 96 Gothenburg, Sweden


 

Keywords: posterior consistency, posterior contraction, Gaussian prior, posterior distribution, inverse problems

2000 MSC:

62G20, 62C10, 35R30, 45Q05

 

1 Introduction

The solution of inverse problems provides a rich source of applications of the Bayesian nonparametric methodology. It encompasses a broad range of applications from partial differential equations (PDEs) Banks and Kunisch (1989), where there is a well-developed theory of classical, non-statistical, regularization Engl et al. (1996). On the other hand, the area of nonparametric Bayesian statistical estimation and in particular the problem of posterior consistency has attracted a lot of interest in recent years; see for instance Ghosal et al. (2000); Shen and Wasserman (2001); Rousseau (2010); van der Vaart and H. van Zanten (2008); van der Vaart and van Zanten (2007); Giné and Nickl (2011); Diaconis and Freedman (1986). Despite this, the formulation of many of these PDE inverse problems using the Bayesian approach is in its infancy Stuart (2010). Furthermore, the development of a theory of Bayesian posterior consistency, analogous to the theory for classical regularization, is under-developed with the primary contribution being the recent paper Knapik et al. (2011). This recent paper provides a roadmap for what is to be expected regarding Bayesian posterior consistency, but is limited in terms of applicability by the assumption of simultaneous diagonalizability of the three linear operators required to define Bayesian inversion. Our aim in this paper is to make a significant step in the theory of Bayesian posterior consistency for linear inverse problems by developing a methodology which sidesteps the need for simultaneous diagonalizability. The central idea underlying the analysis is to work with precision operators rather than covariance operators, and thereby to enable use of powerful tools from PDE theory to facilitate the analysis.

Let be a separable Hilbert space, with norm and inner product , and let be a known self-adjoint and positive-definite linear operator with bounded inverse. We consider the inverse problem to find from , where is a noisy observation of . We assume the model,

(1.1)

where is an additive noise. We will be particularly interested in the small noise limit where .

A popular method in the deterministic approach to inverse problems is the generalized Tikhonov-Phillips regularization method in which is approximated by the minimizer of a regularized least squares functional: define the Tikhonov-Phillips functional

(1.2)

where are bounded, possibly compact, self-adjoint positive-definite linear operators. The parameter is called the regularization parameter, and in the classical non-probabilistic approach the general practice is to choose it as an appropriate function of the noise size , which shrinks to zero as , in order to recover the unknown parameter  Engl et al. (1996).

In this paper we adopt a Bayesian approach for the solution of problem (1.1), which will be linked to the minimization of via the posterior mean. We assume that the prior distribution is Gaussian, , where and is a self-adjoint, positive-definite, trace class, linear operator on . We also assume that the noise is Gaussian, , where is a self-adjoint positive-definite, bounded, but not necessarily trace class, linear operator; this allows us to include the case of white observational noise. We assume that the, generally unbounded, operators and have been maximally extended to self-adjoint positive-definite operators on appropriate domains. The unknown parameter and the noise are considered to be independent, thus the conditional distribution of the observation given the unknown parameter (termed the likelihood) is also Gaussian with distribution

Define and let

(1.3)

In finite dimensions the probability density of the posterior distribution, that is, the distribution of the unknown given the observation, with respect to the Lebesgue measure is proportional to This suggests that, in the infinite-dimensional setting, the posterior is Gaussian , where we can identify the posterior covariance and mean by the equations

(1.4)

and

(1.5)

obtained by completing the square. We present a method of justifying these expressions in Section 5. We define

(1.6)

and observe that the dependence of on and is only through . Since

(1.7)

the posterior mean also depends only on : . This is not the case for the posterior covariance , since it depends on and separately: . In the following, we suppress the dependence of the posterior covariance on and and we denote it by .

Observe that the posterior mean is the minimizer of the functional , hence also of that is, the posterior mean is the Tikhonov-Phillips regularized approximate solution of problem (1.1), for the functional with .

In Mandelbaum (1984) and Lehtinen et al. (1989), formulae for the posterior covariance and mean are identified in the infinite-dimensional setting, which avoid using any of the inverses of the prior, posterior or noise covariance operators. They obtain

(1.8)

and

(1.9)

which are consistent with formulae (1.4) and (1.7) for the finite-dimensional case. In Mandelbaum (1984) this is done only for of trace class while in Lehtinen et al. (1989) the case of white observational noise was included. We will work in an infinite-dimensional setting where the formulae (1.4), (1.7) for the posterior covariance and mean can be justified. Working with the unbounded operator opens the possibility of using tools of analysis, and also numerical analysis, familiar from the theory of partial differential equations.

In our analysis we always assume that is regularizing, that is, we assume that dominates in the sense that it induces stronger norms than This is a reasonable assumption since otherwise we would have (here is used loosely to indicate two operators which induce equivalent norms; we will make this notion precise in due course). This would imply that the posterior mean is , meaning that we attempt to invert the data by applying the, generally discontinuous, operator  (Engl et al., 1996, Proposition 2.7).

We study the consistency of the posterior in the frequentist setting. To this end, we consider data which is a realization of

(1.10)

where is a fixed element of ; that is, we consider observations which are perturbations of the image of a fixed true solution by an additive noise , scaled by . Since the posterior depends through its mean on the data and also through its covariance operator on the scaling of the noise and the prior, this choice of data model gives as posterior distribution the Gaussian measure , where is given by (1.4) and

(1.11)

We study the behavior of the posterior as the noise disappears (). Our aim is to show that it contracts to a Dirac measure centered on the fixed true solution . In particular, we aim to determine such that

(1.12)

where the expectation is with respect to the random variable distributed according to the data likelihood .

As in the deterministic theory of inverse problems, in order to get convergence in the small noise limit, we let the regularization disappear in a carefully chosen way, that is, we will choose such that as . The assumption that dominates , shows that is a singularly perturbed unbounded (usually differential) operator, with an inverse which blows-up in the limit . This together with equation (1.7), opens up the possibility of using the analysis of such singular limits to study posterior contraction: on the one hand, as , becomes unbounded; on the other hand, as , we have more accurate data, suggesting that for the appropriate choice of we can get . In particular, we will choose as a function of the scaling of the noise, under the restriction that the induced choice of , is such that as . The last choice will be made in a way which optimizes the rate of posterior contraction , defined in (1.12). In general there are three possible asymptotic behaviors of the scaling of the prior as van der Vaart and van Zanten (2007); Knapik et al. (2011):

  1. ; we increase the prior spread, if we know that draws from the prior are more regular than ;

  2. fixed; draws from the prior have the same regularity as ;

  3. at a rate slower than ; we shrink the prior spread, when we know that draws from the prior are less regular than

The problem of posterior contraction in this context is also investigated in Knapik et al. (2011) and Florens and Simoni (2010). In Knapik et al. (2011), sharp convergence rates are obtained in the case where and are simultaneously diagonalizable, with eigenvalues decaying algebraically, and in particular , that is, the data are polluted by white noise. In this paper we relax the assumptions on the relations between the operators and , by assuming that appropriate powers of them induce comparable norms (see Section 3). In Florens and Simoni (2010), the non-diagonal case is also examined; the three operators involved are related through domain inclusion assumptions. The assumptions made in Florens and Simoni (2010) can be quite restrictive in practice; our assumptions include settings not covered in Florens and Simoni (2010), and in particular the case of white observational noise.

1.1 Outline of the rest of the paper

In the following section we present our main results which concern the identification of the posterior (Theorem 2.1) and the posterior contraction (Theorems 2.2 and 2.3). In Section 3 we present our assumptions and their implications. The proofs of the main results are built in a series of intermediate results contained in Sections 4-7. In Section 4, we reformulate equation (1.7) as a weak equation in an infinite-dimensional space. In Section 5, we present a new method of identifying the posterior distribution: we first characterize it through its Radon-Nikodym derivative with respect to the prior (Theorem 5.1) and then justify the formulae (1.4), (1.7) for the posterior covariance and mean (proof of Theorem 2.1). In Section 6, we present operator norm bounds for in terms of the singular parameter , which are the key to the posterior contraction results contained in Section 7 and their corollaries in Section 2 (Theorems 7.1, 7.2 and 2.2, 2.3). In Section 8, we present some nontrivial examples satisfying our assumptions and provide the corresponding rates of convergence. In Section 9, we compare our results to known minimax rates of convergence in the case where and are all diagonalizable in the same eigenbasis and have eigenvalues that decay algebraically. Finally, Section 10 is a short conclusion.

The entire paper rests on a rich set of connections between the theory of stochastic processes and various aspects of the theory of linear partial differential equations. In particular, since the Green’s function of the precision operator of a Gaussian measure corresponds to its covariance function, our formulation and analysis of the inverse problem via precision operators is very natural. Furthermore, estimates on the inverse of singular limits of these precisions, which have direct implications for localization of the Green’s functions, play a key role in the analysis of posterior consistency.

2 Main Results

In this section we present our main results. We postpone the rigorous presentation of our assumptions to the next section and the proofs and technical lemmas are presented together with intermediate results of independent interest in Sections 4 - 7. Recall that we assume a Gaussian prior and a Gaussian noise distribution . Our first assumption concerns the decay of the eigenvalues of the prior covariance operator and enables us to quantify the regularity of draws from the prior. This is encoded in the parameter ; smaller implies more regular draws from the prior. We also assume that and , for some , where is s used in the manner outlined in Section 1, and defined in detail in Section 3. Finally, we assume that the problem is sufficiently ill-posed with respect to the prior. This is quantified by the parameter which we assume to be larger than ; for a fixed prior, the larger is, the more ill-posed the problem.

2.1 Posterior Identification

Our first main theorem identifies the posterior measure as Gaussian and justifies formulae 1.4 and 1.7. This reformulation of the posterior in terms of the precision operator is key to our method of analysis of posterior consistency and opens the route to using methods from the study of partial differential equations (PDEs). These methods will also be useful for the development of numerical methods for the inverse problem.

Theorem 2.1.

Under the Assumptions 3.1, the posterior measure is Gaussian , where is given by (1.4) and is a weak solution of (1.7). ∎

2.2 Posterior Contraction

We now present our results concerning frequentist posterior consistency of the Bayesian solution to the inverse problem. We assume to have data as in (1.10), and examine the behavior of the posterior , where is given by (1.11), as the noise disappears (). The first convergence result concerns the convergence of the posterior mean to the true solution in a range of weighted norms induced by powers of the prior covariance operator . The spaces are rigorously defined in the following section. The second result provides rates of posterior contraction of the posterior measure to a Dirac centered on the true solution as described in (1.12). In both results, we assume a priori known regularity of the true solution and give the convergence rates as functions of .

Theorem 2.2.

Assume , where and let , where . Under the Assumptions 3.1, we have the following optimized rates of convergence, where is arbitrarily small:

  1. if for

  2. if , for

  3. if and for

    If and then the method does not give convergence.

Theorem 2.3.

Assume , where . Under the Assumptions 3.1, we have the following optimized rates for the convergence in (1.12), where is arbitrarily small:

  1. if for

  2. if for

To summarize, provided the problem is sufficiently ill-posed and the true solution is sufficiently regular we get the convergence in (1.12) for

Figure 1: Exponents of rates of contraction plotted against the regularity of the true solution, . In blue are the sharp convergence rates obtained in the diagonal case in Knapik et al. (2011), while in green the rates predicted by our method, which applies to the more general non-diagonal case

Our rates of convergence agree, up to arbitrarily small, with the sharp convergence rates obtained in the diagonal case in Knapik et al. (2011) across a wide range of regularity assumptions on the true solution (Figure 1); yet, our rates cover a much more applicable range of non-simultaneously diagonalizable problems. (The reason for the appearance of is that in the assumed non-diagonal setting we can only use information about the regularity of the noise as expressed in terms of the spaces (cf. Lemma 3.5), rather than the explicit representation of the noise.)

The rates we obtain are not as strong as in the simultaneously diagonalizable case when the true solution is too regular; in particular our rates saturate earlier as a function of increasing regularity, and we require a certain degree of regularity of the true solution in order to secure convergence. It is not known if our results can be improved but it would be interesting to try. Both of the two discrepancies are attributed to the fact that our method relies on interpolating between rates in a strong and a weak norm of the error ; on the one hand the rate of the error in the weak norm saturates earlier, and on the other hand the error in the strong norm requires additional regularity in order to converge (cf. Section 9).

3 The Setting

In this section we present the setting in which we formulate our results. First, we define the spaces in which we work, in particular, we define the Hilbert scale induced by the prior covariance operator . Then we define the probability measures relevant to our analysis. Furthermore, we state our main assumptions, which concern the decay of the eigenvalues of and the connections between the operators , and , and present regularity results for draws from the prior, , and the noise distribution, . Finally we briefly overview the way in which the Hilbert scale defined in terms of the prior covariance operator , which is natural for our analysis, links to scales of spaces defined independently of any prior model.

We start by defining the Hilbert scale which we will use in our analysis. Recall that is an infinite-dimensional separable Hilbert space and is a self-adjoint, positive-definite, trace class, linear operator. Since is injective and self-adjoint we have that . This means that is a densely defined, unbounded, symmetric, positive-definite, linear operator in . Hence it can be extended to a self-adjoint operator with domain ; this is the Friedrichs extension Lax (2002). Thus, we can define the Hilbert scale , with  Engl et al. (1996), where

The bounded linear operator is assumed to be self-adjoint, positive-definite (but not necessarily trace class); thus can be extended in the same way to a self-adjoint operator with domain Finally, recall that we assume that is a self-adjoint and positive-definite, linear operator with bounded inverse, .

We assume that we have a probability space . The expected value is denoted by and means that the law of the random variable is the measure .

Let and be the prior and noise distributions respectively. Furthermore, let denote the measure constructed by taking and as independent Gaussian random variables and respectively:

where . We denote by the measure constructed by taking and as independent Gaussian random variables and respectively:

Let be orthonormal eigenpairs of in . Thus, are the singular values and an orthonormal eigenbasis. Since is trace class we have that . In fact we require a slightly stronger assumption see Assumption 3.1(1) below.

3.1 Assumptions

We are now ready to present our assumptions. The first assumption enables us to quantify the regularity of draws from the prior whereas the rest of the assumptions regard interrelations between the three operators , and ; these assumptions reflect the idea that

for some , where is used in the same manner as in Section 1. This is made precise by the inequalities presented in the following assumption, where the notation means that there exist constants such that .

Assumption 3.1.

Suppose there exist , and constants such that

  1. is trace class for all ;

  2. , where

Notice that, by Assumption 3.1(2) we have which, in combination with Assumption 3.1(3), implies that

capturing the idea that the regularization through is indeed a regularization. In fact the assumption connects the ill-posedness of the problem to the regularity of the prior. We exhibit this connection in the following example:

Example 3.2.

Assume and are simultaneously diagonalizable, with eigenvalues having algebraic decay , and , respectively, for and so that is trace class. Then Assumptions (1),(3)-(7) are trivially satisfied with and . The Assumption (2) is then equivalent to . That is, for a certain degree of ill-posedness (encoded in the difference ) we have a minimum requirement on the regularity of the prior (encoded in ). Put differently, for a certain prior, we require a minimum degree of ill-posedness.

We refer the reader to Section 8 for nontrivial examples satisfying Assumptions 3.1.

In the following, we exploit the regularity properties of a white noise to determine the regularity of draws from the prior and the noise distributions using Assumption 3.1(1). We consider a white noise to be a draw from , that is a random variable . Even though the identity operator is not trace class in , it is trace class in a bigger space , where is sufficiently large.

Lemma 3.3.

Under the Assumption 3.1(1) we have:

  1. Let be a white noise. Then for all .

  2. Let . Then -a.s. for every .

Proof.
  1. We have that , thus is equivalent to being of trace class. By the Assumption it suffices to have .

  2. We have , where is a white noise, therefore using part (i) we get the result.

Remark 3.4.

Note that as changes, both the Hilbert scale and the decay of the coefficients of a draw from change. The norms are defined through powers of the eigenvalues . If , then has eigenvalues that decay like , thus an element has coefficients , that decay faster than . As gets closer to zero, the space for fixed , corresponds to a faster decay rate of the coefficients. At the same time, by the last lemma, draws from belong to for all . Consequently, as gets smaller, not only do draws from belong to for smaller , but also the spaces for fixed reflect faster decay rates of the coefficients. The case corresponds to having eigenvalues that decay faster than any negative power of . A draw from in that case has coefficients that decay faster than any negative power of .

In the next lemma, we use the interrelations between the operators to obtain additional regularity properties of draws from the prior, and also determine the regularity of draws from the noise distribution and the joint distribution of the unknown and the data.

Lemma 3.5.

Under the Assumptions 3.1 we have:

  1. -a.s. for all

  2. -a.s.

  3. -a.s. for all

  4. -a.s. for all .

Proof.
  1. We can choose an as in the statement by the Assumption 3.1(2). By Lemma 3.3(ii), it suffices to show that . Indeed,

  2. Under Assumption 3.1(3) it suffices to show that . Indeed, by Lemma 3.3(ii), we need to show that , which is true since and we assume , thus

  3. It suffices to show it for any . Noting that is a white noise, using Assumption 3.1(4), we have by Lemma 3.3(i)

    since .

  4. By (ii) we have that is -a.s. in the Cameron-Martin space of the Gaussian measures and , thus the measures and are -a.s. equivalent (Da Prato, 2006, Theorem 2.8) and (iii) gives the result.

3.2 Guidelines for applying the theory

The theory is naturally developed in the scale of Hilbert spaces defined via the prior. However application of the theory may be more natural in a different functional setting. We explain how the two may be connected. Let be an orthonormal basis of the separable Hilbert space . We define the spaces as follows: for we set

and the spaces are defined by duality,

For example, if we restrict ourselves to functions on a periodic domain and assume that is the Fourier basis of , then the spaces can be identified with the Sobolev spaces of periodic functions , by rescaling:  (Robinson, 2001, Proposition 5.39).

In the case , as explained in Remark 3.4 we have algebraic decay of the eigenvalues of and in particular decay like . If is diagonalizable in the basis , that is, if , then it is straightforward to identify the spaces with the spaces . The advantage of this identification is that the spaces do not depend on the prior so one can use them as a fixed reference point for expressing regularity, for example of the true solution.

In our subsequent analysis, we will require that the true solution lives in the Cameron-Martin space of the prior , which in different choices of the prior (different ) is a different space. Furthermore, we will assume that the true solution lives in for some and provide the convergence rate depending on the parameters . The identification and the intuitive relation between the spaces and the Sobolev spaces, enable us to understand the meaning of the assumptions on the true solution.

We can now formulate the following guidelines for applying the theory presented in the present paper: we work in a separable Hilbert space with an orthonormal basis and we have some prior knowledge about the true solution which can be expressed in terms of the spaces . The noise is assumed to be Gaussian , and the forward operator is known; that is, and are known. We choose the prior , that is, we choose the covariance operator , and we can determine the value of . If the operator is chosen to be diagonal in the basis then we can find the regularity of the true solution in terms of the spaces , that is, the value of such that , and check that which is necessary for our theory to work. We then find the values of and and calculate the value of appearing in Assumption 3.1, checking that our choice of the prior is such that . We now have all the necessary information required for applying the Theorems 2.2 and 2.3 presented in Section 2 to get the rate of convergence.

Remark 3.6.

Observe that in the above mentioned example of periodic functions, we have the identification , thus since we have that the assumption implies that , for . By the Sobolev embedding theorem (Robinson, 2001, Theorem 5.31), this implies that the true solution is always assumed to be continuous. However, this is not a disadvantage of our method, since in many cases a Gaussian measure which charges with probability one, can be shown to also charge the space of continuous functions with probability one (Stuart, 2010, Lemma 6.25)

4 Properties of the Posterior Mean and Covariance

We now make sense of the equation weakly in the space , under the assumptions presented in the previous section. To do so, we define the operator from (1.6) in and examine its properties. In Section 5 we demonstrate that (1.4) and (1.7) do indeed correspond to the posterior covariance and mean.

Consider the equation

(4.1)

where

Define the bilinear form ,

Definition 4.1.

Let . An element is called a weak solution of (4.1), if

Proposition 4.2.

Under the Assumptions 3.1(2) and (3), for any , there exists a unique weak solution of (4.1).

Proof.

We use the Lax-Milgram theorem in the Hilbert space , since

  1. is coercive:

  2. is continuous: indeed by the Cauchy-Schwarz inequality and the Assumptions 3.1(2) and (3),

Remark 4.3.

The Lax-Milgram theorem defines a bounded operator , such that for all , which has a bounded inverse such that for all . Henceforward, we identify and . Furthermore, note that in Proposition 4.2, Lemma 4.4 below, and the three propositions in Section 6, we only require and not the stronger assumption . However, in all our other results we actually need

Lemma 4.4.

Suppose the Assumptions 3.1(2) and (3) hold. Then the operator is identical to the operator , where is defined weakly in .

Proof.

The Lax-Milgram theorem implies that is bounded. Moreover, is bounded, thus the operator is also bounded and satisfies

(4.2)

Define weakly in , by the bilinear form given by

By Assumption 3.1(3), is coercive and continuous in , thus by the Lax-Milgram theorem, there exists a uniquely defined, boundedly invertible, operator such that for all . We identify with the bounded operator . By Assumption 3.1(2) we have hence

that is, is bounded. By the definition of and , this implies that . ∎

Proposition 4.5.

Under the Assumptions 3.1(1),(2),(3),(4),(7), there exists a unique weak solution, of equation (1.7), -almost surely.

Proof.

It suffices to show that -almost surely. Indeed, by Lemma 3.5(iv) we have that -a.s. for all , thus by the Assumption 3.1(7)

since , which holds by the Assumption 3.1(2). ∎

5 Characterization of the Posterior using Precision Operators

Suppose that in the problem (1.1) we have and , where is independent of . Then we have that . Let be the posterior measure on .

In this section we prove a number of facts concerning the posterior measure for . First, in Theorem 5.1 we prove that this measure has density with respect to the prior measure , identify this density and show that is Lipschitz in , with respect to the Hellinger metric. Continuity in will require the introduction of the space , to which drawn from belongs almost surely. Secondly, we prove Theorem 2.1, where we show that is Gaussian and identify the covariance and mean via equations (1.4) and (1.7). This identification will form the basis for our analysis of posterior contraction in the following section.

Theorem 5.1.

Under the Assumptions 3.1(1),(2),(3),(4),(5),(6), the posterior measure is absolutely continuous with respect to and

(5.1)

where