Some results on contraction rates for Bayesian inverse problems

Some results on contraction rates for Bayesian inverse problems

Madhuresh TIFR Centre for Applicable Mathematics, Post Bag No. 6503, GKVK Post Office, Bangalore-560065, India madhuresh@math.tifrbng.res.in
Abstract.

We prove a general lemma for deriving contraction rates for linear inverse problems with non parametric nonconjugate priors. We then apply it to get contraction rates for both mildly and severely ill posed linear inverse problems with Gaussian priors in non conjugate cases. In the severely illposed case, our rates match the minimax rates using scalable priors with scales which do not depend upon the smoothness of true solution. In the mildly illposed case, our rates match the minimax rates using scalable priors when the true solution is not too smooth.

Further, using the lemma, we find contraction rates for inversion of a semilinear operator with Gaussian priors. We find the contraction rates for a compactly supported prior. We also discuss the minimax rates applicable to our examples when the Sobolev balls in which the true solution lies, are different from the usual Sobolev balls defined via the basis of forward operator.

Key words and phrases:
Bayesian asymptotics
2010 Mathematics Subject Classification:
62Gxx
This research work benefited from the support of the AIRBUS Group Corporate Foundation Chair in Mathematics of Complex Systems established in CAM and ICTS, TIFR, Bangalore. The authors would like to thanks Andrew Stuart for the initial discussions which motivated this work.

1. Introduction

Inverse problems can always be formulated as solutions of equations of the type where is known (typically some measurement or data from an experiment) and one is interested in solving the equation to find [16, Chapter 1]. The model which we shall investigate here is the statistical version of infinite dimensional linear inverse problems of this type. Specifically, we assume that is a linear, injective operator from Hilbert space to Hilbert space (both infinite dimensional). We treat as a random variable on and define by the following equation:

(1)

where is a Gaussian noise on : .111 We may choose to be the white noise i.e. . This makes the data space large, because the white noise is supported on a much bigger space than . We shall explain white noise with reasonable details in Section 2. In a Bayesian formulation of the problem, also known as nonparametric (parametric) Bayesian framework when is infinite (finite) dimensional, we assume that is distributed according to a measure (we assume that the choice of prior may depend on level of noise) on , to be considered as the prior measure. We further assume that and are independent. Given the above assumptions, Bayes’ theorem gives the distribution of the conditional random variable given , called the posterior distribution, to be denoted henceforth by . We are interested in the properties of the posterior in the small-noise limit as .

In many practical applications, the inverse problem is ill-posed, in the sense that may not exist for given , or it may not be unique, or may not depend continuously on . In such cases, many methods of regularization are well developed for linear ill-posed inverse problems but are still being developed for many nonlinear problems, see, e.g. the book by Kirsch [16] and references therein. A related approach in dealing with ill-posed inverse problems is the statistical approach similar to the one introduced in the first paragraph. The statistical approach may turn a deterministic ill-posed inverse problem into a well-posed problem in the sense that, under certain assumptions on , and , the posterior distribution for the conditional random variable given is continuous with respect to even when is not continuous (see for e.g., the papers [24, 8] for details, further discussions and references).

Thus the main object of interest in the statistical version is the posterior distribution . Depending on the context, various properties and quantities related to the posterior distribution are of interest. A far from exhaustive list of recent studies includes credible sets of the posterior [18]; computational methods and finite dimensional approximations of equation (1) and of the posterior [15]; convergence of MAP (maximum a posteriori) and CM (conditional mean) estimators. For an extensive bibliography, one may refer to [13].

Another major question in the statistical version is that of consistency of the posterior measure. Depending on the formulation of the problem, one expects the posterior measure to approach Dirac delta supported on the true value of the unknown either as the number of observations goes to infinity (large data limit), or as the observations become more accurate (small noise limit). The first result in large data limit is due to Doob [21], and this limit has been investigated in many different contexts since then. There has been a recent interest in the small noise limit [22, 18, 3, 19]. In the case when is identity operator, then the asymptotics of large data and small noise limits are known to be equivalent [6, 7]. Consistency can be proved for all linear, injective when the prior is Gaussian using elementary methods, however we have not been able to locate a proof in literature. Methods from the former (large data limit) are exploited in the work of Ray [22] to deal with small noise limits and will be used explicitly by us in this work as well.

We shall focus on the small noise limit. In particular, we will assume that the data is obtained from a “true” unknown element in equation (1). We will be concerned with the rate of contraction of the Bayesian posterior distribution to the true solution in an appropriate way to be described later. Essentially we look for the posterior measure of the complements of small balls around , that is, . Note that this is a random variable in . We then study the convergence of this random variable to as . In case we are able to prove such a convergence, is called a contraction rate.

The novelty of this work, compared to the previous studies [22, 18, 3, 19, 2], is to extend the class of models, mainly priors, but also the class of distributions of noise for which we can obtain contraction rates. The details of the technical assumptions needed for our results are given in Section 3. We also give necessary and sufficient conditions for well-posedness of linear Bayesian models in Banach spaces (That is, model 1 with being Banach spaces).

In Section 2.1, we shall begin by presenting the basic set-up, followed in Section 2.2 by the precise definitions for consistency and contraction rates. We discuss our main contributions and relations to previous work in detail in Section 2.4. Our main results on contraction rates, Lemma 3.9, as well as Lemma 3.17 and 3.14, which are extensions of, and use the methods from the work of Ray [22], are stated and proved in Section 3. Finally, in Section 4, we discuss different classes of examples, one of which is semilinear, where we can apply our results.

2. Consistency and contraction rates

In this section, we first introduce the basic setup and the Bayes’ theorem in the context of our problem, followed by the definitions of consistency and contraction rates Section 2.2. Heuristically, consistency implies that the random posterior measure concentrates around the true solution as the observations get more precise by way of the observational noise converging to zero in an appropriate way. In a similar fashion, Contraction rates measure how quickly the above said concentration happens.

2.1. Basic setup

As discussed in Section 1, we will consider statistical versions of linear inverse problem, as given in model (1). We will treat and as random variables defined on an abstract probability space , taking values in Hilbert spaces and , equipped with the inner products , and and corresponding norms and , respectively. Here recall that is referred to as the observed data and as the unknown, or the parameter to be determined.

We will consider the operator to be linear and injective. Our first lemma will require that have a singular value decomposition, i.e., the eigenvectors of form a basis of which we denote by with corresponding eigenvalues , i.e., for . We will consider ill-posed inverse problems in the sense that the inverse of is discontinuous, i.e., is a limit point of the eigenvalues . Thus in this case, for any , there is a unique satisfying , but the dependence of on is not continuous. Also note that may not even belong to or due to the white noise component. Two important cases of ill-posed inverse problems are as follows [22, 7].

  • The inverse problem is called severely ill-posed if

    (2)

    for some constants and . Essentially in this case, the eigenvalues of decay exponentially. An example is the classical inverse problem of determination of initial condition of the heat equation given the solution at a fixed later time.

  • The problem is called mildly ill-posed if

    (3)

    for some constants .

We will consider the case of observational noise to be Gaussian, i.e., , on and also assume that is a subset of , the Cameron-Martin space of the noise.

Remark 2.1.

Note that is supported on if and only if is trace class, see [20]. However, covariance operators which are not trace class are also commonly used. For instance, the covariance of white noise is the identity which is clearly not trace class. In such a case, the noise is not actually supported on but on a much larger space which we define below. Let be any orthonormal basis of and let be the one dimensional span of . Let be the standard Gaussian on and define the white noise on to be the product measure of on . Even though the above definition uses the basis , it is easy to check222If is a random variable distributed as , we can check that (see [20]). that the measure is independent of the choice of the basis .

Denote by the measure of the noise and by the measure of given , i.e., measure of the scaled noise shifted by for any fixed . The above assumption that belongs to the Cameron-Martin space of implies that . Indeed, the Radon-Nikodym derivative is given by the Cameron-Martin theorem [20]:

(4)

where

(5)

where is the Cameron-Martin inner product. Note that existence of assumes that the map for almost surely in prior measure.

Following the Bayesian philosophy, we will assume to be valued random variable independent of . The distribution for will be interpreted as the prior for , and will be denoted by . We will be interested in the properties of conditional distribution of given , as given by the Bayes’ theorem.

We will now describe the posterior measure and show that it is well defined. The following proposition is a generalization of Theorem 4.1 in [24] as it does not depend on any assumptions on the likelihood other than its existence.

Claim 2.2.

Assume that the random variable is independent of the noise . Further, assume that for almost all with respect to the prior measure. Then, the conditional distribution, denoted by , for the random variable is well-defined and also absolutely continuous with respect to the prior for -a.e. . Indeed, the posterior measure of any Borel set is given by

(6)

where is defined in equation (5) and the constant

(7)

is finite:

(8)
Proof.

Due to the independence of and , the joint distribution of is given by . Using equation (1) and the definition of above, we see that the distribution of the random variable is which is absolutely continuous w.r.t.  with the same Radon-Nikodym derivative as the RHS of equation (4). Thus by Bayes’ theorem, the conditional distribution of given , denoted by , is absolutely continuous with respect to with the same Radon-Nikodym derivative, which proves equation (6), as long as the constant is finite but non-zero.

We now show for -a.e. . It would suffice to show for all sets of positive measure (under ).

We have used Tonelli theorem and the change of variable formula for translation of Gaussian measures in the above. Since and are absolutely continuous with respect to each other for almost all , we have and hence . Therefore the posterior distribution exists and has the density given by equation (6). ∎

The above computation proves that the posterior is well defined for a.e. . We also note here that the proof solely relies on the existence of , which is a consequence of the assumption that for almost all with respect to the prior measure.

2.2. Definition of contraction rates

Posterior distribution, as seen above, can indeed be considered as a solution to a statistical inverse problem. However, such proposed solution needs to be tested for some obvious flaws. Heuristically, assuming that the observation corresponds a specific true solution , the posterior distribution should concentrate around the true solution as the intensity of noise decreases to zero.

The main idea is to estimate the posterior measure of complements of neighborhoods of the true solution, and show that they converge to zero.

In particular, we will define which is a random variable that is defined for each in the support of . The consistency of the statistical inverse problem is then defined by convergence in probability of random variables .

Definition 2.3.

The sequence of posteriors is said to be consistent if as for every .

In fact, the above convergence may be true even when we replace with a sequence , which essentially defines the contraction rate as follows.

Definition 2.4.

A sequence is said to be a rate of contraction for the sequence of posterior measures if as for every .

As mentioned earlier in the introduction, we prove a general lemma for deriving contraction rates for linear inverse problems. We also compare our rates with minimax rates. A description of minimax rates can be found in [5, 7]. Before we state and prove the lemma in Section 3, we discuss relation of our results to previous work.

2.3. Definition of well-posedness

The concept of well-posedness captures the fact that the posterior varies continuously with observation. In order to formalise the concept, we need appropriate metrics on the space of posteriors as well as observations. It is standard to use Hellinger distance as a metric on the space of posteriors. We use the definition given in [24].

Definition 2.5.

Given two probability measures , and a measure such that and have densities with respect to , the Hellinger distance is defined as

Definition 2.6.

The posterior model is said to be well posed if there exists a Banach space such that almost surely(note that is the sum of random variables and ) and is a continuous function of with respect to the corresponding metrics

2.4. Our contribution and relation to previous work

Well-posedness of the posterior for inverse problems on Banach spaces has been discussed in [24] in a very general setting. It provides certain technical assumptions which are sufficient to show well-posedness. However, it can be cumbersome to show that the assumptions hold even in simple cases as can be seen in [2].

We give necessary and sufficient conditions for the posterior to be well-posed when the model 1 is defined on Banach spaces 3.1.

In the finite dimensional setup, a nondegenerate linear ensures well-posedness of the inverse problem and makes the problem of finding contraction rates easy. The same clearly does not hold for infinite dimensional case, which has received considerable attention only recently. We shall now outline some of these studies of contraction rates in infinite dimensions, pointing out the relations to our main result.

In the papers [18] and [3, 19], the authors deal with mildly ill-posed (3) and severely ill-posed (2) inverse problems, respectively. Both these studies use scalable Gaussian priors and white noise. They first calculate the mean and covariance operator of the posterior distribution (which is Gaussian as well) and find a bound for defined above, using Markov inequality and then take the expectation of the resulting random variable. At this point, they use the assumption that where is the basis of with respect to which the covariance of prior is diagonalizable (recall that is the eigenbasis of ). Using this assumption, they show that the expectation of equals the sum of a series and the estimate for the value of the sum provides the contraction rate. In [17], the authors provide adaptive priors to get contraction rates which match the minimax rates (except for a logarithmic term) when using priors which are diagonalizable in the basis of .

In the paper [2], the authors again work with Gaussian priors, but where may not hold. Further, observational noise may not be white, generalizing some recent results [9, 10, 18, 19]. They work with the explicit expressions for the mean and covariance of the Gaussian posterior, obtaining contraction rates for mildly ill-posed problems. The paper makes use of certain technical assumptions comparing the covariance operators of noise and prior measures to the linear operator . The paper gets contraction rates only when the true solution lies in the Cameron-martin space of the prior, in particular when for , where is the Hilbert scale of order defined in Section 3 of [2].

Finally in [22], the author discusses cases where the prior may not be Gaussian and we shall be using methods used in that paper. The main idea in [22] is to use test functions as introduced in some earlier works [11]. Similar test functions are defined later in our paper in Proposition 3.11. A key result, Theorem 2.1 in [22], is along the lines of similar results in [12], and is proved under the assumption that

(9)

for all , where is an arbitrary basis of . However, when considering Gaussian priors, the conditions required for the main Theorem 2.1 in [22] are verified only for case when , which is also the case discussed in [18, 3, 19, 17]. In conclusion, we would like to say that there is a paucity of results on contraction rates in non-conjugate (non-conjugate is used to mean that ) cases with Gaussian priors. The available contraction rates hold only when the true solution is in the Cameron-Martin space of the prior. Also, there are no results for severely ill posed problems in non conjugate cases with Gaussian priors. The main contribution of this paper is in the aspect which we summarise below.

2.4.1. Our contribution

  • Theorem 3.1: We show that the posterior for model (1) is well-posed for Gaussian priors if and only if lies in the Cameron-Martin space of the noise almost surely with respect to the prior measure.

  • Lemma 3.14: We weaken condition (9) of [22] by demanding only that

    (10)

    for all .

  • Subsection 4.1.1: We then verify the conditions imposed by the lemma to examples of mildly illposed problems where for scalable Gaussian priors and achieve minimax rates when true solution does not lie in the cameron martin space of the prior. The rates however, are suboptimal when the true solution lies in the Cameron-Martin space of the prior. Our examples strictly contain the class of examples discussed in the paper [2] (in the sense that we allow for a much larger class of perturbations) and we get contraction rates for true solutions belonging to all Hilbert scales/Sobolev classes (Subsection 4.1.2). We also get contraction rates for the deconvolution problem using Gaussian priors on Meyer wavelet basis.

  • Subsection 4.1.3: We obtain the minimax rates for severely ill posed problem in non conjugate cases using scalable priors (with scales independent of the smoothness of true solution) for all Sobolev classes of true solutions.

  • Subsection 4.1.4: Finally, under appropriate assumptions, we also obtain the contraction rates for the posterior when the operator is a semilinear.

3. Well-posedness and Contraction Lemmas

In this section, we shall present our main results concerning wellposedness of the posterior, and contraction rates for various choices of model parameters.

3.1. Well-posedness

We shall begin with stating our result on wellposedness of the posterior on separable Hilbert spaces and then use it to show the result on Banach spaces.

Theorem 3.1.

Consider the model (1) with Gaussian prior and noise . Then, the posterior measure is well-posed and is locally Lipshitz in the observation with respect to the appropriate norm, if and only if lies almost surely in the Cameron-Martin space of .

Proof.

Following the proof of Proposition 3.1 in [18], we write the posterior explicitly as a Gaussian measure with

and

Note that is independent of the observation . Using Fernique’s theorem ([20]) and bounded convergence theorem, it is easy to see that is locally Lipshitz with respect to where is the usual norm on . If lies in the Cameron-Martin space of noise almost surely with respect to the prior, then we have

where the norm is the Cameron-Martin norm of noise. This implies that the covariance of the pushforward of under is trace class. This implies that is trace class. Similar calculations show that if lies in the Cameron-Martin space of the posterior almost surely with respect to the distribution of (distributed as , that is, Gaussian with mean and covariance ), then is trace class. Since both the summands are positive, both and need to be trace class. With these facts in hand, we prove the theorem.

The if part

We assume that is trace class. Well-posedness of posterior will follow if we show that is continuous on a space with appropriate norm and almost surely. Note that lies in Cameron-Martin space of almost surely with respect to . Hence, is absolutely continuous with respect to for almost all with respect to . Thus, it is sufficient to show that almost surely.

We choose , with the norm . The random variable belongs to almost surely, if almost surely. This holds if

which, in turn is true whenever the covariance operator of pushforward of under , given by is trace class.

Noting that is trace class if and only if is trace class, the above is equivalent to showing that

is trace class. We have seen that this follows by the fact that lies in the Cameron-Martin space of the noise almost surely with respect to the prior measure. Hence, almost surely.

Next, we show that is continuous on . For, each , we have some such that and . Thus, is continuous on if is continuous on . Setting and , we have

and

Next, we note that is compact (since is trace class) and the eigenbasis of is also the eigenbasis of . Assuming that are the eigenvalues of , the eigenvalues of are , hence is continuous. Finally, we estimate Noting that is continuous, we have

for some . Hence, , proving the statement of theorem.

The only if part

We need to show that is trace class, or equivalently, is Hilbert-Schmidt. Well-posedness of the posterior implies that lies in the Cameron-Martin space of the posterior almost surely with respect to the distribution of . This implies that and are trace class. In particular, this implies that and are Hilbert-Schmidt. Further, note that is bounded below since is bounded below and and are positive operators. Thus, is bounded below making Hilbert-Schmidt. Next, recalling that , we have

Since is Hilbert-Schmidt ( is Hilbert-Schmidt and is continuous), we have after explicitly writing out ,

Hence, is Hilbert-Schmidt. Let have eigenbasis . It is easy to see that is Hilbert-Schmidt and has eigenbasis as well.

Now, we use the fact that is Hilbert-Schmidt to show that is Hilbert-Schmidt. By arguments used before, it follows that is also Hilbert-Schmidt. After opening up , we have

As before, noting that is Hilbert-Schmidt and opening up , we have

Hence, is Hilbert-Schmidt. Assuming eigenvalues of are , we have

along with (since is compact) imply that making trace class and hence Hilbert-Schmidt. ∎

Now, to show the result for Banach spaces, we shall embed the Banach spaces into appropriate Hilbert spaces and show that the well-posedness result on the Hilbert spaces imply the well-posedness result on the Banach spaces. We rewrite model 1 in Banach space.

(11)

where is a Gaussian noise on : . where are Banach spaces. is the prior on . If there exists a Hilbert space such that , then we can push forward the prior via the inclusion map to get a measure on . Similarly, if there exists a Hilbert space such that then we can rewrite model 11 as model 1 with minor changes. is a linear map on which is defined almost everywhere with respect to the measure . The noise stays the same as before. If we now write the posterior for this model prior, we get the same expression for the posterior and hence the well-posedness of model 1 implies the well-posedness of 11. The only thing left to show is that given any Banach space , we can find a Hilbert space such that .

Lemma 3.2.

Given a Banach space , we can construct a separable Hilbert space such that .

Proof.

Since is separable, we can find a countable collection of linear functionals with norm such that for all . Define

Also, the norm is smaller than the norm since

Completing under the norm gives us the desired Hilbert space . ∎

This achieves the first part in Section 2.4.1. As an application, we shall discuss the well-posedness of posteriors for a certain class of examples which strictly contains the examples discussed in [2].

Example 3.3.

Consider the model (1), with the prior . Let the operator and noise for some continuous, self adjoint, positive operators such that is trace class for some and . Assume further that is continuous.

Claim 3.4.

The posterior for the above example is well-posed.

Proof.

We need to show that for almost all with respect to the prior measure. Following the previous proof, we can show that the above holds if is continuous for some . Hence, it is sufficient to show that is continuous. Noting that and are continuous (by first part of theorem LABEL:thm:functional by putting and ), we have that

is continuous. Hence, we have proved the claim. ∎

We note here that in the context of Example 8.3 in [2], we allow for a larger class of perturbations (class of and we can choose) both in noise measure and operator to be inverted.

3.2. Contraction lemmas

In this section, we shall focus on providing tight conditions which enable us in computing (almost) optimal contraction rates. Each lemma in this section demands different conditions, which are closely related to one another. Depending on the kind of prior, one may find it easier or difficult to verify some of these conditions and this may dictate which form of the lemma to use. We shall present the proof only for the first contraction lemma, and outline the proofs for the other two.

It has been noted by Knapik etal. [18] that it is possible to improve the contraction rates by choosing different priors for different noise levels . We shall also adopt the same method in our quest for better contraction rates in our setting. All the priors however, shall be defined using the same basis . As described in Section 2.2, contraction rate for a posterior measure is a way of quantifying concentration of the posterior measure around the true value. However, such concentration phenomenon is not exhibited by the posterior if the prior distribution does not put enough probability mass around the true value. This can be avoided with the following assumption:

Assumption 3.5.

Assume that there exist a sequence of real numbers such that and , such that

(12)

where is the Cameron-Martin norm of the noise as defined in (5), and is the true value.

Additionally, we also need to ensure that, with high probability (under the prior measure), elements in are well approximated by the finite dimensional projections. This is made precise in the following.

Assumption 3.6.

Assume that admits a singular value decomposition, and that has the eigenpair . Further, assume that there exists a sequence of real constants , sequences of positive integers , and a basis of satisfying

  • as ;

  • for some , where are as defined in the previous assumption;

  • as , and

  • writing and as the projections onto the subspace spanned by and respectively, we need

    (13)

    for some , and the same as in Assumption 3.5.

Remark 3.7.

Note that a sufficient condition to check the inequality in (13) is

(14)

and

(15)

Next is a technical assumption which underlines relationship between the eigenbasis of and the basis .

Assumption 3.8.

Let , then define

(16)

We assume that . Note that is finite whenever and are finite.

In short, the choice of a priors and the operator will determine the sequence in Assumption 3.5. Thereafter, along with and will dictate the existence of , and appearing in Assumptions 3.6 and 3.8. Finally, equipped with the above sequences, we shall see in the following lemma that is the rate of contraction of the posterior measure defined in equation (6).

Lemma 3.9 (Contraction Lemma).

Consider the model given by equation (1), together with Assumptions 3.5-3.8 stated above. Also, let be such that . Then, for some ,

(17)

where, recall that, is called the rate of contraction of the posterior measure around the true value .

Remark 3.10.

Clearly, Assumption 3.8 does not place any a priori restrictions on the basis of the prior as in previous works (see [22]). We also note here that, at this stage we do not know if Assumption 3.6 can be verified for a given problem when is finite. However, in our next lemma we shall weaken Assumption 3.6 by setting .

We shall prove the above lemma in an indirect way which follows the work of Ghosal et. al. [11], where the authors established a close relation between contraction rates and existence of sequence of tests, which are real valued functions defined on and satisfying certain regularity conditions, which in turn translate into contraction rates. More precisely, we shall use the following proposition.

Proposition 3.11.

Let there exist tests for , satisfying

(18)
(19)

where all the constants appearing above are as defined in Assumptions 3.5-3.8. Then is the contraction rate of the posterior measure around the true value . Here is the composition of and .

In view of the above proposition, which will be proved later, the proof of Lemma 3.9 proceeds as follows: under Assumptions 3.5-3.8, we shall show that the same tests as were used in the work of Ray [22] satisfy conditions (18) and (19) above.

Proof of Lemma 3.9..

Recall the model . We will define


Then, setting as the projection of onto , we can write

where