Robust and rate-optimal Gibbs posterior inference on the boundary of a noisy image

Robust and rate-optimal Gibbs posterior inference on the boundary of a noisy image

Nick Syring  and  Ryan Martin
Department of Statistics
North Carolina State University
(nasyring, rgmarti3)@ncsu.edu
June 29, 2019
Abstract

Detection of an image boundary when the pixel intensities are measured with noise is an important problem in image segmentation, with numerous applications in medical imaging and engineering. From a statistical point of view, the challenge is that likelihood-based methods require modeling the pixel intensities inside and outside the image boundary, even though these are typically of no practical interest. Since misspecification of the pixel intensity models can negatively affect inference on the image boundary, it would be desirable to avoid this modeling step altogether. Towards this, we develop a robust Gibbs approach that constructs a posterior distribution for the image boundary directly, without modeling the pixel intensities. We prove that, for a suitable prior on the image boundary, the Gibbs posterior concentrates asymptotically at the minimax optimal rate, adaptive to the boundary smoothness. Monte Carlo computation of the Gibbs posterior is straightforward, and simulation experiments show that the corresponding inference is more accurate than that based on existing Bayesian methodology.

Keywords and phrases: Adaptation; boundary detection; likelihood-free inference; model misspecification; posterior concentration rate.

1 Introduction

In image analysis, the boundary or edge of the image is one of the most important features of the image, and extraction of this boundary is a critical step. An image consists of pixel locations and intensity values at each pixel, and the boundary can be thought of as a curve separating pixels of higher intensity from those of lower intensity. Applications of boundary detection are wide-ranging, e.g., Martin et al. (2004) use boundary detection to identify important features in pictures of natural settings, Li et al. (2010) identifies boundaries in medical images, and in Yuan et al. (2016) boundary detection helps classify the type and severity of wear on machines. For images with noiseless intensity, boundary detection has received considerable attention in the applied mathematics and computer science literature; see, e.g., Ziou and Tabbone (1998), Maini and Aggarwal (2009), Li et al. (2010), and Anam et al. (2013). However, these approaches suffer from a number of difficulties. First, they can produce an estimate of the image boundary, but do not quantify estimation uncertainty. Second, these methods use a two-stage approach where the image is first smoothed to filter out noise and then a boundary is estimated based on a calculated intensity gradient. This two-stage approach makes theoretical analysis of the method challenging, and no convergence results are presently known to the authors. Third, in our examples, these methods perform poorly on noisy data, and we suspect one reason for this is that the intensity gradient is less informative for the boundary when we observe noisy data. In the statistics literature, Gu et al. (2015) take a Bayesian approach to boundary detection and emphasize borrowing information to recover boundaries of multiple, similar objects in an image. Boundary detection using wombling is also a popular approach; see Liang et al. (2009), with applications to geography (Lu and Carlin 2004), public health (Ma and Carlin 2007), and ecology (Fitzpatrick et al. 2010). However, these statistical techniques are used with areal, or spatially aggregated data, and are not suitable for the pixel data encountered in image analysis.

In Section 2, we describe the image boundary problem mathematically, following the setup considered recently in Li and Ghosal (2015). They take a fully Bayesian approach, modeling the probability distributions of the pixel intensities both inside and outside the image. This approach is challenging because it often introduces nuisance parameters in addition to the image boundary. Therefore, in Section 3, we propose to use a so-called Gibbs model, where a suitable loss function is used to connect the data to the image boundary directly, rather than a probability model (e.g., Zhang 2006; Jiang and Tanner 2008; Syring and Martin 2016b; Bissiri et al. 2016).

We investigate the asymptotic convergence properties of the Gibbs posterior, in Section 4, and we show that, if the boundary is -Hölder smooth, then the Gibbs posterior concentrates around the true boundary at the rate , which is minimax optimal (Mammen and Tsybakov 1995) up to the logarithmic factor, relative to neighborhoods of the true boundary measured by the Lebesgue measure of a symmetric difference. Moreover, as a consequence of appropriately mixing over the number of knots in the prior, this rate is adaptive to the unknown smoothness . To our knowledge, this is the first Gibbs posterior convergence rate result for an infinite-dimensional parameter, so the proof techniques used herein may be of general interest. Further, since the Gibbs posterior concentrates at the optimal rate without requiring a model for the pixel intensities, we claim that the inference on the image boundary is robust.

Computation of the Gibbs posterior is relatively straightforward and, in Section 5, we present a reversible jump Markov chain Monte Carlo method; code to implement to the proposed Gibbs posterior inference is available at https://github.com/nasyring/GibbsImage. A comparison of inference based on the proposed Gibbs posterior to that based on the fully Bayes approach in Li and Ghosal (2015) is shown in Section 6. For smooth boundaries, the two methods perform similarly, providing very accurate estimation. However, the Gibbs posterior is easier to compute, thanks to there being no nuisance parameters, and is notably more accurate than the Bayes approach when the Bayesian model is misspecified or when the image boundary is not everywhere smooth. The technical details of the proofs are deferred to Section 7.

2 Problem formulation

Let be a bounded region that represents the frame of the image; typically, will be a square, say, , but, generally, we assume only that is scaled to have unit Lebesgue measure. Our data consists of pairs , , where is a pixel location in and is an intensity measurement at that pixel. The range of is context-dependent, and we consider both binary and real-valued cases below. The model asserts that there is a region in , whose boundary is the object of interest, such that the intensity distribution is different depending on whether the pixel is inside or outside . We consider the following model for the joint distribution of pixel location and intensity, :

(1)

where is a density on , and are densities on the intensity space, and denotes an indicator function. That is, given the pixel location , the distribution of the pixel intensity depends only on whether is in or . Of course, more complex models are possible, e.g., where the pixel intensity distribution depends on the specific pixel location, but we will not consider such models here. We assume that there is a true, star-shaped region, denoted by , with a known reference point in its interior. That is, any point in can be connected to the reference point by a line segment fully contained in . The observations are iid samples from , and the goal is to make inference on or, equivalently, its boundary .

The density for the pixel locations is of no interest and, as is common, we take it to be known. So the question is how to handle the two conditional distributions, and . Li and Ghosal (2015) take a fully Bayesian approach, modeling both and . By specifying these parametric models, they are obligated to introduce priors and carry out posterior computations for the corresponding parameters. Besides the efforts needed to specify models and priors and to carry out posterior computations, there is also a concern that the models for the pixel intensities might be incorrect, potentially biasing the inference on . Since the forms of and , as well as any associated parameters, are of no inferential interest in the boundary detection problem, it is natural to ask if inference can be carried out robustly, without modeling the pixel intensities.

We answer this question affirmatively, developing a Gibbs model for in Section 3. In the present context, suppose we have a loss function that measures how well an observed pixel location–intensity pair agrees with a particular region . The defining characteristic of is that should be the unique minimizer of , where is the risk, i.e., the expected loss. A main contribution here, in Section 3.1, is specification of a loss function that meets this criterion. A necessary condition to construct such a loss function is that the distribution functions and are stochastically ordered. Imagine a gray-scale image; then, stochastic ordering means this image is lighter, on average, inside the boundary than outside the boundary, or vice-versa. In the specific context of binary images, this means that we assume, without loss of generality, , while for continuous images, again without loss of generality, for all . If we define the empirical risk,

(2)

then, given a prior distribution for , the Gibbs posterior distribution for , denoted by , has the formula

(3)

Proper scaling of the loss in the Gibbs model is important (e.g., Bissiri et al. 2016; Syring and Martin 2016a), and here we will provide some context-specific scaling; see Sections 4 and 5.2. Our choice of prior for is discussed in Section 3.2. Together, the loss function and the prior for define the Gibbs model, no further modeling is required.

3 Gibbs model for the image boundary

3.1 Loss function

To start, we consider inference on the image boundary when the pixel intensity is binary, i.e., . In this case, the conditional distributions, and , in (1) must be Bernoulli, so the likelihood is known. Even though the parametric form of the conditional distributions is known, the Gibbs approach only requires prior specification and posterior computation related to , whereas the Bayes approach must also deal with the nuisance parameters in these Bernoulli conditional distributions. More generally, this binary case is relatively simple and will provide insights into how to formulate a Gibbs model in the more challenging continuous intensity problem.

For the binary case, a reasonable choice for the loss function is the following weighted misclassification error loss, depending on a parameter :

(4)

Note that putting in (4) gives the usual misclassification error loss. In order for the Gibbs model to work the risk, or expected loss, must be minimized at the true for some . Picking to ensure this property holds necessitates making a connection between the probability model in (1) and the loss in (4). The condition in (5) below is just the connection needed. With a slight abuse of notation, let and denote the conditional probabilities for the event , given and , respectively. Recall our stochastic ordering assumption implies .

Proposition 1.

Using the notation in the previous paragraph, if is such that

(5)

then the risk function is minimized at .

Proof.

See Section 7.1. ∎

Either one—but not both—of the above inequalities can be made inclusive and the result still holds. The condition in (5) deserves additional explanation. For example, if we know , then we take , which means that in (4) we penalize both intensities of outside and intensities of inside by a loss of . If, however, we know the overall image brightness is higher so that then we take in (4) and penalize bright pixels outside by less than dull pixels inside . To see why this loss balancing is so crucial, suppose the second case above holds so that and , but we take anyway. Then, in (4), is very often equal to while is very often . We will likely minimize the expected loss then by incorrectly taking to be all of so that the first term in the loss vanishes. Knowing a working corresponds to having some prior information about and , but we can also use the data to estimate a good value of and we describe this data-driven strategy in Section 5.2.

In the continuous case we assume that the pixel intensity takes value in . The proposed strategy is to modify the misclassification error loss (4) by working with a suitably discretized pixel intensity measurement. In particular, consider the following version of the misclassification error, depending on parameters :

(6)

Again, we claim that, for suitable , the risk function is minimized at . Let and denote the distribution functions corresponding to the densities and in (1), respectively. Recall our stochastic ordering assumption implies .

Proposition 2.

If in (6) satisfies

(7)

then the risk function is minimized at .

Proof.

See Section 7.2. ∎

Again, either one—but not both—of the above inequalities in (7) can be made inclusive and the result still holds. The parameters and in (6) determine the scale of the loss as mentioned in Section 1. This implies that the true image can be identified by working with a suitable version of the loss (6). A similar condition to (7), see Assumption 1 in Section 4, says what scaling is needed in order for the Gibbs posterior to concentrate at the optimal rate. Although the conditions on the scaling all involve the unknown distribution , a good choice of can be made based on the data alone, without prior information, and we discuss this strategy in Section 5.2.

3.2 Prior specification

We specify a prior distribution for the boundary of the region by first expressing the pixel locations in terms of polar coordinates , an angle and radius, where and . The specific reference point and angle in used to define polar coordinates are essentially arbitrary, subject to the requirement that any point in can be connected to the reference point by a line segment contained in . Li and Ghosal (2015) tested the influence of the reference point in simulations and found it to have little influence on the results. Using polar coordinates the boundary of can be determined by the parametric curve . We proceed to model this curve .

Whether one is taking a Bayes or Gibbs approach, a natural strategy to model the image boundary is to express as a linear combination of suitable basis functions, i.e., . Li and Ghosal (2015) use the eigenfunctions of the squared exponential periodic kernel as their basis functions. Here we consider a model based on free knot b-splines, where the basis functions are defined recursively as

where , and are outer knots, are inner knots, and is the vector of coefficients. Note that we restrict the coefficient vector to be positive because the function values measure the radius of a curve from the origin. In the simulations in Section 6, the coefficients are free parameters, while is calculated deterministically to force the boundary to be closed, i.e. , and we require and ; all other inner knots are free. Our model based on the b-spline representation seem to perform as well as the eigenfunctions used in Li and Ghosal (2015) for smooth boundaries, but a bit better for boundaries with corners; see the examples in Section 6.

Therefore, the boundary curve is is parametrized by an integer and a -vector . We introduce a prior on hierarchically as follows: has a Poisson distribution with rate and, given , the coordinates of are iid exponential with rate . These choices satisfy the technical conditions on detailed in Section 4. In our numerical experiments in Section 6, we take and .

4 Gibbs posterior convergence

The Gibbs model depends on two inputs, namely, the prior and the loss function. In order to ensure that the Gibbs posterior enjoys desirable asymptotic properties, some conditions on both of these inputs are required. The first assumption listed below concerns the loss; the second concerns the true image boundary ; and the third concerns the prior. Here we will focus on the continuous intensity case, since the only difference between this and the binary case is that the latter provides the discretization for us.

Assumption 1.

Loss function parameters in (6) satisfy

(8)

Compared to the condition (7) that was enough to allow the loss function to identify the true , condition (8) in Assumption 1 is only slightly stronger. This can be seen from the following inequality:

However, if are small, then the three quantities in the above display are all approximately equal, so Assumption 1 is not much stronger than what is needed to identify . Again, these conditions on can be understood as providing a meaningful scale to the loss function. Intuitively, the scale of the loss between observations receiving no loss versus some loss, expressed by parameters and , should be related to the level of information in the data. When and are far apart, the data can more easily distinguish between and , so we are free to assign larger losses than when and are close and the data are relatively less informative.

The ability of a statistical method to make inference on the image boundary will depend on how smooth the true boundary is. Li and Ghosal (2015) interpret as a function from the unit circle to the positive real line, and they formulate a Hölder smoothness condition for this function. Following the prior specification described in Section 3.2, we treat the boundary as a function from the interval to the positive reals, and formulate the smoothness condition on this arguably simpler version of the function. Since the reparametrization of the unit circle in terms of polar coordinates is smooth, it is easy to check that the Hölder smoothness condition (9) below is equivalent to that in Li and Ghosal (2015).

Assumption 2.

The true boundary function is -Hölder smooth in the sense that, there exists a constant such that

(9)

where denotes the derivative of and denotes the largest integer less than or equal to . Following the description of in the introduction, we also assume that the reference point is strictly interior to meaning that it is contained in an open set itself wholly contained in so that is uniformly bounded away from zero. Moreover, the density for , as in (1), is uniformly bounded above by and below by on .

General results are available on the error in approximating an -Hölder smooth function by b-splines of the form specified in Section 3.2. Indeed, Theorem 6.10 in Schumaker (2007) implies that if satisfies (9), then

for all , there exists such that . (10)

Since , we can consider all coefficients to be positive; i.e. and see Lemma 1(b) in Shen and Ghosal (2015). The next assumption about the prior makes use of the approximation property in (10).

Assumption 3.

Let , for , be as in (10). Then there exists positive constants and such that the prior for satisfies, for all ,

The first two conditions in Assumption 3 ensure that the prior on is sufficiently spread out while the second two conditions ensure that there is sufficient prior support near ’s that approximate well. Assumption 3 is also needed in Li and Ghosal (2015) for convergence of the Bayesian posterior at the optimal rate. However, the Bayes model also requires assumptions about the priors on the nuisance parameters, e.g., Assumption C in Li and Ghosal (2015), which are not necessary in our approach here.

Theorem 1.

With a slight abuse of notation, let denote the prior for , induced by that on , and the corresponding Gibbs posterior (3). Under Assumptions 13, for any positive sequence

where and is the smoothness coefficient in Assumption 2.

Proof.

See Section 7.3. ∎

Theorem 1 says that, as the sample size increases, the Gibbs posterior places its mass on a shrinking neighborhood of the true boundary . The rate, given by the size of the neighborhood, is optimal according to Mammen and Tsybakov (1995), up to a logarithmic factor, and adaptive since the prior does not depend on the unknown smoothness . In addition, the diverging sequence in Theorem 1 can be replaced by a constant that depends only on a particular feature of ; see the proof for the details of this claim.

5 Computation

5.1 Sampling algorithm

We use reversible jump Markov chain Monte Carlo, as in Green (1995), to sample from the Gibbs posterior. These methods have been used successfully in Bayesian free-knot spline regression problems; see, e.g., Denison et al. (1998) and DiMatteo et al. (2001). Although the sampling procedure is more complicated when allowing the number and locations of knots to be random versus using fixed knots, the resulting spline functions can do a better job fitting curves with low smoothness.

To summarize the algorithm, we start with the prior distribution for as discussed in Section 3.2. Next, we need to initialize values of , the knot locations , and the values of . The value of is then calculated numerically to force closure. We choose with , , , , , and evenly spaced in . We set inner knots and while the other inner knot locations remain free to change in birth, death, and relocation moves; we also set . Then the following three steps constitutes a single iteration of the reversible jump Markov chain Monte Carlo algorithm to be repeated until the desired number of samples are obtained:

Step 1.

Use Metropolis-within-Gibbs steps to update the elements of the vector, again solving for to force closure at the end. In our examples we use normal proposals centered at the current value of the element of the vector, and with standard deviation .

Step 2.

Randomly choose to attempt either a birth, death, or relocation move to add a new inner knot, delete an existing inner knot, or move an inner knot.

Step 3.

Attempt the jump move proposed in Step 2. The vector must be appropriately modified when adding or deleting a knot, and again we must solve for . Details on the calculation of acceptance probabilities for each move can be found in Denison et al. (1998) and DiMatteo et al. (2001).

R code to implement this Gibbs posterior sampling scheme, along with the empirical loss scaling method described in Section 5.2, is available at https://github.com/nasyring/GibbsImage.

5.2 Loss scaling

It is not clear how to select to satisfy Assumption 1 without knowledge of and . However, it is fairly straightforward to select values of based on the data which are likely to meet the required condition. First, we need a notion of optimal values. If we knew and , then we would select to maximize because this choice of gives us the point at which and are most easily distinguished. Then, we would choose to be the largest values such that (8) holds. Intuitively, we want large values of so that the loss function in (6) is more sensitive to departures from .

Since we do not know and , we estimate and from the data. In order to do this, we need an estimate of to define the regions and . For a grid of values , we minimize (6) for each where are taken to satisfy . The resulting classifiers obtained from these minimizations are used to estimate the regions and and we estimate and by their corresponding empirical versions and , respectively. Then, we choose from among the such that . Based on this choice of , choose the final values of to satisfy (8) replacing and by their estimates and .

Based on the simulations in Section 6, this method produces values of very close to their optimal values. Importantly, the estimated are much more likely to be smaller than their optimal values than larger, which makes our estimates more likely to satisfy (8). The reason for this comes from the stochastic ordering of and . Unless the classifier we obtain by minimizing (6) is perfectly accurate, we will tend to mix together samples from and in our estimates. If we estimate with some observations from and some from , we will tend to overestimate , and vice versa we will tend to underestimate . These errors will cause to be underestimated, and therefore more likely to satisfy (8).

6 Numerical examples

We tested our Gibbs model on data from both binary and continuous images following much the same setup as in Li and Ghosal (2015). The pixel locations in are sampled by starting with a fixed grid in and making a small random uniform perturbation at each grid point. Several different pixel intensity distributions are considered. We consider two types of shapes for : an ellipse with center , rotated at an angle of 60 degrees, with major axis length and minor axis length ; and a centered equilateral triangle of height 0.5. The ellipse boundary will test the sensitivity of the model to boundaries which are off-center while the triangle tests the model’s ability to identify non-smooth boundaries.

We consider four binary intensity images and four continuous intensity images and compare with the Bayesian method in Li and Ghosal (2015). Codes for implementing their fully Bayesian approach are available via CRAN in the BayesBD package (Syring and Li 2016).

Binary Examples.

  • Ellipse image, , and and are Bernoulli with parameters 0.5 and 0.2, respectively.

  • Same as B1 but with triangle image.

  • Ellipse image, , and and are Bernoulli with parameters 0.25 and 0.2, respectively.

  • Same as B3 but with triangle image.

Continuous Examples.

  • Ellipse image, , and and are and , respectively.

  • Same as C1 but with triangle image.

  • Ellipse image, , and and are , a normal mixture, and , respectively.

  • Ellipse image, , and and are distributions with 3 degrees of freedom and non-centrality parameters 1 and 0, respectively.

For binary images, the likelihood must be Bernoulli, so the Bayesian model is correctly specified in cases B1–B4. For the continuous examples in C1–C4, we assume a Gaussian likelihood for the Bayesian model. Then, cases C1 and C2 will show whether or not the Gibbs model can compete with the Bayesian model when the model is correctly specified, while cases C3 and C4 will demonstrate the superiority of the Gibbs model over the Bayesian model when there is model misspecification. Again, the Gibbs model has the added advantage of not having to specify priors for or sample values of the mean and variance hyperparameters associated with the normal conditional distributions.

We replicated each example scenario 100 times for both the Gibbs and Bayesian models, each time producing a posterior sample of size 4000 after a burn in of 1000 samples. We recorded the errors—Lebesgue measure of the symmetric difference—for each run along with the estimated loss function parameters for the Gibbs models for continuous images. The results are summarized in Tables 12. We see that the Gibbs model is competitive with the fully Bayesian model in Examples B1–B4, C1, and C2, when the likelihood is correctly specified. When the likelihood is misspecified, there is a chance that the Bayesian model will fail, as in Examples C3 and C4. However, the Gibbs model does not depend on a likelihood, only the stochastic ordering of and , and it continues to perform well in these non-Gaussian examples. From Table 2, we see that the empirical method described in Section 5.2 is able to select parameters for the loss function in (6) close to the optimal values and meeting Assumption 1.

Model B1 B2 B3 B4 C1 C2 C3 C4
Bayes
0.00
(0.00)
0.02
(0.00)
0.01
(0.00)
0.02
(0.01)
0.03
(0.03)
0.04
(0.03)
0.11
(0.06)
0.10
(0.05)
Gibbs
0.01
(0.00)
0.01
(0.00)
0.02
(0.01)
0.02
(0.01)
0.01
(0.00)
0.01
(0.00)
0.01
(0.01)
0.01
(0.01)
Table 1: Average errors (and standard deviations) for each example.
Parameter
C1 C2 C3 C4
1.45
(1.86)
1.47
(1.86)
0.80
(1.27)
0.71
(0.80)
2.30
(2.36)
2.29
(2.36)
0.26
(0.34)
0.71
(0.75)
2.43
(2.40)
2.39
(2.40)
-1.83
(-1.76)
0.46
(0.46)
Table 2: Average (and optimal) values of the parameters in (6).

Figure 1 shows the results of the Bayesian and Gibbs models for one simulation run in each of Examples B1–B2 and C1–C4. The credible regions, as in Li and Ghosal (2015), are highlighted in gray around the posterior means. That is, let , where is the posterior boundary sample, is the pointwise posterior mean and the pointwise standard deviation of the samples. If is the percentile of the ’s, then a uniform credible band is given by . The results of cases B2 and C2 suggest that free-knot b-splines may do a better job of approximating non-smooth boundaries than the kernel basis functions used by Li and Ghosal (2015).

Figure 1: From top, Examples B1–B2, C1–C4. In each row, the observed image is on the left, the Bayesian posterior mean estimator (Li and Ghosal 2015) is in the middle, and the Gibbs posterior mean estimator is on the right. Solid lines show the true image boundary, dashed lines are the estimates, and gray regions are 95% credible bands.

7 Proofs

7.1 Proof of Proposition 1

By the definition of the loss function in (4), for a fixed and for any , we have

Then the expectation of the loss difference above is

where is the probability relative to the distribution of . This quantity is zero if and only if . It can also be lower bounded by

Given the condition (5) in Proposition 1, both terms in the minimum are positive. Therefore, with equality if and only if , proving the claim.

7.2 Proof of Proposition 2

The proof here is very similar to that of Proposition 1. By the definition of the loss function in (6), for any fixed and for any , we get

Then, the expectation of the loss difference above is given by

where, again, is the probability relative to the distribution of . This quantity is zero if and only if . It can also be lower bounded by

Given the condition (7) in Proposition 2, both terms in the minimum are positive. Therefore, with equality if and only if , proving the claim.

7.3 Preliminary results

Towards proving Theorem 1, we need several lemmas. The first draws a connection between the distance between defined by the Lebesgue measure of the symmetric difference and the sup-norm between the boundary functions.

Lemma 1.

Suppose , with boundary , satisfies Assumption 2, in particular, . Take any , with , such that , and any such that satisfies , where . Then

where is the diameter of . So, if , then the lower bound is a positive multiple of .

Proof.

The first step is to connect the symmetric difference-based distance to the distance between boundary functions. A simple conversion to polar coordinates gives

If we let , then it is easy to verify that

Therefore,

(11)

Next, if , which is positive by Assumption 2, then it follows from the right-most inequality in (11) that and, by the triangle inequality,

We have which, by assumption, is less than . Consequently,

and, hence,

By the left-most inequality in (11), we get

which is the desired bound. It follows immediately that the lower bound is a positive multiple of if . ∎

The next lemma shows that we can control the expectation of the integrand in the Gibbs posterior under the condition (8) on the tuning parameters in the loss function definition.

Lemma 2.

If (8) holds, then for a constant .

Proof.

From the proof of Proposition 2, we have

The key observation is that, if , then the loss difference is 0 and, therefore, the exponential of the loss difference is 1. Taking expectation with respect , we get

From (8), we have

so that

Then the claim follows, with , since . ∎

The next lemma yields a necessary lower bound on the denominator of the Gibbs posterior distribution. The neighborhoods are simpler than those in Lemma 1 of Shen and Wasserman (2001), because the variance of our loss difference is automatically of the same order as its expectation, but the proof is otherwise very similar to theirs, so we omit the details.

Lemma 3.

Let be a sequence of positive numbers such that and set for some . Then with probability converging to as .

Our last lemma is a summary of various results derived in Li and Ghosal (2015) towards proving their Theorem 3.3. This helps us to fill in the details for the lower bound in Lemma 3 and to identify a sieve—a sequence of subsets of the parameter space—with high prior probability but relatively low complexity.

Lemma 4.

Let be as in Theorem 1 and let . Then, for some choice of , from (10).

  1. Define the neighborhood . Then for some depending on .

  2. Define the sieve . Then for some .

  3. The bracketing number of satisfies .

7.4 Proof of Theorem 1

Define the set

(12)

For the sieve in Lemma 4, part 2, we have . We want to show that both terms in the upper bound vanish, in , as .

It is helpful to start with a lower bound on , the denominator in both of the terms discussed above. First, write

where is defined in Lemma 3 with and to be determined. From Proposition 2 and Lemma 1

where is the term in the above display. Let denote the set of regions with boundary functions that satisfy . If , then we have

and, therefore, , where . From Lemma 3, we have , with probability converging to . We have shown that , so it follows from Lemma 4, part 1,

with