Frequentist coverage and sup-norm convergence rate in Gaussian process regression

Frequentist coverage and sup-norm convergence rate in Gaussian process regression

Abstract

Gaussian process (GP) regression is a powerful interpolation technique due to its flexibility in capturing non-linearity. In this paper, we provide a general framework for understanding the frequentist coverage of point-wise and simultaneous Bayesian credible sets in GP regression. As an intermediate result, we develop a Bernstein von-Mises type result under supremum norm in random design GP regression. Identifying both the mean and covariance function of the posterior distribution of the Gaussian process as regularized -estimators, we show that the sampling distribution of the posterior mean function and the centered posterior distribution can be respectively approximated by two population level GPs. By developing a comparison inequality between two GPs, we provide exact characterization of frequentist coverage probabilities of Bayesian point-wise credible intervals and simultaneous credible bands of the regression function. Our results show that inference based on GP regression tends to be conservative; when the prior is under-smoothed, the resulting credible intervals and bands have minimax-optimal sizes, with their frequentist coverage converging to a non-degenerate value between their nominal level and one. As a byproduct of our theory, we show that the GP regression also yields minimax-optimal posterior contraction rate relative to the supremum norm, which provides a positive evidence to the long standing problem on optimal supremum norm contraction rate in GP regression.

Key words: Gaussian process regression; Bernstein-von Mises theorem; nonparametric regression; Gaussian comparison theorem; kernel estimator.

1 Introduction and preliminaries

Gaussian process (GP) regression is a popular Bayesian procedure for learning an infinite-dimensional function in the nonparametric regression model

where is the covariate variable and the response variable. Through specifying a GP as the prior measure over the infinite dimensional parameter space , Bayes†rule yields a posterior measure for that can be used to either construct a point estimator as the posterior mean, or characterize estimation uncertainties through the corresponding point-wise credible intervals and simultaneous credible bands. Examples of the wide usages of GP include computer experiment emulations [35, 18, 25], spatial data modeling [17, 1], geostatistical kriging [40, 28], and machine learning [32].

Despite its long-standing popularity, formal investigations into theoretical properties of GP regression from a frequentist perspective assuming the existence of a true data-generating function is a much more recent activity. A majority of this recent line of work has been directed towards understanding large sample performance of point estimation in the form of proving posterior consistency and deriving posterior contraction rates. As an incomplete survey, [46, 44] provide a general framework for studying the rate of posterior contraction of GP regression, and derive the contraction rates for several commonly used covariance kernels. In [45] and [4], the authors show that by putting hyper priors over inverse bandwidth parameters in the square-exponential covariance kernel, GP regression can adapt to the unknown (anisotropic) smoothness levels of the target function. [49] show that a class of GP priors with Euclidean-metric based covariance kernels can additionally adapt to unknown low-dimensional manifold structure even with a moderate-to-high dimensional ambient space in which the covariate lives.

There is a comparatively limited literature on the validity of conducting statistical inference in GP regression, or more generally, in Bayesian nonparametric procedures, from a frequentist perspective. Uncertainty quantification for GPs plays an important role—even more important than point estimation itself—in some applications such as design of experiments [6] and risk management [36], and hence it is of interest to investigate the frequentist validity of such. However, unlike finite dimensional parametric models, where a Bernstein von Mises (BvM) theorem under the total variation metric holds fairly generally and guarantees asymptotically nominal coverage of Bayesian credible intervals, the story of the frequentist validity of credible intervals/bands in infinite dimensional models is more complicated and delicate [16, 20, 24, 27]. For the Gaussian white noise model, [10] showed that a Bernstein von Mises (BvM) theorem holds in weaker topologies for some natural priors, which yields the correct asymptotic coverage of credible sets based on the weaker topology; however their result is not applicable for understanding the coverage of credible intervals/bands. A similar result for the stronger -norm can be found in [11]. In the context of linear inverse problems with Gaussian priors, [26] showed that asymptotic coverage of credible sets can be guaranteed by under-smoothing the prior compared to the truth. [43] investigated the frequentist coverage of Bayesian credible sets in the Gaussian white noise model, and showed that depending on whether the smoothness parameter in the prior distribution matches that in the truth, the coverage of the corresponding credible sets can either be significantly below its nominal confidence level, or converge to one as the sample size increases, even though the nominal level is fixed at a constant value. [43] also investigated the adaptivity of the credible sets to unknown smoothness using an empirical Bayes approach; see also [37, 33].

A majority of the work discussed in the previous paragraph focuses on the Gaussian white noise model or its equivalent infinite dimensional normal mean formulation, where appropriate conjugate Gaussian priors lead to an analytically tractable posterior. A major difficulty of calculating the nominal coverage probabilities of Bayesian point-wise credible intervals and simultaneous credible bands in GP regression lies in the fact that the covariance structure in the posterior GP of involves unwieldy matrix inverses that are complicated for direct analysis. Moreover, when the design is random, the randomness in the covariance structure further complicates the analysis. A relevant work in this regard is [53], where the authors derive the posterior contraction rates under the supremum norm for GP regression induced by tensor products of B-splines, and identify conditions under which the coverage of the simultaneous credible band tends to one as the sample size tends to infinity. However, their result requires the nominal level of the credible bands to also tend to one, and is based on a key property of their prior distribution that the resulting posterior covariance function can be sandwiched by two identity matrices with similar scalings, preventing it to be applicable to a broader class of GP priors. Also relevant to the present discussion is [39], who obtained similar results for point-wise credible intervals using a scaled Brownian motion prior. The authors exploit explicit formulas for the eigenvalues and eigenfunctions of the covariance kernel of a Brownian motion when the design points are on a regular grid, which along with other properties specific to Brownian motion can be used to linearize the posterior mean and variance, aiding a direct analysis.

The goal of this article is to provide a general framework for understanding the frequentist coverage of Bayesian credible intervals and bands in GP regression by proving a BvM type theorem under the supremum norm in random design nonparametric regression. Towards this goal, we first show that the sampling distribution of the posterior mean function can be well-approximated by a GP over the covariate space with an explicit expression for its covariance function. Second, we find a tractable population level GP approximation to the centered posterior GP whose covariance function is non-random and also admits an explicit expression. The frequentist coverage of the Bayesian credible intervals and bands are derived from an interplay between these two population level GPs. A salient feature of our technique is that it provides an explicit expression of the coverage along with explicit finite sample error bounds—it is non-asymptotic and applies to any nominal level. Interestingly, we find that when the prior is under-smoothed, the Bayesian credible intervals and bands are always moderately conservative in the sense that its frequentist coverage is always higher than its nominal level, and converges to a fixed number (with explicit expression) between its nominal level and one as the sample size grows to infinity. For example, when the covariate is one-dimensional and uniformly distributed over the unit interval, and the prior smoothness level is , the asymptotic frequentist coverage of a credible interval is 0.976(0.969). This phenomenon is radically different from existing results, where the frequentist coverage is either tending to zero or one but never converges to a non-degenerate value. As a byproduct of our theory, we show that the GP regression also yields minimax-optimal posterior contraction rate relative to the supremum norm, which provides positive evidence to the long standing problem about supremum norm contraction rate of general Bayesian nonparametric procedures.

In our proofs, we employ the equivalent kernel representation [32, 38] of the kernel ridge regression estimator to establish a novel local linear expansion of the posterior mean function relative to the supremum norm, which is of independent interest and builds a link between GP regression and frequentist kernel type estimators [21]. This local linear expansion can be utilized to show the limiting GP approximation to the sampling distribution of the posterior mean function. Towards the proof of approximating the posterior GP with a population level GP, we develop a new Gaussian comparison inequality that provides explicit bounds on the Kolmogorov distance between two GPs through the supremum norm difference between their respective covariance functions.

Overall, our results reveals delicate interplay between frequentist coverage of Bayesian credible sets and proper characteristics of the prior measure in infinite-dimensional Bayesian procedures. We validate GP regression for conducting statistical inference by showing that as long as the prior measure is not over-smoothed, Bayesian credible sets always provide moderately conservative uncertainty quantification with minimax-optimal sizes.

We begin the technical development by introducing notation used throughout the paper. A summary of all the major notations are provided in Table 2 in §5 for the reader’s convenience.

1.1 Notation

Let be normed linear spaces. The Fréchet derivative of an operator at the point is the bounded linear operator denoted which satisfies

(1)

In particular, when , the Fréchet derivative is the Jacobian of , a linear operator which is represented by an matrix .

Throughout are generically used to denote positive constants whose values might change from one line to another, but are independent from everything else. We use and denote inequalities up to a constant multiple; we write when both and hold. For , let denote the largest integer strictly smaller than .

1.2 Review of RKHS

We recall some key facts related to reproducing kernel Hilbert spaces (RKHS); further details and proofs can be found in Chapter 1 of [48]. Let denote a general index set. A symmetric function is positive definite (p.d.) if for any , , and ,

A (real) RKHS is a Hilbert space of real-valued functions on such that for any , the evaluation function ; , is a bounded linear functional, i.e., there exists a constant such that

In the above display, is the Hilbert space norm. Since the evaluation maps are bounded linear, it follows from the Riesz representation theorem that for each , there exists an element such that . is called the representer of evaluation at , and the kernel is easily shown to be p.d. Conversely, given a p.d. function on , one can construct a unique RKHS on with as its reproducing kernel. Given a kernel , we shall henceforth let .

Let denote the space of square integrable functions with . We denote the usual inner product on . If a p.d. kernel is continuous and , then by Mercer’s theorem, there exists an orthonormal sequence of continuous eigenfunctions in with eigenvalues , and

The RKHS determined by consists of functions satisfying , where . Further, for ,

(2)

For stationary kernel with and being the Lebesgue measure, we can always choose and , , by expanding into a Fourier series over and exploiting the identity . Under these choices, we also have for (more details can be found in Appendix A of [50]).

2 Framework

To begin with, we introduce Gaussian process regression and draw its connection (for both the posterior mean function and posterior covariance function) with kernel ridge regression (KRR) in §2.1. In §2.2, we introduce the key mathematical tool in our proofs—equivalent kernel representation of the kernel ridge regression. In §2.3, we present our first result on the local linear expansion of the KRR estimator relative to the supremum norm, indicating the asymptotic equivalence between the KRR estimator with a carefully constructed kernel type estimator.

2.1 GP regression

For easy presentation, we focus on the univariate regression problem where , and our results can be straightforwardly extend to multivariate cases. Let , be i.i.d., with and , with joint density , where denotes the density, and is the unknown function of interest. Our goal is to estimate and perform inference on based on the data . We assume is known throughout the paper.

We consider a nonparametric regression model

(3)

and assume a mean-zero GP prior on the regression function , , where is a positive definite kernel and is a tuning parameter to be chosen later.

By conjugacy, it is easy to check that the posterior distribution of is also a GP, , with mean function and covariance function ,

(4)

Since the posterior distribution is a GP, it is completely characterized by and . These quantities can be explicitly calculated; we introduce some notation to express these succinctly. For vectors and , let denote the matrix . Also, let and . With these notations,

(5)
(6)

In particular, the posterior variance function is

The presence of the inverse kernel matrix in equations (5) and (6) renders analysis of the GP posterior unwieldy. A contribution of this article is to recognize both the mean function and covariance function as regularized -estimators and use a equivalent kernel trick to linearize the solutions that avoids having to deal with matrix inverses. It is well-known (see, e.g. Chapter 6 of [32]) that the posterior mean coincides with the kernel ridge regression (KRR) estimator

(7)

when the RKHS corresponds to the prior covariance kernel . The objective function in (7) combines the average squared-error loss with a squared RKHS norm penalty weighted by the prior precision parameter . It follows from the representer theorem for RKHSs [48] that the solution to (7) belongs to the linear span of the kernel functions . Given this fact, solving (7) only amounts to solving a quadratic program, and the solution coincides with in (5).

A novel observation aiding our subsequent development is that the posterior covariance function can be related to the bias of a noise-free KRR estimator. Write as

where as defined earlier, and . Comparing with (5) and (7), it becomes apparent that is the solution to the following KRR problem with noiseless observations of at the random design points ,

(8)

where .

To summarize, the posterior mean corresponds to the usual KRR estimator, and an appropriately scaled version of the posterior covariance function can be recognized as the bias of a noiseless KRR estimator. This motivates us to study the performance of KRR estimators in the supremum norm, which to best of our knowledge, hasn’t been carried out before. For past work on risk bounds for the KRR estimator in other norms, refer to [29, 54, 41, 19, 52].

2.2 Equivalent kernel representation of the KRR estimator

We first introduce an equivalent-kernel formulation that allows us to linearize the bias of a KRR estimate. Let be an RKHS, with reproducing kernel having eigenfunctions and eigenvalues with respect to . Fix . Define a new inner product on as

(9)

and let . Observe that is again an RKHS, since for any , for some constant depending on and , proving the boundedness of the evaluation maps.

Let and be elements of . Then,

where

(10)

Hence, consists of , with

(11)

The reproducing kernel associated with the new RKHS is thus

(12)

As before, we let denote the representer of the evaluation map, i.e., , so that for any , . The kernel is known as the equivalent kernel (see, e.g., Chapter 7 of [32]), motivated by the notion of equivalent kernels in the spline smoothing literature [38]. We comment more on connections with the equivalent kernel literature once we represent the KRR problem in terms of the equivalent kernel.

Before proceeding further, we introduce two operators which are routinely used subsequently. First, define a linear operator given by

We shall often use the abbreviation . The operator is easily recognized as a convolution with the equivalent kernel . If , a straightforward calculation yields . Thus, for functions , it follows from (11) that

The above display also immediately tells us that is a self-adjoint operator, i.e., for all . Define another self-adjoint operator given by , where id is the identity operator on . Then, it follows from the previous display and (9) that

(13)

Having developed the necessary groundwork, let us turn our attention back to the KRR estimate. Recall the objective function defined in (7). Writing , and using (13), we can express

Viewing and performing a Fréchet differentiation with respect to , one obtains a score equation for the KRR estimate as,

where is given by

(14)

Define to be the population version of the score equation, where the expectation is assumed with respect to the true joint density . Recall the convolution operator . We then have,

(15)

since by definition. It is immediate that , which therefore implies that the function is the solution to the population level score equation. We shall henceforth refer to as the population-level KRR estimator.

In the above treatment, we first differentiated the objective function and then took an expectation with respect to the true distribution to arrive at the population-level KRR estimator . One arrives at an identical conclusion if the expectation of the objective function is minimized, which is equivalent to minimizing

Solving for , one obtains , and hence . This approach thus also leads to the equivalent kernel; see Chapter 7 of [32] for a detailed exposition along these lines.

2.3 Sup-norm bounds for the KRR estimator

We now use the equivalent kernel representation to derive error bounds in the supremum norm between a KRR estimator and its target function. We first lay down two kinds of parameter space considered for the true function . Recall the orthonormal basis of .

For any and , define

(16)

For integer and the Fourier basis, corresponds to the -smooth Sobolev functions with absolutely continuous derivatives and whose th derivative has uniformly bounded norm.

Next, for any and , define

(17)

The -subscript is used to indicate a correspondence of this class of functions with -Hölder functions. Indeed, under the Fourier basis, if , then has continuous derivatives, and

which implies that the th derivative of is Lipschitz continuous of order .

We next lay down some standard technical conditions on the eigenfunctions and eigenvalues of the kernel .

Assumption (B): There exists global constants such that the eigenfunctions of satisfy for all , and for all and .

Assumption (E): The eigenvalues of satisfy for some .

As a motivating example, the Matérn kernel with smoothness index satisfies (B) and (E) when expanded with respect to the Fourier basis

on ; see [3, 50] for more details. Observe that the RKHS associated with a Gaussian process associated with Matérn kernel with smoothness index (in the multivariate case, , with the dimensionality of ) is . As a passing comment, this space does not contain functions with smoothness less than , which includes functions with smoothness . Although for concreteness we focus on kernels with polynomially decaying eigenvalues in the paper, our theory is also applicable to other kernel classes, such as squared exponential kernels and polynomial kernels.

Observe that under Assumption (E). Bounding the sums by integrals, the following facts are easily observed and used repeatedly in the sequel,

(18)

Recall the KRR estimator from (7). Let , so that are independent for and also independent of . Recall the operators and from §2.2. For , we continue to denote

(19)

Since for all , and are elements of for any .

We now state a theorem which bounds the sup-norm distance between and the true function . To state the theorem in its most general form, we don’t make any smoothness assumptions of yet and state a high probability bound on by only assuming . Reductions of the bound when is in either of the smoothness classes or are discussed subsequently.

Theorem 2.1 (Sup-norm bounds for KRR estimator).

Assume the kernel satisfies assumptions (B) and (E). Define via the equation (). Then, with probability at least with respect to the randomness in , the following error estimate holds,

(20)

Moreover, with the same probability, the following higher-order expansion holds,

(21)

where

(22)

Here, and are constants independent of , and is the equivalent kernel of defined in §2.2.

Remark 2.1 (Sub-Gaussian errors).

An inspection of the proof of Theorem 2.1 reveals that we haven’t made explicit use of the normality of the s. Indeed, the conclusions of the theorem, and all subsequent results, extend to sub-Gaussian errors.

The first-order bound in (20) has two components: a bias term, , and a variance term, . Since , measures the closeness (in terms of the supremum norm) between the true function and its convolution with the equivalent kernel , . Recall also that is the solution to the population-level score equation. The smoothing parameter provides the trade-off between the bias and variance; larger (stronger regularization) reduces variance at the cost of increasing the bias, and vice versa. Notice that a typical analysis of the KRR estimator using basic inequality (such as [51]) requires a lower bound on the regularization parameter , while our analysis is free of such an assumption. Under additional assumptions on , more explicit bounds can be obtained for the bias term . For example, if , then one can show that . Similarly, for , one obtains the bound . Choosing optimally in either situation lead to the following explicit bounds.

Corollary 2.1 (Minimax rates for Sobolev and Hölder classes).

Suppose defined in (16). Set

Then, with probability at least with respect to the randomness in ,

(23)

Next, suppose defined in (17). Set

Then, with probability at least with respect to the randomness in ,

(24)

Corollary 2.1 implies that the rate of convergence (up to logarithmic terms) of the KRR estimator to the truth in supremum norm is and for -smooth Sobolev and Hölder functions respectively. For functions in the Sobolev class, the KRR estimator concentrates at the usual minimax rate of under the norm [8, 54]. However, it is known [7, 5] that under the point-wise and/or supremum norm, the minimax rate for -smooth Sobolev functions deteriorates to . Hence, the KRR estimator achieves the minimax rate under the supremum norm as well. For the Hölder class, the minimax rate remains the same under the and norms, and the KRR estimator achieves the minimax rate.

The higher-order expansion in the display (21) provides a finer insight into the distributional behavior of . Let denote the zero-mean random process with covariance function , so that . For any fixed , the law of can be approximated by the law of a distribution by the central limit theorem, and (21) can be used to establish that the law of is close to a distribution. Indeed, we shall establish a stronger result that the law of the process can be approximated by a Gaussian process in the Kolmogorov distance. The point-wise and uniform approximation results will be crucial to prove asymptotic validity of Bayesian point-wise and simultaneous credible bands in §3.

Theorem 2.1 specializes to the noiseless KRR problem in a straightforward fashion upon setting in the bounds (20) and (21). As noted in §2.1, the posterior variance function can be expressed as the solution to a noiseless KRR problem, motivating our interest in such situations. The following corollary states a general result for the noiseless case which is used subsequently to analyze the posterior variance function.

Corollary 2.2 (Sup-norm bounds for KRR estimator: noiseless case).

Consider a noiseless version of the KRR problem in (7) where . Assume the kernel satisfies assumptions (B) and (E). Define via the equation . Then, with probability at least with respect to the randomness in , the following error estimate holds,

(25)

Moreover, with the same probability,

(26)

where is as in (22). As before, and are constants independent of , and is the equivalent kernel of defined in §2.2.

The proof of Corollary 2.2 follows directly from tracking the proof of Theorem 2.1 with , and hence omitted. Implications of Corollary 2.2 for the posterior variance function are discussed in the next section.

3 Convergence limit of Bayesian posterior

We now use the sup-norm KRR bounds developed in §2.3 to analyze the GP posterior defined in (4).
Implications for the posterior mean function: As noted in §2.1, the posterior mean under the prior coincides with the KRR estimator . Hence, the conclusions of Theorem 2.1 apply to , that is, with probability at least ,

(27)

In particular, for optimal choices of the prior precision parameter as in Corollary 2.1, the posterior mean achieves the minimax rates for the Sobolev and Hölder classes under the supremum norm.
Implications for the posterior variance function: Recall from § 2.1 that the posterior covariance function admits the representation , where is the solution to the noiseless KRR problem in (8). From (31) in Corollary 2.2, it follows with at least ,

A simple calculation yields

Since are uniformly bounded, we have

(28)

From (26) in Corollary 2.2, we obtain that with probability at least ,

Combining with the previous display,

(29)

In particular, this relationship between the rescaled posterior covariance function and the equivalent kernel function leads to a practically useful way to numerically approximate the equivalent kernel function without explicitly conducting the eigen-decomposition to the original kernel function .

Let . Then the conditional distribution of given is the posterior distribution of the mean function , or in other words, the law of given is that of the centered posterior distribution. When studying second-order properties of the posterior, it will be useful to work with a scaled version of the posterior with a scaling, which has the same covariance function as ; . Write

The approximation error bound in (29) motivates us to define a related GP as , with