A Theoretical Framework for Bayesian Nonparametric Regression: Orthonormal Random Series and Rates of Contraction
Abstract
We develop a unifying framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated distance without assuming the regression function space to be uniformly bounded. The framework is built upon orthonormal random series in a flexible manner. A general theorem for deriving rates of contraction for Bayesian nonparametric regression is provided under the proposed framework. Three nontrivial applications of the proposed framework are provided: The finite random series regression of an Hölder function, with adaptive rates of contraction up to a logarithmic factor, given independent and uniform design points; The unmodified block prior regression of an Sobolev function, with adaptiveandexact rates of contraction, given independent and uniform design points; The squaredexponential Gaussian process regression of a supersmooth function with a nearparametric rate of contraction, under the condition that the design points are fixed and reasonably spread. These applications serve as generalization or complement of the their respective results in the literature. Extension to sparse additive models in high dimensions is discussed as well.
A theoretical framework for Bayesian regression
, , and
Bayesian nonparametric regression \kwdintegrated distance \kwdorthonormal random series \kwdrate of contraction
1 Introduction
Consider the standard nonparametric regression problem , , where the predictors are referred to as design (points) and take values in , ’s are independent and identically distributed (i.i.d.) meanzero Gaussian noises with , and ’s are the responses. We follow the popular Bayesian approach by assigning a carefullyselected prior, and perform inference tasks by finding the posterior distribution of given the observations .
We develop a theoretical framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated distance
The regression function is assumed to be represented using a set of appropriate orthonormal basis functions : . The coefficients are allowed to range over the entire real line and the space of regression functions is allowed to be unbounded, including the renowned Gaussian process priors as special examples.
Rates of contraction of posterior distributions for Bayesian nonparametric priors have been studied extensively. Following the earliest framework on generic rates of contraction theorems with i.i.d. data proposed by [12], specific examples for density estimation via Dirichlet process mixture models [4, 13, 16, 37] and locationscale mixture models [22, 46] are discussed. For nonparametric regression, the rates of contraction had not been discussed until [14], who develop a generic framework for fixeddesign to study rates of contraction with respect to the empirical distance. There are extensive studies for various priors that fall into this framework, including locationscale mixture priors [9], conditional Gaussian tensorproduct splines [10], and Gaussian processes [42, 44], among which adaptive rates are obtained in [9, 10, 44].
Although it is interesting to achieve adaptive rates of contraction with respect to the empirical distance for nonparametric regression, this might be restrictive since the empirical distance quantifies the convergence of functions only at the given design points. In nonparametric regression, one also expects that the error between the estimated function and the true function can be globally small over the whole design space [47], i.e., small meansquared error for outofsample prediction. Therefore the integrated distance is a natural choice. For Gaussian processes, [40], [48] provide contraction rates for nonparametric regression with respect to the integrated , and distance, respectively. A novel spikeandslab wavelet series prior is constructed in [52] to achieve adaptive contraction with respect to the stronger distance. These examples however, take advantages of their respective prior structures and may not be easily generalized. A closely related reference is [26], which discusses the rates of contraction of the rescaledGaussian process prior for nonparametric randomdesign regression with respect to the integrated distance, which is weaker than the integrated distance. Although a generic framework for the integrated distance is presented in [20], the prior there is imposed on a uniformly bounded function space and hence rules out some popular priors, e.g., the popular Gaussian process prior [29].
It is therefore natural to ask the following fundamental question: for Bayesian nonparametric regression, can one build a unifying framework to study rates of contraction for various priors with respect to the integrated distance without assuming the uniform boundedness of the regression function space? In this paper we provide a positive answer to this question. Inspired by [11], we build the general framework upon series expansion of functions with respect to a set of orthonormal basis functions. The prior is not necessarily supported on a uniformly bounded function space. A general rate of contraction theorem with respect to the integrated distance for Bayesian nonparametric regression is provided under the proposed framework. Examples of applications falling into this framework include the finite random series prior [30, 35], the (unmodified) block prior [11], and the classical squaredexponential Gaussian process prior [29]. In particular, for the block prior regression, rather than modifying the block prior by conditioning on a truncated function space as in [11] with a known upper bound for the unknown true regression function, we prove that the unmodified block prior automatically yields rateexact Bayesian adaptation for nonparametric regression without such a truncation. We further extend the proposed framework to sparse additive models in high dimensions. The analyses of the above applications and extension under the proposed framework also generalize their respective results in the literature. This is made clear in Sections 3 and 4.
The literatures regarding rates of contraction for Bayesian nonparmatric models are largely based on the cuttingedge priorconcentrationandtesing procedure proposed by [2] and [12]. The major challenge of establishing a general framework for Bayesian nonparametric regression is that unlike the Hellinger distance for the density estimation and the empirical distance for fixedregression problem, the integrated distance for randomdesign regression problem is not locally testable, complicating the construction of certain test functions. The exact definition of local testability of a distance is provided later, but a locally testable distance is the key ingredient for the priorconcentrationandtesing procedure to work. It turns out that by modifying the testing procedure, we only require the integrated distance to be locally testable over an appropriate class of functions. The above arguments are made precise in Section 2.
The layout of this paper is as follows. In Section 2 we introduce the random series framework for Bayesian nonparametric regression and present the main result concerning rates of contraction. As applications of the main result, we derive the rates of contraction of various renowned priors for nonparametric regression in the literature with substantial improvements in Section 3. Section 4 elaborates on extension of the proposed framework to sparse additive models in high dimensions. The technical proofs of the main results are deferred to Section 5.
Notations
For , we use to denote both the norm on any finite dimensional Euclidean space and the integrated norm of a measurable function (with respect to the Lebesgue measure). In particular, for any function , we use to denote the integrated norm defined to be . We follow the convention that when , the subscript is omitted, i.e., . The Hilbert space denotes the space of sequences that are squaredsummable. We use to denote the maximal integer no greater than , and to denote the minimum integer no less than . The notations and denote the inequalities up to a positive multiplicative constant, and we write if and . Throughout capital letters are used to denote generic positive constants and their values might change from line to line unless particularly specified, but are universal and unimportant for the analysis.
We refer to as a statistical model if it consists of a class of densities on a sample space with respect to some underlying finite measure. Given a (frequentist) statistical model and the i.i.d. data from some , the prior and the posterior distribution on are always denoted by and , respectively. Given a function , we use to denote the empirical measure of , and to denote the empirical process of , given the i.i.d. data . With a slight abuse of notations, when applying to a set of design points , we also denote and to be the empirical measure and empirical process, even when the design points are deterministic. In particular, denotes the probability density function of the (univariate) standard normal distribution, and we use the shorthand notation to denote the density of . For a metric space , for any , the covering number of , denoted by , is defined to be the minimum number of balls of the form that are needed to cover .
2 The framework and main results
Consider the nonparametric regression model: , where are i.i.d. meanzero Gaussian noises with , and are design points taking values in . Unless otherwise stated, the design points are assumed to be independently and uniformly sampled for simplicity throughout the paper. The framework presented in this paper naturally adapts to the case where the design points are independently sampled from a density function that is bounded away from and . We assume that the responses ’s are generated from for some unknown , thus the data can be regarded as i.i.d. samples from a distribution with joint density . Throughout we assume that the variance is known, but the framework can be easily extended to the case where is unknown by placing a prior on that is supported on a compact interval contained in with a density bounded away from and .
The regression model is parametrized by the function , and we follow the standard nonparametric Bayes procedure by assigning a prior distribution on as follows. Let be a set of orthonormal basis functions in , where is a countably infinite index set. Since , we also impose the prior on a subclass of and write in terms of the orthonormal series expansion , where the coefficients are imposed a prior distribution such that a.s.. The prior on can be very general as long as it yields realizations that are squaredintegrable with probability one. We shall also assume that the true function yields a series expansion , where , i.e., . We require that . The widely used tensorproduct Fourier basis satisfies this requirement:
where , , , and the index set is the times cartesian product of the set of all positive integers .
Remark 2.1.
The prior is supported on a function space that may not be uniformly bounded and the setup here is more general than that in [20]. The proposed framework includes the wellknown Gaussian process prior as a special case. Let be a positive definite covariance function that yields an eigenfunction expansion , where is a set of orthonormal eigenfunctions of , and are the corresponding eigenvalues. For a Gaussian process prior on , the KarhunenLoève theorem [43] yields the following representation of : , where independently.
Before presenting the main result for studying convergence of Bayesian nonparametric regression under the proposed framework, let us first recall the extensively used priorconcentrationandtesting procedure for deriving rates of contraction proposed by [12]. The general setup is as follows. Suppose are i.i.d. data sampled from a distribution that yields a density with respect to some underlying finite measure. Let be the parameter space (typically infinite dimensional for nonparametric problems) such that for some . By imposing a prior on the parameter (equipped with a suitable measurable structure), the posterior distribution can be written as
for any measurable in , where is the loglikelihood ratio
An appropriate distance on is crucial to study rates of contraction, since typically different distances would yield different rates, and when is infinite dimensional, different distances are possibly not equivalent. In order that the posterior distribution contracts to at rate with respect to a distance on , i.e., in probability for some large constant , the priorconcentrationandtesting procedure requires that the following hold with some constants for sufficiently large :

The prior concentration condition holds:
(2.1) 
There exists a sequence of subsets of (often referred to as the sieves) and test functions such that
The major technical barrier for establishing a general framework to study rates of contraction for nonparametric regression with respect to lies in the construction of suitable aforementioned test functions . Typically, the existence of suitable test functions can be guaranteed by elaborating on the covering numbers of certain function classes (see, for example, [12, 13, 16, 42] among others), provided that the metrics of interest satisfies the local testability condition (see Section 7 of [12]). Formally, we say that a distance is locally testable, if there exists a sequence of test functions such that the following holds for some constant when is sufficiently large whenever :
where is a fixed constant. When the parameter space itself is a class of densities, i.e., is parametrized by itself, it is proved that the Hellinger distance is locally testable [23]. The widelyused empirical distance for fixeddesign nonparametric regression is also locally testable (see, for example, [14]). The integrated distance is, nonetheless, not locally testable when the space of functions is not uniformly bounded. To resolve this issue, we modify the general procedure above by imposing more assumptions on the sieves.
Since is a countable set, we may assume without loss of generality that . For a fixed integer and a positive constant , we denote by a class of functions in that satisfies the following property:
(2.2) 
The sieves are constructed by letting for certain sequences and .
Rather than considering the supremum over as in [12], in the proposed framework we consider the supremum over the smaller function class and obtain the following weaker result analogous to the local testability condition.
Lemma 2.1.
Let satisfies (2.2). Then for any with , there exists a test function such that
for some constant and .
Based on Lemma 2.1, we are able to establish the following lemma that tests against the large composite alternative . In particular, it is instructive to the construction of aforementioned suitable test functions with respect to the integrated distance .
Lemma 2.2.
Suppose that satisfies (2.2) for an and a . Let be a sequence with . Then there exists a sequence of test functions such that
where is the covering number of
and is some positive constant.
The prior concentration condition (2.1) is a very important ingredient in the study of Bayes theory. It guarantees that the denominator appearing in the posterior distribution can be lower bounded by for some constant with large probability (see, for example, lemma 8.1 in [12]). In the context of normal regression, the KullbackLeibler divergence is proportional to the integrated distance between two regression functions. Motivated by this observation, we establish the following lemma that yields an exponential lower bound for the denominator under the proposed framework.
Lemma 2.3.
Denote
for any and . Suppose sequences and satisfy , , , and is some constant. Then for any constant ,
In some cases it is also straightforward to consider the prior concentration with respect to the stronger norm. For example, for a wide class of Gaussian process priors, the prior concentration has been extensively studied (see, for example, [15, 42, 44] for more details).
Lemma 2.4.
Suppose the sequence satisfies and . Then for any constant ,
Now we present the main result regarding the rates of contraction for Bayesian nonparametric regression under the orthonormal random series framework. The proof is based on the modification of the priorconcentrationandtesting procedure.
Theorem 2.1 (Generic Contraction).
Let and be sequences such that as , with . Assume that the sieve satisfies (2.2) with for some constant . In addition, assume that there exists another sequence such that . Suppose the following conditions hold for some constant and sufficiently large and :
(2.3)  
(2.4)  
(2.5) 
where is the covering number of
and is defined in Lemma 2.3. Then
3 Applications
In this section we consider three concrete priors on for the nonparametric regression problem , . For simplicity the design points are assumed to independently follow the onedimensional uniform distribution . For some of the examples, the results presented in this section can be easily generalized to the case where the design points are multidimensional by considering the tensorproduct Fourier basis. The results for these applications under the proposed framework also generalize their respective counterparts in the literature.
3.1 Finite random series regression with adaptive rate
In this subsection the true regression function is assumed to be in the Hölder ball
where is the smoothness level, and is some positive constant. In particular, We do not assume that the smoothness level of is known a priori. In the literature of Bayes theory, such a procedure is referred to as adaptive. We shall also assume that a lower bound for is known: .
The finite random series prior [1, 30, 35, 45] is a popular prior in the literature of Bayesian nonparametric theory. It is a class of hierarchical priors that first draw an integervalued random variable serving as the number of “terms” to be used in a finite sum, and then sample the “termwise” parameters given the number of “terms.” The finite random series prior typically does not depend on the smoothness level of the true function, often yields minimaxoptimal rates of contraction (up to a logarithmic factor) in many other nonparametric problems (e.g., density estimation [30, 35] and fixeddesign regression [1]), and hence is fully adaptive. However, the adaptive rates of contraction for the finite random series prior in the randomdesign regression with respect to the integrated distance has not been established. In this subsection we address this issue within the orthonormal random series framework proposed in Section 2.
We now formally construct the finite random series prior for nonparametric regression. Let be represented by a set of orthonormal basis functions in : . The coefficients are imposed a prior distribution as follows: first sample an integervalued random variable from a density function (with respect to the counting measure on ), and then given the coefficients ’s are independently sampled according to
where is an exponential power density for some [36] and . We further require that
(3.1) 
for some constants , . For instance, the zerotruncated Poisson distribution satisfies condition (3.1) [46].
The following theorem shows that the finite random series prior constructed above is adaptive and the rate of contraction with respect to the integrated distance is minimaxoptimal up to a logarithmic factor [38].
Theorem 3.1.
Suppose the true regression function for some and , and is imposed the prior given above. Then there exists some sufficiently large constant such that for any .
3.2 Block prior regression with adaptive and exact rate
In the literature of adaptive Bayesian procedure, the minimaxoptimal rates of contraction are often obtained with an extra logarithmic factor. It typically requires extra work to obtain the exact minimaxoptimal rate. Gao and Zhou [11] elegantly construct a modified block prior that yields rateadaptive (i.e., the prior does not depend on the smoothness level) and rateexact contraction for a wide class of nonparametric problems, without the logarithmic factor. Nevertheless, for nonparametric regression, [11] modifies the block prior by conditioning on the space of uniformly bounded functions. Requiring a known upper bound for the unknown when constructing the prior is restrictive since it eliminates the popular Gaussian process priors. Besides the theoretical concern, the block prior itself is also a conditional Gaussian process and such a modification is inconvenient for implementation. In this section, we address this issue by showing that for nonparametric regression such a modification is not necessary. Namely, the unmodified block prior itself also yields the exact minimaxoptimal rate of contraction for in the Sobolev ball
and it does not depend on the smoothness level of the true regression function.
We now describe the block prior. Given a sequence in the squaredsummable sequence space , define the th block to be the integer index set and , where . We use to denote the coefficients with index lying in the th block . Let be the Fourier basis, i.e., , , and , . The block prior is a prior distribution on induced by the following distribution on the Fourier coefficients :
where is a sequence of densities satisfying the following properties:

There exists such that for any and ,
(3.2) 
There exists such that for any ,
(3.3) 
There exists such that for any ,
(3.4)
The existence of a sequence of densities satisfying (3.2), (3.3), and (3.4) is verified in [11] (see proposition 2.1 in [11]).
Our major improvement for the block prior regression is the following theorem, which shows that the (unmodified) block prior yields rateexact Bayesian adaptation for nonparametric regression.
Theorem 3.2.
Suppose the true regression function for some and , and is imposed the block prior as described above. Then for some sufficiently large constant .
Rather than using the sieve proposed in theorem 2.1 in [11], which does not necessarily satisfy (2.2), we construct in a slightly different fashion:
with and . The covering number can be bounded using the metric entropy for Sobolev balls (for example, see lemma 6.4 in [3]), and the rest conditions in 2.1 can be verified using similar techniques as in [11].
As discussed in Section 4.2 in [11], the block prior can be easily extended to the wavelet basis functions and wavelet series. The wavelet basis functions are another widelyused class of orthonormal basis functions for . Let be an orthonormal basis of compactly supported wavelets for , with referring to the socalled “resolution level,” and to the “dilation” (see Section E.3 in [15]). We adopt the convention that the index set for the th resolution level runs through . The exact definition and specific formulas for the wavelet basis are not of great interest to us, and for a complete and thorough review of wavelets from a statistical perspective, we refer to [17]. We shall assume that the wavelet basis ’s are appropriately selected such that for any , the following inequalities hold [6, 7, 19]:
It is worth noticing that unlike the Fourier basis, the wavelet basis functions are typically not uniformly bounded in : . However, the supremum norm of the wavelet series can be upper bounded in terms of the wavelet coefficients, which allows us to modify the framework in Section 2 for wavelet basis functions.
Given a function with wavelet series expansion , the block prior for the wavelet series is introduced through the wavelet coefficients ’s as follows:
where , , and is given by
We further assume that lies in the Besov ball defined as follows [11] for some and :
which turns out to be equivalent to the aforementioned Sobolev ball. For the block prior regression via wavelet series, the rateexact Bayesian adaptation also holds.
Theorem 3.3.
Suppose the true regression function for some and , and is imposed the block prior for wavelet series as described above. Then there exists some sufficiently large constant such that .
3.3 Squaredexponential Gaussian process regression with fixed design
So far, the design points in this paper for the nonparametric regression problem are assumed to be randomly sampled over . This is referred to as the randomdesign regression problem. There are, however, many cases where the design points are fixed and can be controlled. One of the examples is the design and analysis of computer experiments [8, 34]. To emulate a computer model, the design points are typically manipulated so that they are reasonably spread. In some physical experiments the design points can also be required to be fixed [39].
In this subsection we consider one of the most widely used and perhaps the most popular Gaussian processes ([29]) with the covariance function of the squaredexponential form, for the fixeddesign nonparametric regression problem. We show that optimal rates of contraction with respect to the integrated distance is also attainable in such a scenario when the design points are reasonably selected, in contrast to the most Bayesian literatures that obtain rates of contraction with respect to the empirical distance. This can be done by slightly extending the framework in Section 2.
We first present an assumption for the design points. Suppose that the design points are onedimensional and are fixed instead randomly sampled. Intuitively, the design points need to be relatively “spread” so that the global behavior of the true signal can be recovered as much as possible. Formally, we require that the design points satisfy [51, 52]
(3.5) 
A simple example of such design is the univariate equidistance design, i.e., . It clearly satisfies (3.5) (see, for example, [52]).
Now we extend the orthonormal random series framework in Section 2 to the (onedimensional) fixeddesign regression problem with Fourier basis: Assume that yields Fourier series expansion , where , , and , . We also assume that lies in the Hölder space with . Consider the sieve
(3.6) 
with . Notice that the requirement (3.6) is stronger than the condition (2.2), as contains functions that permit termbyterm differentiation.
With the above ingredients, we are in a position to present the following modification of Theorem 2.1 for the fixeddesign regression, which might be of independent interest as well.