Mercer kernels and integrated variance experimental design: connections between Gaussian process regression and polynomial approximation

# Mercer kernels and integrated variance experimental design: connections between Gaussian process regression and polynomial approximation

## Abstract

This paper examines experimental design procedures used to develop surrogates of computational models, exploring the interplay between experimental designs and approximation algorithms. We focus on two widely used approximation approaches, Gaussian process (GP) regression and non-intrusive polynomial approximation. First, we introduce algorithms for minimizing a posterior integrated variance (IVAR) design criterion for GP regression. Our formulation treats design as a continuous optimization problem that can be solved with gradient-based methods on complex input domains, without resorting to greedy approximations. We show that minimizing IVAR in this way yields point sets with good interpolation properties, and that it enables more accurate GP regression than designs based on entropy minimization or mutual information maximization. Second, using a Mercer kernel/eigenfunction perspective on GP regression, we identify conditions under which GP regression coincides with pseudospectral polynomial approximation. Departures from these conditions can be understood as changes either to the kernel or to the experimental design itself. We then show how IVAR-optimal designs, while sacrificing discrete orthogonality of the kernel eigenfunctions, can yield lower approximation error than orthogonalizing point sets. Finally, we compare the performance of adaptive Gaussian process regression and adaptive pseudospectral approximation for several classes of target functions, identifying features that are favorable to the GP + IVAR approach.

Gaussian process regression

, experimental design, computer experiments, approximation theory, polynomial approximation, kernel interpolation, uncertainty quantification

## 1 Introduction

Computational simulations are essential for design, optimization, uncertainty quantification, and inference in complex systems. Yet these tasks typically require a large number of simulations over a range of parameter values, which can be computationally prohibitive. One method of mitigating this computational expense is to construct surrogates or “emulators” that replace the simulation in the relevant analyses. Because many computational simulations are available only as black-box or legacy codes, surrogates often must be constructed through a limited number of simulations at particular parameter values . These simulations are sometimes called “computer experiments” [52, 53] and choosing these parameter values is a question of experimental design.

Surrogate construction can be viewed as a function approximation problem in which one attempts to approximate an input-output relationship induced by the expensive simulation. One can do so deterministically, i.e., obtaining a single approximation , or probabilistically, i.e., obtaining a distribution over possible functions , where is a probability distribution on a suitable function space. In either case, three decisions must be made. First, one must choose an approximation space that contains candidate surrogate functions; second, one must select a set of parameter values or experiments at which to simulate the system; and finally, one must choose an algorithm to convert the simulation results into a particular function or distribution . Examples of approximation spaces include those spanned by polynomials of a certain degree, or by radial basis functions and other kernels. Possible experiments include designs produced by Monte Carlo sampling, obtained via optimization of information-based design criteria such as entropy and mutual information, or based on numerical quadrature rules. Finally, examples of algorithms include linear regression and pseudospectral approximation. These three decisions are usually not independent; for example, pseudospectral approximation requires experimental design procedures that preserve orthogonality of the relevant basis functions.

In this paper, we analyze two of the most commonly used surrogate construction approaches in uncertainty quantification: Gaussian process regression (GPR), which has seen much development in the statistics community, and pseudospectral approximation (PSA), which is widely used in applied mathematics and engineering. Both of these methods are routinely applied for similar purposes—replacing computationally expensive models with cheap-to-evaluate surrogates for complex analysis tasks. One of our primary goals is to compare these two approaches by analyzing the approximation spaces, experiments, and algorithms that they employ. The PSA method comes equipped with an experimental design procedure and an algorithm intended to produce approximations that are accurate in an sense. The GPR methodology is more flexible in that it does not come pre-equipped with an experimental design procedure. In order to compare these two methods, we will employ an experimental design criterion that also seeks accuracy in an sense.

Our main contributions are as follows. First, we develop a continuous optimization algorithm, based on sample-average approximation (SAA), to minimize an integrated posterior variance (IVAR) design criterion for Gaussian process regression. We compare our algorithm to approaches that maximize other information-based criteria (e.g., entropy or mutual information) by evaluating their computational costs, the properties of the resulting point sets, and the accuracy of the resulting approximations. Second, we provide a theoretical and numerical analysis comparing non-intrusive polynomial approximation—in particular pseudospectral polynomial approximation—with Gaussian process regression. While the relative performance of these methods may be a subject of broader debate, here we assess the impact of each of the three surrogate construction ingredients described above. Theoretically, we develop results to describe the difference between the approximations given the same experiments and similar approximation spaces. Numerically, we investigate the performance of the two approaches given similar approximation spaces but different experiments, chosen to be optimal for each.

The IVAR objective, which is equivalent to the IMSE criterion [52, 9], can be minimized either over a finite (and hence discrete) design space or over a continuous design space. In the discrete case, the criterion is sometimes called the ALC (“active learning Cohn”) [6] objective. Minimizing the ALC criterion involves sequentially adding experiments, chosen one at a time from a discrete and finite set of candidates, to minimize a weighted average of the predictive variance. ALC has also been investigated in the context of determining local designs for large-scale computer experiments [27]. ALC is often compared to other discrete design space criteria, namely ALM (“active learning MacKay”) and mutual information (MI) [34]. Minimizing the ALM criterion involves sequentially adding experiments at locations where the local predictive variance is maximized. Compared with ALM, the ALC criterion considers the effect of each experiment on the entire domain and therefore yields better designs; ALC is more expensive, however, because it requires a new variance computation for each potential design. The MI criterion sequentially seeks points that maximize the expected information gain at locations not yet chosen. MI design requires a good candidate set, which may be difficult to obtain for input domains with complex geometries, though strides have been made in this direction [3]. It also remains computationally expensive, with a complexity that grows cubically with the size of the candidate set.

Instead of dealing with the combinatorial optimization issues associated with discrete design spaces, we will pursue optimization in a continuous space. Our approach is similar to [51, 52] in that we use gradient-based methods to search for optimal designs, but we will explore opportunities presented by solving the full optimization problem, without ad hoc simplifications of the design space. We will employ a sample-average approximation (SAA) that proves to be effective for complex domain geometries. Benefits of this approach include generating batches of experiments with lower computational complexity than sequentially minimizing the ALC criterion; eliminating undesirable boundary clustering effects associated with radial basis function kernels, which plague ALM designs [48, 34]; and achieving better approximation performance than either ALM or MI. Finally, because we perform design on a continuous space of candidate points, it becomes natural to analyze the stability and accuracy of approximation with IVAR-optimal designs from the perspective of numerical analysis. For example, in Section 4 we will show that our algorithm generates point sets with good interpolatory properties, as measured by their Lebesgue constants. Finding these point sets via a statistical criterion raises interesting links with previous work in Bayesian numerical analysis [44, 15], particularly the average-case quadrature of [40, 46]. A continuous design procedure also facilitates more cleanly comparing GPR and pseudospectral approximation.

Our comparison of GPR with pseudospectral approximation has two elements. First, we use Mercer’s theorem to rewrite GP regression in terms of orthogonal eigenfunctions of the kernel, such that when these eigenfunctions contain the finite basis for a pseudospectral approximation, one can directly assess the difference between the two approximations. If experimental design is based on an orthogonalizing quadrature rule, the difference between the GP mean and the pseudospectral approximation is due to eigenfunctions of the GP kernel which are not in this finite basis. These leftover eigenfunctions also account entirely for the integrated variance of the GP. Furthermore, we illustrate through numerical examples that experiments achieving optimal IVAR for these GPs can differ qualitatively from standard quadrature rules, and that when the IVAR criterion is then used to select experiments for GP regression, GPR can outperform pseudospectral approximation in some settings. Second, we consider adaptive procedures for GPR that interleave IVAR-based experimental design with adaptation of the kernel hyperparameters. For test problems of moderate dimension, we find that GP approximations constructed in this way can again outperform certain adaptive pseudospectral approaches [7].

This paper is organized into two parts. The first part reviews Gaussian process regression (Section 2), introduces the IVAR criterion and its optimization (Section 3), and describes numerical comparisons of IVAR with other experimental design procedures for GP regression (Section 4). The second part provides a brief background on pseudospectral approximation (Section 5.1), then describes theoretical (Section 5.2) and numerical (Section 5.4) comparisons between PSA and GP regression.

## 2 Gaussian process regression

Gaussian process (GP) regression can be interpreted as a Bayesian method for function approximation [45, 49], providing a posterior probability distribution over functions. The method begins with a Gaussian process prior, specified via a prior mean function and a covariance kernel that is positive semidefinite and bounded. Suppose that simulations of the function are performed at parameter values , , yielding noisy function evaluations , where for and . The resulting posterior distribution is

 ~f|x,^y∼N(m(x),C(x,x′)),

where the posterior mean is

 (1) m(x)=m0(x)+αTK(x,x),

and the posterior covariance is

 (2) C(x,x′)=K(x,x′)−K(x,x)TRK(x,x′).

In the notation above, is a (column) vector in whose th component is . The covariance matrix has elements . Finally, the th element of the vector of coefficients is .

Gaussian process regression reverts to interpolation when . However, as the covariance matrix becomes ill-conditioned; a small value for , called a nugget, is often then introduced to stabilize the procedure [43]. Note also that we have not included inference of the prior mean in the Bayesian formulation above. If the prior mean is described via some parametric model , where is a vector of basis functions, then Bayesian inference of the coefficients would add terms to the posterior covariance. In practice, however, it is common and quite effective to assume either a zero or non-zero constant term for and to fix its value (for example, by maximizing the log-marginal likelihood) before performing the GP update; see, e.g., [30, 36, 27]. For simplicity, we fix the prior mean here. Doing so will also help focus the comparison of nonparametric GP regression with parametric PSA in Section 5 on its essential aspects.

### 2.1 Reproducing kernels and Mercer’s theorem

Many elements of this work rely on the interpretation of the covariance kernel through its eigenfunctions, and to this end we recall the properties of a Mercer kernel. Let the kernel be defined on a first-countable [16] space endowed with a strictly positive Borel measure . Suppose that the kernel is continuous, positive semi-definite,

 (3) ∫χ×χK(x,x′)g(x)g(x′)dμ(x)dμ(x′)≥0, ∀g∈L2μ(X),

and in

 (4) ∫χ|K(x,x)|dμ(x)<∞.

Additionally we define the integral operator such that This operator has a countable system of eigenvalues that are non-negative and that satisfy

 ∞∑j=1λ2j<∞.

The eigenfunctions of form an orthonormal basis of . These eigenfunctions and eigenvalues can be used to define the reproducing kernel Hilbert space (RKHS) associated with the kernel [49]. Mercer’s theorem lets us represent as a convergent series in terms of this eigensystem.

{theorem}

[Mercer] [39, 21, 4] Let be first countable or locally compact, a strictly positive Borel measure on , and a continuous function on satisfying (3) and (4). Then

 (5) K(x,x′)=∞∑i=1λiϕi(x)ϕi(x′),

where the series converges absolutely for each pair and uniformly on each compact subset of .

When comparing GP regression with pseudospectral approximation in later sections of this paper, we will also use the notion of a truncated kernel. These are kernels for which , and for which we can equivalently write (5) as

 K(x,x′)=ℓ∑i=1λiϕi(x)ϕi(x′).

One common example of a truncated kernel is a polynomial kernel, e.g., , where is a positive integer.

We can use Mercer’s theorem to write the integrated posterior variance of the Gaussian process in terms of the eigensystem . The posterior variance at any point in the domain is . The integrated variance then becomes

 ∫c(x)dμ(x) =∫(K(x,x)−K(x,x)TRK(x,x))dμ(x) =∫K(x,x)dμ(x)−∫∞∑i=1∞∑j=1λiλjϕTiRϕjϕi(x)ϕj(x)dμ(x) (6) =∞∑i=1λi−∞∑i=1λ2iϕTiRϕi,

where The first term is the integrated variance of the prior. The second term, which is always non-negative, reflects reduction in the integrated prior variance due to conditioning on the data. We will now examine how experimental design procedures can use this integrated variance as an optimization objective.

## 3 Integrated variance experimental design

We will design experiments to minimize the integrated posterior variance (IVAR) of the Gaussian process. This choice is motivated by inferential considerations. As opposed to design procedures based on Latin hypercube sampling, quasi-Monte Carlo, quadrature, or other “lattice” designs, the present design strategy directly aims to minimize a measure of the uncertainty associated with the approximation. One advantage of this approach is that it can be used to design experiments on a wide variety of input domains, not just domains with tensor-product or some other canonical structure. Another advantage of computing and monitoring a measure of uncertainty is that it provides useful feedback about the quality of the approximation; for example, if the data do not yield much reduction in uncertainty, one can adjust the approximation space or some other aspect of the surrogate construction methodology.

To put the IVAR criterion in context, we note that it is equivalent to an expected integrated squared error of the posterior mean. First consider the posterior expectation of the squared error in function values, integrated over the parameter space,

 E~f|x,^y[∫(~f(x)−f(x))2dμ(x)],

where can be thought of as a posterior realization of the Gaussian process and is the measure on the parameter space . We can divide this quantity into two terms,

 (7) E~f|x,^y[∫(~f(x)−f(x))2dμ(x)] =∫(m(x|x,^y)−f(x))2dμ(x)+∫c(x|x)dμ(x),

where the first term on the right-hand side is the integrated squared error of the posterior mean and the second term is the integrated posterior variance, i.e., the IVAR. We have explicitly indicated all the conditioning on the right-hand side of (7). Note that the second term is independent of the sampled values of the function .3 Computing the first term, on the other hand, requires the ability to evaluate . Directly using this term in a design criterion would defeat the purpose of experimental design, which is motivated by the desire to evaluate sparingly. Instead, we can consider the expectation of this squared error over the joint distribution of and , for a fixed design . We assume that is drawn from the prior , and therefore this expectation becomes the Bayes risk of the posterior mean under squared error loss, which is equivalent to the integrated mean squared error (IMSE) criterion proposed by [52]. Some manipulation shows that this Bayes risk is indeed equal to the integrated posterior variance, i.e., that after taking the expectation over and , the two terms on the right side of (7) are the same. For further details, see, e.g., [54, p. 92]. Thus IVAR minimization can also be understood as - or -optimal design [1], in the sense of minimizing the -weighted variance of the predictions over the design region .

A different connection to optimal design theory can be made by finding a finite parameterization (and hence, in general, an approximation) of the Mercer kernel and converting the problem into one of parametric model fitting and prediction [18, 20, 19, 28]. Such approaches decompose the kernel by computing a Karhunen-Loève expansion [19, 28] of the Gaussian process and then truncating it; alternatively, one can use a polar spectral approximation [61] of the kernel to avoid computing its eigenfunctions. In either case, once the kernel has been decomposed and truncated, one can approximate the posterior mean of the GP prediction as a linear combination of finitely many basis functions,

 (8) m(x)=m0(x)+ℓ∑i=1θiϕi(x).

In optimal design theory, one may then seek experiments to best learn about the mean structure , the parameters , or to minimize some uncertainty in the prediction. These choices correspond to different classical “alphabetic optimality” criteria applied to the corresponding information matrix, e.g., , , , or -optimal design.

The IVAR design procedure presented in this paper, however, does not require a finite parameterization of the kernel; instead it can use a closed-form expression for the posterior covariance, with no truncation. A direct comparison between such kernel-based procedures (for example, the Sacks-Ylvisaker approach described in [18]) and parametric optimal design theory is outside the scope of this paper; we refer readers to [18, 62, 42, 19] instead. We will, however, compare our IVAR optimization procedure to other algorithms and design objectives that do not require explicitly finding a finite parameterization of the kernel. Later, when comparing GP regression to another surrogate modeling methodology—pseudospectral polynomial approximation, in Section 5—we will return to the eigenfunction viewpoint of GP regression. In that context, our focus will not be on algorithms that require explicit access to the eigenfunctions of the kernel, but rather on how the eigenfunction viewpoint exposes the distinct modeling assumptions made by the two methodologies.

### 3.1 IVAR evaluation and minimization

For a chosen number of experiments and a fixed prior covariance kernel , our optimal experimental design is a set of evaluation points , , minimizing the integrated posterior variance:

 (9) x∗=argminx∈U∫Xc(x|x)dμ(x).

Here is the space of all feasible experiments and the posterior variance is specified via (2).

While this objective function is similar to that in [58], here we will employ a continuous space of possible experiments. Also, we will solve the optimization problem (9) both in a non-greedy fashion (finding all design points simultaneously) and using greedy updates with varying batch sizes. In this section the number of design points and the prior covariance kernel will be considered fixed. Later, in Section 5.4, we will consider closed-loop design procedures that alternate between batch minimization of IVAR and updates of the covariance kernel.

#### Sample-average approximation of IVAR

One method of minimizing IVAR involves numerically evaluating the objective in (9) via quadrature or a quasi-Monte Carlo or Monte Carlo (MC) sampling procedure. Since the variance is in general a smooth function of , quadrature schemes may work efficiently for low-to-moderate dimensional input spaces, but Monte Carlo will generally work better in higher dimensions; Monte Carlo also offers more flexibility for non-tensor-product domains . Monte Carlo sampling replaces the integral with the summation

 (10) ∫Xc(x|x)dμ(x)≈1NmcNmc∑i=1c(^xi|x)=:^Jmc(x),

where is the number of Monte Carlo samples and . Computing the integrated variance for a set of experiments then requires an inversion of the covariance matrix, an operation, and variance evaluations at points, an operation.

The sample-average approximation (SAA) [59] approach to optimization simply replaces the expectation in the objective (9) with a quadrature or Monte Carlo approximation at a fixed set of points and minimizes this objective. After one has chosen this set of fixed points, the minimization becomes a constrained deterministic minimization problem. We can use readily available analytical derivatives of the objective in this setting. In particular, given the form of the kernel , we can directly compute the gradient from (2); details are given in Appendix C. The gradient of the SAA objective obtained from Monte Carlo then becomes

 ∇x^Jmc(x)=1NmcNmc∑i=1∇xc(^xi|x),

Finally, we note that in some cases it may be more convenient to work with closed-form expressions for the eigenfunctions of the kernel rather than the kernel itself; such situations arise when one has a desired basis of approximation but the corresponding closed-form kernel is unknown, or if the eigenfunctions can otherwise be easily computed. In this case, one can rewrite the IVAR objective in terms of eigenfunctions and simply maximize the second term on the right of (6). Written in this form, the objective does not require integration with respect to the parameter measure . Indeed, having the eigenfunctions in hand is tantamount to already having performed the integration, as the eigenfunctions are solutions of the homogeneous Fredholm integral equation with operator . [22] uses this approach for IVAR minimization, exploring truncations of (6) to eigenfunctions.

#### Batch and greedy implementations

In this work, we have used gradient-based optimization algorithms from both the NLopt [31] and SciPy [32] optimization packages, with similar performance, to solve the optimization problem (9). This problem involves design points, each in dimensions, and thus has unknowns. It can become expensive to solve when is very large. In these cases it may be useful to solve a sequence of smaller optimization problems to achieve an approximate solution. In the numerical examples below, we will investigate constructing these smaller problems through the use of a greedy minimization procedure. In this procedure one decides how many training points are computationally feasible to minimize. Suppose that is an integer. Then the greedy procedure solves optimization problems of size . Once the th problem is solved, for , we have obtained the experiments . During the th iteration we find points to append to .

### 3.2 Entropy and mutual information

Statistical criteria underlie many other experimental design procedures for Gaussian process regression [53]. Two popular techniques include minimizing the conditional entropy of the Gaussian process at unobserved locations, or maximizing the mutual information (MI) between the locations at which experiments are performed and the rest of the design space. Our numerical results will compare IVAR designs with MI and entropy-based designs because the latter have algorithms specifically tailored for Gaussian process regression. For a comparison between these two methods and additional design procedures, e.g., based on classical alphabetic optimality criteria, we refer to [34].

The conditional entropy design procedure seeks experiments that reduce the uncertainty, as measured by entropy , across a typically finite set of possible simulation locations. If the set of candidate experimental locations is denoted by and are the chosen locations, then are the locations at which the entropy is evaluated. In particular, one seeks to minimize the conditional entropy , where is a random variable representing the outputs of the simulation model at a set of inputs and denotes a probability density. Minimizing this function is shown to be NP-hard in [33]. In practice, one instead employs greedy but suboptimal algorithms that add one experiment at a time—for example, adding each experiment at the location where the conditional entropy is largest [60, 9, 38]. In the context of GPR, and are jointly Gaussian; this procedure then becomes equivalent to greedily adding experiments at the locations of highest variance, and is commonly called the MacKay criterion (ALM). Other algorithms for choosing a subset of points to minimize the conditional entropy are based on the DETMAX algorithm [41], demonstrated for GP regression in [10]

The mutual information criterion for experimental design considers the change in the entropy of before and after performing experiments at locations : , which is equivalent to the mutual information of and . This objective is also typically maximized in a greedy fashion: from the candidate set of experiments , the element which yields the greatest MI is chosen at each iteration. Unlike the conditional entropy, however, MI is submodular, guaranteeing that a greedy approach performs within a constant factor of the full -experiment maximization [34]. The greedy procedure requires the inversion of a matrix containing the covariance between every pair of candidate experiments, which arises when computing the entropy on the set of all simulation locations. If one is choosing experiments from a set of candidates, each iteration then requires inverting a matrix of size . This expense is typically large and many recent efforts have aimed at reducing it. [34] reduces the cost of each iteration to by using specialized local kernels. [3] points out that the quality of an MI design depends crucially on the candidate set, and modifies the greedy algorithm of [34] to be more robust by resampling the set of candidate experiments after each new point is chosen.

Besides the choice of objective, our IVAR-based design algorithm differs from the entropy and MI approaches above in other fundamental ways. First, we select experiments from a continuous design space and thus avoid the challenges of combinatorial optimization. In this sense, we follow the approach of [52] by solving a continuous optimization problem using gradient descent algorithms. Second, as described in Section 3.1.2, our approach can identify multiple experiments simultaneously. Designing batch experiments is advantageous because interactions between the experiments are taken into account; we will demonstrate this advantage empirically in the next section. Our continuous design approach also takes several steps beyond [52]. First, we do not employ particular patterns or partitions to simplify the design space. And as described earlier, we introduce a sample-average approximation of the IVAR objective. This helps design experiments on complex input domain geometries by penalizing proximity to the domain boundaries; the objective automatically becomes large in locations where there are few samples. We will also address the “pileup” problem identified in [52], explaining it in terms of the design size relative to correlation kernel complexity.

### 3.3 Computational complexity

In some sense, if evaluating is sufficiently expensive, then the computational cost of finding a good design is immaterial. But the design procedures described above can have very different costs, and in practice one may not want the computational effort required for experimental design to be too large. Table 1 summarizes the computational complexity of evaluating and optimizing the various experimental design objectives considered above. In the IVAR scenarios, denotes the number of Monte Carlo points sufficient for (10), where typically . For the entropy and MI criteria we assume that the number of candidate designs . Finally, we assume that optimizing the IVAR objective requires objective and/or gradient evaluations, where typically .

Greedy minimization of the conditional entropy (ALM) requires iterations. In each iteration the variance must be computed at locations.4 At iteration , this variance computation requires an inversion followed by variance evaluations, each of complexity . Thus the complexity for step is . The total complexity is thus .

The ALM approach is often the computationally cheapest option, but not if . If and are of comparable magnitude, then the comparison of entropy and IVAR minimization depends on whether , i.e., how the number of optimization iterations (in the continuous case) compares to the number of candidate design points (in the discrete case). We also see that the MI design becomes very expensive for large ; a small candidate design set must be chosen for the MI procedure to remain tractable. Using a smaller candidate set, however, increases the possibility that the chosen designs will perform poorly.

## 4 Numerical examples of IVAR designs

We now provide numerical examples illustrating the quality of designs arising from continuous IVAR minimization. All of the examples in this section were performed using the freely available GPL-licensed GPEXP package [25] for python. First, we examine the interpolatory properties of IVAR design points. Then we compare IVAR, entropy, and MI-based designs on several domains. We also include comparisons with space-filling designs obtained by the method of Lekivetz and Jones (LJ) [35], which is well suited for many non-rectangular regions. The LJ algorithm works by grouping randomly sampled points into equally sized clusters using Ward’s mininum variance criterion; one design point is then extracted from each cluster (for instance, by computing the cluster centroid). The LJ approach encounters difficulties for non-convex domains, and thus we do not use it in the non-convex test case of Section 4.2.3.

### 4.1 Stability of interpolation

As discussed above, GP regression with (and a kernel rank ) corresponds to interpolation. In this setting, it is useful to analyze the quality of interpolation on IVAR design points. The Lebesgue constant is often used to bound the error of a polynomial interpolant relative to the best approximation in a polynomial space of equivalent degree, but it can also be used to analyze interpolation with positive definite kernels, corresponding here to the posterior mean of the GP. Let denote the interpolant or posterior mean, and let denote the best approximation of , in the sense, from the finite-dimensional kernel space . We restrict our attention to approximation on compact domains . Then the interpolation error is bounded as [17]

 ∥f−m∥L∞(X)≤(1+ΛK,xN)∥f−m∗∥L∞(X),

and moreover we have [13],

 ∥m∥L∞(X)≤ΛK,xN∥y∥ℓ∞.

The Lebesgue constant for prior kernel and design points is

 (11) ΛK,xN=maxx∈XN∑j=1|uj(x)|,

where are the cardinal functions satisfying , such that the interpolant can be written as with . We evaluate the cardinal functions as in [17] by solving the linear system

 (12) R−1u(x)=K(x,x),

where .

Now we conduct a simple numerical experiment to evaluate the Lebesgue constants of point sets arising from IVAR minimization. We find IVAR designs on the domain with squared exponential kernel . Figure 1 shows the associated Lebesgue constants as a function of number of design points , for various correlation lengths . Several interesting trends are observed. First, we see that for any value of , the Lebesgue constant is exactly one for sufficiently small . This observation is consistent with the asymptotic estimate for in [50]. The Lebesgue constant reverts to one for small point sets because the IVAR criterion ensures that the points are well separated, and hence the cardinal functions do not overlap. Once a certain threshold value of is attained, however, the Lebesgue constant begins to increase; this threshold value is smaller for larger values of the correlation length . This transition coincides with interactions among the cardinal functions: Figure 2 shows cardinal functions for and , just beyond the threshold where interaction becomes significant. In this regime, the Lebesgue constant increases steadily with . For a sufficiently large , becomes poorly conditioned and direct computations of the interpolant, the cardinal functions, and the Lebesgue constant are no longer numerically stable. From the Bayesian perspective, this ill-conditioning corresponds to the “complexity” of the RKHS associated with the prior kernel—i.e., the effective number of nonzero eigenvalues in (5)—being exceeded by the data.

The success of the IVAR optimization procedure itself also depends on how relates to the complexity of the kernel. In the small- and small- regime (intuitively, “too few” points relative to the complexity of the kernel), the IVAR cost function is relatively flat as a function of the design coordinates , and distinguishing the quality of different designs becomes difficult. By contrast, in the large- and large- regime (“too many” points relative to the complexity of the kernel), the IVAR value itself is exceedingly small and the problem is poorly conditioned, as described above. Away from these limiting regimes, however, the IVAR design procedure yields relatively slow growth in the Lebesgue constant and thus relatively stable interpolation.

### 4.2 Approximation comparisons

Here we illustrate IVAR designs on variety of input domains , and compare the performance of IVAR designs with that of designs obtained through conditional entropy minimization or MI maximization. We highlight designs and approximations on irregular input domains (e.g., domains that are neither hypercubes nor ), which are critical in many real-world applications. In particular, irregular domains often arise as a result of domain partitioning by a discontinuity detection algorithm, for models whose outputs are piecewise smooth. For example, in [29, 26] the authors automatically partition the input domain of a genetic toggle switch model that exhibits a phase transition. Following this partitioning, function approximation proceeds on two separate but irregular domains. In [55, 56] the authors study a climate model which exhibits a discontinuity. Discontinuity detection is used to split the input domain into two irregular subdomains, and is followed by function approximation. We will thus attempt to show the applicability of our algorithm to input domains that are characteristic of such problems, which have no guarantees of convexity or even connectedness.

In most of our numerical experiments, we employ an isotropic squared exponential kernel and we set ; the only exception is in Section 4.2.2, which uses a periodic kernel to contrast space-filling designs with designs that are adapted to the kernel. Entropy minimization is pursued through the ALM approach described earlier; each experiment is chosen by comparing the variance at possible experimental locations sampled according to the parameter measure on . For MI maximization, we select designs from candidate locations generated by the space-filling LJ design. The size of the MI candidate set is chosen so that the computational times of the different experimental design procedures are comparable; it is also comparable to sizes used in the literature [3]. We find that small changes in the size of this candidate set do not qualitatively change the results reported below.

Finally, we define the relative error of a GP approximation as , and we estimate it using Monte Carlo samples. While in the previous section we considered error, recall that the IVAR objective function reflects the expectation of a squared -averaged error. Thus is a logical metric of quality. Other efforts, e.g., [37], empirically evaluate the and errors of approximations resulting from a variety of other design criteria.

#### Circular domain

We first construct designs on a circular domain in with radius 0.7. We fix the correlation length in the kernel to ; the measure is uniform over the domain. Results from each design strategy are shown in Figure 3, for designs with , , and points. We consider full batch IVAR, designing all points simultaneously, along with greedy IVAR strategies that add points in groups of or . Full IVAR results in the most symmetric and regularly spaced points. But all of the IVAR strategies attempt to spread the points throughout the domain, whereas entropy design first places points on the boundaries, which is not a desirable feature for radial basis function kernels [48, 34]. The LJ designs are reasonably space-filling, as expected, but full IVAR yields even more consistent spacing. For timing reference, we note that finding a 40-point design with the full IVAR criterion took 11 seconds, IVAR-1 took 65 seconds, IVAR-4 took 29 seconds, entropy-based design took 10 seconds, and MI design took 79 seconds. Thus non-greedy IVAR design and entropy design take approximately the same amount of time, while MI maximization is slightly more costly, though of the same order of magnitude.

To evaluate the effectiveness of these designs for function approximation, we perform GP regression on 1000 functions independently sampled from the prior Gaussian process, with results shown in Figure 4(a). For each sampled function , we compute the relative error between the posterior mean of the GP and , as described above. We then report the average and standard deviation of this error, for each design strategy and different values of . The IVAR designs, including the IVAR- greedy strategies, clearly outperform the entropy and MI designs. Optimality of the full IVAR designs is to be expected, since we are essentially calculating the Bayes risk (the expectation of ) which is equivalent to IVAR, as discussed in Section 3. But it is noteworthy that even the greedy IVAR strategies show better performance than the MI and entropy designs. The LJ designs show errors comparable to the greedy IVAR strategies, but not as low as full IVAR. In Figure 4(b), we repeat this study for a higher-dimensional domain—a ball in —and observe similar trends. For reference, the computational times required to find 40-point designs in dimensions are: 13 seconds for batch IVAR, 125 seconds for IVAR-1, 45 seconds for IVAR-4, 10 seconds for entropy, and 82 seconds for MI. These results further support the idea that full IVAR is computationally competitive with entropy and that design with MI is more expensive.

#### Periodic kernel

The IVAR, MI, and entropy-based design criteria explicitly incorporate the kernel of the GP, while the space-filling LJ design approach does not. Thus it is reasonable to expect kernel-adapted designs to be more successful for a broader range of kernel specifications. To assess this behavior, we repeat the experiments of the previous subsection on a square domain with a periodic kernel. In particular, we use the kernel

 K(x,y;p,l)=exp(−2d∑i=1sin2(π|x−y|/p)l2),

on the domain , with period and correlation length . The resulting design points are shown in Figure 5 and the function approximation results are shown in Figure 6. Now, optimal designs for the IVAR, entropy, and MI schemes are not space filling. They are clustered in areas that reflect the periodic structure of the covariance kernel. The space-filling LJ nodes do not exploit this property and, as shown in Figure 6, yield higher errors than the kernel-adapted nodes.

#### Non-convex, non-simply connected domain

Figure 7 illustrates experimental design on a parameter domain that is neither convex nor simply connected, found via full (non-greedy) IVAR minimization. These designs are obtained with an isotropic squared exponential kernel with correlation length . The IVAR objective is minimized using the SAA approach described in Section 3.1.1. Optimization for each -point design began from a single randomly-generated point in the -dimensional design space; no multi-start procedure was used here. The designs do a good job covering the domain. While twelve- and sixteen-point designs are not completely evenly distributed, designs using 32 and 60 points are very well spaced and have interesting symmetries. The approximation performance of IVAR designs—even greedy IVAR- designs—is also superior to that of the entropy and MI approaches, as shown in Figure 8.

## 5 Comparisons with pseudospectral approximation

Pseudospectral approximation (PSA) is a well-established approach for computing polynomial approximations from pointwise function evaluations [63, 8, 7]. Recall that PSA comes equipped with an experimental design procedure and an algorithm which seeks accuracy in an sense. In Sections 24, we described an experimental design procedure for GP regression that also targets an expected -type error. Thus we are now well positioned to make a comparison between GP regression and PSA. In particular, our comparison of these surrogate construction methods arises from the perspective of the three choices described in the introduction: the approximation space, the experiments, and the algorithm for constructing the approximation itself. Our comparison comprises three components described in Sections 5.25.3, and 5.4. We first discuss a theoretical result bounding the error between the GP posterior mean and the PSA under orthogonalizing designs. Then we show what happens when we relax the idea of exact orthogonality. Finally, we provide some numerical comparisons between the two methods on canonical approximation problems.

Overall, our goal is not so much to compare performance but rather to understand the links and distinctions between these approaches. In particular, we would like to know: are GP regression and pseudospectral approximation ever equivalent? Do the point sets used for pseudospectral approximation yield low IVAR when used for GP regression? How do IVAR-minimizing designs compare with quadrature rules, for appropriate choices of kernel?

### 5.1 Pseudospectral approximation

In this section, we recall the basics of pseudospectral approximation. Consider a set of basis functions comprising a complete orthonormal system in . Pseudospectral approximation takes advantage of orthonormality

 ⟨ψi,ψj⟩μ=∫ψi(x)ψj(x)dμ(x)=δij,

to compute the projection of a function onto a subspace of spanned by basis functions. The orthogonal projection

 fℓ(x)=ℓ∑i=1⟨f,ψi⟩μψi(x).

converges in the sense as the subspace grows: . Pseudospectral approximation departs from exact orthogonal projection by using numerical quadrature to approximate the inner products . In particular, one seeks an orthogonalizing set of nodes and weights, , i.e., a quadrature rule that computes the inner products between the first basis functions exactly:

 (13) ⟨ψi,ψj⟩μ=N∑k=1wkψi(xk)ψj(xk), i,j=1,…,ℓ.

This choice eliminates internal aliasing error [7], though the pseudospectral approximation will still exhibit external aliasing error, since is not necessarily in the span of . Given an orthogonalizing rule (13), a pseudospectral approximation can be written as

 (14) ^fℓ(x)=ℓ∑i=1(N∑k=1wkf(xk)ψi(xk))ψi(x)=N∑k=1[wkf(xk)(ℓ∑i=1ψi(xk)ψi(x))].

In subsequent analysis, it will be convenient to express in matrix notation. Let , and . Then

 (15) ^fℓ(x)=yTWℓ∑i=1ψiψi(x).

Examples of and include one-dimensional orthogonal polynomial families and their corresponding Gaussian quadratures; or tensorized versions of each in multiple dimensions. However, the basis functions do not in general need to be polynomials, and other quadrature rules besides Gaussian rules may exist. Using Mercer’s theorem, we can already see that can be interpreted as a truncated kernel with eigenvalues that do not decay, and we can interpret the pseudospectral approximation given in (15) as a weighted combination of such kernels.

### 5.2 Same approximation space and experiments

To begin our comparison, we will fix two of the choices involved in surrogate modeling and evaluate the impact of the third. In particular, this section will compare the impact of different algorithms—i.e., using pseudospectral approximation versus GP regression—for identical experiments and for the same or similar approximation spaces. Because pseudospectral approximation requires experiments to be chosen according to an orthogonalizing rule, we will also use these experiments for GP regression. We are now left to relate the basis for the pseudospectral approximation to the GP kernel, as these objects determine the approximation space.

Theorem 2.1 lets us represent any Mercer kernel, and the corresponding GP posterior mean function, using the eigenfunctions. With this connection, we first develop a more general result than required. Specifically, we show that when the basis for pseudospectral approximation comprises a subset of the eigenfunctions of a given kernel, then the -norm of the difference between the resulting GP posterior mean function and the pseudospectral approximation may be bounded. Results for identical approximation spaces follow immediately as a corollary of this general case.

For the result below, we will assume that we have a Mercer kernel consisting of eigenfunctions, where could be finite or infinite, and a spectral expansion consisting of the first eigenfunctions, where is finite. The approximations will be constructed from evaluations of the function , performed at nodes of a quadrature rule , that orthogonalizes the first eigenfunctions. Clearly depends on . Let be the eigendecomposition of the matrix of covariances among all the design points computed without a nugget, i.e., . is a diagonal matrix with elements . Finally, assume a zero mean prior for the GP model. The latter assumption does not restrict the generality of our results. In fact, one can transform the problem from one of approximating to one of approximating , for some fixed function . Of course, this transformation requires one to translate the data accordingly. Under these conditions, we have the following result, whose proof is given in Appendix A.

{theorem}

Let be a pseudospectral approximation represented with basis functions , computed via an orthogonalizing quadrature rule as in (13)–(14). Let . Also, let . Let be the posterior mean of GP regression with prior covariance kernel and nugget term , constructed from function evaluations at the nodes of . If , for , and , then the difference between these two approximations is bounded as

We can consider a relative notion of error by dividing each side by . Furthermore, since we are dealing with bounded functions and finite amounts of data, we can always normalize and center the data to make . The difference between the two approximations described in Theorem 5.2 has two sources. First is the contribution of kernel eigenfunctions that are not in the pseudospectral approximation basis, i.e., the summations involving above. Second is the contribution of the noise term .

Because the kernel eigenfunctions span the RKHS containing the GP posterior mean, the case of equivalent approximation spaces for pseudospectral approximation and GP regression follows simply by setting . This case is highlighted by Corollary 5.2.

{corollary}

Let . Then the difference between the pseudospectral approximation and GP posterior mean defined in Theorem 5.2 is:

 (16) ∥m−^fℓ∥2L2μ≤∥y∥22ℓNM2w2max(σ2sN+σ2)2.

In this case, the difference between approximations is due only to the nugget and the smallest eigenvalue of the covariance matrix as indicated by . If we additionally have that , then the design covariance matrix remains invertible (i.e., with ) even as , yielding zero difference between the approximations. This occurs, for example, in the case of kernels constructed from fully tensorized polynomial eigenfunctions and tensorized Gaussian quadrature rules, where .

We also note that the bound in Theorem 5.2 depends on the minimum eigenvalue of the covariance matrix, represented (after diagonalization) by . If is nearly rank-deficient and the nugget is sufficiently small, then the bound can be large. This situation is not purely an artifact of the theory; indeed, it corresponds to a poorly conditioned numerical problem, where the actual difference between the two approximations may be large as well. One may imagine a case where is not invertible, e.g., too many quadrature nodes are used and the GP kernel is finite rank; in this case the computation of becomes unstable whereas the computation of can still proceed in a stable manner.

Besides bounding the difference between approximations, we can also analyze the impact of an orthogonalizing experimental design on the integrated variance of the GP posterior. Consider, again, a kernel consisting of eigenfunctions and a training set corresponding to the nodes of a quadrature rule that orthogonalizes the first eigenfunctions. We begin by splitting (6) into summations involving the first eigenfunctions and the remaining to eigenfunctions:

 ∫c(x)dμ(x) =ℓ∑i=1λi(1−λiϕTiRϕi)+ℓGP∑i=ℓ+1λi(1−λiϕTiRϕi).

The second term in this expansion represents the contribution of the extra eigenfunctions to the integrated variance. We cannot comment on how our training points will affect this term because they are designed to have special properties only for the first eigenfunctions. But we can use these properties to analyze the first term above, thus describing the impact of an orthogonalizing rule on the associated integrated variance. Note that the weights do not explicitly enter the GP regression; nonetheless, as we show in Appendix B, this portion of the integrated variance can be rewritten as:

 (17) ℓ∑i=1λi(1−λiϕTiRϕi) =ℓ∑i=1λiϕTiW⎛⎝ℓGP∑j=ℓ+1λjϕjϕTj+σ2I⎞⎠Rϕi.

This expression consists of interactions between the first eigenfunctions and the last eigenfunctions; it also includes the impact of the nugget. The eigenfunction interactions are expected because the design only orthogonalizes the first eigenfunctions; errors are incurred when the numerical inner product is taken between one of the first eigenfunctions and one of the remaining eigenfunctions. We see that if , the integrated variance is of the order of the noise as expected. When additionally , the integrated variance is zero. An orthogonalizing design ensures that the integrated variance captures only the contributions of eigenfunctions which are not orthogonalized.

The preceding results highlight some of the assumptions underlying the practical application of these two surrogate construction methodologies. When using a pseudospectral approximation, one implicitly assumes that the true function’s projection onto basis functions not included in the expansion (and hence not orthogonalized by the underlying design) is small. Otherwise, more points and basis functions should be added to the approximation; indeed, adaptive basis selection is the concern of a vast array of approximation methods, pseudospectral and otherwise. In GP regression, a properly chosen kernel is one whose eigenvalue decay matches the decay of the spectral coefficients of the true function. In this case the function will lie in the RKHS associated with the kernel and an accurate approximation can readily be achieved. These two approaches interact through the integrated variance. The orthogonalizing rule in a pseudospectral approximation ensures that the difference between and lies mostly in the span of the eigenfunctions . Correspondingly, this rule forces the IVAR, a measure of the uncertainty and error (via the Bayes risk) of the GP approximation, to retain contributions only from the same subspace (that spanned by ). We also see that the difference between the GP posterior mean and pseudospectral approximation is dominated by these extra eigenfunctions.

### 5.3 Relaxing exact orthogonality

It is instructive to consider the tradeoff between exactly orthogonalizing fewer basis functions or, for the same design points, generating an approximation using more basis functions than can be orthogonalized. The latter corresponds to performing GP regression under the conditions of Theorem 5.2 with .

Consider the function for with standard Gaussian weight . We will use a Gauss-Hermite quadrature rule with points to construct a pseudospectral approximation of with the first Hermite polynomials (degrees 0 to 19) as basis functions. We compare this approximation to Gaussian process regression on the same points using the Mehler kernel [57]

 (18) K(x,y;t)=1√1−t2exp(−12(x)2t2−2txy+(y)2t21−t2),

which is a closed-form expression for , where are again normalized Hermite polynomials. The nugget and the decay constant . In this way we compare two surrogates using identical experiments and closely related approximation spaces: the approximation space of the former is contained in that of the latter. But the two surrogates use different approaches for combining the same function evaluations into an approximation. GP regression approximates using an infinite collection of polynomial eigenfunctions, with more emphasis on those associated with higher eigenvalues, while pseudospectral approximation projects onto finite polynomial basis, with the exactness of the projection limited by external aliasing.

Figure 9 shows the results of this experiment. First, we note that the relative errors are for pseudospectral approximation and for the GP mean function. Perhaps more interesting than this improvement, however, is the spectrum of the error associated with each approximation. The left panel of Figure 9 shows the projections of the errors and onto each basis function , i.e., . For reference, the right panel shows how much energy has in each basis direction, via the magnitudes of the exact projections . (All projections are computed with extremely high-order quadrature.) We see that the projection of the pseudospectral approximation error onto the first few basis functions is small, even though itself has significant energy in these directions. The projection of then rises with the basis index and peaks around an index of 20; it then begins to decay, just as the exact coefficients themselves decay in magnitude. The projection of the GP error, on the other hand, is flatter across the indices. It rises slightly for the first few basis functions and then begins to decay somewhat more slowly. The basis functions for which the orange line (GP error) lies below the red line (PSA error) correspond to relatively large coefficient values; this results in a lower error overall.

As a preview of the next section, we include a third line in Figure 9 representing GP regression performed with a 20-point IVAR-minimizing design. The relative error of this approximation is even smaller: . And the spectrum of its error, shown with the grey line in the left panel, is even flatter than that of the quadrature-based GP approximation. This result amplifies the previous trend: compared to pseudospectral approximation, GP regression spreads its error more evenly over the spectrum. Departing from an orthogonalizing design allows this error to be spread even more broadly.

### 5.4 Similar approximation spaces, optimal experiments

Our next goal is to compare GP regression and pseudospectral approximation when the approximation spaces are similar, but the experimental designs (and the algorithms) differ. We have already seen that RKHS containing the GP mean can coincide with the range of the pseudospectral approximation operator in the case of a finite-rank GP kernel: the eigenfunctions of the kernel can simply be used as the basis for the pseudospectral approximation. In the numerical comparisons below, however, we would like to allow the GP kernel to have infinite rank. This choice is more representative of how GP regression is used in practice, and in principle may allow the GP mean to better approximate a wider variety of functions. Of course, the eigenvalues of the kernel need to decay, so that we have a bounded kernel.

We will therefore use the Mehler kernel (18) as in Section 5.3, and consider target functions whose inputs are endowed with standard Gaussian weight on . The eigenfunctions of the Mehler kernel are Hermite polynomials, which we use as basis functions for pseudospectral approximation. In dimensions, we use a tensorized version of the Mehler kernel: , where now governs the decay rate of the eigenvalues associated with the univariate eigenfunctions in dimension .

To design experiments for GP regression, we use IVAR minimization in combination with adaptation of the covariance kernel hyperparameters (e.g., correlation lengths) and the nugget . In particular, we choose these parameters by maximizing the log-marginal likelihood,

 (19) logp(^y|x,θ)=−12^yTR^y−12log|R−1|−N2log2π,

with respect to the complete set of hyperparameters (including ), denoted by . Our approach interleaves kernel adaptation with batch IVAR design. The batch size for each design is described in each example, but the number of Monte Carlo points used to evaluate the IVAR objective is fixed at .

For pseudospectral approximation, we use a state-of-the-art adaptive Smolyak algorithm [7], which adaptively enriches both the approximation basis and the experimental design using a greedy heuristic. The approximation is essentially a Smolyak sum of full tensor-product polynomial approximations, each computed using a pseudospectral approach. We use this approach because a regular tensor-product approach becomes infeasible for more than a few dimensions; hence many sparse grid [2] and dimension-adaptive [24] polynomial approximation algorithms have been developed. The Smolyak algorithm in [7] uses generalized sparse grids and dimension adaptivity and thus provides a useful benchmark for comparison with kernel-adaptive GP. It is implemented in the MIT Uncertainty Quantification framework (MUQ) [47]. Since the input domain is endowed with Gaussian measure, the sparse grid design is based on one-dimensional Gauss-Hermite quadrature rules, with the number of points growing exponentially with the level of the sparse grid.

The first problems we consider are two dimensional and additively separable. Additive separability is a property favorable to the Smolyak algorithm, in part because of the greedy heuristic the algorithm uses to enrich the polynomial basis. The target functions are:

 (20) f1(x) =x(1)+(x(2))2, (21) f2(x) =sin(2πx(1))+(x(2))2.

The batch size for the closed-loop IVAR scheme is one training point. Adaptation is performed for the eigenvalue decay rates and in each dimension and for the noise term . Figures 10 and 11 show the resulting relative errors and hyperparameter traces, as a function of the number of training points. As in Section 4, the relative errors are defined as and for Smolyak pseudospectral approximation and GP/IVAR, respectively. These errors are computed using 10000 Monte Carlo points.

In these examples, we observe that the adaptive Smolyak algorithm requires several function evaluations until it begins to converge. Once it converges, however, the relative error drops to machine precision. (Recall that the quadrature rules used in the Smolyak scheme require adding several points at a time.) The GP approximation error, on the other hand, decreases steadily but gradually as additional points are added and as the kernel is refined. In both cases, by the time the pseudospectral approximation becomes accurate, the GP approximation has already achieved at least relative error, perhaps sufficient for many applications. The hyperparameter traces show how the GP covariance kernel adapts to the function being approximated. Function (20) is fairly low order in both dimensions, and the hyperparameter values and both converge to fairly low numbers, indicating fast eigenvalue decay. The converged value of is small because only the constant and linear eigenfunctions (in the first dimension) are updated by the data; converges to a slightly higher value, thus slowing the eigenvalue decay, to account for the quadratic term in . Function (21), on the other hand, is relatively high order in the first dimension but low order in the second. We thus observe that converges to , corresponding to a slower decay, and that converges to roughly ; the hyperparameter optimization has “learned” that only the first few eigenfunctions in this dimension matter.

Next we perform the same experiments on the three-dimensional Ishigami function

 (22) f(x)=sin(x(1))+7sin2(x(2))+0.05(x(3))4sin(x(1)),

The Ishigami function is not additively separable, and we expect to see even better relative performance of GP regression in this example since the kernel eigenfunctions include the tensor products of all univariate Hermite polynomials. Here we use a batch size of experiments for IVAR design and hyperparameter adaptivity. The results are shown in Figure 12.

We see that the GP begins to outperform the pseudospectral approximation after roughly 20 function evaluations, with an error that decreases consistently. The error from the adaptive Smolyak approach, on the other hand, plateaus in several regions. After 50 iterations, the GP approximation reaches a relative error of , while it takes pseudospectral approximation 100 iterations to reach the same relative error. This behavior may be attributed to the fact that the Mehler kernel accounts for interactions among the input variables immediately, whereas the dimension-adaptive Smolyak approach requires some exploration (corresponding to the error plateaus) to find the basis functions that capture these interactions.

#### Ten-dimensional Genz function

Finally, we consider a higher-dimensional example using the oscillatory Genz function [23]:

 (23) f4(x)=cos(2πw1+10∑i=1x(i)ci),

where and the are chosen randomly from a uniform distribution on and then normalized to . This Genz function is typically evaluated on a hypercube domain with normalization . We have found the present approximation problem, with an unbounded and Gaussian-weighted domain, to be significantly more challenging, however, due to the tail behavior of the Hermite polynomials. A comparison of approximation errors, again between GP/IVAR using the Mehler kernel and adaptive Smolyak pseudospectral approximation, is shown in Figure 13. The GP approximation performs extremely well. We note that this Genz function involves non-additive coupling among all ten input dimensions, a feature that may amplify the benefits of including a fully tensorized set of eigenfunctions via the GP kernel. For the GP regression calculations, we initially added experiments in batches of , learning the hyperparameters after each batch, until obtaining 700 experiments total. Then we added experiments at a time. In the future, it may be useful to automatically stop adapting the hyperparameters once the changes in the hyperparameters between iterations fall below some threshold.

Note that we were able to compare GP and pseudospectral approximation in these examples because we had access to the Mehler kernel, and hence explicit evaluations of the Hermite basis functions were not required for GP regression. In practice, we may not have closed-form kernels for other basis function families, typically used in pseudospectral approximation. Truncated kernels whose eigenfunctions are polynomials, however, can often be represented using the Christoffel-Darboux formula [5, 11]. For normalized basis functions, the Christoffel-Darboux formula is essentially a finite rank kernel with equal eigenvalues, and thus for a fixed basis order, one would not be able to adapt this kernel.

## 6 Conclusion

This paper has examined experimental design criteria and optimization procedures used to develop surrogates of computational models, highlighting aspects of the interplay between experimental designs and approximation algorithms. In the first part of the paper, we discussed experimental design criteria for Gaussian process regression and presented an algorithm for minimizing the posterior integrated variance (IVAR) of a Gaussian process over a continuous design space. Our approach is adapted to the kernel structure; for instance, with isotropic squared exponential kernels, it yields well-spaced points on arbitrary complex domains, avoiding boundary clustering and other undesirable artifacts. IVAR points also have good interpolation stability, as measured by the Lebesgue constant for kernel interpolation. The underlying optimization problem is tractably solved with gradient descent methods, as long as the number of design points is not too small or too large relative to the complexity of the prior covariance kernel. Our numerical experiments also demonstrate that simultaneously designing multiple points can yield substantial benefits over greedy strategies. Nonetheless, even greedy minimization of IVAR yields better approximation performance than standard greedy algorithms for minimizing entropy or maximizing MI.

In the second part of the paper, we compare GP regression to polynomial approximation. For simplicity, we consider only pseudospectral approximation, using nodes and weights that orthogonalize the finite set of basis functions used for approximation. In this setting, when GP regression uses the same nodes and a kernel with polynomial eigenfunctions, the difference between the two approximations is due only to the GP “nugget” and truncation of the kernel. When instead coupled with an IVAR design, GP regression for an infinite-rank kernel sacrifices numerical orthogonality of any given set of eigenfunctions; compared to the pseudospectral approach, projection errors may be larger for the first few eigenfunctions but are more nearly constant over the eigenspectrum. This observation is reminiscent of the average-case quadrature of [40], which compares a Bayesian method for deriving quadrature nodes with Gauss rules of fixed degree. We follow these comparisons with an empirical study of approximation performance on various test functions, comparing adaptive variants of GP regression and sparse pseudospectral approximation. We observe that while additively separable functions lend themselves easily to sparse polynomial approximations, more strongly coupled functions can be more efficiently approximated using the GP approach. Kernel approximation is also well suited to complex (e.g., non-tensorized) input domains. In our current design approach, while eigenfunctions of the kernel integral operator on the domain need not be explicitly computed, eigenfunctions that couple the input dimensions are all implicitly present.

Future work can extend our analysis of experimental design for GP regression in many ways. It is useful to compare the present design criteria with other methods for choosing “good” interpolation nodes in the radial basis function literature [12, 14]. Comparing IVAR-optimal nodes to optimal nodes for “Bayesian quadrature” [40, 46] would also be of great interest. More broadly, we advocate for closer connections between the numerical analysis literature and the statistics literature on these topics. Finally, adaptive designs that interleave point selection with updates to the kernel would benefit from more rigorous study. Current sequential approaches, including the ones presented here, are perhaps reasonable heuristics. But look-ahead strategies that anticipate information gain from future designs, and that balance information for kernel adaptation with reduction of the conditional posterior variance, may lead to even more effective approximation procedures.

### Acknowledgments

The authors gratefully acknowledge BP for funding this research. We thank Tarek Moselhy, Akil Narayan, and Alessio Spantini for many helpful discussions. We also thank the referees and editors for their thoughtful comments and suggestions.

## Appendix A Proof of Theorem 5.2

First we recall some notation. The prior precision matrix is . We define and to be the matrices associated with the eigenvalue decomposition .

We assume, without loss of generality, that the prior mean is zero; from (1) we obtain

 m(x)=yTRℓGP∑j=1λjϕjϕj(x) =yTRℓ∑j=1λjϕjϕj(x)+yTRℓGP∑j=ℓ+1λjϕjϕj(x) =yTRℓ∑j=1λjϕjϕj(x)+a(x),

where we have replaced with its eigenfunctions and then separated the posterior mean into two terms, the first containing the first eigenfunctions and the second denoting the rest.

We now use the orthogonality rule (13) to obtain . Recall that the pseudospectral approximation