Evaluating Gaussian Process Metamodels and Sequential Designs for Noisy Level Set Estimation
Abstract
We consider the problem of learning the level set for which a noisy blackbox function exceeds a given threshold. To efficiently reconstruct the level set, we investigate Gaussian process (GP) metamodels. Our focus is on strongly stochastic samplers, in particular with heavytailed simulation noise and low signaltonoise ratio. To guard against noise misspecification, we assess the performance of three variants: (i) GPs with Student observations; (ii) Student processes (TPs); and (iii) classification GPs modeling the sign of the response. In conjunction with these metamodels, we analyze several acquisition functions for guiding the sequential experimental designs, extending existing stepwise uncertainty reduction criteria to the stochastic contourfinding context. This also motivates our development of (approximate) updating formulas to efficiently compute such acquisition functions. Our schemes are benchmarked by using a variety of synthetic experiments in 1–6 dimensions. We also consider an application of level set estimation for determining the optimal exercise policy of Bermudan options in finance.
1 Introduction
1.1 Statement of Problem
Metamodeling has become widespread for approximating expensive blackbox functions that arise in applications ranging from engineering to environmental science and finance (Santner et al., 2013). Rather than aiming to capture the precise shape of the function over the entire region, in this article we are interested in estimating the level set where the function exceeds some particular threshold. Such problems are common in cases where we need to quantify the reliability of a system or its performance relative to a benchmark. It also arises intrinsically in control frameworks where one wishes to rank the payoff from several available actions (Hu and Ludkovski, 2017).
We consider a setup where the latent is a continuous function over a dimensional input space . The levelset estimation problem consists in classifying every input according to
(1.1) 
Without loss of generality the threshold is taken to be zero, so that the level set estimation is equivalent to learning the sign of the response function . For later use we also define the corresponding zerocontour of , namely the partition boundary .
For any , we have access to a simulator that generates noisy samples of :
(1.2) 
where are realizations of independent, mean zero random variables with variance .
To assess a levelset estimation algorithm, we compare the resulting estimate with the true in terms of their symmetric difference. Let be a probability measure on the Borel algebra (e.g., ). Then our loss function is
(1.3) 
where . Frequently, the inference is carried out by first producing an estimate of the response function; in that case we take ) and rewrite the loss as
(1.4) 
where is the indicator function.
1.2 Motivation
As a concrete example of level set estimation, consider the problem of evaluating the probability of failure, determined via the limit state of a performance function (Picheny and Ginsbourger, 2013). The system is safe when , and fails otherwise. In the context where the performance function can be evaluated via deterministic experiments, the estimation of the safe zone (more precisely its volume ) was carried out in Bect et al. (2012) and Mukhopadhyay et al. (2005) employing a Gaussian Process approach with a sequential design. A related example dealing with the probability of failure in a nuclear fissile chain reaction appeared in Chevalier et al. (2014a).
Another application, which motivated this present investigation, comes from simulationbased algorithms for valuation of Bermudan options (Gramacy and Ludkovski, 2015; Ludkovski, 2018). This problem consists of maximizing the expected reward over all stopping times bounded by the specified horizon :
(1.5) 
where is the underlying asset price at time , typically satisfying a stochastic differential equation and is the frequency of exercising. The approach in the socalled Regression Monte Carlo methods (Longstaff and Schwartz, 2001; Tsitsiklis and Van Roy, 2001) is to convert the decision of whether to exercise the option or continue when at intermediate step , into comparing the immediate reward vis à vis the rewardtogo . In turn this is equivalent to determining the zero level set (known as the continuation region) of the timing value . The stopping problem (1.5) is now solved recursively by backward induction over , which allows noisy samples of to be generated by simulating a trajectory emanating from and evaluating the respective pathwise rewardtogo. Probabilistically, this means that we are interested in (1.2) where corresponds to a conditional expectation related to a pathdependent functional of the Markov process ; the loss function (1.3) arises naturally as a metric regarding the quality of the estimated stopping rule in terms of the underlying distribution of . We refer to Ludkovski (2018) for a summary of existing state of the art and the connection to employing a GP metamodel for learning the timing value .
1.3 Design of Experiments for Contour Finding
Reconstructing via a metamodel can be divided into two steps: the construction of the response model and the development of methods for efficiently selecting the simulation inputs , known as design of experiments (DoE). Since the level set is intrinsically defined in terms of the unknown , an adaptive DoE approach is needed that selects ’s sequentially.
For the response modeling aspect, GP regression, or kriging, has emerged as the most popular nonparametric approach for both deterministic and stochastic blackbox functions (Bect et al., 2012; Gramacy and Lee, 2009; Picheny et al., 2013a; Jalali et al., 2017). GPs have also been widely used for the levelset estimation problem; see Bryan and Schneider (2008); Gotovos et al. (2013); Hu and Ludkovski (2017); Picheny et al. (2010) and Ranjan et al. (2008). In a nutshell, at step the GP paradigm constructs a metamodel that is then used to guide the selection of and also to construct the estimate . To this end, GPs are well suited for sequential design by offering a rich uncertainty quantification aspect that can be (analytically) exploited to construct informationtheoretic DoE heuristics. The standard framework is to develop an acquisition function that quantifies the value of information from taking a new sample at input conditional on an existing dataset and then to myopically maximize :
(1.6) 
Early levelset sampling criteria were proposed by Bichon et al. (2008), Picheny et al. (2010), and Ranjan et al. (2008) based on modifications to the Expected Improvement criterion (Jones et al., 1998) for response function optimization. A criterion more targeted to reduce the uncertainty about itself was first developed by Bect et al. (2012) using the concept of stepwise uncertainty reduction (SUR). Specifically, the SUR strategy aims to myopically maximize the global learning rate about ; see also Chevalier et al. (2014a) for related computational details. Recently, further criteria using tools from random set theory were developed in Chevalier et al. (2013); Azzimonti et al. (2016). Specifically, those works use the notions of Vorob’ev expectation and Vorob’ev deviation to choose inputs that minimize the posterior expected distance in measure between the level set and its estimate . This approach is computationally expensive however, and requires conditional simulations of the posterior Gaussian field. Other works dealing with more conservative estimates are Bolin and Lindgren (2015); Azzimonti et al. (2019). Clear analysis comparing all these choices in the stochastic setting is currently lacking.
1.4 Summary of Contributions
Most of the cited papers consider only the deterministic setting without any simulation noise. The main goal of this article is to present a comprehensive assessment of GPbased surrogates for stochastic contourfinding. In that sense, our analysis complements the work of Picheny et al. (2013b) and Jalali et al. (2017), who benchmarked GP metamodels for Bayesian optimization (BO) where the objective is to evaluate .
While simple versions (with constant or prespecified Gaussian noise) are easily handled, the literature on GP surrogates for complex stochastic simulators remains incomplete. Recently, several works focused on heteroskedastic simulation variance; see the Stochastic Kriging approach of Ankenman et al. (2010) and the earlier works by two of the authors (Binois et al., 2018, 2019). In the present article we instead target the nonGaussian aspects, in particular the likely heavytailed property. This issue is fundamental to any realistic stochastic simulator where there is no justification for assuming Gaussiandistributed (as opposed to the physical experimental setup where represents observation noise and is expected to be Gaussian thanks to the central limit theorem). This motivates us to study alternative GPbased metamodels for learning that are more robust to nonGaussian in (1.2). In parallel, we investigate which of the contourfinding heuristics outlined above perform best in such setups.
To stay within the sequential design paradigm, we continue to work with a GPbased setup but investigate several modifications that are relevant for learning .

To target the classificationlike objective underlying (1.3), we consider the use of classification GPs that model the sign of the response via a probit logistic model driven by a latent GP : . Deployment of the logistic regression is expected to “wash out” nonGaussian features in beyond its effect on the sign of the observations.

In a different vein, to exploit a structure commonly encountered in applications where the level set is connected, we study the performance of monotone GP regression/classification metamodels (Riihimäki and Vehtari, 2010) that force (or ) to be monotone in the specified coordinates.
Our analysis is driven by the primal effect of noise on contourfinding algorithms. This effect was already documented in related studies, such as that of Jalali et al. (2017) who observed the strong impact of on performance of BO. Consequently, specialized metamodeling frameworks and acquisition functions are needed that can best handle the stochasticity for the given loss specification. Thus, the combination of the above tools with the GP framework aims to strike the best balance in carrying out uncertainty quantification and constructing a robust surrogate that is not too swayed by the simulation noise structure. In the context of GPs, this means accurate inference of the mean response and sampling noise that in turn drive the posterior mean and the posterior GP variance . Both of the latter ingredients are needed to blend the exploitation objective to locally learn the contour and to explore lesssampled regions. These issues drive our choices of the metamodels and also factor in developing the respective acquisition functions ; see cf. Section 3. On the latter front we consider four choices (MCU, cSUR, tMSE, ICU), including heuristics that depend only on the posterior standard deviation , as well as those that anticipate information gain from sampling at via the lookahead standard deviation . Because in the nonGaussian GPs depends on , we develop tractable approximations for that purpose, see Propositions 4.24.34.4.
To recap, our contributions can be traced along five directions. First, we investigate two ways to handle heavytailed simulation noise via a GP with observations and via TP. As far as we are aware, this is the first application of either tool in sequential design and contourfinding contexts. Second, we present an original use of monotonic GP metamodels for level set estimation. This idea is related to a graybox approach that aims to exploit known structural properties of (or ) so as to improve on the agnostic blackbox strategies. Third, we analyze the performance of classification GP metamodels for contourfinding. This context offers an interesting and novel comparison between regression and classification approaches benchmarked against a shared loss function. Fourth, we develop and implement approximate lookahead formulas for all our metamodels that are used for the evaluation of acquisition functions. To our knowledge, this is the first presentation of such formulas for nonGaussian GPs, as well as TPs. Fifth, beyond the metamodels themselves, we also provide a detailed comparison among the proposed acquisition functions, identifying the bestperforming combinations of and metamodel and documenting the complex interplay between design geometry and surrogate architecture.
The rest of the article is organized as follows. Section 2 describes the metamodels we employ. Section 3 develops the sequential designs for the levelset estimation problem, and Section 4 discusses the lookahead variance formulas for nonGaussian GPs. Section 5 compares the models using synthetic data where ground truth is known. Two case studies from derivative pricing are investigated in Section 6. In Section 7 we summarize our conclusions.
2 Statistical Model
2.1 Gaussian Process Regression with Gaussian Noise
We begin by discussing regression frameworks for contour finding that target learning the latent based on the loss (1.4). The Gaussian process paradigm treats as a random function whose posterior distribution is determined from its prior and the collected samples . We view , a priori, as a realization of a Gaussian process completely specified by its mean function and covariance function .
In the classical case (Williams and Rasmussen, 2006), the noise distribution is homoskedastic Gaussian , and the prior mean is zero, . Given observations at inputs , the conditional distribution is then another Gaussian process, with posterior marginal mean and covariance given by (throughout we use subscripts to indicate the metamodel type, e.g., for Gaussian noise)
(2.1)  
(2.2) 
with the vector and matrix defined by , and .
The posterior mean is treated as a point estimate of and the posterior standard deviation as the uncertainty of this estimate. We use to denote the random posterior vector .
Model Fitting: In this article, we model the covariance between the values of at two inputs and with the squared exponential (SE) function:
(2.3) 
defined in terms of the hyperparameters known as the process variance and lengthscales, respectively. Simulation variance is also treated as unknown and part of . Several common ways exist for estimating . Within a Bayesian approach we integrate against the prior using
(2.4) 
where is the likelihood and is the latent function prior. Notice that following the Gaussian noise assumption, the likelihood is Gaussian. With a Gaussian prior , the posterior is tractable and also follows a Gaussian distribution. The normalizing constant in the denominator is independent of the latent function and is called the marginal likelihood, given by
(2.5) 
One may similarly express the posterior over the hyperparameters , where plays the role of the likelihood. To avoid expensive MCMC integration, we use the Maximum Likelihood (ML) estimate which maximizes the likelihood (2.5). Given the estimated hyperparameters , we take the posterior of as .
2.2 Gaussian Process Regression with Student Noise
Taking the noise term as Gaussian is widely used since the marginal likelihood is then analytically tractable. In a stochastic simulation setting however, the exact distribution of the outputs relative to their mean is unknown and often is clearly nonGaussian. A more robust choice is to assume that has a Student distribution (Jylänki et al., 2011). In particular, this may work better when the noise is heavytailed by making inference more resistant to outliers (O’Hagan, 1979). In the resulting GP formulation is assumed to be distributed with variance and degrees of freedom (the latter is treated as another hyperparameter). The marginal likelihood of observing can be written as
(2.6) 
where is the incomplete Gamma function. The likelihood in (2.4) is no longer Gaussian, and integrating (2.6) against the Gaussian prior is intractable; we therefore use the Laplace approximation (LP) method (Williams and Barber, 1998) to calculate the posterior. A secondorder Taylor expansion of around its mode, , gives a Gaussian approximation
(2.7) 
where is the Hessian of the negative conditional log posterior density at :
(2.8) 
and is diagonal, since the likelihood factorizes over observations.
Using (2.7), the approximate posterior distribution is also Gaussian , defined by its mean and covariance :
(2.9)  
(2.10) 
Note the similarity to (2.1)–(2.2): with Student likelihood the mode plays the role of and replaces the noise matrix . Critically, the latter implies that the posterior variance is a function of both designs and observations .
2.3 Gaussian Process Classification
Our target in (1.1) is to learn where the mean response is positive, which is equivalent to classifying each as belonging either to or to . Assuming that is symmetric, . This motivates us to consider the alternative of directly modeling the response sign (rather than overall magnitude) via a classification GP model (ClGP) (Williams and Barber, 1998; Williams and Rasmussen, 2006). The idea is to model the probability of a positive observation by using a probit logistic regression: , with the standard normal cdf. The latent classifier function is taken as the GP . After learning we then set .
To compute the posterior distribution of conditional on , we use the fact that for an observation and conditional on the likelihood of is . To simplify notation we use to represent the signed responses driving ClGP, leading to . The posterior of the latent is therefore
(2.11) 
Similar to GP, we use a Laplace approximation for the nonGaussian in Eq. (2.11) (details to be found in Appendix B). The posterior mean for at is then expressed by using the GP predictive mean equation (2.1) and LP approximation (B.1):
(2.12)  
(2.13) 
We again see the same algebraic structure, with a standin for in (2.1) and a standin for in (2.2). Also note that we may formally link the of the ClGP metamodel to the GP used previously via the posterior probability that :
(2.14) 
2.4 Student Process Regression with Student Noise
Instead of just adding Student likelihood to the observations, Shah et al. (2014) proposed processes (TPs) as an alternative to GPs, deriving closedform expressions for the marginal likelihood and posterior distribution of the process by imposing an inverse Wishart process prior over the covariance matrix of a GP model. They found the process to be more robust to model misspecification and to be particularly promising for BO. Moreover, Shah et al. (2014) showed that TPs retain most of the appealing properties of GPs, including analytical expressions, with increased flexibility.
As noticed for example in Williams and Rasmussen (2006), dealing with noisy observations is less straightforward with TPs, since the sum of two independent Student distributions has no closed form. Still, this drawback can be circumvented by incorporating the noise directly in the kernel. The corresponding datagenerating mechanism is taken to be multivariate , where the degrees of freedom are . The posterior predictive distribution is then , where (Shah et al., 2014)
(2.15)  
(2.16) 
with
Comparing with the regular GPs, we have the same posterior mean , but the posterior covariance now depends on observations and is inflated: . Moreover, the latent function and the noise are uncorrelated but not independent. As noticed in Shah et al. (2014), assuming the same hyperparameters, as goes to infinity, the above predictive distribution becomes Gaussian.
Inference of TPs can be performed similarly as for a GP, for instance based on the marginal likelihood:
(2.17) 
One issue is estimation of , which plays a central role in the TP predictions. We find that restricting to be small is important in order to avoid degenerating to the plain Gaussian GP setup.
2.5 Metamodel Performance for Level Set Inference
To evaluate the performance of different metamodels, we consider several metrics. The first statistic is the error rate based on the loss function defined in Eq. (1.3), measuring the distance between the level set and its estimate :
(2.18) 
For ClGP, we replace with in the above, namely, use . A related statistic is the bias , which is based on the signed (weighted) difference between and :
(2.19) 
The error rate and bias evaluate the accuracy of the point estimate when the ground truth is known. In a realistic case study when the latter is unavailable, we replace by its empirical counterpart, based on quantifying the uncertainty in through the associated uncertainty of . Following Azzimonti et al. (2016), we define the empirical error as the expected distance in measure between the random set and its estimate :
(2.20) 
with calculated by using (2.1) and (2.2):
The local empirical error is the posterior probability of wrongly classifying conditional on the training dataset . It is intrinsically tied to the point estimate and the associated posterior variance through the Gaussian uncertainty quantification. For the TPs, the predictive distribution is Student, so that the Gaussian cdf is replaced with the respective survival function.
Uncertainty Quantification: To quantify the overall uncertainty about (rather than local uncertainty about ), a natural criterion is the volume of the credible band that captures inputs whose sign classification remains ambiguous given . A simple definition at a credibility level (e.g., ) would be
(2.21) 
where is the appropriate Gaussian/Student quantile. Thus (2.21) evaluates the region where the sign of is nonconstant over the posterior CI of . Heuristically however, is effectively equivalent to empirical error exceeding , so that the volume of is roughly proportional to the integrated empirical error .
In a more sophisticated approach based on random set theory, Chevalier et al. (2013) used the Vorob’ev deviation to define the uncertainty measure :
(2.22) 
where
and
An satisfying the unbiasedness condition
is referred to as the Vorob’ev threshold and can be determined through dichotomy (Chevalier et al., 2013). If the Vorob’ev threshold is picked to be zero, then the Vorob’ev deviation is reduced to the empirical error . Because of the computational overhead of working with (2.22), we restrict attention to the credible bands defined through , which correspond to local uncertainty about (or ) as in (2.21).
3 Sequential Design
We estimate the level set in a sequential design setting that assumes that is expensive to evaluate, for example because of the complexity of the underlying stochastic simulator. Therefore efficient selection of the inputs is important. In sequential design, at each step the next sampling location is selected given all previous measurements. The Bayesian approach to sequential design is based on greedily optimizing an acquisition function as in (1.6). These strategies got popularized thanks to the success of the expected improvement (EI) criterion and the associated efficient global optimization (EGO) algorithm (Jones et al., 1998). The basic loop for sequential design is as following:

Initialize

Loop for .

Choose the next input , and sample .

Augment

Update with

We now propose several metrics for the acquisition function in Eq. (1.6). The key plan is to target regions close to the boundary . A second strategy is to use the lookahead posterior standard deviation conditional on sampling at , in order to assess the corresponding information gain. This links the constructed design to the metamodel for , since different surrogate architectures quantify uncertainty differently.
The first metric, dubbed Maximum Contour Uncertainty (MCU), stems from the Upper Confidence Bound (UCB) strategies proposed by Srinivas et al. (2012) for Bayesian optimization. The idea of UCB is to express the exploitationexploration tradeoff through the posterior mean and standard deviation . Following the spirit of UCB, MCU blends the minimization of (exploitation) with maximization of the posterior uncertainty (exploration):
(3.1) 
where is a stepdependent sequence of weights. Thus, MCU targets inputs with high uncertainty (large ) and close to the boundary (small ). Small leads to aggressive sampling concentrated along the estimated ; large leads to spacefilling sampling that effectively minimizes the integrated meansquared error. Thus, the choice of ’s is critical for the performance; in particular should be increasing to avoid being trapped in local minima of . In the original application to BO (Srinivas et al., 2012) it is proved that with , the regret (a metric measuring the distance between estimated optima and the trueth in BO) of the estimate is guaranteed to converge. Further recipes for (3.1) for level set estimation were proposed in Gotovos et al. (2013) and Bogunovic et al. (2016); both papers mention that the above recommendation is too conservative and tends to overexplore. A constant choice of corresponds to the Straddle scheme in Bryan et al. (2006) and leads to (95% CI band of ). Similarly, Gotovos et al. (2013) employed and Bogunovic et al. (2016) suggested to use . Based on our experiments (see Appendix A), we find that a constant value of may be problematic and recommend to adapt to the relative ratio between (for steeper response surfaces should be larger) and ( needs to rise as posterior uncertainty decreases). One recipe is to use ( denotes the average of standard deviation and IQR the interquantile range) which keeps both terms in (3.1) approximately comparable as changes.
Remark 3.1.
The local empirical error as defined in Eq. (2.5) could be directly used as an acquisition function, i.e.,
(3.2) 
This Maximal Empirical Error (MEE) acquisition function measures the local probability of misclassification and is similar to the sequential criteria in Bect et al. (2012); Echard et al. (2010); Ranjan et al. (2008); Bichon et al. (2008), all based on the idea of sampling at where the event is most uncertain. However, (3.2) is not suitable for our purposes since it is maximized across the entire (namely for any where ), so does not possess a unique maximizer as soon as is nontrivial. One potential solution could be to maximize (3.2) over a finite candidate set, which however requires significant finetuning.
Our second strategy focuses on quickly reducing by comparing the current given and the expected conditional on the onestepahead sample, . This is achieved by integrating out the effect of on :
(3.3) 
The name cSUR is because (3.3) is directly related to the SUR strategy (Bect et al., 2012), modified to target contourfinding. Crucially, ties the selection of to the lookahead mean and lookahead standard deviation that appear on the righthand side of (3.3). To compute the integral over , we replace with its average . Similarly, we plug in the approximate onestepahead standard deviation discussed in Section 4 (especially Equations (4.4), (C.4), and (C.6)) for :
(3.4) 
Note that if is such that then both terms above are and . Thus, the cSUR criterion will not place samples directly on , but will aim to bracket the zerocontour.
In (3.4) cSUR only measures the local improvement in at the sampling location and consequently might be overly aggressive in targeting . This motivates us to target the global reduction in the uncertainty of , so as to take into account the spatial structure of . The resulting Integrated Contour Uncertainty (ICU) is linked to the already defined empirical error from Section 2.5:
(3.5) 
We apply the same approximation as for cSUR to simplify the expectation over and replace the integral over with a sum over a finite subset of size :
(3.6) 
Then can be viewed as measuring the overall information gain about from sampling at . The motivation behind ICU is to myopically minimize the expected onestepahead empirical error , which would correspond to 1step Bayesoptimal design.
As a last alternative, we utilize the targeted mean squared error (tMSE) criterion, a localized form of targeted IMSE criterion in Picheny et al. (2010):
(3.7) 
where
(3.8) 
The tMSE criterion upweighs regions close to the zero contour through the weight function which measures the distance of to using the Gaussian posterior density . Like MCU, tMSE is based only on the posterior at step and does not integrate over future ’s.
Remark 3.2.
In Picheny et al. (2010) an additional parameter was added to the definition of by replacing everywhere with . Larger yields more spacefilling as becomes flatter. Since Picheny et al. (2010) dealt with deterministic experiments, was necessary to ensure that is well defined at existing and the recommendation was for