Gibbs reference posterior for Robust Gaussian Process Emulation.
We propose an objective posterior distribution on correlation kernel parameters for Simple Kriging models in the spirit of reference posteriors. Because it is proper and defined through its conditional densities, it lends itself well to Gibbs sampling, thus making the full-Bayesian procedure tractable. Numerical examples show it has near-optimal frequentist performance in terms of prediction interval coverage.
Keywords. Gaussian process, Kriging, Jeffreys prior, Reference prior, integrated likelihood, Gibbs sampling, posterior propriety, frequentist coverage, computer model.
Gaussian Processes are widely used to model the spatial distribution of some real-valued quantity when said quantity is only observed at a few locations. This emulation technique is a convenient way to represent the uncertainty of the value of the quantity at unobserved points [SWN]. In this work, we focus on stationary Gaussian Processes with null mean function, a framework that is referred to as “Simple Kriging” in the geostatistical literature [JH78] and is also frequently used in the context of computer experiments and machine learning [Ras06]. The exact probability distribution of a Gaussian Process depends not only on its mean function (supposed here to be null), but also on a variance parameter and a correlation function (also known as “correlation kernel”) which itself depends on parameters.
We propose an “objective” posterior distribution on these parameters derived through techniques related to the Bernardo reference posterior, which we call “Gibbs reference posterior”.
The need for a prior distribution on the parameters of Kriging models arises from the lack of robustness of the Maximum Likelihood Estimator (MLE) in dealing with parameters of correlation kernels. Indeed, what makes estimating them “notoriously difficult”, as KOH01 put it, is that the likelihood function may often be quite flat [LS05]. To tackle this problem, one may stabilize the MLE by adding a nugget to the covariance kernel, namely adding a covariance component concentrated on the diagonal. However, as was noted by AC12, “the presence of a nugget is equivalent to the assumption that the simulator contains some variability that is not explainable by its inputs”. Alternatively, LS05 proposed penalizing the likelihood function, which may also be interpreted as using a prior distribution and then choosing the Maximum A Posteriori (MAP) estimate instead of the MLE. Of course, using a full-Bayesian approach obviates the problem of robustness of the estimator of the parameters, as one may simply use the integrated predictive distribution.
Whether one wishes to use a prior distribution as a penalizing function for the likelihood or to deploy the whole Bayesian machinery, one often faces the problem of a lack of a priori information. This is where Objective Bayes, which was first introduced in this context by BDOS01, is helpful. Their work on deriving the reference prior in this context and establishing posterior propriety was then successively extended by paulo05, RSS13 and Gu16. RSH12 and KP12 also derived and studied the reference prior in the case of an added nugget effect. However, the above cited works all make a restrictive assumption in order to guarantee posterior propriety. It turns out this assumption is untrue for twice differentiable correlation kernels, as we show in Appendix A. Therefore, to the author’s knowledge, there exists currently no proof of the propriety of the reference posterior for such standard covariance kernels as Matérn covariance functions with smoothness parameter . This, along with the wish to make the full-Bayesian process tractable in practice, led us to consider a different but similar “objective” posterior distribution. As it is defined through conditional densities and thus is well suited to Gibbs sampling, we call it the “Gibbs reference posterior".
In Section 2, we provide the background and basic idea behind the construction of the Gibbs reference posterior.
Because its definition involves reaching a compromise between incompatible conditional distributions, we give in Section 3 some theory as to how this may be achieved.
In Section 4, we offer theoretical guarantees of existence and propriety of the Gibbs reference posterior and provide an MCMC algorithm for sampling of the Gibbs reference posterior in the case where a Matérn class covariance kernel [Mat86] [HS93] with known smoothness parameter is used. For a thorough description of Matérn class covariance kernels and their properties, see Ste99 or Bac13.
In Section 5, we detail how to sample from the posterior distribution on parameters and contrast two ways of using it, namely the full-Bayesian method and the MAP plug-in approach.
The MAP plug-in approach has the intrinsic disadvantage of requiring parameter estimation. As its effectiveness depends on the quality of the estimation, it makes sense to compare its performance with that of the MLE. Section 6 shows that the MAP estimator improves inference robustness significantly when compared to the MLE.
Beyond parameter inference, what matters to us is how well we are able to account for uncertainty on values of the Gaussian Process at unobserved points. Section 7 shows that predictive intervals at unobserved points produced by the full-Bayesian method have effective coverage close to their theoretical level, while predictive intervals produced by estimator plug-in methods (MAP and a fortiori MLE) have substantially lower effective coverage.
2 Objective prior distribution
We assume that the real-valued random field of interest , being a bounded subset of , is Gaussian, with zero mean (or known mean) and with covariance of the form . thus denotes the variance of the Gaussian Process and , hereafter named the “vector of correlation lengths”, is the vector of scaling parameters used by the chosen class of correlation kernels .
denotes the vector of observations made at the points of the design set and the correlation matrix of the Gaussian vector of which is a realization. When applied to a matrix, refers to its determinant.
With these notations, the distribution of given and is and the likelihood of parameters and is :
Our work is based on Bernardo’s reference prior [Ber79] [BB92]. Under a few regularity conditions [CB94], the reference prior as defined in BBS09 coincides with the Jeffreys-rule prior. However, the Jeffreys-rule prior has known shortcomings in cases with several unknown parameters.
For example, consider the case of independent real-valued observations of the model [RCR09]. If is known but is not, the Jeffreys-rule prior on is . Then the posterior distribution of (i.e. its distribution knowing all () and ) is the chi-squared distribution with degrees of freedom.
If both and are unknown, the Jeffreys-rule joint prior distribution on and is . Let us denote . Then the posterior distribution of (i.e. its distribution knowing all () but not ) is also the chi-squared distribution with degrees of freedom. In other words, using the Jeffreys-rule prior in both situations implies that by simply substituting the empirical mean to the actual mean , we are able to reach the same state of knowledge about as if we actually knew . Alternatively, one may use the independence Jeffreys prior distribution, which is the joint prior distribution on and obtained by taking the product of the Jeffreys-rule prior on when is known (flat prior) and of the Jeffreys-rule prior on when is known, thus yielding the joint prior distribution . Then the posterior distribution of is the chi-squared distribution with degrees of freedom, which acknowledges the loss of information on when is unknown. This suggests that seperating parameters may improve performance of Jeffreys priors in multidimensional cases.
The reference prior algorithm was designed to overcome such shortcomings [BB92]. It first requires the user to specify an ordering on the parameters. Then, based on this ordering and for each parameter, the reference prior (i.e. the Jeffreys-rule prior in regular cases) on the parameter is computed, the parameter is integrated out of the likelihood function and the likelihood function is replaced by the new, “integrated” version.
Note that in the example, for all the chosen ordering of and , the reference prior coincides with the independence Jeffreys prior.
To apply the reference prior algorithm, we first need to decide on an ordering of and . Knowledge of implies knowledge of correlation matrix , which may then be used to decorrelate the vector of observations . For any matrix such that (the Cholesky decomposition for instance), if and are known, then follows the multivariate normal distribution . This means that if we assume only is known, the inference problem is reduced to the aforementioned case of independent observations of a normal model with known mean but unknown variance. Conversely, knowledge of may be used to rescale the vector of observations : if and are known, follows the multivariate normal distribution , but this inference problem is nearly as difficult as the original one. This dissymmetry motivates our choice to first derive the reference prior on when is known, and then the marginal prior on .
The Fisher information of the model on assuming to be known is . Thus we define the prior distribution on knowing as .
Let us average the likelihood with respect to this distribution :
Assuming correlation length to be known, is the MLE of . Plugging this estimator into the likelihood of the model (1), we get an expression that is proportional to integrated likelihood (2) : . Thus, using the MLE for is equivalent to integrating the likelihood of the model (1) over the prior distribution on .
At this point, we run into a problem : because the reference prior is improper, the integrated likelihood is improper as well. In order to interpret this fact, let us adopt an other parametrization for the model : denote and . This change yields the following density with respect to the Lebesgue measure on , which, for the sake of concision, we still denote :
Now let us compute the marginal density with respect to :
Notice that the marginal density does not depend on , so regardless of the (proper) prior we may use, integrating out yields the same marginal likelihood.
As the expression obtained in (4) is the same as the one obtained in (2), the latter should be interpreted in light of the former : it only pertains to . With no knowledge of , some part of the data , namely is useless when it comes to inferring .
From this point onwards, whenever is written, it is always implicitely assumed that is the density of a measure that is absolutely continous with respect to the Lebesgue measure on the unit sphere . As it integrates to 1, it is a proper probability density.
We conclude this preliminary study with a fact that will be useful in the following. Define a matrix such that (use for instance the Cholesky decomposition). Then is such that . Now define the mapping ; If we restrict to , it becomes bijective and the pushforward measure of the probability distribution with density with respect to the Lebesgue measure on is the uniform probability distribution on . To see this, suppose is a Gaussian vector with null mean and covariance matrix . Then is an independent Gaussian vector and clearly follows the Uniform distribution on .
In this section, we assume that is a scalar parameter. To emphasize this, we denote it .
Because the model with integrated likelihood is regular, the reference prior is once again the Jeffreys-rule prior, whose square is the Fisher information. It can be written :
It is more convenient to express in terms of and .
The Fisher information can now be simply expressed : given a random variable following the uniform distribution on , it is the variance of .
The reference prior on with respect to the model with likelihood is
Let us note that the full prior distribution is the same as the Jeffreys-rule prior distribution computed from the original likelihood function (which is also the reference prior distribution given by Theorem 2 of BDOS01 when taken in the Simple Kriging framework).
In the case where has dimension , there is no natural way of setting up a hierarchy between correlation lengths. It is thus tempting to group them. This requires us to compute the Fisher information matrix related to the correlation lengths.
Proposition 1 yields the diagonal coefficients of this matrix. Non-diagonal coefficients may be derived using the polarization formula. For every integer and between 1 and ,
This reference prior, which can be written , does not coincide with that of paulo05, nor with any reference prior given by RSS13. It is a different extension of BDOS01’s prior to the case of a multidimensional (although it is also a restriction, because this prior distribution only fits the Simple Kriging framework whereas BDOS01’s, paulo05’s and RSS13’s all fit the Universal Kriging framework).
However, this method has the disadvantage of requiring the use of a multidimensional Jeffreys-rule prior distribution, which may show the sort of undesirable behavior discussed at the beginning of this section.
Alternatively, we could draw inspiration from the one-dimensional case in the following way. Suppose that we know every entry of except one, . Then, according to equation (8), the prior density on knowing all entries () would be
A natural idea would be to regard all conditional densities () as implicitly defining a joint prior density on all entries (). This would have several advantages. First, it would allow us to treat all s separately instead of using a multidimensional Jeffreys-rule prior. Second, it would sidestep the need to specify an ordering on s, which would be arbitrary in the absence of additional prior knowledge. Third, it would enable Gibbs sampling (or rather Metropolis within Gibbs sampling), which is less computationally intensive than multidimensional Metropolis sampling.
The trouble is that no such joint distribution exists, because its existence would require that, , there exist two functions and such that
The conditional prior distributions are thus incompatible and a more subtle approach is required. For every integer , we may define a “conditional posterior distribution” . Naturally, these “conditional posterior distributions” are not compatible either, but we may define a joint “posterior” distribution that would, in some sense, act as the “best compromise” between them.
In Section 3, we give a formal definition of an optimal compromise between incompatible conditionals. In Section 4, we show that that when Matérn anisotropic correlation kernels are used, under certain conditions, there exists a unique optimal compromise between our conditional posterior distributions ().
3 Optimal compromise between incompatible conditionals
3.1 Definitions and notations
Consider a set of parameters . In the following, we use as a shorthand for the set of parameters .
If is a joint probability distribution, then , is a hypermarginal distribution.
Let be a sequence of conditional probability distributions fitting the sequence of parameters : , is a distribution on conditional to all parameters with and is thus written .
Let be a sequence of hypermarginal distributions : , is a distribution on all parameters () and is thus written .
, we denote the joint distribution defined by .
A sequence of hypermarginals is compatible with a sequence of conditionals if
With the usual abuse of notations, we may write this
In words, this definition means that a sequence of hypermarginals is considered compatible with a sequence of conditionals if any hypermarginal is the mixture, with equal probabilities, of the hypermarginals with respect to of all joint distributions ().
For any joint distribution and for any integer we define by the -th hypermarginal of : .
Let be a sequence of conditionals. A joint distribution is a compromise between the conditionals if the sequence of hypermarginals is compatible with .
Naturally, if the conditionals are compatible, then a joint distribution that agrees with them all is a compromise.
A joint distribution is an optimal compromise between a sequence of conditionals if it minimizes the functional over all compromises between , where is defined by :
Notice that the set of all compromises between is convex and that the functional is convex. If the conditionals are compatible and there exists a unique joint distribution that agrees with them all, then and is the unique optimal compromise.
3.2 Deriving the optimal compromise
In this section, we identify the optimal compromise between incompatible conditionals.
A Gibbs compromise between a sequence of conditionals is a joint probability distribution which satisfies :
Obviously, a Gibbs compromise is a compromise in the sense of Definition 4.
Any sequence of hypermarginals that is compatible with a sequence of conditionals is the sequence of hypermarginals of a Gibbs compromise between the conditionals .
Let be a sequence of hypermarginals that is compatible with . Define the joint distribution as follows :
Writing the -th hypermarginal of , we have
The main result follows trivially from Proposition 7 :
If there exists one and only one Gibbs compromise between a given sequence of conditionals , then it is the unique optimal compromise.
If there exists only one sequence of hypermarginals compatible with the conditionals , then all compromises have this sequence of hypermarginals. Therefore, the Gibbs compromise minimizes among all compromises the integrand of Equation 13 for any value of . A fortiori, it minimizes among them. ∎
The next result shows that, under the conditions of Theorem 8, the optimal compromise is parametrization-invariant for a fairly large class of reparametrizations.
Assume that the parameters belong to open subsets of .
Consider reparametrizations of in the form , where the ’s are diffeomorphisms. If all conditionals in a sequence are invariant for such reparametrizations, and if there exists a unique Gibbs compromise between them, then it is also invariant for such reparametrizations.
Without loss of generality, we may only consider a reparametrization such that , is the identity function : , . Indeed, any general reparametrization of the type given in this Proposition may be seen as the composition of reparametrizations of one parameter only.
Let be the conditional distributions after reparametrization : ,
Assume there exists a sequence of hypermarginals that is compatible with . For all , we may define as the image of by , that is
We now prove that the sequence of hypermarginals is compatible with the conditionals . Clearly, , so there is nothing to show about . Forall ,
As the sequence of hypermarginals is compatible with the conditionals , Proposition 7 gives us the existence of a Gibbs compromise (resp. ) between the whose hypermarginals are the (resp. ). Moreover, is clearly the image by of . Therefore, if either or is the unique Gibbs compromise between the or the respectively, then so is the other.
Appendix C provides further discussion of the notion of compatibility of hypermarginals, which is central to the definition of a compromise between incompatible conditionals.
4 Gibbs reference posterior with Matérn anisotropic kernels
In this section, we establish that, whenever Matérn anisotropic geometric or tensorized kernels with known smoothness parameter are used, under certain conditions to be detailed later, there exists a unique Gibbs compromise between the reference posterior conditionals, which thanks to Theorem 8 is the optimal compromise. Henceforth, it will be called “Gibbs reference posterior distribution”, even though this “posterior” has not been derived from a prior distribution using the Bayes rule.
All proofs for this section may be found in Appendix D.
In this work, we use the following convention for the Fourier transform : the Fourier transform of a smooth function verifies and .
Let us set up a few notations.
is the modified Bessel function of second kind with parameter ;
is the -dimensional Matérn isotropic covariance kernel with variance 1, correlation length 1 and smoothness and is its Fourier transform :
is the -dimensional Matérn tensorized covariance kernel with variance 1, correlation length 1 and smoothness and is its Fourier transform :
let us adopt the following convention : if , and .
We define the Matérn geometric anisotropic covariance kernel with variance parameter , correlation lengths (resp. inverse correlation lengths ) and smoothness as the function (resp. ).
Similarly, we define the Matérn tensorized covariance kernel with variance parameter , correlation lengths (resp. inverse correlation lengths ) and smoothness as the function (resp. ).
Thanks to Proposition 9, we may choose any parametrization we wish for the Matérn correlation kernels. We have found that the parametrization involving inverse correlation lengths makes proofs easier.
Several key passages in the following proofs involve a technical assumption on the design set :
A design set is said to have coordinate-distinct points, or simply to be coordinate-distinct, if for any distinct points in the set and , every component of vector differs from 0.
Randomly sampled design sets almost surely have coordinate-distinct points – for instance Latin Hypercube Sampling. Cartesian product design sets, however, do not.
4.2 Main result
The folowing result is valid for Simple Kriging models with the following characteristics:
the design set contains coordinate-distinct points in ( and are positive integers);
the covariance function is Matérn anisostropic geometric or tensorized with variance parameter , smoothness parameter and vector of correlation lengths (resp. inverse correlation lengths) (resp. ) ;
one of the following conditions is verified :
and and the Matérn kernel is tensorized ;
In a Simple Kriging model with the characteristics described above, there exists a hyperplane of such that, for any , there exists a unique Gibbs compromise (resp. ) between the reference posterior conditionals (resp. ). It is the unique stationary distribution of the Markovian kernel defined by:
The Markov kernel is uniformly ergodic: (resp. ), where is the total variation norm.
The reference posterior conditionals are invariant by reparametrization, so the Markovian kernel does not depend on whether the chosen parametrization uses correlation lengths or inverse correlation lengths and, as per Propostion 9, the Gibbs compromise does not either. The parametrization using inverse correlation lenghts is more convenient for proving this theorem, however, so we use it exclusively in the rest of this section.
Notice that in such a Kriging model, the vector of observations almost surely belongs to , so this assumption is of no practical consequence. Theorem 11 therefore asserts that the Gibbs compromise between the incompatible conditionals exists, is unique, and can be sampled using an MCMC algorithm. Slightly abusing denominations, we call this “Gibbs sampling”.
In a Simple Kriging model with the characteristics described above, there exists a hyperplane of such that, for any and any , there exists a measurable function such that, for all , the conditional reference posterior density verifies :
Indeed, this implies that ,
and thus , ,
Denoting , is a measurable positive function. Therefore is the density w.r.t. the Lebesgue measure of a positive measure with mass . So is a probability density w.r.t. the Lebesgue measure and the Markov kernel thus satisfies the uniform Doeblin condition :
This implies that is uniformly ergodic : it has a unique invariant probability distribution and , where