Nested Kriging predictions for datasets with large number of observations
Abstract
This work falls within the context of predicting the value of a real function at some input locations given a limited number of observations of this function. The Kriging interpolation technique (or Gaussian process regression) is often considered to tackle such a problem but the method suffers from its computational burden when the number of observation points is large. We introduce in this article nested Kriging predictors which are constructed by aggregating submodels based on subsets of observation points. This approach is proven to have better theoretical properties than other aggregation methods that can be found in the literature. Contrarily to some other methods it can be shown that the proposed aggregation method is consistent. Finally, the practical interest of the proposed method is illustrated on simulated datasets and on an industrial test case with observations in a 6dimensional space.
mycommfont
Keywords: Gaussian process regression big data aggregation methods best linear unbiased predictor spatial processes.
1 Introduction
Gaussian process regression models have proven to be of great interest in many fields when it comes to predict the output of a function , , based on the knowledge of inputoutput tuples for [stein2012interpolation, santner2013design, Rasmussen2006]. One asset of this method is to provide not only a mean predictor but also a quantification of the model uncertainty. The Gaussian process regression framework uses a (centered) realvalued Gaussian process over as a prior distribution for and approximates it by the conditional distribution of given the observations for . In this framework, we denote by the covariance function (or kernel) of : , and by the vector of observation points with entries for .
In the following, we use classical vectorial notations: for any functions , and for any vectors and , we denote by the real valued vector with components and by the real valued matrix with components , , . With such notations, the conditional distribution of given the vector of observations is Gaussian with mean, covariance and variance:
(1) 
Since we do not specify yet the values taken by at , the “mean predictor” is random so it is denoted by an uppercase letter . The approximation of given the observations is thus given by . This method is quite general since an appropriate choice of the kernel allows to recover the models obtained from various frameworks such as linear regression and splines models [wahba1990spline].
One limitation of such models is the computational time required for building models based on a large number of observations. Indeed, these models require computing and inverting the covariance matrix between the observed values , which leads to a complexity in space and in time. In practice, this computational burden makes Gaussian process regression difficult to use when the number of observation points is in the range or greater.
Many methods have been proposed in the literature to overcome this limit. Let us first mention that, when the observations are recorded on a regular grid, choosing a separable covariance function enables to drastically simplify the inversion of the covariance matrix , since the latter can be written as a Kronecker product. In the same context of gridded data, alternative approaches such as Gaussian Markov Random Fields are also available [rue05gaussian].
For irregularly spaced data, a common approach in machine learning relies on inducing points. It consists in introducing a set of pseudo input points and in approximating the full conditional distribution by . The challenge here is to find the best locations for the inducing inputs and to decide which values should be assigned to the outputs at . Various methods are suggested in the literature to answer these questions [guhaniyogi2011adaptive, hensman2013, katzfuss2013bayesian, zhang2015full]. One drawback of this kind of approximation is that the predictions do not interpolate the observation points any more. Note that this method has recently been combined with the Kronecker product method in [nickson2015blitzkriging].
Other methods rely on low rank approximations or compactly supported covariance functions. Both methods show limitations when the scale dependence is respectively short and large. For more details and references, see [stein14limitations, maurya16well, bachoc2017]. Another drawback – which to the best of our knowledge is little discussed in the literature – is the difficulty to use these methods when the dimension of the input space is large (say larger than 10, which is frequent in computer experiments or machine learning).
Let us also mention that the computation of , for an arbitrary vector can be performed using iterative algorithms, like the preconditionned conjugate gradient algorithm[golub12matrix]. Unfortunately, the algorithms need to be run many times when a posterior variance – involving the computation of – needs to be computed for a large set of prediction points.
The method proposed in this paper belongs to the socalled “mixture of experts” family. The latter relies on the aggregation of submodels based on subsets of the data which make them easy to compute. This kind of methods offers a great flexibility since it can be applied with any covariance function and in large dimension while retaining the interpolation property. Some existing “mixture of experts” methods are product of experts [hinton2002training], and the (robust) Bayesian committee machine [trespBCM, deisenroth2015]. All these methods are based on a similar approach: for a given point , each submodel provides its own prediction (a mean and a variance) and these predictions are then merged into one single mean and prediction variance. The differences between these methods lie in how to aggregate the predictions made by each submodel. It shall be noted that aggregating expert opinions is the topic of consensus statistical methods (sometimes referred to as opinion synthesis or averaging methods), where probability distributions representing expert opinions are joined together. Early references are [winkler1968consensus, winkler1981combining]. A detailed review and an annotated bibliography is given in [genest1986combining] (see also [satopaa2015modeling, ranjan2010combining] for recent related developments). From a probabilistic perspective, usual mixture of experts methods assume that there is some (conditional) independence between the submodels. Although this kind of hypothesis leads to efficient computations, it is often violated in practice and may lead to poor predictions as illustrated in [samo2016string]. Furthermore, these methods only provide pointwise confidence intervals instead of a full Gaussian process posterior distribution.
Since our method is part of the mixture of experts framework, it benefits from the properties of the mixture of experts techniques: it does not require the data to be on a grid, the predictions can interpolate the observations and it can be applied to data with small or large scale dependencies regardless of the input space dimension. Compared to other mixtures of experts, we relax the usually made independence assumption so that the prediction takes into account all pairwise crosscovariances between observations. We show that this addresses two main pitfalls of usual mixture of experts. First, the predictions are more accurate. Second, the theoretical consistency is ensured whereas it is not the case for the product of experts and the Bayesian committee machine methods. The detailed proofs of the later are out of the scope of this paper and we refer the interested reader to [bachoc2017] for further details. The proposed method remains computationally affordable: predictions are performed in a few seconds for and a few minutes for using a standard laptop and the proposed online implementation. Finally, the prediction method comes with a naturally associated inference procedure, which is based on cross validation errors.
The proposed method is presented in Section 2. In Section 3, we introduce an iterative scheme for nesting the predictors derived previously. A procedure for estimating the parameters of models is then given in Section 4. Finally, Section 5 compares the method with stateoftheart aggregation methods on both a simulated dataset and an industrial case study.
2 Pointwise aggregation of experts
Let us now address in more details the framework of this article. The method is based on the aggregation of submodels defined on smaller subsets of points. Let be subvectors of the vector of observations input points , it is thus possible to define associated submodels (or experts) . For example, the submodel can be a Gaussian process regression model based on a subset of the data
(2) 
however, we make no Gaussian assumption in this section. For a given prediction point , the submodels predictions are gathered into a vector . The random column vector is supposed to be centered with finite first two moments and we consider that both the covariance vector and the covariance matrix are given. Submodels aggregation (or mixture of experts) aims at merging all the pointwise submodels into one unique pointwise predictor of . We propose the following aggregation:
Definition 1 (Submodels aggregation).
For a given point , let , be submodels with covariance matrix . Then, when is invertible, we define the submodel aggregation as:
(3) 
In practice, the invertibility condition on can be avoided by using matrices pseudoinverses. Given the vector of observations , the associated prediction is
(4) 
Notice that we are here aggregating random variables rather than their distributions. For dependent nonelliptical random variables, expressing the probability density function of as a function of each expert density is not straightforward. This difference in the approaches implies that the proposed method differs from usual consensus aggregations. For example, aggregating random variables allows to specify the correlations between the aggregated prediction and the experts whereas aggregating expert distributions into a univariate prediction distribution does not characterize uniquely these correlations.
Proposition 1 (Blup).
is the best linear unbiased predictor of that writes . The mean squared error writes
(5) 
The coefficients are given by the vector .
Proof.
The standard proof applies: The square error writes . The value of minimising it can be found by differentiation: which leads to . Then, and the result holds. ∎
Proposition 2 (Basic properties).
Let be a given prediction point in .

Linear case: if is linear in , i.e. if there exists a deterministic matrix such that and if is invertible, then is linear in with
(6) where .

Interpolation case: if interpolates at , i.e. if for any component of the vector there is at least one index such that , and if is invertible for any component of , then is also interpolating, i.e.
(7) where is a vector with entries . This property can be extended when some are not invertible by using pseudoinverse in place of matrix inverse in Definition 1.

Gaussian case: if the joint distribution is multivariate normal, then the conditional distribution of given is normal with moments
(8)
Proof.
Linearity directly derives from and .
Interpolation: Let , and be an index such that . As , the line of is equal to . Setting the dimensional vector having entries except on its component, it is thus clear that . As is assumed to be invertible, then , so that . This result can be plugged into the definition of to obtain the second part of Eq. (7): .
Finally the Gaussian case can be proved directly by applying the usual multivariate normal conditioning formula.
∎
Let us assume here that conditions in items (i) and (ii) of Proposition 2 are satisfied, that is that is linear in , and that . Then, the proposed aggregation method also benefits from several other interesting properties:

First, the aggregation can be seen as an exact conditional process for a slightly different prior distribution on . One can indeed define a process as where is an independent replicate of and with as in (3). One can then show that and
Denoting the covariance function of , one can also show that for all and .

Second, the error between the aggregated model and the full model of Equation (1) can be bounded. For any norm , one can show that there exists some constants such that
One can also show that the differences between the full and aggregated models write as norm differences, where :

Third, contrarily to several other aggregation methods, when submodels are informative enough, the difference between the aggregated model and the full model vanishes: when where is a matrix with full rank, then and . Furthermore, in this fullinformation case,

Finally, in the Gaussian case and under some supplementary conditions, it can be proven that, contrarily to several other aggregation methods, the proposed method is consistent when the number of observation points tends toward infinity. Let be a triangular array of observation points such that for all , . For , let , let be any collection of simple Kriging predictors based on respective design points where (with a slight abuse of notation), then
One can also exhibit nonconsistency results for other aggregation methods of the literature that do not use covariances between submodels.
The full development of these properties is out of the scope of this paper and we dedicate a separate article to detail them [bachoc2017].
We now illustrate our aggregation method with two simple examples.
Example 1 (Gaussian process regression aggregation).
In this example, we set and we approximate the function based on a set of five observation points in : . These observations are gathered in two column vectors and . We use as prior a centered Gaussian process with squared exponential covariance in order to build two Kriging submodels, for :
(9) 
The expressions required to compute as defined in Eq. (3) are for :
(10) 
Recall and , as it can be seen in Figure 1, the resulting model appears to be a very good approximation of and there is only a slight difference between prediction variances and on this example.
Example 2 (Linear regression aggregation).
In this distributionfree example, we set and we consider the process where and are independent centered random variables with unit variance. is thus centered with covariance . Furthermore, we consider that is corrupted by some observation noise where is an independent white noise process with covariance . Note that we only make assumptions on the first two moments of , or but not on their laws. We introduce five observation points gathered in two column vectors: and and their associated outputs and . The linear regression submodels, obtained by square error minimization, are , , where stands for the identity matrix. Resulting covariances , and aggregated model , of Eq. (8) are then easily obtained. The resulting model is illustrated in Figure 2.
3 Iterative scheme
In the previous sections, we have seen how to aggregate submodels into one unique aggregated value . Now, starting from the same submodels, one can imagine creating several aggregated values, , each of them based on a subset of . One can show that these aggregated values can themselves be aggregated. This makes possible the construction of an iterative algorithm that merges submodels at successive steps, according to a tree structure. Such treebased schemes are sometimes used to reduce the complexity of models, see e.g. [tzeng2005fast], or to allow parallel computing [doi:10.1080/15481603.2014.1002379].
The aim of this section is to give a generic algorithm for aggregating submodels according to a tree structure and to show that the choice of the tree structure helps partially reducing the complexity of the algorithm. It also aims at giving perspectives for further large reduction of the global complexity.
Let us introduce some notations. The total height (i.e number of layers) of the tree is denoted and the number of node of a layer is . We associate to each node (say node in layer ) a submodel corresponding to the aggregation of its child node submodels. In other words, is the aggregation of where is the set of children of node in layer . These notations are summarized in Figure 3 which details the tree associated with Example 1. In practice, there will be one root node () and each node will have at least one parent: . Typically, the sets , , are a partition of but this assumption is not required and a child node may have several parents (which can generate a lattice rather than a tree).
3.1 TwoLayer aggregation
We discuss in this section the tree structure associated with the case as per the previous examples. With such settings, the first step consists in calculating the initial submodels of the layer and the second one is to aggregate all submodels of layer into one unique predictor (see for example Figure 3). This aggregation is obtained by direct application of Definition 1.
In practice the submodels can be any covariates, like gradients, nonGaussian underlying factors or even blackbox responses, as soon as crosscovariances and covariances with are known. When submodels are calculated from direct observations , the number of leave nodes at layer is . In further numerical illustrations of Section 5, the submodels are simple Kriging predictors of , with for ,
(11) 
With these particular simple Kriging initial submodels, the layer corresponds to the aggregation of covariates at the previous layer , .
3.2 Multiple Layer aggregation
In order to extend the twolayer settings, one needs to compute covariances among aggregated submodels. The following proposition gives covariances between aggregated models of a given layer.
Proposition 3 (aggregated models covariances).
Let us consider a layer and given aggregated models . Assume that the following covariances and are given, . Let be a number of new aggregated values. Consider subsets of , , and assume that is the aggregation of , . Then
(12) 
where the vectors of optimal weights are
and where corresponds to the subvector of of indices in and similarly for and the submatrix , which is assumed to be invertible.
Furthermore, .
Proof.
This follows immediately from Definition 1: as the aggregated values are linear expressions, the calculation of their covariances is straightforward. The last equality is simply obtained by inserting the value of into the expression of . ∎
The following algorithm, which is a generic algorithm for aggregating submodels according to a tree structure, is based on an iterative use of the previous proposition. It is given for one prediction point and it assumes that the submodels are already calculated, starting directly from layer 1. This allows a large variety of submodels, and avoids the storage of the possibly large covariance matrix .
Its outputs are the final scalar aggregated model, , and the scalar covariance from which one deduces the prediction error .
In order to give dimensions in the algorithm and to ease the calculation of complexities, we define as the number of children of the submodel , . We also denote the maximal number of children.
[H] \DontPrintSemicolon\CommentSty
Inputinputs
\SetKwInOutOutputoutputs
\Input, vector of length (submodels evaluated at )
, vector of length (covariance between and submodels at )
, matrix of size (covariance between submodels at )
, a list describing the tree structure
,
Create vectors , of size and matrix of size
Create vectors of size and matrix of size
Create vector of size subvector of on submatrix of on \KwStyif \KwSty then \KwStyelse \For submatrix of on , and all can be deleted
Notice that Algorithm 3 uses the result from Prop. 3: when we consider aggregated models (), we do not need to store and compute the vector any more. When , depending on the initial covariates, is not necessarily equal to (this is however the case when are simple Kriging predictors).
For the sake of clarity, some improvements have been omitted in the algorithm above. For instance, covariances can be stored in triangular matrices, one can store two couples instead of couples by using objects and . Furthermore, it is quite natural to adapt this algorithm to parallel computing, but this is out of the scope of this article.
3.3 Complexity
We study here the complexity of Algorithm 3 in space (storage footprint) and in time (execution time). For the sake of clarity we consider in this paragraph a simplified tree where is decreasing in and each child has only one parent. This corresponds to the most common structure of trees, without overlapping. Furthermore, at any given level , we consider that each node has the same number of children: for all . Such a tree will be called regular. In this setting, one easily sees that , . Complexities obviously depend on the choice of submodels, we give here complexities for Kriging submodels as in Eq. (11), but this can be adapted to other kinds of submodels.
For one prediction point , we denote by the storage footprint of Algorithm 3, and by its complexity in time, including submodels calculation. One can show that in a particular twolayers setting with submodels ( and ), a reachable global complexity for prediction points is (see assumptions below and expression details in the proof of Proposition 4)
(13) 
This is to be compared with for the same prediction with the full model. The aggregation of submodels can be useful when the number of prediction points is smaller than the number of observations. Notice that the storage needed for prediction points is the same as for one prediction point, but in some cases (as for leaveoneout errors calculation), it is worth using a storage to avoid recalculations of some quantities.
We now detail chosen assumptions on the calculation of and , and study the impact of the tree structure on these quantities. For one prediction point , including submodels calculation, the complexity in time can be decomposed into , where

is the complexity for computing all cross covariances among initial design points, which does not depend on the tree structure (neither on the number of prediction points).

is the complexity for building all aggregation predictors, i.e. the sum over of all operations in the loop in Algorithm 3 (excluding operations in the loop).

is the complexity for building the covariance matrices among these predictors, i.e. the sum over of all operations in the loop in Algorithm 3.
We assume here that there exists two constants and such that the complexity of operations inside the loop (excluding those of the loop) is , and the complexity of operations inside the loop is . Despite perfectible, this assumption follows from the fact that one usually considers that the complexity of matrix inversion is and the complexity of matrixvector multiplication is . We also assume that the tree height is finite, and that all numbers of children tend to as tends to . This excludes for example binary trees, but makes assumptions on complexities more reliable. Under these assumptions, the following proposition details how the tree structure affects the complexities.
Proposition 4 (Complexities).
The following storage footprint and complexities , hold for the respective tree structures, when the number of observations tends to .

The twolayer equilibrated tree, where , , is the optimal storage footprint tree, and
(14) 
The layer equilibrated tree, where , , is such that
(15) 
The optimal complexity tree is defined as the regular tree structure that minimizes , as it is not possible to reduce to orders lower than . This tree is such that
(16) with and