A Proof of Proposition 4

Nested Kriging predictions for datasets with large number of observations

Abstract

This work falls within the context of predicting the value of a real function at some input locations given a limited number of observations of this function. The Kriging interpolation technique (or Gaussian process regression) is often considered to tackle such a problem but the method suffers from its computational burden when the number of observation points is large. We introduce in this article nested Kriging predictors which are constructed by aggregating sub-models based on subsets of observation points. This approach is proven to have better theoretical properties than other aggregation methods that can be found in the literature. Contrarily to some other methods it can be shown that the proposed aggregation method is consistent. Finally, the practical interest of the proposed method is illustrated on simulated datasets and on an industrial test case with observations in a 6-dimensional space.

mycommfont

Keywords: Gaussian process regression big data aggregation methods best linear unbiased predictor spatial processes.

1 Introduction

Gaussian process regression models have proven to be of great interest in many fields when it comes to predict the output of a function , , based on the knowledge of input-output tuples for [stein2012interpolation, santner2013design, Rasmussen2006]. One asset of this method is to provide not only a mean predictor but also a quantification of the model uncertainty. The Gaussian process regression framework uses a (centered) real-valued Gaussian process over as a prior distribution for and approximates it by the conditional distribution of given the observations for . In this framework, we denote by the covariance function (or kernel) of : , and by the vector of observation points with entries for .

In the following, we use classical vectorial notations: for any functions , and for any vectors and , we denote by the real valued vector with components and by the real valued matrix with components , , . With such notations, the conditional distribution of given the vector of observations is Gaussian with mean, covariance and variance:

 Missing or unrecognized delimiter for \left (1)

Since we do not specify yet the values taken by at , the “mean predictor” is random so it is denoted by an upper-case letter . The approximation of given the observations is thus given by . This method is quite general since an appropriate choice of the kernel allows to recover the models obtained from various frameworks such as linear regression and splines models [wahba1990spline].

One limitation of such models is the computational time required for building models based on a large number of observations. Indeed, these models require computing and inverting the covariance matrix between the observed values , which leads to a complexity in space and in time. In practice, this computational burden makes Gaussian process regression difficult to use when the number of observation points is in the range or greater.

Many methods have been proposed in the literature to overcome this limit. Let us first mention that, when the observations are recorded on a regular grid, choosing a separable covariance function enables to drastically simplify the inversion of the covariance matrix , since the latter can be written as a Kronecker product. In the same context of gridded data, alternative approaches such as Gaussian Markov Random Fields are also available [rue05gaussian].

For irregularly spaced data, a common approach in machine learning relies on inducing points. It consists in introducing a set of pseudo input points and in approximating the full conditional distribution by . The challenge here is to find the best locations for the inducing inputs and to decide which values should be assigned to the outputs at . Various methods are suggested in the literature to answer these questions [guhaniyogi2011adaptive, hensman2013, katzfuss2013bayesian, zhang2015full]. One drawback of this kind of approximation is that the predictions do not interpolate the observation points any more. Note that this method has recently been combined with the Kronecker product method in [nickson2015blitzkriging].

Other methods rely on low rank approximations or compactly supported covariance functions. Both methods show limitations when the scale dependence is respectively short and large. For more details and references, see [stein14limitations, maurya16well, bachoc2017]. Another drawback – which to the best of our knowledge is little discussed in the literature – is the difficulty to use these methods when the dimension of the input space is large (say larger than 10, which is frequent in computer experiments or machine learning).

Let us also mention that the computation of , for an arbitrary vector can be performed using iterative algorithms, like the preconditionned conjugate gradient algorithm[golub12matrix]. Unfortunately, the algorithms need to be run many times when a posterior variance – involving the computation of – needs to be computed for a large set of prediction points.

The method proposed in this paper belongs to the so-called “mixture of experts” family. The latter relies on the aggregation of sub-models based on subsets of the data which make them easy to compute. This kind of methods offers a great flexibility since it can be applied with any covariance function and in large dimension while retaining the interpolation property. Some existing “mixture of experts” methods are product of experts [hinton2002training], and the (robust) Bayesian committee machine [trespBCM, deisenroth2015]. All these methods are based on a similar approach: for a given point , each sub-model provides its own prediction (a mean and a variance) and these predictions are then merged into one single mean and prediction variance. The differences between these methods lie in how to aggregate the predictions made by each sub-model. It shall be noted that aggregating expert opinions is the topic of consensus statistical methods (sometimes referred to as opinion synthesis or averaging methods), where probability distributions representing expert opinions are joined together. Early references are [winkler1968consensus, winkler1981combining]. A detailed review and an annotated bibliography is given in [genest1986combining] (see also [satopaa2015modeling, ranjan2010combining] for recent related developments). From a probabilistic perspective, usual mixture of experts methods assume that there is some (conditional) independence between the sub-models. Although this kind of hypothesis leads to efficient computations, it is often violated in practice and may lead to poor predictions as illustrated in [samo2016string]. Furthermore, these methods only provide pointwise confidence intervals instead of a full Gaussian process posterior distribution.

Since our method is part of the mixture of experts framework, it benefits from the properties of the mixture of experts techniques: it does not require the data to be on a grid, the predictions can interpolate the observations and it can be applied to data with small or large scale dependencies regardless of the input space dimension. Compared to other mixtures of experts, we relax the usually made independence assumption so that the prediction takes into account all pairwise cross-covariances between observations. We show that this addresses two main pitfalls of usual mixture of experts. First, the predictions are more accurate. Second, the theoretical consistency is ensured whereas it is not the case for the product of experts and the Bayesian committee machine methods. The detailed proofs of the later are out of the scope of this paper and we refer the interested reader to [bachoc2017] for further details. The proposed method remains computationally affordable: predictions are performed in a few seconds for and a few minutes for using a standard laptop and the proposed online implementation. Finally, the prediction method comes with a naturally associated inference procedure, which is based on cross validation errors.

The proposed method is presented in Section 2. In Section 3, we introduce an iterative scheme for nesting the predictors derived previously. A procedure for estimating the parameters of models is then given in Section 4. Finally, Section 5 compares the method with state-of-the-art aggregation methods on both a simulated dataset and an industrial case study.

2 Pointwise aggregation of experts

Let us now address in more details the framework of this article. The method is based on the aggregation of sub-models defined on smaller subsets of points. Let be subvectors of the vector of observations input points , it is thus possible to define associated sub-models (or experts) . For example, the sub-model can be a Gaussian process regression model based on a subset of the data

 Mi(x)=E[Y(x)|Y(Xi)]=k(x,Xi)k(Xi,Xi)−1Y(Xi), (2)

however, we make no Gaussian assumption in this section. For a given prediction point , the sub-models predictions are gathered into a vector . The random column vector is supposed to be centered with finite first two moments and we consider that both the covariance vector and the covariance matrix are given. Sub-models aggregation (or mixture of experts) aims at merging all the pointwise sub-models into one unique pointwise predictor of . We propose the following aggregation:

Definition 1 (Sub-models aggregation).

For a given point , let , be sub-models with covariance matrix . Then, when is invertible, we define the sub-model aggregation as:

 MA(x)=kM(x)tKM(x)−1M(x). (3)

In practice, the invertibility condition on can be avoided by using matrices pseudo-inverses. Given the vector of observations , the associated prediction is

 mA(x)=kM(x)tKM(x)−1m(x). (4)

Notice that we are here aggregating random variables rather than their distributions. For dependent non-elliptical random variables, expressing the probability density function of as a function of each expert density is not straightforward. This difference in the approaches implies that the proposed method differs from usual consensus aggregations. For example, aggregating random variables allows to specify the correlations between the aggregated prediction and the experts whereas aggregating expert distributions into a univariate prediction distribution does not characterize uniquely these correlations.

Proposition 1 (Blup).

is the best linear unbiased predictor of that writes . The mean squared error writes

 vA(x)=k(x,x)−kM(x)tKM(x)−1kM(x). (5)

The coefficients are given by the vector .

Proof.

The standard proof applies: The square error writes . The value of minimising it can be found by differentiation: which leads to . Then, and the result holds. ∎

Proposition 2 (Basic properties).

Let be a given prediction point in .

• Linear case: if is linear in , i.e. if there exists a deterministic matrix such that and if is invertible, then is linear in with

 {MA(x)=λA(x)tY(X)to0.0pt,vA(x)=k(x,x)−λA(x)tk(X,x)to0.0pt. (6)

where .

• Interpolation case: if interpolates at , i.e. if for any component of the vector there is at least one index such that , and if is invertible for any component of , then is also interpolating, i.e.

 {MA(X)=Y(X)to0.0pt,vA(X)=0nto0.0pt, (7)

where is a vector with entries . This property can be extended when some are not invertible by using pseudo-inverse in place of matrix inverse in Definition 1.

• Gaussian case: if the joint distribution is multivariate normal, then the conditional distribution of given is normal with moments

 {E[Y(x)|Mi(x),i∈A]=MA(x)to0.0pt,V[Y(x)|Mi(x),i∈A]=vA(x)to0.0pt. (8)
Proof.

Linearity directly derives from and .
Interpolation: Let , and be an index such that . As , the line of is equal to . Setting the dimensional vector having entries except on its component, it is thus clear that . As is assumed to be invertible, then , so that . This result can be plugged into the definition of to obtain the second part of Eq. (7): .
Finally the Gaussian case can be proved directly by applying the usual multivariate normal conditioning formula. ∎

Let us assume here that conditions in items (i) and (ii) of Proposition 2 are satisfied, that is that is linear in , and that . Then, the proposed aggregation method also benefits from several other interesting properties:

• First, the aggregation can be seen as an exact conditional process for a slightly different prior distribution on . One can indeed define a process as where is an independent replicate of and with as in (3). One can then show that and

 {MA(x)=E[YA(x)|YA(X)]to0.0pt,vA(x)=V[YA(x)|YA(X)]to0.0pt.

Denoting the covariance function of , one can also show that for all and .

• Second, the error between the aggregated model and the full model of Equation (1) can be bounded. For any norm , one can show that there exists some constants such that

 {|MA(x)−Mfull(x)|≤λ∥k(X,x)∥∥Y(X)∥to0.0pt,|vA(x)−vfull(x)|≤μ∥k(X,x)∥2to0.0pt.

One can also show that the differences between the full and aggregated models write as norm differences, where :

 {E[(MA(x)−Mfull(x))2]=∥k(X,x)−kA(X,x)∥2Kto0.0pt,vA(x)−vfull(x)=∥k(X,x)∥2K−∥kA(X,x)∥2Kto0.0pt.
• Third, contrarily to several other aggregation methods, when sub-models are informative enough, the difference between the aggregated model and the full model vanishes: when where is a matrix with full rank, then and . Furthermore, in this full-information case,

 {MA(x)=Mfull(x)to0.0pt,vA(x)=vfull(x)to0.0pt.
• Finally, in the Gaussian case and under some supplementary conditions, it can be proven that, contrarily to several other aggregation methods, the proposed method is consistent when the number of observation points tends toward infinity. Let be a triangular array of observation points such that for all , . For , let , let be any collection of simple Kriging predictors based on respective design points where (with a slight abuse of notation), then

 supx∈DE((Y(x)−MA(x))2)→n→∞0.

One can also exhibit non-consistency results for other aggregation methods of the literature that do not use covariances between sub-models.

The full development of these properties is out of the scope of this paper and we dedicate a separate article to detail them [bachoc2017].

We now illustrate our aggregation method with two simple examples.

Example 1 (Gaussian process regression aggregation).

In this example, we set and we approximate the function based on a set of five observation points in : . These observations are gathered in two column vectors and . We use as prior a centered Gaussian process with squared exponential covariance in order to build two Kriging sub-models, for :

 {Mi(x)=E[Y(x)|Y(Xi)]=k(x,Xi)k(Xi,Xi)−1Y(Xi)to0.0pt,mi(x)=E[Y(x)|Y(Xi)\resizebox7.499886pt0.0pt=f(Xi)]=k(x,Xi)k(Xi,Xi)−1f(Xi)to0.0pt. (9)

The expressions required to compute as defined in Eq. (3) are for :

 {(kM(x))i=Cov[Mi(x),Y(x)]=k(x,Xi)k(Xi,Xi)−1k(Xi,x)to0.0pt,(KM(x))i,j=Cov[Mi(x),Mj(x)]=k(x,Xi)k(Xi,Xi)−1k(Xi,Xj)k(Xj,Xj)−1k(Xj,x)to0.0pt. (10)

Recall and , as it can be seen in Figure 1, the resulting model appears to be a very good approximation of and there is only a slight difference between prediction variances and on this example.

Example 2 (Linear regression aggregation).

In this distribution-free example, we set and we consider the process where and are independent centered random variables with unit variance. is thus centered with covariance . Furthermore, we consider that is corrupted by some observation noise where is an independent white noise process with covariance . Note that we only make assumptions on the first two moments of , or but not on their laws. We introduce five observation points gathered in two column vectors: and and their associated outputs and . The linear regression sub-models, obtained by square error minimization, are , , where stands for the identity matrix. Resulting covariances , and aggregated model , of Eq. (8) are then easily obtained. The resulting model is illustrated in Figure 2.

3 Iterative scheme

In the previous sections, we have seen how to aggregate sub-models into one unique aggregated value . Now, starting from the same sub-models, one can imagine creating several aggregated values, , each of them based on a subset of . One can show that these aggregated values can themselves be aggregated. This makes possible the construction of an iterative algorithm that merges sub-models at successive steps, according to a tree structure. Such tree-based schemes are sometimes used to reduce the complexity of models, see e.g.  [tzeng2005fast], or to allow parallel computing [doi:10.1080/15481603.2014.1002379].

The aim of this section is to give a generic algorithm for aggregating sub-models according to a tree structure and to show that the choice of the tree structure helps partially reducing the complexity of the algorithm. It also aims at giving perspectives for further large reduction of the global complexity.

Let us introduce some notations. The total height (i.e number of layers) of the tree is denoted and the number of node of a layer is . We associate to each node (say node  in layer ) a sub-model corresponding to the aggregation of its child node sub-models. In other words, is the aggregation of where is the set of children of node  in layer . These notations are summarized in Figure 3 which details the tree associated with Example 1. In practice, there will be one root node () and each node will have at least one parent: . Typically, the sets , , are a partition of but this assumption is not required and a child node may have several parents (which can generate a lattice rather than a tree).

3.1 Two-Layer aggregation

We discuss in this section the tree structure associated with the case as per the previous examples. With such settings, the first step consists in calculating the initial sub-models of the layer and the second one is to aggregate all sub-models of layer into one unique predictor (see for example Figure 3). This aggregation is obtained by direct application of Definition 1.

In practice the sub-models can be any covariates, like gradients, non-Gaussian underlying factors or even black-box responses, as soon as cross-covariances and covariances with are known. When sub-models are calculated from direct observations , the number of leave nodes at layer is . In further numerical illustrations of Section 5, the sub-models are simple Kriging predictors of , with for ,

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩M1i(x)=k(x,Xi)k(Xi,Xi)−1Y(Xi)to0.0pt,Cov[M1i(x),Y(x)]=k(x,Xi)k(Xi,Xi)−1k(Xi,x)to0.0pt,Cov[M1i(x),M1j(x)]=k(x,Xi)k(Xi,Xi)−1k(Xi,Xj)k(Xj,Xj)−1k(Xj,x)to0.0pt. (11)

With these particular simple Kriging initial sub-models, the layer corresponds to the aggregation of covariates at the previous layer , .

3.2 Multiple Layer aggregation

In order to extend the two-layer settings, one needs to compute covariances among aggregated sub-models. The following proposition gives covariances between aggregated models of a given layer.

Proposition 3 (aggregated models covariances).

Let us consider a layer and given aggregated models . Assume that the following covariances and are given, . Let be a number of new aggregated values. Consider subsets of , , and assume that is the aggregation of , . Then

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(Mν+1(x))i=αν+1i(x)t(Mν(x)[Aν+1i])to0.0pt,Cov[Mν+1i(x),Y(x)]=αν+1i(x)t(kν(x)[Aν+1i])to0.0pt,Cov[Mν+1i(x),Mν+1j(x)]=αν+1i(x)t(Kν(x)[Aν+1i,Aν+1j])αν+1j(x)to0.0pt, (12)

where the vectors of optimal weights are and where corresponds to the sub-vector of of indices in and similarly for and the submatrix , which is assumed to be invertible.
Furthermore, .

Proof.

This follows immediately from Definition 1: as the aggregated values are linear expressions, the calculation of their covariances is straightforward. The last equality is simply obtained by inserting the value of into the expression of . ∎

The following algorithm, which is a generic algorithm for aggregating sub-models according to a tree structure, is based on an iterative use of the previous proposition. It is given for one prediction point and it assumes that the sub-models are already calculated, starting directly from layer 1. This allows a large variety of sub-models, and avoids the storage of the possibly large covariance matrix . Its outputs are the final scalar aggregated model, , and the scalar covariance from which one deduces the prediction error .

In order to give dimensions in the algorithm and to ease the calculation of complexities, we define as the number of children of the sub-model , . We also denote the maximal number of children.

{algorithm}

\SetKwInOut

Inputinputs \SetKwInOutOutputoutputs \Input, vector of length (sub-models evaluated at )
, vector of length (covariance between and sub-models at )
, matrix of size (covariance between sub-models at )
, a list describing the tree structure

\Output

,

Create vectors , of size and matrix of size

\For

Create vectors of size and matrix of size

\For

Create vector of size subvector of on submatrix of on \KwStyif \KwSty then \KwStyelse \For submatrix of on , and all can be deleted Nested Kriging algorithm

Notice that Algorithm 3 uses the result from Prop. 3: when we consider aggregated models (), we do not need to store and compute the vector any more. When , depending on the initial covariates, is not necessarily equal to (this is however the case when are simple Kriging predictors).

For the sake of clarity, some improvements have been omitted in the algorithm above. For instance, covariances can be stored in triangular matrices, one can store two couples instead of couples by using objects and . Furthermore, it is quite natural to adapt this algorithm to parallel computing, but this is out of the scope of this article.

3.3 Complexity

We study here the complexity of Algorithm 3 in space (storage footprint) and in time (execution time). For the sake of clarity we consider in this paragraph a simplified tree where is decreasing in and each child has only one parent. This corresponds to the most common structure of trees, without overlapping. Furthermore, at any given level , we consider that each node has the same number of children: for all . Such a tree will be called regular. In this setting, one easily sees that , . Complexities obviously depend on the choice of sub-models, we give here complexities for Kriging sub-models as in Eq. (11), but this can be adapted to other kinds of sub-models.

For one prediction point , we denote by the storage footprint of Algorithm 3, and by its complexity in time, including sub-models calculation. One can show that in a particular two-layers setting with sub-models ( and ), a reachable global complexity for prediction points is (see assumptions below and expression details in the proof of Proposition 4)

 S=O(n) and qC=O(n2q). (13)

This is to be compared with for the same prediction with the full model. The aggregation of sub-models can be useful when the number of prediction points is smaller than the number of observations. Notice that the storage needed for prediction points is the same as for one prediction point, but in some cases (as for leave-one-out errors calculation), it is worth using a storage to avoid recalculations of some quantities.

We now detail chosen assumptions on the calculation of and , and study the impact of the tree structure on these quantities. For one prediction point , including sub-models calculation, the complexity in time can be decomposed into , where

• is the complexity for computing all cross covariances among initial design points, which does not depend on the tree structure (neither on the number of prediction points).

• is the complexity for building all aggregation predictors, i.e. the sum over of all operations in the -loop in Algorithm 3 (excluding operations in the -loop).

• is the complexity for building the covariance matrices among these predictors, i.e. the sum over of all operations in the -loop in Algorithm 3.

We assume here that there exists two constants and such that the complexity of operations inside the -loop (excluding those of the -loop) is , and the complexity of operations inside the -loop is . Despite perfectible, this assumption follows from the fact that one usually considers that the complexity of matrix inversion is and the complexity of matrix-vector multiplication is . We also assume that the tree height is finite, and that all numbers of children tend to as tends to . This excludes for example binary trees, but makes assumptions on complexities more reliable. Under these assumptions, the following proposition details how the tree structure affects the complexities.

Proposition 4 (Complexities).

The following storage footprint and complexities , hold for the respective tree structures, when the number of observations tends to .

• The two-layer equilibrated -tree, where , , is the optimal storage footprint tree, and

 Extra open brace or missing close brace (14)
• The -layer equilibrated -tree, where , , is such that

 S=O(n2−2/¯ν),Cα∼αn1+2¯ν,Cβ∼β2n2. (15)
• The optimal complexity tree is defined as the regular tree structure that minimizes , as it is not possible to reduce to orders lower than . This tree is such that

 S=O(n2−1δ¯ν−1),Cα∼γαn1+1δ¯ν−1,Cβ∼β2n2, (16)

with and