When Gaussian Process Meets Big Data: A Review of Scalable GPs

When Gaussian Process Meets Big Data: A Review of Scalable GPs

Haitao Liu, Yew-Soon Ong,  Xiaobo Shen, and Jianfei Cai,  Haitao Liu is with the Rolls-Royce@NTU Corporate Lab, Nanyang Technological University, Singapore, 637460. E-mail: htliu@ntu.edu.sgYew-Soon Ong, Xiaobo Shen and Jianfei Cai are with School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798. E-mail: asysong@ntu.edu.sg, xbshen@ntu.edu.sg, asjfcai@ntu.edu.sg.
Abstract

The vast quantity of information brought by big data as well as the evolving computer hardware encourages success stories in the machine learning community. In the meanwhile, it poses challenges for the Gaussian process (GP), a well-known non-parametric and interpretable Bayesian model, which suffers from cubic complexity to training size. To improve the scalability while retaining the desirable prediction quality, a variety of scalable GPs have been presented. But they have not yet been comprehensively reviewed and discussed in a unifying way in order to be well understood by both academia and industry. To this end, this paper devotes to reviewing state-of-the-art scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning. Particularly, for global approximations, we mainly focus on sparse approximations comprising prior approximations which modify the prior but perform exact inference, and posterior approximations which retain exact prior but perform approximate inference; for local approximations, we highlight the mixture/product of experts that conducts model averaging from multiple local experts to boost predictions. To present a complete review, recent advances for improving the scalability and model capability of scalable GPs are reviewed. Finally, the extensions and open issues regarding the implementation of scalable GPs in various scenarios are reviewed and discussed to inspire novel ideas for future research avenues.

Gaussian process, big data, scalability, sparse approximations, mixture/product of experts

I Introduction

In the era of big data, the vast quantity of information poses the demand of effective and efficient analysis, interpretation and prediction to explore the benefits lie ahead. Thanks to the big data, the machine learning community tells many success stories [1, 2, 3, 4, 5] while still leaving many challenges. This article focuses on Gaussian process (GP) regression [6], also known as Kriging in geostatistics [7], and surrogates or emulators in computer experiments [8]. The GP is a non-parametric statistical model which has been extensively used in various scenarios, e.g., active learning [9, 10], multi-task learning [11, 12], manifold learning [13], reinforcement learning [14], and optimization [15].

Given training points and the relevant observations , GP seeks to infer the latent function in the function space defined by the mean function and the kernel function . The most prominent weakness of the standard GP inference is that it suffers from a cubic time complexity because of the calculation of the inversion and determinant of the kernel matrix . This limits the scalability of GP and confines it to the datasets of size less than .

Hence, scalable GPs devote to improving the scalability of full GP while retaining favorable prediction quality for big data. Through extensive literature review, which has been summarized in Fig. 1, we classify scalable GPs into two main categories including

  • Global approximations which approximate the kernel matrix through global distillation. The distillation can be achieved by (i) a random subset of the training data with () points (subset-of-data [16]), resulting in a smaller kernel matrix ; (ii) the deletion of unimportant entries in (sparse kernels [17]), resulting in a sparse kernel matrix with many zero entries; and (iii) the low-rank representation measured between global inducing points and training points (sparse approximations [18, 19, 2, 20]), resulting in the Nyström approximation .

  • Local approximations which follow the divide-and-conquer (D&C) idea to focus on the local subsets of training data. Efficiently, local approximations only need to tackle a local subset with data points at each time [21, 22]. Additionally, to produce smooth predictions equipped with valid uncertainty information, modeling averaging has been employed through the mixture-of-experts [23, 24, 25, 26] or the product-of-experts [27, 28, 29, 30].

Fig. 1: Percentages of the categories for (a) scalable GPs, (b) global approximations and (c) local approximations in the literature surveyed.

As depicted in Fig. 2, in terms of scalability, most of the sparse approximations using inducing points and the local approximations using data points for each expert have the same training complexity as , and they can be further sped up through parallel/distributed computing [31, 32, 33, 22, 34, 35]. When organizing the inducing points into Kronecker structure, sparse approximations can further reduce the complexity to  [20, 36]. In the meantime, by reorganizing the variational lower bound, stochastic optimization is available for sparse approximations with a remarkable complexity of  [2, 37, 38], enabling the regression with up to one billion data points [38].

Fig. 2: Comparison of scalable GPs regarding scalability and model capability, where ; is the inducing size for sparse approximations, and the subset size for subset-of-data and local approximations.

It is notable that as stated before, we welcome GPs with high scalability but require that they should offer favorable predictions, i.e., good model capability. For example, though showing a very low complexity of , we cannot expect the subset-of-data to perform well with increasing . In terms of model capability, global approximations are capable of capturing the global patterns (long-term spatial correlations) but often filter out the local patterns due to the limited inducing set. On the contrary, due to the local nature, local approximations favor the capture of local patterns (non-stationary features), enabling them to possibly outperform the full GP for complicated tasks. The drawback however is that they ignore the global patterns to risk discontinuous predictions and local over-fitting. Recently, attempts have been made to improve the model capability through, for example, the inter-domain strategy [39], hierarchical structure [40], and combinations of local approximations or neural networks (NNs) [41, 42, 36], showcasing the state-of-the-art performance on some public large-scale datasets [36, 43].

The development and success stories of scalable GPs pose the demand of comprehensive review including the methodological characteristics and comparisons for better understanding. We consider a depiction in Fig. 1 to classify, review and analyze state-of-the-art scalable GPs for big data. Particularly, for global approximations, we mainly focus on sparse approximations which, to be clear, are classified into two categories: prior approximations which modify the prior but perform exact inference, and posterior approximations which retain exact prior but perform approximate inference; for local approximations, we highlight the mixture-of-experts and the product-of-experts that conduct model averaging from multiple local experts in different ways to boost predictions. To present a complete review, recent advances for improving the scalability and model capability of global and local approximations are reviewed. Finally, the extensions and open issues regarding scalable GPs for various purposes are reviewed and discussed in order to inspire novel ideas for future research avenues.

The remaining of the article is organized as follows. Section II briefly introduces the GP regression from the function-space view and the alternative weight-space view. Then, the two main categories of scalable GPs, global and local approximations, are comprehensively reviewed in sections III and IV. Moreover, section V reviews the improvements for scalable GPs in terms of scalability and model capability. Thereafter, section VI discusses the extensions of scalable GPs in different scenarios to highlight potential research avenues. Finally, section VII offers concluding remarks.

Ii Gaussian process regression revisited

The GP regression is a non-parametric Bayesian model, which places a GP prior over the latent function as

(1)

The mean function is often taken as zero; the kernel function controls the smoothness of GP and is often taken as the squared exponential (SE) function equipped with automatic relevance determination (ARD)

(2)

where comprises the length-scales along dimensions, and is the signal variance.

Given the training data where with the iid noise , we obtain the marginal likelihood (model evidence), since any finite set of GP follows a multivariate Gaussian distribution,

(3)

where , and comprises the hyperparameters which could be inferred by maximizing the log marginal likelihood

(4)

which automatically achieves the bias-variance trade-off. For the sake of clarity, the hyperparameters now are omitted from the conditioning of the distribution. It is observed that the computational bottleneck in (4) is the calculation of the linear system and the log determinant . Traditionally, we use Cholesky decomposition such that and , resulting in the time complexity of  [6].

Thereafter, given the training data and the inferred hyperparameters, the predictive distribution at a test point has the mean and variance respectively expressed as

(5a)
(5b)

where and . For , we need to consider the noise such that .

Alternatively, we can interpret the GP from the weight-space view as an extension of the Bayesian linear model as

(6)

where the Gaussian prior is placed on the weights as ; is a feature mapping which maps the -dimensional input into a -dimensional feature space. Equivalently, we can recognize the kernel function as . Particularly, the SE kernel (2) can be recovered from an infinite number () of Gaussian-shaped basis functions centered everywhere [44].

In order to improve the scalability of standard GP for big data, the scalable GPs have been extensively presented and studied in recent years. In what follows, we classify current scalable GPs into two main categories as global approximations and local approximations, and comprehensively review them to showcase their methodological characteristics.

Iii Global approximations

Global approximations achieve the sparsity of the full kernel matrix , which is crucial for scalability, through (i) using a random subset of the training data (subset-of-data); (ii) removing the entries of with low correlations (sparse kernels); and (iii) employing inducing points to learn a low-rank representation (sparse approximations). In what follows, we comprehensively review the three kinds of global approximations.

Iii-a Subset-of-data

Subset-of-data (SoD) is the simplest strategy to approximate the full GP by using a subset of the training data . The SoD retains the standard GP inference at however a lower time complexity of , since it operates on () data points, resulting in approximating by a smaller . Though SoD could produce acceptable prediction mean for the case with redundant data, it struggles to produce overconfident prediction variance due to the limited subset.

Regarding the selection of , one could (i) randomly choose data points from , (ii) use clustering techniques, e.g., -means and KD tree [45], to partition the data into subsets and choose their centroids as subset points, and (iii) employ active learning criteria, e.g., differential entropy [46], information gain [47] and matching pursuit [48], to sequentially select data points with however higher computing cost.

Iii-B Sparse kernels

Sparse kernels attempt to directly achieve a sparse representation of via the particularly designed compactly supported (CS) kernel function, which imposes when exceeds a certain threshold. Therefore, only the non-zero elements in are involved in the calculation, which speeds up the inference. The main challenge in constructing valid CS kernels is to ensure that the kernel matrix is positive semi-definite (PSD), i.e.,  [49, 17, 50, 51]. By employing the CS kernel, the complexity of the GP inference scales as with a smaller coefficient () than using the typical globally supported kernels. The GP using CS kernel is potential for capturing local patterns due to the truncation property.

Iii-C Sparse approximations

Typically, we conduct eigen-decomposition and choose the first eigenvalues to approximate the full-rank kernel matrix as . Thereafter, it is straightforward to calculate the inversion using the Sherman-Morrison-Woodbury formula

(7)

and the determinant using the Sylvester determinant theorem

(8)

resulting in the time complexity of . However, the direct eigen-decomposition of is of limited interest since itself is an operation. Therefore, to obtain computation gains, we approximate the eigen-functions of based on data points, leading to the Nyström approximation

(9)

which greatly improves large-scale kernel learning [52], and enables naive Nyström GP [53]. This scalable GP however may produce negative prediction variances [54], since (i) it is not a complete generative probabilistic model as the Nyström approximation is only imposed on the training data, and (ii) it cannot guarantee the PSD kernel matrix.

Inspired by the influential Nyström approximation, sparse approximations seek to incorporate the Nyström approximation into GP to produce a unifying and principled generative probabilistic model. Sparse approximations, as the name suggests, achieve the sparsity via inducing points (also referred to as support points, active set or pseudo points) to optimally summarize the dependency of the whole training data. We introduce a set of inducing points and the relevant inducing variables . The latent variables akin to and are assumed to be drawn from the same GP, resulting in the prior . Besides, is assumed to be a sufficient statistic for , i.e., for any variables it holds . We recover the joint prior by marginalizing out as .

In what follows, depending on whether modifying the joint prior or not, we classify sparse approximations into two main categories: the prior approximations which approximate the prior but perform exact inference, and the posterior approximations which retain exact prior but perform approximate inference.

Iii-C1 Prior approximations

Prior approximations seek to modify the joint prior, which is the origin of the cubic complexity. They impose the independence assumption such that111We see here is called inducing variables since the dependencies between and are only induced through .

(10)

where the training and test conditionals write, given a Nyström notation ,

(11a)
(11b)

To obtain computational gains, we have the approximation

(12)

with the training and test conditionals in (11) respectively approximated as

(13a)
(13b)

Then, the log marginal likelihood is approximated by as

(14)

It is found that the specific selections of enable the usage of (7) and (8) to calculate and which depend only on , resulting in the substantially reduced complexity of .

Particularly, the subset-of-regressors (SoR) [55, 56, 57] approximation, also called deterministic inducing conditional (DIC) approximation, imposes both deterministic training and test conditionals, i.e., , as

(15a)
(15b)

This is equivalent to applying the Nyström approximation to both training and test data, resulting in a degenerate222Degeneracy means the kernel in GP has a finite number of non-zero eigenvalues. GP with a rank (at most) kernel

(16)

Alternatively, we can interpret the SoR approximation from the weight-space view. It is known that the GP which employs a kernel function with infinite expansion of the input in the feature space defined by dense basis functions is equivalent to a Bayesian linear model in (6) with infinite weights. To obtain computational gains, the relevance vector machine (RVM) [58] uses only basis functions for approximation as

(17a)
(17b)

where . As a consequence, from the function-space view, the RVM approximation is a GP with the kernel function

(18)

which recovers when and where  [38].333The choose of should produce PSD kernel matrix such that . Alternatively, we can take the simple form where is a diagonal matrix and  [58, 59]; or we take with and respectively being the eigenvalue matrix and eigenvector matrix of  [60], leading to a scaled Nyström approximation. However, as depicted in Fig. 3, the SoR approximation and the RVM-type models [58, 61, 60, 59] impose too restrictive assumptions to the training and test data such that they produce overconfident prediction variances when leaving the training data.444This can be clearly seen in (16) and (18) that (i) the kernels are degenerate and have the odd property of depending on inputs; and (ii) they only have degrees of freedom.

To reverse the uncertainty behavior of SoR, the RVM is healed through augmenting the basis functions at with however higher computing cost [62]. This augmentation by including into was also studied in [18]. Alternatively, the sparse spectrum GP (SSGP) [63] and its variational variants [64, 65, 66] elegantly address this issue by reconstructing the Bayesian linear model from spectral representation (Fourier features), resulting in the stationary kernel

(19)

where the -dimensional vector represents the spectral frequencies. Another way, which will be reviewed below, is to advance the prior approximations such that they retain the computation gains but produce sensible uncertainty.

For instance, the deterministic training conditional (DTC) approximation [67, 68] imposes the deterministic training conditional but retains the exact test conditional as

(20a)
(20b)

As a result, the prediction mean is the same as that of SoR, but the prediction variance is always larger than that of SoR, and grows to the prior variance when leaving the inducing points, see Fig. 3. Notably, due to the inconsistent conditionals in (20), the DTC approximation is not an exact GP. Besides, DTC and SoR often do not perform well due to the restrictive prior assumption .

Hence, the fully independent training conditional (FITC) [69] imposes a more informative fully independence assumption to remove the dependency among such that, given ,

(21)

whereas the test conditional retains exact. It is found that the variances of (21) is identical to that of due to the correlation term . Hence, compared to SoR and DTC which throw away the uncertainty in (15) and (20), FITC partially retains it, leading to a closer approximation to the prior . Moreover, the fully independence assumption can be extended to to derive the fully independent conditional (FIC) approximation [18], which stands as a non-degenerate GP with the kernel

where is the Kronecker’s delta. Now has a constant prior variance but is not stationary. Note that the predictive distributions of FITC and FIC only differ when predicting multiple test points simultaneously. Alternatively, the approximation in (21) can be derived from minimizing the Kullback-Leibler (KL) divergence  [70], which quantifies the similarity between the exact joint prior and the prior approximation. Moreover, the FIT(C) approximation has been generalized to non-Gaussian likelihoods using expectation propagation (EP) [71, 72].

Particularly, the diagonal correlation in (21) brings interesting properties for FITC with the prediction mean and variance as

(22a)
(22b)

where and . It is found that represents the posterior variances of latent variables given the inducing variables . Hence, these varying variances, which are zeros exactly at , enable FITC to capture the noise heteroscedasticity, see Fig. 3, at the cost of (i) producing an invalidate estimation (nearly zero) of the noise variance , and (ii) sacrificing the accuracy of prediction mean [73].

To improve FITC, the partially independent training conditional (PITC) [18] has

(23)

This equates to partitioning the training data into independent subsets (blocks) , and taking into account the joint distribution of in each subset. But it is argued that though being a closer approximation to , the blocking brings little improvements over FITC [41].

So far, we have reviewed state-of-the-art prior approximations. Regarding their implementation, the choose of inducing points is crucial. Alternatively, similar to SoD, we could use clustering techniques to select a finite set of space-filling inducing points from , or we employ some querying criteria [74, 57, 68, 48, 75] to choose informative inducing points at higher computing cost. More flexibly, inducing points are regarded as pseudo-points and then the optimization of their locations together with the hyperparameters under the same objective is available [69]. Furthermore, a recent work [76] shows the first attempt to simultaneously determine the number and locations of inducing points in the Bayesian framework by placing a prior on . Finally, note that the optimization over inducing points, which additionally introduces parameters, turns the inference into an intractable high-dimensional optimization task. Besides, with increasing , the benefits brought by the optimization over the simple selection from training data vanish. That means in the case with many inducing points, we could simply use for example clustering techniques to query the inducing set.

Finally, the heteroscedasticity of FITC raises another finding that this approximation attempts to achieve a desirable predictive accuracy at low computing cost, rather than faithfully recovering the standard GP with increasing . Indeed, if we have inducing points and place them on the training points, the prior approximations recover the full GP. But this configuration is not the global optimum when maximizing of the prior approximations, which makes them philosophically troubling. Besides, learning inducing points via the optimization of (14) may produce poor predictions, see the discussions of DTC in [19]. These issues will be addressed by the posterior approximations reviewed in next section that retain the exact prior but perform approximate inference.

Fig. 3: Illustration of sparse approximations on a one-dimensional toy example with where . In the panel, the symbols represent 120 training points; the top circles represent the initial locations of inducing points, whereas the bottom triangles represent the optimized locations of inducing points; the dot green curves represent the prediction mean of full GP; the green curves represent 95% confidence interval of the full GP predictions; the red curves represent the prediction mean of sparse approximations; the shaded regions represent 95% confidence interval of the predictions of sparse approximations. For SKI, it does not optimize over the positions of inducing points.

Iii-C2 Posterior approximations

The most well-known posterior approximation is the elegant variational free energy (VFE) presented by Titsias [19] via variational inference [77]. Instead of modifying the prior , VFE directly approximates the posterior , the learning of which is a central task in statistical models, by introducing a variational distribution . Then, we push the approximation towards by minimizing their KL divergence

(24)

where represents the expectation over the distribution .555Matthews et al. [78] further extended the procedure to infinite index sets using the KL divergence between stochastic processes such that the posterior is approximated over the entire process . It is found that minimizing the rigorously defined is equivalent to maximizing , since is constant for . Thus, is called evidence lower bound or variational free energy, which permits us to jointly optimize the variational parameters and the hyperparameters. The variational distribution akin to the exact posterior can be factorized as . Notably, the inducing positions are regarded as the variational parameters in rather than the model parameters. It is observed that maximizing the bound w.r.t. the hyperparameters directly improves ; while maximizing w.r.t. the variational parameters implicitly drives the approximation to match both the posterior and the marginal likelihood .

To obtain a more tight bound, the calculus of variations finds the optimal variational distribution to remove the dependency of on by taking the relevant derivative to zero, leading to

(25)

It is found that is a “collapsed” version of since it collapses the variational distribution . Note that the bound differs with only by a trace term which however substantially improves the inference of VFE. The trace represents the total variance of predicting the latent variables given the inducing variables . In order to maximize , we need to decrease the trace. Particularly, means that and we reproduce the full GP. Hence, the additional trace term brings about the following benefits: (i) it is a regularizer and guards against over-fitting; (ii) it seeks to deliver a good inducing set; and (iii) it always improves with increasing , see the theoretical analysis in [73, 78]. The third property implies that given enough resources the VFE will recover the full GP, see Fig. 3. On the contrary, without this trace term, the DTC approximation often risks over-fitting [79].

Regarding the improvements of VFE, it was extended to continuous and discrete inputs through an efficient QR factorization-based optimization over both inducing points and hyperparameters [80]. Another recent improvement over VFE seeks to boost the estimation of inducing points in an augmented feature space [81], which is similar to the inter-domain strategy [39]. The authors argued that the similarity of inducing points measured in the Euclidean space is inconsistent to that measured by the kernel function in GP. Hence, they assigned a mixture prior on in the latent feature space, and derived a regularized variational lower bound wherein the additional regularization term helps choose good inducing points in the kernel space. Besides, Matthews et al. [78, 82] bridged the gap between the variational inducing-points framework and the more general KL divergence between stochastic processes, thus enhancing various applications. More interestingly, based on [78], Bui et al. [83] attempted to approximate the general, infinite joint prior which comprises two inferential objects of interest: posterior distribution and marginal likelihood. As a consequence, minimizing their KL divergence encourages the direct approximation to both posterior and marginal likelihood. Based on this new interpretation and the power EP (PEP), the FITC and VFE approximations are interpreted jointly as

(26)

where takes the form (14) with . By varying the parameter , we recover FITC when and VFE when . Besides, a hybrid approximation of VFE and FITC using a moderate , e.g., , often produces better predictions.

To further improve the scalability of VFE, unlike the collapsed bound , Hensman et al. [2] retained the global variational distribution in to obtain a full factorization over data points as

(27)

The first term in the right-hand side of is the sum of terms due to the iid observation noises, i.e., . Hence, the stochastic gradient descent (SGD) [84], which encourages large-scale learning, may be employed to obtain an unbiased estimation of using a mini-batch as

(28)

Due to the difficulty of optimizing variational parameters and in the Euclidean space, one can employ the recently proposed Stochastic Variational Inference (SVI) [85] wherein the variational parameters are inferred in the natural gradients666The superiority of natural gradients is verified in [86]. space via SGD, resulting in the greatly reduced complexity of when , and more interestingly, the parallel and online learning fashion. Therefore, a crucial property of the stochastic variational GP (SVGP) is that it can train a sparse GP at any time with a small and random subset of the training data in each iteration [37]. Another interesting property is that taking a unit step in the natural gradient direction equals to performing an update in the Variational Bayes Expectation Maximization (VB-EM) framework [87]. Though showing high scalability and desirable approximation, the SVGP has some drawbacks: (i) the bound is less tight than Titsias’s bound because is not optimally eliminated; (ii) SVGP needs to optimize over with a huge number () of parameters, thus requiring much time to complete one epoch of training; and (iii) the introduction of SVI brings the empirical requirement of carefully turning the parameters of SGD.

Inspired by the idea of Hensman, Peng et al. [38] derived the similar factorized variational bound for GPs by taking the weight-space augmentation in (17). The weight-space view (i) allows to use flexible basis functions in order to incorporate various low-rank structures; and (ii) provides a composite non-convex bound enabling the speedup using an asynchronous proximal gradient-based algorithm [88]. By deploying the variational model in a distributed machine learning platform PARAMETERSERVER [89], the authors have first scaled GP up to billions of data points. Similarly, Cheng and Boots [90] also derived a stochastic variational framework from the weight-space view with the difference being that the mean and variance of respectively use the decoupled basis function sets and , leading to more flexible inference. Inevitably, the weight-space sparse approximations suffer from some limitations, e.g., poor predictions and overconfident prediction variances, as previously discussed. Finally, a recent interesting work [37] presents a novel unifying, anytime variational framework akin to Hensman’s for accommodating existing sparse approximations, e.g., SoR, DTC, FIT(C) and PIT(C), such that they can be trained via the efficient SGD which achieves asymptotic convergence to the predictive distribution of the chosen sparse model. The key is to conduct a reverse variational inference wherein “reverse” means we can find a prior (not the conventional GP prior) such that the variational distribution for FIT(C) and PIT(C) is the maximum of the variational lower bound.777For VFE and its stochastic variant SVGP, normally, we pre-define the prior , and then derive an optimal to maximize the variational lower bound. Finally, the scalability of Hensman’s model can be further reduced to nearly by introducing Kronecker structures for inducing points and the variance of  [91, 92].

The applicability of Titsias and Hensman’s models have been further improved by using, e.g., (i) Bayesian treatment of hyperparameters [93, 94, 95, 66] rather than traditional point estimation which risks over-fitting when the number of hyperparameters is small; and (ii) non-Gaussian likelihoods [94, 96, 97, 98] for regression and classification.

Iii-C3 Structured sparse approximations

A direct speedup to solve in standard GP can be achieved through fast matrix-vector multiplication (MVM) [99, 100], which iteratively solves the linear system using conjugate gradients (CG) [101] with () iterations,888The solution to is the unique minimum of the quadratic function . resulting in a time complexity of . It was argued by [16] that the direct implementation of MVMs has some open questions, e.g., the determination of , the lack of meaningful speedups, and the handling of badly conditioned kernel matrix. Alternatively, the pre-conditioned CG (PCG) [102] employs a pre-conditioning matrix through for example the Nyström approximation to improve the conditioning of kernel matrix and accelerate the convergence of CG. Moreover, when the kernel matrix itself has some kind of structure or sparsity, the MVM that is capable of exploiting the algebraic structure in can provide massive scalability.

One of the representatives is the Kronecker methods [103, 104] that exploit the multi-variate grid inputs and the tensor product kernel with the form .999Popular kernels for example the SE kernel (2) fulfill the product structure. Then, the kernel matrix decomposes to a Kronecker product , which eases the eigen-decomposition with a greatly reduced time complexity of where for .101010 () is an matrix when the number of points along each dimension is the same. Besides, for the detailed introduction of GP with multiplicative kernels, please refer to section 2.2 of [104]. Another one is the Toeplitz methods [105]—complementary to the Kronecker methods—that exploit the kernel matrix constructed from regularly spaced one dimensional points, resulting in the time complexity of . The severe limitation of the Kronecker and Toeplitz methods is that they require grid inputs, preventing them from being applied to the general arbitrary data points.111111This limitation was relaxed in the partial grid and variable noise scenario by introducing virtual observations and inputs [106, 104].

To handle large-scale arbitrary data while retaining the efficient Kronecker structure, the structured kernel interpolation (SKI) [20] imposes the grid constraint on the inducing points rather than the training points. Hence, the matrix admits the Kronecker structure for and the Toeplitz structure for , whereas the cross kernel matrix is approximated for example by a local linear interpolation using adjacent grid inducing points as

(29)

where and are two inducing points most closely bound , and is the interpolation weight. Inserting the approximation (29) back into , we have

(30)

where the weight matrix is extremely sparse since it only has two non-zero entires per row for local linear interpolation, leading to an impressive time complexity of where for solving .

The original SKI approximation has two main drawbacks. First, the number of grid inducing points grows exponentially with dimensionality , making SKI impractical for . To address this issue, one could use dimensionality reduction or manifold learning to map the inducing points into a -dimensional () latent space [107]; or more interestingly, one can use the hierarchical structure of deep neural networks to extract the latent low-dimensional feature space [36, 108]. Furthermore, continual efforts [109, 92, 110] have been made to directly reduce the time complexity to be linear with by exploiting the row-partitioned Khatri-Rao structure [111] of , or imposing tensor train decomposition [112] and Kronecker product to the mean and variance of in Hensman’s variational framework. The linear complexity with permits the use of numerous inducing points, e.g., .

Second, the SKI approximation may produce discontinuous predictions due to the local weight interpolation, and provides overconfident prediction variance when leaving the training data due to the restrictive SoR framework, as shown in Fig. 3. To smooth the predictions, Evans and Nair [109] directly exploited the row-partitioned Khatri-Rao structure of rather than using local weight interpolation. To have sensible uncertainty, a diagonal correlation term akin to that of FITC has been considered [109, 113].

Finally, note that the permit of many inducing points is expected to improve the model capability. But due to the grid structure, the structured sparse approximations (i) use fixed inducing points, (ii) resort to dimensionality reduction for tackling high-dimensional tasks, and (iii) place the vast majority of inducing points on the domain boundary with increasing , which in turn may degenerate the model capability.

Iv Local approximations

Inspired by D&C, local approximations rely on multiple localized experts to achieve the scalability for handling big data. In comparison to global approximations, the local nature enables (i) capturing non-stationary features and (ii) distributing computational loads. In what follows, we comprehensively review the naive-local-experts which directly employs the pure local experts for prediction, and the mixture-of-experts and product-of-experts which inherit the advantages of naive-local-experts but boost the predictions through model averaging.

Iv-a Naive-local-experts

It is known that a pair of points far away from each other is almost uncorrected. Hence, we can use localized experts trained on subsets of to produce sensible predictions with low computational complexity. Particularly, the simple naive-local-experts (NLE) lets the local expert completely responsible for the subregion defined by , since is trained on and thus produces the most confident predictions in . Mathematically, we predict at as

(31)

According to the partition of , we classify NLE into two main categories: (i) inductive NLE, which first partitions the input space and trains all the experts, and then chooses an approximate expert for predicting at ; and (ii) transductive NLE, which particularly chooses a neighborhood subset around , and trains the relevant expert for predicting at .

Inductive NLE employs a static partition of the whole data by using clustering techniques, e.g., Voronoi tessellations [114] and trees [21, 115, 116], and trains independent local GP experts in those subregions, resulting in where is the training size for each expert. Note that the partition and the experts are usually learned separately; or they can be learned jointly with Bayesian treatment [117]. On the contrary, transductive NLE, e.g., the nearest-neighbors (NeNe) [118] which could induce a valid stochastic process [119, 120], employs a dynamic partition to choose neighbor points around , resulting in complexity that relies on the number of test points. A key issue in transductive NLE is the definition of the neighborhood set around . The simplest way is to use geometric closeness criteria121212The selected points should be close to ; meanwhile, they should distribute uniformly to avoid conveying redundant information. to choose neighbor points, which however are not optimal since they do not consider the data spatial correlation [121]. Hence, some GP-based active learning methods have been employed to sequentially update the neighborhood set [122, 123, 22, 34].

Whilst enjoying the capability of capturing non-stationary features due to the localized structure, the NLE (i) produces discontinuous predictions on the boundaries of subregions and (ii) suffers from poor generalization capability since it misses the long-term data spatial correlations, as depicted in Fig. 4. To address the discontinuity issue, the patched GPs [124, 125, 126] have been developed, which impose continuity conditions such that the two adjacent local GPs are patched to share the nearly identical predictions on the boundary. But the patched GPs suffer from inconsistent and even negative prediction variance and are available in low dimensional space [76, 126]. More popularly, the model averaging strategy, which is accomplished by the mixture/product of local GP experts elaborated below, well smooths the predictions from multiple experts. To address the generalization issue, it is possible to (i) share the hyperparameters across experts, like [28]; or (ii) combine local approximations with global approximations, which will be reviewed in section V-B.

Fig. 4: Illustration of local approximations using six individual experts on the toy example. Note that for MoE, we did not jointly learn the experts and gating functions. For simplicity, we use the individual experts and the differential entropy as in the softmax gating function.

Iv-B Mixture-of-experts

The mixture-of-experts (MoE) [23, 24] devotes to combining the local and diverse behaviors of multiple experts owning individual hyperparameters for improving the overall accuracy and reliability [127].131313This topic has been studied in a broad regime, called ensemble learning [128, 129], for aggregating various learning models to boost predictions. As shown in Fig. 5, MoE generally expresses the combination as a Gaussian mixture model (GMM) [130]

(32)

where is the gating function, which usually takes a parametric form like the softmax or probit function [130, 131], and can be thought as the probability that the expert indicator is , i.e., is assigned to expert ; comes from and is responsible for component of the mixture. In (32), the gates manage the mixture through probabilistic partition of the input space for defining the subregions where the individual experts responsible for. The experts can be a variety of machine learning models, e.g., linear model and support vector machines [130, 132, 133].

Fig. 5: Illustration of mixture-of-experts.

The training of MoE usually assumes that the data is iid such that we maximize the factorized log likelihood to learn the gating functions and the experts simultaneously by the gradient-based methods [134] and more popularly, the EM algorithm [130, 135, 132, 136]. The joint learning permits (i) probabilistic (soft) partition of the input space via both the data and the experts themselves, and (ii) diverse experts specified for different but overlapped subregions. Finally, the predictive distribution at is

(33)

where can be regarded as the posterior probability , called responsibility.

To advance the MoE, (i) the single-layer model is extended to a tree-structured hierarchical architecture [135]; (ii) the Bayesian approach is employed instead of the simple maximum likelihood to get rid of over-fitting and noise-level underestimate [137, 138]; (iii) the t-distribution is considered to handle the outliers [139]; and finally (iv) instead of following the conditional mixture (32), the input distribution is considered to form the joint likelihood for better assignment of experts [132].

Next, we review the mixture of GP experts for big data. It is observed that (i) the original MoE is designed for capturing multi-modal (non-stationary) features, and the individual global experts are responsible for all the data points, leading to higher complexity in comparison to individual modeling; (ii) the iid data assumption does not hold for GP experts which model the dependencies in the joint distribution; and (iii) the parametric gating function is not favored in the Bayesian non-parametric framework. In 2001, Tresp [140] first introduced the mixture of GP experts, which employs GP experts to respectively capture the mean, the noise variance, and the gate parameters at the cost of however requiring nearly complexity, which is unattractive for big data.

The mixture of GP experts for big data should address two issues: (i) how to reduce the computational complexity (model complexity), and (ii) how to determine the number of experts (model selection).

To address the model complexity issue while retaining the capability of describing non-stationary features, there are three core threads. The first is the localization of experts. For instance, the infinite mixture of GP experts (iMGPE) [25] uses a localized likelihood to get rid of the iid assumption as

(34)

Given a configuration of the expert indicators , the likelihood factorizes over local experts, resulting in when each expert has the same training size . Similar to [132], the iMGPE model was further improved in [141] by employing the joint distribution rather than the conditional as

(35)

The full generative model is capable of handling partially specified data and providing inverse functional mappings, and thus has been widely used and extended in the literature. But the inference over (34) and (35) should resort to the expensive Markov Chain Monte Carlo (MCMC) sampling. Alternatively, the localization can be achieved by the so-called hard-cut EM algorithm using a truncation representation, wherein the E-step of the EM algorithm directly assigns the data to the experts through maximum a posteriori (MAP) of the expert indicators or a threshold value [142, 143, 42, 144, 145]. As a result, the M-step only operates the localized experts on small subsets, enhancing the capability of handling big data. The second thread is to combine global experts with the sparse approximations reviewed in section III-C under the variational EM framework, which is a faster alternative to MCMC sampling. The dependency among training outputs is broken to make variational inference feasible by (i) interpreting GP as the finite Bayesian linear model in (17[146, 26], or (ii) using the FITC experts such that we can factorize over given the inducing set  [42, 145]. With inducing points for each expert, the complexity is , which can be further reduced to with the help of hard-cut EM [42, 145].

Note that the first two threads of reducing model complexity assign the data points dynamically according to the data property and the experts’ performance. Hence, they are denoted as mixture of implicitly localized experts (MILE) [24]. The implicit partition of MILE determines the optimal allocation of experts, thus enabling capturing the interaction (overlapping) among experts. This advantage encourages the application on data association [147, 148]. The main drawback of MILE however is that in the competitive learning process, some experts may be eliminated due to the zero-coefficient problem caused by unreasonable initial parameters [149].

To reduce model complexity and overcome the problem of MILE, the third thread is to pre-partition the input space by clustering techniques and assign points to the relevant experts before model training [150, 151]. The mixture of explicitly localized experts (MELE) [24] (i) reduces the model complexity as well, and (ii) explicitly determines the architecture of MoE and poses distinct local experts. In the meantime, the drawback of MELE is that the clustering-based partition misses the information from data labels and experts such that it cannot capture the interaction among experts.

Finally, to address the model selection issue, the Akaike information criterion [152] and the synchronously balancing criterion [153] have been employed to choose over a set of candidate values at the cost of increasing complexity. More elegantly, the input-dependent Dirichlet process (DP) [25] or the Polya urn distribution [141] is introduced over the expert indicators to act as gating function, which in turn brings the capability of automatically inferring the number of experts from the data. This complex prior in practice however leads to the difficulty in learning it well, and the infinite number of experts makes the variational inference infeasible [26]. Therefore, for simplicity, a truncation representation of DP is usually used with a fixed  [145, 26].

Iv-C Product-of-experts

Different from the MoE model, which employs a weighted sum of several probability distributions (experts) via an “or” operation, the product-of-experts (PoE) proposed by Hinton [27] conducts the combination by multiplying these probability distributions, which sidesteps the weight assignment in MoE and is similar to an “and” operation, as

(36)

where is a normalization factor. However, it is found that given general experts for PoE, the model training through maximizing log likelihood is intractable because of the normalization factor  [154].

Fortunately, the product of GP experts sidesteps this issue since is a Gaussian distribution. Hence, the product of Gaussian distributions is still a Gaussian distribution, resulting in a factorized marginal likelihood of PoE over GP experts

(37)

where with and being the training size of expert . This factorization degenerates the full kernel matrix into a diagonal block matrix , leading to . Hence, the complexity is substantially reduced to given . Besides, it is observed that the PoE likelihood (37) is a special case of the MoE likelihood (34): the MoE likelihood averages the PoE likehood over possible configurations of the expert indicators . Consequently, the joint learning of gating functions and experts makes MoE achieve optimal allocation of experts such that it generally outperforms PoE [155].

Generally, due to the weighted sum form (32), the MoE will never be sharper than the sharpest expert; on the contrary, due to the product form (36), the PoE can be sharper than any of the experts. This can be confirmed in Fig. 4: the PoE produces poor prediction mean and overconfident prediction variance by aggregating the predictions from six independent experts, due to the inability of suppressing poor experts; on the contrary, the MoE provides desirable predictions through gating functions.

Hence, in order to improve the performance of PoE, we retain the effective PoE training process but modify the predicting process. Instead of following the simple product rule to aggregate the experts’ predictions, various advanced aggregation criteria have been proposed to weaken the votes of poor experts. Note that the aggregation strategy is post-processing or transductive since it is independent from model training but depends on the test point location. Particularly, the aggregations are expected to have several properties [155]: (i) the aggregated prediction is a valid probabilistic model, and (ii) the aggregated prediction is robust to weak experts.

Given the GP experts with predictive distributions