Variational Inference for Gaussian Process Models with Linear Complexity

Variational Inference for Gaussian Process Models with Linear Complexity

Ching-An Cheng
Institute for Robotics and Intelligent Machines
Georgia Institute of Technology
Atlanta, GA 30332
cacheng@gatech.edu
&Byron Boots
Institute for Robotics and Intelligent Machines
Georgia Institute of Technology
Atlanta, GA 30332
bboots@cc.gatech.edu
Abstract

Large-scale Gaussian process inference has long faced practical challenges due to time and space complexity that is superlinear in dataset size. While sparse variational Gaussian process models are capable of learning from large-scale data, standard strategies for sparsifying the model can prevent the approximation of complex functions. In this work, we propose a novel variational Gaussian process model that decouples the representation of mean and covariance functions in reproducing kernel Hilbert space. We show that this new parametrization generalizes previous models. Furthermore, it yields a variational inference problem that can be solved by stochastic gradient ascent with time and space complexity that is only linear in the number of mean function parameters, regardless of the choice of kernels, likelihoods, and inducing points. This strategy makes the adoption of large-scale expressive Gaussian process models possible. We run several experiments on regression tasks and show that this decoupled approach greatly outperforms previous sparse variational Gaussian process inference procedures.

 

Variational Inference for Gaussian Process Models with Linear Complexity


  Ching-An Cheng Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, GA 30332 cacheng@gatech.edu Byron Boots Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, GA 30332 bboots@cc.gatech.edu

\@float

noticebox[b]\end@float

1 Introduction

Gaussian process (GP) inference is a popular nonparametric framework for reasoning about functions under uncertainty. However, the expressiveness of GPs comes at a price: solving (approximate) inference for a GP with data instances has time and space complexities in and , respectively. Therefore, GPs have traditionally been viewed as a tool for problems with small- or medium-sized datasets

Recently, the concept of inducing points has been used to scale GPs to larger datasets. The idea is to summarize a full GP model with statistics on a sparse set of fictitious observations [18, 24]. By representing a GP with these inducing points, the time and the space complexities are reduced to and , respectively. To further process datasets that are too large to fit into memory, stochastic approximations have been proposed for regression [10] and classification [11]. These methods have similar complexity bounds, but with replaced by the size of a mini-batch .

Despite the success of sparse models, the scalability issues of GP inference are far from resolved. The major obstruction is that the cubic complexity in in the aforementioned upper-bound is also a lower-bound, which results from the inversion of an -by- covariance matrix defined on the inducing points. As a consequence, these models can only afford to use a small set of basis functions, limiting the expressiveness of GPs for prediction.

In this work, we show that superlinear complexity is not completely necessary. Inspired by the reproducing kernel Hilbert space (RKHS) representation of GPs [2], we propose a generalized variational GP model, called DGPs (Decoupled Gaussian Processes), which decouples the bases for the mean and the covariance functions. Specifically, let and be the numbers of basis functions used to model the mean and the covariance functions, respectively. Assume . We show, when DGPs are used as a variational posterior [24], the associated variational inference problem can be solved by stochastic gradient ascent with space complexity and time complexity , where is the input dimension. We name this algorithm svdgp. As a result, we can choose , which allows us to keep the time and space complexity similar to previous methods (by choosing ) while greatly increasing accuracy. To the best of our knowledge, this is the first variational GP algorithm that admits linear complexity in , without any assumption on the choice of kernel and likelihood.

While we design svdgp for general likelihoods, in this paper we study its effectiveness in Gaussian process regression (GPR) tasks. We consider this is without loss of generality, as most of the sparse variational GPR algorithms in the literature can be modified to handle general likelihoods by introducing additional approximations (e.g. in Hensman et al. [11] and  Sheth et al. [22]). Our experimental results show that svdgp significantly outperforms the existing techniques, achieving higher variational lower bounds and lower prediction errors when evaluated on held-out test sets.

1.1 Related Work

, , Time Space
svdgp sga sga sga false true
svi snga sga sga true true
ivsgpr sma sma sga true true
vsgpr cg cg cg true true
gpr cg cg cg true false
Table 1: Comparison between svdgp and variational GPR algorithms: svi [10], ivsgpr [2], vsgpr [24], and gpr [19], where is the number of observations/the size of a mini-batch, , , are the number of basis functions, and is the input dimension. Here it is assumed 111The first three columns show the algorithms to update the parameters: sga/snga/sma denotes stochastic gradient/natural gradient/mirror ascent, and cg denotes batch nonlinear conjugate gradient ascent. The 4th and the 5th columns indicate whether the bases for mean and covariance are strictly shared, and whether a variational posterior can be used. The last two columns list the time and space complexity. .

Our framework is based on the variational inference problem proposed by Titsias [24], which treats the inducing points as variational parameters to allow direct approximation of the true posterior. This is in contrast to Seeger et al. [21], Snelson and Ghahramani [23], Quiñonero-Candela and Rasmussen [18], and Lázaro-Gredilla et al. [15], which all use inducing points as hyper-parameters of a degenerate prior. While both approaches have the same time and space complexity, the latter additionally introduces a large set of unregularized hyper-parameters and, therefore, is more likely to suffer from over-fitting [1].

In Table 1, we compare svdgp with recent GPR algorithms in terms of the assumptions made and the time and space complexity. Each algorithm can be viewed as a special way to solve the maximization of the variational lower bound (5), presented in Section 3.2. Our algorithm svdgp generalizes the previous approaches to allow the basis functions for the mean and the covariance to be decoupled, so an approximate solution can be found by stochastic gradient ascent in linear complexity.

(a)
(b)
(c)
Figure 1: Comparison between models with shared and decoupled basis. (a)(c) denote the models with shared basis of size . (b) denotes the model of decoupled basis with size . In each figure, the red line denotes the ground truth; the blue circles denote the observations; the black line and the gray area denote the mean and variance in prediction, respectively.

To illustrate the idea, we consider a toy GPR example in Figure 1. The dataset contains 500 noisy observations of a function. Given the same training data, we conduct experiments with three different GP models. Figure 1 (a)(c) show the results of the traditional coupled basis, which can be solved by any of the variational algorithms listed in Table 1, and Figure 1 (b) shows the result using the decoupled approach svdgp. The sizes of basis and observations are selected to emulate a large dataset scenario. We can observe svdgp achieves a nice trade-off between prediction performance and complexity: it achieves almost the same accuracy in prediction as the full-scale model in Figure 1(c) and preserves the overall shape of the predictive variance.

In addition to the sparse algorithms above, some recent attempts aim to revive the non-parametric property of GPs by structured covariance functions. For example, Wilson and Nickisch [27] proposes to space the inducing points on a multidimensional lattice, so the time and space complexities of using a product kernel becomes and , respectively. However, because , where is the number of grid points per dimension, the overall complexity is exponential in and infeasible for high-dimensional data. Another interesting approach by Hensman et al. [12] combines variational inference [24] and a sparse spectral approximation [15]. By equally spacing inducing points on the spectrum, they show the covariance matrix on the inducing points have diagonal plus low-rank structure. With MCMC, the algorithm can achieve complexity . However, the proposed structure in [12] does not help to reduce the complexity when an approximate Gaussian posterior is favored or when the kernel hyper-parameters need to be updated.

Other kernel methods with linear complexity have been proposed using functional gradient descent  [14, 5]. However, because these methods use a model strictly the same size as the entire dataset, they fail to estimate the predictive covariance, which requires space complexity. Moreover, they cannot learn hyper-parameters online. The latter drawback also applies to greedy algorithms based on rank-one updates, e.g. the algorithm of Csató and Opper [4].

In contrast to these previous methods, our algorithm applies to all choices of inducing points, likelihoods, and kernels, and we allow both variational parameters and hyper-parameters to adapt online as more data are encountered.

2 Preliminaries

In this section, we briefly review the inference for GPs and the variational framework proposed by Titsias [24]. For now, we will focus on GPR for simplicity of exposition. We will discuss the case of general likelihoods in the next section when we introduce our framework, DGPs.

2.1 Inference for GPs

Let be a latent function defined on a compact domain . Here we assume a priori that is distributed according to a Gaussian process . That is, , and . In short, we write .

A GP probabilistic model is composed of a likelihood and a GP prior ; in GPR, the likelihood is assumed to be Gaussian i.e. with variance . Usually, the likelihood and the GP prior are parameterized by some hyper-parameters, which we summarize as . This includes, for example, the variance and the parameters implicitly involved in defining . For notational convenience, and without loss of generality, we assume in the prior distribution and omit explicitly writing the dependence of distributions on .

Assume we are given a dataset , in which and . Let222In notation, we use boldface to distinguish finite-dimensional vectors (lower-case) and matrices (upper-case) that are used in computation from scalar and abstract mathematical objects. and . Inference for GPs involves solving for the posterior for any new input , where . For example in GPR, because the likelihood is Gaussian, the predictive posterior is also Gaussian with mean and covariance

(1)

and the hyper-parameter can be found by nonlinear conjugate gradient ascent [19]

(2)

where , and denote the covariances between the sets in the subscript.333If the two sets are the same, only one is listed. One can show that these two functions, and , define a valid GP. Therefore, given observations , we say .

Although theoretically GPs are non-parametric and can model any function as , in practice this is difficult. As the inference has time complexity and space complexity , applying vanilla GPs to large datasets is infeasible.

2.2 Variational Inference with Sparse GPs

To scale GPs to large datasets, Titsias [24] introduced a scheme to compactly approximate the true posterior with a sparse GP, , defined by the statistics on function values: where is a bounded linear operator444Here we use the notation loosely for the compactness of writing. Rigorously, is a bounded linear operator acting on and , not necessarily on all sample paths . and . is called an inducing function and an inducing point. Common choices of include the identity map (as used originally by Titsias [24]) and integrals to achieve better approximation or to consider multi-domain information [26, 7, 3]. Intuitively, we can think of as a set of potentially indirect observations that capture salient information about the unknown function .

Titsias [24] solves for by variational inference. Let and let and be the (inducing) function values defined on and , respectively. Let be the prior given by and define to be its variational posterior, where and are the mean and the covariance of the approximate posterior of . Titsias [24] proposes to use as the variational posterior to approximate and to solve for together with the hyper-parameter through

(3)

where is a variational lower bound of , is the conditional probability given in , and .

At first glance, the specific choice of variational posterior seems heuristic. However, although parameterized finitely, it resembles a full-fledged GP :

(4)

This result is further studied in Matthews et al. [16] and Cheng and Boots [2], where it is shown that (3) is indeed minimizing a proper KL-divergence between Gaussian processes/measures.

By comparing (2) and (3), one can show that the time and the space complexities now reduce to and , respectively, due to the low-rank structure of  [24]. To further reduce complexity, stochastic optimization, such as stochastic natural ascent  [10] or stochastic mirror descent [2] can be applied. In this case, in the above asymptotic bounds would be replaced by the size of a mini-batch . The above results can be modified to consider general likelihoods as in [22, 11].

3 Variational Inference with Decoupled Gaussian Processes

Despite the success of sparse GPs, the scalability issues of GPs persist. Although parameterizing a GP with inducing points/functions enables learning from large datasets, it also restricts the expressiveness of the model. As the time and the space complexities still scale in and , we cannot learn or use a complex model with large .

In this work, we show that these two complexity bounds, which have long accompanied GP models, are not strictly necessary, but are due to the tangled representation canonically used in the GP literature. To elucidate this, we adopt the dual representation of Cheng and Boots [2], which treats GPs as linear operators in RKHS. But, unlike Cheng and Boots [2], we show how to decouple the basis representation of mean and covariance functions of a GP and derive a new variational problem, which can be viewed as a generalization of (3). We show that this problem—with arbitrary likelihoods and kernels—can be solved by stochastic gradient ascent with linear complexity in , the number of parameters used to specify the mean function for prediction.

In the following, we first review the results in [2]. We next introduce the decoupled representation, DGPs, and its variational inference problem. Finally, we present svdgp and discuss the case with general likelihoods.

3.1 Gaussian Processes as Gaussian Measures

Let an RKHS be a Hilbert space of functions with the reproducing property: , such that , .555To simplify the notation, we write for , and for , where and , even if is infinite-dimensional. A Gaussian process is equivalent to a Gaussian measure on Banach space which possesses an RKHS  [2]:666Such w.l.o.g. can be identified as the natural RKHS of the covariance function of a zero-mean prior GP. there is a mean functional and a bounded positive semi-definite linear operator , such that for any , , we can write and . The triple is known as an abstract Wiener space [9, 6], in which is also called the Cameron-Martin space. Here the restriction that , are RKHS objects is necessary, so the variational inference problem in the next section can be well-defined.

We call this the dual representation of a GP in RKHS (the mean function and the covariance function are realized as linear operators and defined in ). With abuse of notation, we write in short. This notation does not mean a GP has a Gaussian distribution in , nor does it imply that the sample paths from are necessarily in . Precisely, contains the sample paths of and is dense in . In most applications of GP models, is the Banach space of continuous function and is the span of the covariance function. As a special case, if is finite-dimensional, and coincide and becomes equivalent to a Gaussian distribution in a Euclidean space.

In relation to our previous notation in Section 2.1: suppose and is a feature map to some Hilbert space . Then we have assumed a priori that is a normal Gaussian measure; that is samples functions in the form , where are independent. Note if , with probability one is not in , but fortunately is large enough for us to approximate the sampled functions. In particular, it can be shown that the posterior in GPR has a dual RKHS representation in the same RKHS as the prior GP [2].

3.2 Variational Inference in Gaussian Measures

Cheng and Boots [2] proposes a dual formulation of (3) in terms of Gaussian measures777 We assume is absolutely continuous wrt , which is true as is non-degenerate. The integral denotes the expectation of over , and denotes the Radon-Nikodym derivative.:

(5)

where is a variational Gaussian measure and is a normal prior. Its connection to the inducing points/functions in (3) can be summarized as follows [2, 3]: Define a linear operator as , where is defined such that . Then (3) and (5) are equivalent, if has a subspace parametrization,

(6)

with and satisfying and . In other words, the variational inference algorithms in the literature are all using a variational Gaussian measure in which and are parametrized by the same basis .

Compared with (3), the formulation in (5) is neater: it follows the definition of the very basic variational inference problem. This is not surprising, since GPs can be viewed as Bayesian linear models in an infinite-dimensional space. Moreover, in (5) all hyper-parameters are isolated in the likelihood , because the prior is fixed as a normal Gaussian measure.

3.3 Disentangling the GP Representation with DGPs

While Cheng and Boots [2] treat (5) as an equivalent form of (3), here we show that it is a generalization. By further inspecting (5), it is apparent that sharing the basis between and in (6) is not strictly necessary, since (5) seeks to optimize two linear operators, and . With this in mind, we propose a new parametrization that decouples the bases for and :

(7)

where and denote linear operators defined similarly to and . Compared with (6), here we parametrize through its inversion with so the condition that can be easily realized as . This form agrees with the posterior covariance in GPR [2] and will give a posterior that is strictly less uncertain than the prior.

The decoupled subspace parametrization (7) corresponds to a DGP, , with mean and covariance functions as 888In practice, we can parametrize with Cholesky factor so the problem is unconstrained. The required terms in (8) and later in (9) can be stably computed as and , where .

(8)

While the structure of (8) looks similar to (4), directly replacing the basis in (4) with and is not trivial. Because the equations in (4) are derived from the traditional viewpoint of GPs as statistics on function values, the original optimization problem (3) is not defined if and therefore, it is not clear how to learn a decoupled representation traditionally. Conversely, by using the dual RKHS representation, the objective function to learn (8) follows naturally from (5), as we will show next.

3.4 svdgp: Algorithm and Analysis

Substituting the decoupled subspace parametrization (7) into the variational inference problem in  (5) results in a numerical optimization problem: with

(9)
(10)

where each expectation is over a scalar Gaussian given by (8) as functions of and . Our objective function contains [11] as a special case, which assumes . In addition, we note that  Hensman et al. [11] indirectly parametrize the posterior by and , whereas we parametrize directly by (6) with for scalability and for better stability (which always reduces the uncertainty in the posterior compared with the prior).

We notice that and are completely decoupled in (9) and potentially combined again in (10). In particular, if is Gaussian as in GPR, we have an additional decoupling, i.e. for some and . Intuitively, the optimization over aims to minimize the fitting-error, and the optimization over aims to memorize the samples encountered so far; the mean and the covariance functions only interact indirectly through the optimization of the hyper-parameter .

One salient feature of svdgp is that it tends to overestimate, rather than underestimate, the variance, when we select . This is inherited from the non-degeneracy property of the variational framework [24] and can be seen in the toy example in Figure 1. In the extreme case when , we can see the covariance in (8) becomes the same as the prior; moreover, the objective function of svdgp becomes similar to kernel methods (exactly the same as kernel ridge regression, when the likelihood is Gaussian). The additional inclusion of expected log-likelihoods here allows svdgp to learn the hyper-parameters in a unified framework, as its objective function can be viewed as minimizing a generalization upper-bound in PAC-Bayes learning [8].

svdgp solves the above optimization problem by stochastic gradient ascent. Here we purposefully ignore specific details of to emphasize that svdgp can be applied to general likelihoods as it only requires unbiased first-order information, which e.g. can be found in [22]. In addition to having a more adaptive representation, the main benefit of svdgp is that the computation of an unbiased gradient requires only linear complexity in , as shown below (see Appendix Afor details).

KL-Divergence

Assume and . By (9), One can show and Therefore, the time complexity to compute can be reduced to if we sample over the columns of with a mini-batch of size . By contrast, the time complexity to compute will always be and cannot be further reduced, regardless of the parametrization of .999Due to , the complexity would remain as even if is constrained to be diagonal. The gradient with respect to and can be derived similarly and have time complexity and , respectively.

Expected Log-Likelihood

Let and be the vectors of the mean and covariance of scalar Gaussian for . As (10) is a sum over terms, by sampling with a mini-batch of size , an unbiased gradient of (10) with respect to can be computed in . To compute the full gradient with respect to , we compute the derivative of and with respect to and then apply chain rule. These steps take and for and , respectively.

The above analysis shows that the curse of dimensionality in GPs originates in the covariance function. For space complexity, the decoupled parametrization (7) requires memory in ; for time complexity, an unbiased gradient with respect to can be computed in , but that with respect to has time complexity . This motivates choosing and in or , which maintains the same complexity as previous variational techniques but greatly improves the prediction performance.

0:  , , ,
0:   ,
1:   initializeHyperparameters( sampleMinibatch(, ) )
2:  for  do
3:      sampleMinibatch(, )
4:     .addBasis(, , , )
5:     .updateModel(, t)
6:  end for
Algorithm 1 Online Learning with DGPs

4 Experimental Results

We compare our new algorithm, svdgp, with the state-of-the-art incremental algorithms for sparse variational GPR, svi [10] and ivsgpr [2], as well as the classical gpr and the batch algorithm vsgpr [24]. As discussed in Section 1.1, these methods can be viewed as different ways to optimize (5). Therefore, in addition to the normalized mean square error (nMSE) [19] in prediction, we report the performance in the variational lower bound (VLB) (5), which also captures the quality of the predictive variance and hyper-parameter learning.101010The exact marginal likelihood is computationally infeasible to evaluate for our large model. These two metrics are evaluated on held-out test sets in all of our experimental domains.

Algorithm 1 summarizes the online learning procedure used by all stochastic algorithms,111111The algorithms differs only in whether the bases are shared and how the model is updated (see Table 1). where each learner has to optimize all the parameters on-the-fly using i.i.d. data. The hyper-parameters are first initialized heuristically by median trick using the first mini-batch. We incrementally build up the variational posterior by including observations in each mini-batch as the initialization of new variational basis functions. Then all the hyper-parameters and the variational parameters are updated online. These steps are repeated for iterations.

For all the algorithms, we assume the prior covariance is defined by the se-ard kernel [19] and we use the generalized se-ard kernel [2] as the inducing functions in the variational posterior (see Appendix B for details). We note that all algorithms in comparison use the same kernel and optimize both the variational parameters (including inducing points) and the hyperparameters.

In particular, we implement sga by adam [13] (with default parameters and ). The step-size for each stochastic algorithms is scheduled according to , where is selected manually for each algorithm to maximize the improvement in objective function after the first 100 iterations. We test each stochastic algorithm for iterations with mini-batches of size and the increment size . Finally, the model sizes used in the experiments are listed as follows: and for svdgp; for svi; for ivsgpr; , for vsgpr; for gp. These settings share similar order of time complexity in our current Matlab implementation.

4.1 Datasets

Inverse Dynamics of KUKA Robotic Arm

This dataset records the inverse dynamics of a KUKA arm performing rhythmic motions at various speeds [17]. The original dataset consists of two parts: kuka and kuka, each of which have 17,560 offline data and 180,360 online data with 28 attributes and 7 outputs. In the experiment, we mix the online and the offline data and then split 90% as training data (178,128 instances) and 10% testing data (19,792 instances) to satisfy the i.i.d. assumption.

Walking MuJoCo

MuJoCo (Multi-Joint dynamics with Contact) is a physics engine for research in robotics, graphics, and animation, created by [25]. In this experiment, we gather 1,000 walking trajectories by running trpo [20]. In each time frame, the MuJoCo transition dynamics have a 23-dimensional input and a 17-dimensional output. We consider two regression problems to predict 9 of the 17 outputs from the input121212Because of the structure of MuJoCo dynamics, the rest 8 outputs can be trivially known from the input.: mujoco which maps the input of the current frame (23 dimensions) to the output, and mujoco which maps the inputs of the current and the previous frames (46 dimensions) to the output. In each problem, we randomly select 90% of the data as training data (842,745 instances) and 10% as test data (93,608 instances).

kuka - Variational Lower Bound ()
svdgp svi ivsgpr vsgpr gpr
mean 1.262 0.391 0.649 0.472 -5.335
std 0.195 0.076 0.201 0.265 7.777
(a)
kuka - Prediction Error (nMSE)
svdgp svi ivsgpr vsgpr gpr
mean 0.037 0.169 0.128 0.139 0.231
std 0.013 0.025 0.033 0.026 0.045
(b)
mujoco - Variational Lower Bound ()
svdgp svi ivsgpr vsgpr gpr
mean 6.007 2.178 4.543 2.822 -10312.727
std 0.673 0.692 0.898 0.871 22679.778
(c)
mujoco - Prediction Error (nMSE)
svdgp svi ivsgpr vsgpr gpr
mean 0.072 0.163 0.099 0.118 0.213
std 0.013 0.053 0.026 0.016 0.061
(d)
Table 2: Experimental results of kuka and mujoco after 2,000 iterations.
(a) Sample Complexity
(b) Time Complexity
Figure 2: An example of online learning results (the 9th output of mujoco dataset). The blue, red, and yellow lines denote svdgp, svi, and ivsgpr, respectively.

4.2 Results

We summarize part of the experimental results in Table 2 in terms of nMSE in prediction and VLB. While each output is treated independently during learning, Table 2 present the mean and the standard deviation over all the outputs as the selected metrics are normalized. For the complete experimental results, please refer to Appendix C.

We observe that svdgp consistently outperforms the other approaches with much higher VLBs and much lower prediction errors; svdgp also has smaller standard deviation. These results validate our initial hypothesis that adopting a large set of basis functions for the mean can help when modeling complicated functions. ivsgpr has the next best result after svdgp, despite using a basis size of 256, much smaller than that of 1,024 in svi, vsgpr, and gpr. Similar to svdgp, ivsgpr also generalizes better than the batch algorithms vsgpr and gpr, which only have access to a smaller set of training data and are more prone to over-fitting. By contrast, the performance of svi is surprisingly worse than vsgpr. We conjecture this might be due to the fact that the hyper-parameters and the inducing points/functions are only crudely initialized in online learning. We additionally find that the stability of svi is more sensitive to the choice of step size than other methods. This might explain why in [10, 2] batch data was used to initialize the hyper-parameters and the learning rate to update the hyper-parameters was selected to be much smaller than that for stochastic natural gradient ascent.

To further investigate the properties of different stochastic approximations, we show the change of VLB and the prediction error over iterations and time in Figure 2. Overall, whereas ivsgpr and svi share similar convergence rate, the behavior of svdgp is different. We see that ivsgpr converges the fastest, both in time and sample complexity. Afterwards, svdgp starts to descend faster and surpass the other two methods. From Figure 2, we can also observe that although svi has similar convergence to ivsgpr, it slows down earlier and therefore achieves a worse result. These phenomenon are observed in multiple experiments.

5 Conclusion

We propose a novel, fully-differentiable framework, Decoupled Gaussian Processes DGPs, for large-scale GP problems. By decoupling the representation, we derive a variational inference problem that can be solved with stochastic gradients with linear time and space complexity. Compared with existing algorithms, svdgp can adopt a much larger set of basis functions to predict more accurately. Empirically, svdgp significantly outperforms state-of-the-arts variational sparse GPR algorithms in multiple regression tasks. These encouraging experimental results motivate further application of svdgp to end-to-end learning with neural networks in large-scale, complex real world problems.

Acknowledgments

This work was supported in part by NSF NRI award 1637758.

References

  • Bauer et al. [2016] Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems, pages 1525–1533, 2016.
  • Cheng and Boots [2016] Ching-An Cheng and Byron Boots. Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems, pages 4403–4411, 2016.
  • Cheng and Huang [2016] Ching-An Cheng and Han-Pang Huang. Learn the Lagrangian: A vector-valued RKHS approach to identifying Lagrangian systems. IEEE Transactions on Cybernetics, 46(12):3247–3258, 2016.
  • Csató and Opper [2001] Lehel Csató and Manfred Opper. Sparse representation for Gaussian process models. Advances in Neural Information Processing Systems, pages 444–450, 2001.
  • Dai et al. [2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049, 2014.
  • Eldredge [2016] Nathaniel Eldredge. Analysis and probability on infinite-dimensional spaces. arXiv preprint arXiv:1607.03591, 2016.
  • Figueiras-Vidal and Lázaro-gredilla [2009] Anibal Figueiras-Vidal and Miguel Lázaro-gredilla. Inter-domain Gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, pages 1087–1095, 2009.
  • Germain et al. [2016] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. Pac-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pages 1884–1892, 2016.
  • Gross [1967] Leonard Gross. Abstract wiener spaces. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory, Part 1, pages 31–42. University of California Press, 1967.
  • Hensman et al. [2013] James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
  • Hensman et al. [2015] James Hensman, Alexander G. de G. Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics, 2015.
  • Hensman et al. [2016] James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. arXiv preprint arXiv:1611.06740, 2016.
  • Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kivinen et al. [2004] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE transactions on signal processing, 52(8):2165–2176, 2004.
  • Lázaro-Gredilla et al. [2010] Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11(Jun):1865–1881, 2010.
  • Matthews et al. [2016] Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics, 2016.
  • Meier et al. [2014] Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local Gaussian regression. In Advances in Neural Information Processing Systems, pages 972–980, 2014.
  • Quiñonero-Candela and Rasmussen [2005] Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. 2006.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015.
  • Seeger et al. [2003] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFL-CONF-161318, 2003.
  • Sheth et al. [2015] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized GP models. In Proceedings of the 32nd International Conference on Machine Learning, pages 1302–1311, 2015.
  • Snelson and Ghahramani [2005] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2005.
  • Titsias [2009] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
  • Walder et al. [2008] Christian Walder, Kwang In Kim, and Bernhard Schölkopf. Sparse multiscale Gaussian process regression. In Proceedings of the 25th international conference on Machine learning, pages 1112–1119. ACM, 2008.
  • Wilson and Nickisch [2015] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on Machine Learning, pages 1775–1784, 2015.

Appendix

Appendix A Variational Inference with Decoupled Gaussian Processes

Here we provide the details of the variational inference problem used to learn DGPs:

(11)

a.1 KL Divergence

a.1.1 Evaluation

First, we show how to evaluate the KL-divergence. We do so by extending the KL-divergence between two finite-dimensional subspace-parametrized Gaussian measures to infinite dimensional space and show that it is well-defined.

Recall for two -dimensional Gaussian distributions and , the KL-divergence is given as

Proposition 1.

Now consider and are subspace parametrized as

(12)

By Proposition 1, we derive the representation of KL-divergence which is applicable even when is infinite. Recall in the infinite dimensional case, , , , and are objects in the RKHS (Cameron-Martin space).

Theorem 1.

Assume and are two subspace parametrized Gaussian measures given as (12). Regardless of the dimension of , the following holds

(13)

where

In particular, if is normal (i.e. ), then

Proof.


To prove, we derive each term in (13) as follows.

First, we derive . Define . Then we can write

(14)

Using (14), we can derive

and therefore

Note this term does not depend on the ambient dimension.

Second, we derive : Since

it holds that

Finally, we derive the quadratic term:

Notes to Practitioners
  1. The above expression is well defined even when , because . Particularly, we can parametrize with Cholesky factor in practice so the problem is unconstrained. The required terms can be stably computed: and , where .

  2. In implementations, a Jacobi preconditioner can be optionally used. That is, we can parametrize and through and

    for improving stability of using first-order methods, here denotes the diagonal part.

a.1.2 Gradients

Here we derive the equations of the gradient of the variational inference problem of svdgp. The purpose here is to show the complexity of calculating the gradients. These equations are useful in implementing svdgp using basic linear algebra routines, while computational-graph libraries based on automatic differentiation are also applicable and easier to apply.

To derive the gradients, we first introduce some short-hand